Production Data Quality Pipeline with SCD Type 1 and Type 2 in Azure Synapse Data Flows

In real projects, source data is messy. It has NULLs where there should be values, invalid emails, numeric garbage in city fields, and duplicate records with conflicting information. And when you finally clean the data, some changes are corrections (fix a typo) while others are real business events (customer moved to a new city). Your pipeline needs to handle both differently.

This post builds a production-grade pipeline that does it all — cleans messy data, deduplicates records, and applies BOTH SCD Type 1 (overwrite corrections) and SCD Type 2 (track business changes with full history) in a single Data Flow. This is the most complex and realistic pipeline we have built on this blog, and it is exactly what enterprise data platforms look like.

Think of it like a hospital intake process. The patient (raw data) arrives with incomplete paperwork (NULLs), illegible handwriting (invalid data), and sometimes the same patient has two conflicting forms (duplicates). The intake team (Data Flow) cleans the paperwork, resolves the conflicts, and routes the patient correctly — new patient goes to registration (INSERT), returning patient with a new address goes to records update (SCD Type 2), and returning patient with a corrected phone number gets their file updated in place (SCD Type 1).

What This Pipeline Does
Why Two SCD Types in One Pipeline
The Raw Data (Intentionally Messy)
The Target Dimension Table
The Complete Data Flow Architecture
Stage 1: Null Handling
Stage 2: Data Standardization
Stage 3: Deduplication with Window Function
Stage 4: Dual Hash Generation (SCD1 + SCD2)
Stage 5: Lookup Against Existing Dimension Table
Stage 6: Conditional Split (Four Streams)
Stage 7: New Records Path
Stage 8: SCD Type 2 Path (Expire + Insert New Version)
Stage 9: SCD Type 1 Path (Overwrite in Place)
Stage 10: Union and Three Sinks
The Three Sinks Explained
The Audit Trail Design
First Run Walkthrough
Second Run Walkthrough
Why Type 2 Is Checked Before Type 1
Complete Transformation Summary
Performance Considerations
Common Errors and Fixes
Interview Questions
Wrapping Up

What This Pipeline Does

RAW CSV (15 rows, messy)
  |
  v
[1] NULL HANDLING → Replace missing values with defaults
  |
  v
[2] DATA STANDARDIZATION → Fix invalid emails, clean city values, proper casing
  |
  v
[3] DEDUPLICATION → Window function ranks by date, keeps latest → 13 unique rows
  |
  v
[4] DUAL HASH GENERATION → Hash_SCD1 (name, email) + Hash_SCD2 (city, country)
  |
  v
[5] LOOKUP → Join with existing dimension table
  |
  v
[6] CONDITIONAL SPLIT
    |           |              |              |
    v           v              v              v
  New       Type 2          Type 1        Unchanged
  Records   Changed         Changed       (dropped)
    |       (city moved)    (email fixed)
    |           |              |
    v           v              v
  SinkInsert  SinkExpire    SinkType1Update
              + SinkInsert

Why Two SCD Types in One Pipeline

The business reality: Not all changes are equal.

When a customer fixes a typo in their email address (vishnu@gmail → vishnu@gmail.com), that is a correction. You do NOT want to create a new historical version for a typo fix. Just overwrite it silently. That is SCD Type 1.

When a customer moves from Chennai to Delhi, that is a real business event. You WANT to track this — for compliance, for analytics (“how many customers relocated?”), for historical accuracy (“where did this customer live when they placed that order?”). That is SCD Type 2.

Most tutorials teach Type 1 OR Type 2 separately. In production, you need BOTH in the same pipeline because real data has both corrections and business changes arriving together.

Real-life analogy: Think of a hospital medical record. If the receptionist misspelled your name, they correct it in place (Type 1 — overwrite the typo). If you change your address, they add a new entry with the date of the change while preserving the old address (Type 2 — full history). Different changes, different handling, same patient record.

The Raw Data (Intentionally Messy)

CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1001,Naveen,Vuppula,naveen@email.com,Hyderabad,India,2026-01-01
1002,Ninad,,ninad@email.com,Bangalore,India,2026-01-15
1002,Ninad,Patil,,Pune,India,2026-02-05
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-01-10
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-02-01
1004,Shreyash,Reddy,shreyash@email.com,,India,2026-01-20
1005,Vishnu,Kumar,vishnu@gmail,Mumbai,India,2026-01-05
1006,Anita,,anita.placeholder@email.com,Delhi,India,2026-01-25
1007,Priya,Mehta,priya@email.com,Kolkata,India,2026-02-10
1008,  Ravi ,Desai,ravi@email.com,Jaipur,India,2026-01-30
1009,Sunita,Nair,sunita@email.com,12345,India,2026-02-15
1010,Arjun,Patel,arjun@email.com,Ahmedabad,India,2026-01-12
1011,Deepa,Iyer,deepa@email.com,Lucknow,,2026-02-20
1012,Kiran,Joshi,kiran@email.com,Chandigarh,India,2026-01-08
1013,Manoj,Gupta,,Patna,India,2026-02-25

The Data Quality Issues

Row	Issue	Type
1002 (row 2)	NULL LastName, valid email	NULL handling
1002 (row 3)	NULL Email but has LastName	NULL + Duplicate
1003 (rows 4-5)	Duplicate CustomerID, same data, different dates	Deduplication
1004	NULL City	NULL handling
1005	Invalid email (`vishnu@gmail` — missing .com)	Standardization
1006	NULL LastName	NULL handling
1008	Leading/trailing spaces in FirstName (`Ravi`)	Standardization
1009	City is numeric (`12345`) — not a real city	Standardization
1011	NULL Country	NULL handling
1013	NULL Email	NULL handling

15 rows of raw data with 10+ quality issues. Real data is exactly like this.

The Target Dimension Table

CREATE TABLE DimCustomer_Project4 (
    CustomerKey     INT IDENTITY(1,1) PRIMARY KEY,
    CustomerID      INT NOT NULL,
    FirstName       VARCHAR(100),
    LastName        VARCHAR(100),
    Email           VARCHAR(200),
    City            VARCHAR(100),
    Country         VARCHAR(100),
    UpdateDate      DATE,
    Hash_SCD1       VARCHAR(256),
    Hash_SCD2       VARCHAR(256),
    StartDate       DATE NOT NULL,
    EndDate         DATE NOT NULL DEFAULT '9999-12-31',
    is_active       BIT NOT NULL DEFAULT 1,
    inserted_tms    DATETIME DEFAULT GETDATE(),
    inserted_by     VARCHAR(50) DEFAULT 'dataflow_insert',
    updated_tms     DATETIME NULL,
    updated_by      VARCHAR(50) NULL
);

Why Two Hash Columns

Column	Covers	Change Behavior
`Hash_SCD1`	FirstName, LastName, Email	If different → overwrite (Type 1)
`Hash_SCD2`	City, Country	If different → expire old + insert new (Type 2)

This dual-hash design is the key innovation. One hash for corrections, another for business changes. Compare each separately to determine the right action.

Real-life analogy: Think of two fingerprint scanners at a bank vault. Scanner 1 (Hash_SCD1) checks your personal details — if your name spelling changed, update the file. Scanner 2 (Hash_SCD2) checks your address — if you moved, create a new file while keeping the old one. Different scanners, different actions, same person.

The Complete Data Flow Architecture

SourceCSV (15 rows)
  |
  v
HandleNulls (Derived Column) — replace NULLs with defaults
  |
  v
StandardizeData (Derived Column) — fix emails, clean cities, trim/case names
  |
  v
AddRowNumber (Window) — rank by UpdateDate DESC within each CustomerID
  |
  v
KeepLatest (Filter) — RowNum == 1 → 13 unique rows
  |
  v
GenerateBothHashes (Derived Column) — Hash_SCD1 + Hash_SCD2
  |
  v
LookupExisting ←── ExistingDimTable (Source: active dim records)
  |                  |
  |                  RenameExisting (Select: DIM_ prefix)
  |
  v
SplitRecords (Conditional Split)
  |          |              |              |
  v          v              v              v
New       Type2Changed   Type1Changed   Unchanged
Records   (city moved)   (email fixed)  (dropped)
  |          |              |
  |     ┌────┴────┐         |
  |     |         |         |
  |  Expire    InsertNew    |
  |  (update)  (insert)     |
  |     |         |         |
  |  SinkExpire   |     SinkType1Update
  |               |
  Union (New + InsertNew)
  |
  SinkInsert

Stage 1: Null Handling

What It Does

Replaces NULL values with sensible defaults so downstream transformations do not fail on NULLs.

Configuration

Add Derived Column after Source
Name: HandleNulls
Columns:

Column	Expression	Logic
LastName	`iif(isNull(LastName), 'Unknown', LastName)`	NULL → ‘Unknown’
Email	`iif(isNull(Email), 'noemail@placeholder.com', Email)`	NULL → placeholder
City	`iif(isNull(City), 'Unknown', City)`	NULL → ‘Unknown’
Country	`iif(isNull(Country), 'India', Country)`	NULL → ‘India’ (default for this dataset)

After This Stage

1002 (row 2): LastName goes from NULL to ‘Unknown’
1002 (row 3): Email goes from NULL to ‘noemail@placeholder.com’
1004: City goes from NULL to ‘Unknown’
1013: Email goes from NULL to ‘noemail@placeholder.com’

Real-life analogy: When a patient arrives at the hospital without insurance information, the intake form does not stay blank — it gets marked “Insurance: Unknown — follow up required.” The form is complete enough to proceed while flagging the missing information.

Stage 2: Data Standardization

What It Does

Fixes invalid data formats, standardizes casing, and trims whitespace.

Configuration

Add Derived Column after HandleNulls
Name: StandardizeData
Columns:

Column	Expression	What It Fixes
FirstName	`initCap(trim(FirstName))`	`Ravi` → `Ravi`
LastName	`initCap(trim(LastName))`	Proper casing
Email	`iif(!contains(Email, '.'), concat(Email, '.com'), lower(trim(Email)))`	`vishnu@gmail` → `vishnu@gmail.com`
City	`iif(regexMatch(City, '^[0-9]+$'), 'Unknown', initCap(trim(City)))`	`12345` → `Unknown`

Key Expressions Explained

initCap(trim(FirstName)) — trim removes leading/trailing spaces. initCap capitalizes the first letter of each word. So ravi becomes Ravi.

iif(!contains(Email, '.'), concat(Email, '.com'), ...) — If the email does not contain a dot (like vishnu@gmail), append .com. Otherwise, lowercase and trim.

regexMatch(City, '^[0-9]+$') — Checks if the city is ALL numbers (like 12345). If so, replace with ‘Unknown’. A city should never be numeric.

Real-life analogy: This is like a proofreader reviewing a document before publication. They fix spelling (trim/initCap), correct formatting (email validation), and flag nonsensical entries (numeric city → Unknown). The content is improved without changing its meaning.

Stage 3: Deduplication with Window Function

What It Does

When the same CustomerID appears multiple times (duplicates), keeps only the most recent record based on UpdateDate.

Configuration

Window Transformation: 1. Add Window after StandardizeData 2. Name: AddRowNumber 3. Over tab: CustomerID (partition by) 4. Sort tab: UpdateDate → Descending (latest first) 5. Window columns: Column = RowNum, Expression = rowNumber()

Filter Transformation: 1. Add Filter after AddRowNumber 2. Name: KeepLatest 3. Expression: RowNum == 1

What Happens

CustomerID 1002 has 2 rows: – Row 1: LastName=’Unknown’, Email=’ninad@email.com’, City=’Bangalore’, UpdateDate=2026-01-15 – Row 2: LastName=’Patil’, Email=’noemail@placeholder.com’, City=’Pune’, UpdateDate=2026-02-05

After Window + Filter: Row 2 wins (latest date). CustomerID 1002 = Patil, Pune, Feb 5.

CustomerID 1003 has 2 rows (same data, different dates). Latest wins.

Result: 15 rows → 13 unique rows.

Real-life analogy: When a student submits homework twice, the teacher keeps the latest submission. The earlier version is discarded. The Window function is the timestamp on the submission.

Stage 4: Dual Hash Generation (SCD1 + SCD2)

What It Does

Creates two separate fingerprints for each row — one for Type 1 columns and one for Type 2 columns.

Configuration

Add Derived Column after KeepLatest
Name: GenerateBothHashes
Columns:

Column	Expression	Covers
Hash_SCD1	`toString(sha2(256, concatWS('\\|', toString(FirstName), toString(LastName), toString(Email))))`	Name and email (corrections)
Hash_SCD2	`toString(sha2(256, concatWS('\\|', toString(City), toString(Country))))`	City and country (business changes)

Why Two Hashes Instead of One

With a single hash, any change — email typo fix or city relocation — triggers the same action. You cannot distinguish corrections from business events.

With two hashes: – Hash_SCD1 changes + Hash_SCD2 same → Type 1 (overwrite the correction) – Hash_SCD2 changes → Type 2 (expire + new version, regardless of Hash_SCD1) – Both same → Unchanged (skip)

Stage 5: Lookup Against Existing Dimension Table

Second Source: Existing Dimension Table

Add new Source (not connected to the main flow)
Name: ExistingDimTable
Query: SELECT CustomerKey, CustomerID, Hash_SCD1, Hash_SCD2 FROM DimCustomer_Project4 WHERE is_active = 1

Why WHERE is_active = 1? Only compare against current records. Expired historical rows are irrelevant for change detection.

Rename with DIM_ Prefix

Add Select after ExistingDimTable
Name: RenameExisting
Mappings:
CustomerKey → DIM_CustomerKey
CustomerID → DIM_CustomerID
Hash_SCD1 → DIM_Hash_SCD1
Hash_SCD2 → DIM_Hash_SCD2

Lookup

Add Lookup after GenerateBothHashes
Name: LookupExisting
Primary: GenerateBothHashes
Lookup: RenameExisting
Condition: CustomerID == DIM_CustomerID
Type: Left outer (keep all source rows)

Stage 6: Conditional Split (Four Streams)

Configuration

Add Conditional Split after LookupExisting
Name: SplitRecords
Conditions (ORDER MATTERS):

#	Stream	Condition	Logic
1	`NewRecords`	`isNull(DIM_CustomerID)`	No match → new customer
2	`Type2Changed`	`Hash_SCD2 != DIM_Hash_SCD2`	City/Country changed → expire + insert
3	`Type1Changed`	`Hash_SCD1 != DIM_Hash_SCD1`	Name/Email corrected → overwrite
Default	`Unchanged`		Both hashes match → skip

Why This Order

Type 2 is checked BEFORE Type 1. If a customer moved to a new city AND fixed their email (both hashes changed), we treat it as Type 2. The new version row will have the corrected email included, so the Type 1 fix comes along for free.

If we checked Type 1 first, we would overwrite the email in the old row and then ALSO create a new row for the city change — resulting in the email fix appearing in BOTH the old and new versions, which is wrong.

Real-life analogy: At airport immigration, they check your visa status (Type 2 — major change) before checking your name spelling (Type 1 — minor correction). If your visa changed, you go through the full re-entry process. A name spelling fix is handled within that process, not separately.

Stage 7: New Records Path

From NewRecords stream:

Derived Column — PrepareNewInsert
StartDate = currentDate()
EndDate = toDate('9999-12-31')
is_active = true()
Alter Row — MarkNewInsert
Insert if: true()

Stage 8: SCD Type 2 Path (Expire + Insert New Version)

The Type2Changed stream splits into TWO parallel operations:

Path A: Expire the Old Row

Derived Column — PrepareExpire
EndDate = subDays(currentDate(), 1) (yesterday)
is_active = false()
Select — SelectExpireColumns
Only 3 columns: DIM_CustomerKey → CustomerKey, EndDate, is_active
Alter Row — MarkExpire
Update if: true()
Sink — SinkExpire
Dataset: DimCustomer_Project4
Allow update: checked
Key columns: CustomerKey

Path B: Insert New Version

Derived Column — PrepareNewVersion
StartDate = currentDate()
EndDate = toDate('9999-12-31')
is_active = true()
Alter Row — MarkNewVersionInsert
Insert if: true()

This stream joins the Union with NewRecords.

Stage 9: SCD Type 1 Path (Overwrite in Place)

From Type1Changed stream:

Derived Column — PrepareType1Update
updated_tms = currentTimestamp()
updated_by = 'dataflow_type1_update'
Select — SelectType1Columns
Map: DIM_CustomerKey → CustomerKey, FirstName, LastName, Email, Hash_SCD1, updated_tms, updated_by
Alter Row — MarkType1Update
Update if: true()
Sink — SinkType1Update
Dataset: DimCustomer_Project4
Allow update: checked
Key columns: CustomerKey

Key design decision: The Type 1 sink only updates the corrected columns + audit trail. It does NOT touch StartDate, EndDate, is_active, City, Country, or Hash_SCD2. The existing row stays exactly as it was except for the corrected fields.

Stage 10: Union and Three Sinks

Union

Combines MarkNewInsert (new records) + MarkNewVersionInsert (new SCD2 versions):

Union — UnionAllInserts
Inputs: MarkNewInsert + MarkNewVersionInsert

SinkInsert

Sink after Union
Name: SinkInsert
Dataset: DimCustomer_Project4
Allow insert: checked
Do NOT map CustomerKey (auto-generated IDENTITY)

The Three Sinks Explained

Sink	Operation	When	What It Does
SinkInsert	INSERT	New records + new SCD2 versions	Adds rows with auto-generated CustomerKey
SinkExpire	UPDATE	SCD Type 2 changes	Sets EndDate and is_active=0 on the OLD row using CustomerKey
SinkType1Update	UPDATE	SCD Type 1 corrections	Overwrites name/email on the EXISTING row using CustomerKey

All three sinks point to the SAME table (DimCustomer_Project4). Different operations, same destination.

Real-life analogy: Three doors to the same room. Door 1 (SinkInsert) is for new arrivals — they get a new desk. Door 2 (SinkExpire) is for the records clerk — they stamp “CLOSED” on old files. Door 3 (SinkType1Update) is for corrections — they white-out and rewrite the typo on the existing file.

The Audit Trail Design

Inserted Records (New + New SCD2 Versions)

inserted_tms = GETDATE()       ← SQL DEFAULT (auto-set on INSERT)
inserted_by  = 'dataflow_insert'  ← SQL DEFAULT
updated_tms  = NULL              ← Not yet updated
updated_by   = NULL

SCD Type 2 — Expired Row

EndDate    = yesterday            ← Set by PrepareExpire
is_active  = 0                    ← Set by PrepareExpire
(inserted_tms preserved)          ← NOT touched
(inserted_by preserved)           ← NOT touched

SCD Type 1 — Corrected Row

updated_tms = NOW                 ← Set by PrepareType1Update
updated_by  = 'dataflow_type1_update'  ← Identifies what made the change
(inserted_tms preserved)          ← NOT touched
(StartDate, EndDate preserved)    ← NOT touched — no version change

This design lets you answer: – “When was this record first created?” → inserted_tms – “Was this record ever corrected?” → updated_tms IS NOT NULL – “What corrected it?” → updated_by = 'dataflow_type1_update' – “When did the customer move?” → Check StartDate on the new version

First Run Walkthrough

Source: 15 rows → After cleaning and dedup: 13 rows.

Dimension table is empty → ALL 13 go to NewRecords → SinkInsert: 13 INSERTs.

DimCustomer_Project4 after first run:
| CKey | CID  | FirstName | City       | Hash_SCD2 | StartDate  | EndDate    | is_active |
|------|------|-----------|------------|-----------|------------|------------|-----------|
| 1    | 1001 | Naveen    | Hyderabad  | abc123... | 2026-04-20 | 9999-12-31 | 1         |
| 2    | 1002 | Ninad     | Pune       | def456... | 2026-04-20 | 9999-12-31 | 1         |
| 3    | 1003 | Veda      | Chennai    | ghi789... | 2026-04-20 | 9999-12-31 | 1         |
| ...  | ...  | ...       | ...        | ...       | ...        | ...        | ...       |
| 13   | 1013 | Manoj     | Patna      | xyz000... | 2026-04-20 | 9999-12-31 | 1         |

Second Run Walkthrough

New CSV with 4 records:

CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1003,Veda,Sharma,veda@email.com,Delhi,India,2026-03-15
1006,Anita,Unknown,anita.correct@email.com,Delhi,India,2026-03-20
1005,Vishnu,Kumar,vishnu@gmail.com,Mumbai,India,2026-03-25
1014,Neha,Reddy,neha@email.com,Pune,India,2026-03-10

What Happens

CustomerID	Source Hash_SCD2	Dim Hash_SCD2	Source Hash_SCD1	Dim Hash_SCD1	Result
1003	hash(Delhi\|India)	hash(Chennai\|India)	SAME	SAME	Type 2 — city changed
1006	SAME	SAME	hash(Anita\|Unknown\|anita.correct@)	hash(Anita\|Unknown\|anita.placeholder@)	Type 1 — email corrected
1005	SAME	SAME	hash(Vishnu\|Kumar\|vishnu@gmail.com)	hash(Vishnu\|Kumar\|vishnu@gmail.com)	Unchanged (email was already fixed in stage 2)
1014	N/A	N/A	N/A	N/A	New — not in dim table

Actions

1003 (Type 2): SinkExpire: UPDATE CustomerKey=3 SET EndDate=’2026-04-20′, is_active=0. SinkInsert: INSERT new row with City=’Delhi’, new CustomerKey.
1006 (Type 1): SinkType1Update: UPDATE CustomerKey=6 SET Email=’anita.correct@email.com’, updated_tms=NOW, updated_by=’dataflow_type1_update’.
1005 (Unchanged): Dropped — nothing happens.
1014 (New): SinkInsert: INSERT new row.

Table After Second Run

| CKey | CID  | City      | Email                    | StartDate  | EndDate    | active | updated_by              |
|------|------|-----------|--------------------------|------------|------------|--------|-------------------------|
| 1    | 1001 | Hyderabad | naveen@email.com         | 2026-04-20 | 9999-12-31 | 1      | NULL                    |
| 2    | 1002 | Pune      | noemail@placeholder.com  | 2026-04-20 | 9999-12-31 | 1      | NULL                    |
| 3    | 1003 | Chennai   | veda@email.com           | 2026-04-20 | 2026-04-20 | 0      | NULL                    | ← EXPIRED
| ...  |      |           |                          |            |            |        |                         |
| 6    | 1006 | Delhi     | anita.correct@email.com  | 2026-04-20 | 9999-12-31 | 1      | dataflow_type1_update   | ← TYPE 1 UPDATE
| ...  |      |           |                          |            |            |        |                         |
| 14   | 1003 | Delhi     | veda@email.com           | 2026-04-21 | 9999-12-31 | 1      | NULL                    | ← NEW VERSION
| 15   | 1014 | Pune      | neha@email.com           | 2026-04-21 | 9999-12-31 | 1      | NULL                    | ← NEW RECORD

CustomerID 1003 now has TWO rows — Chennai (expired) and Delhi (active). Full history preserved. CustomerID 1006 was corrected in place — no new version, just the email fixed with an audit timestamp.

Complete Transformation Summary

#	Transformation	Type	Input Rows	Output Rows	Purpose
1	HandleNulls	Derived Column	15	15	Replace NULLs with defaults
2	StandardizeData	Derived Column	15	15	Fix emails, cities, casing
3	AddRowNumber	Window	15	15	Rank records within each CustomerID
4	KeepLatest	Filter	15	13	Keep only latest record per CustomerID
5	GenerateBothHashes	Derived Column	13	13	Create Hash_SCD1 + Hash_SCD2
6	ExistingDimTable	Source	–	varies	Read current active dim records
7	RenameExisting	Select	varies	varies	Prefix with DIM_
8	LookupExisting	Lookup	13	13	Join source with dim
9	SplitRecords	Conditional Split	13	4 streams	Route: New, Type2, Type1, Unchanged
10	SinkInsert	Sink	varies	–	INSERT new + new versions
11	SinkExpire	Sink	varies	–	UPDATE expired rows
12	SinkType1Update	Sink	varies	–	UPDATE corrections

Common Errors and Fixes

Error	Cause	Fix
“No updateable columns match” in SinkExpire	Column mapping mismatch	Ensure SelectExpireColumns outputs exactly: CustomerKey, EndDate, is_active
Type 1 updates creating new rows	SinkType1Update set to Allow insert instead of Allow update	Only check Allow update, set Key = CustomerKey
All rows going to NewRecords on second run	ExistingDimTable query missing `WHERE is_active = 1`	Add the filter
Hash always different between runs	Column order in hash differs or NULL handling inconsistent	Use `concatWS` with consistent column order and `coalesce`
Duplicate CustomerKey error on insert	Mapping CustomerKey in SinkInsert	Do NOT map CustomerKey — let IDENTITY auto-generate
Type 2 and Type 1 both firing for same row	Condition order wrong	Put Type2Changed BEFORE Type1Changed in Conditional Split

Interview Questions

Q: Why use two separate hash columns instead of one? A: To distinguish between corrections (SCD Type 1 — overwrite) and business changes (SCD Type 2 — version history). Hash_SCD1 covers columns like name and email that should be silently corrected. Hash_SCD2 covers columns like city and country that represent real business events requiring historical tracking.

Q: How do you handle both SCD Type 1 and Type 2 in one data flow? A: Use dual hash columns and a Conditional Split with four streams: New records, Type 2 changes (Hash_SCD2 differs), Type 1 changes (Hash_SCD1 differs), and Unchanged. Type 2 changes go to an expire sink (update old) + insert sink (new version). Type 1 changes go to a separate update sink (overwrite in place). Three sinks, one dimension table.

Q: Why is Type 2 checked before Type 1 in the Conditional Split? A: If both hashes changed (customer moved AND corrected email), treating it as Type 2 is correct — the new version row will include the email correction. Checking Type 1 first would overwrite the email on the old row AND create a new row for the city change, resulting in the correction appearing in both versions incorrectly.

Q: How does the deduplication work? A: A Window transformation partitions by CustomerID, orders by UpdateDate descending, and assigns a row_number. A Filter keeps only RowNum == 1 (the latest record). This resolves duplicate CustomerIDs by keeping the most recent version.

Q: What are the three sinks and why do you need all three? A: SinkInsert (INSERT new records + new SCD2 versions), SinkExpire (UPDATE EndDate and is_active on old SCD2 rows), SinkType1Update (UPDATE corrected columns in place for SCD1). All three target the same table but perform different operations using different column sets and different key columns.

Wrapping Up

This pipeline represents what production data engineering actually looks like — messy data in, clean and properly versioned data out. It combines data quality (null handling, standardization, deduplication), change detection (dual hashing), and change management (SCD Type 1 + Type 2) in a single, coherent Data Flow.

The dual-hash design is the key takeaway. By separating “correction columns” from “business change columns,” you get precise control over which changes create history and which ones silently overwrite. No unnecessary version rows for typo fixes. Full audit trail for real business events.

Build this once, and you have a template that works for any dimension table in any data warehouse.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Production Data Quality Pipeline with SCD Type 1 and Type 2 in Azure Synapse Data Flows

Table of Contents

What This Pipeline Does

Why Two SCD Types in One Pipeline

The Raw Data (Intentionally Messy)

The Data Quality Issues

The Target Dimension Table

Why Two Hash Columns

The Complete Data Flow Architecture

Stage 1: Null Handling

What It Does

Configuration

After This Stage

Stage 2: Data Standardization

What It Does

Configuration

Key Expressions Explained

Stage 3: Deduplication with Window Function

What It Does

Configuration

What Happens

Stage 4: Dual Hash Generation (SCD1 + SCD2)

What It Does

Configuration

Why Two Hashes Instead of One

Stage 5: Lookup Against Existing Dimension Table

Second Source: Existing Dimension Table

Rename with DIM_ Prefix

Lookup

Stage 6: Conditional Split (Four Streams)

Configuration

Why This Order

Stage 7: New Records Path

Stage 8: SCD Type 2 Path (Expire + Insert New Version)

Path A: Expire the Old Row

Path B: Insert New Version

Stage 9: SCD Type 1 Path (Overwrite in Place)

Stage 10: Union and Three Sinks

Union

SinkInsert

The Three Sinks Explained

The Audit Trail Design

Inserted Records (New + New SCD2 Versions)

SCD Type 2 — Expired Row

SCD Type 1 — Corrected Row

First Run Walkthrough

Second Run Walkthrough

What Happens

Actions

Table After Second Run

Complete Transformation Summary

Common Errors and Fixes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply