Production Data Quality Pipeline with SCD Type 1 and Type 2 in Azure Synapse Data Flows
In real projects, source data is messy. It has NULLs where there should be values, invalid emails, numeric garbage in city fields, and duplicate records with conflicting information. And when you finally clean the data, some changes are corrections (fix a typo) while others are real business events (customer moved to a new city). Your pipeline needs to handle both differently.
This post builds a production-grade pipeline that does it all — cleans messy data, deduplicates records, and applies BOTH SCD Type 1 (overwrite corrections) and SCD Type 2 (track business changes with full history) in a single Data Flow. This is the most complex and realistic pipeline we have built on this blog, and it is exactly what enterprise data platforms look like.
Think of it like a hospital intake process. The patient (raw data) arrives with incomplete paperwork (NULLs), illegible handwriting (invalid data), and sometimes the same patient has two conflicting forms (duplicates). The intake team (Data Flow) cleans the paperwork, resolves the conflicts, and routes the patient correctly — new patient goes to registration (INSERT), returning patient with a new address goes to records update (SCD Type 2), and returning patient with a corrected phone number gets their file updated in place (SCD Type 1).
Table of Contents
- What This Pipeline Does
- Why Two SCD Types in One Pipeline
- The Raw Data (Intentionally Messy)
- The Target Dimension Table
- The Complete Data Flow Architecture
- Stage 1: Null Handling
- Stage 2: Data Standardization
- Stage 3: Deduplication with Window Function
- Stage 4: Dual Hash Generation (SCD1 + SCD2)
- Stage 5: Lookup Against Existing Dimension Table
- Stage 6: Conditional Split (Four Streams)
- Stage 7: New Records Path
- Stage 8: SCD Type 2 Path (Expire + Insert New Version)
- Stage 9: SCD Type 1 Path (Overwrite in Place)
- Stage 10: Union and Three Sinks
- The Three Sinks Explained
- The Audit Trail Design
- First Run Walkthrough
- Second Run Walkthrough
- Why Type 2 Is Checked Before Type 1
- Complete Transformation Summary
- Performance Considerations
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
What This Pipeline Does
RAW CSV (15 rows, messy)
|
v
[1] NULL HANDLING → Replace missing values with defaults
|
v
[2] DATA STANDARDIZATION → Fix invalid emails, clean city values, proper casing
|
v
[3] DEDUPLICATION → Window function ranks by date, keeps latest → 13 unique rows
|
v
[4] DUAL HASH GENERATION → Hash_SCD1 (name, email) + Hash_SCD2 (city, country)
|
v
[5] LOOKUP → Join with existing dimension table
|
v
[6] CONDITIONAL SPLIT
| | | |
v v v v
New Type 2 Type 1 Unchanged
Records Changed Changed (dropped)
| (city moved) (email fixed)
| | |
v v v
SinkInsert SinkExpire SinkType1Update
+ SinkInsert
Why Two SCD Types in One Pipeline
The business reality: Not all changes are equal.
When a customer fixes a typo in their email address (vishnu@gmail → vishnu@gmail.com), that is a correction. You do NOT want to create a new historical version for a typo fix. Just overwrite it silently. That is SCD Type 1.
When a customer moves from Chennai to Delhi, that is a real business event. You WANT to track this — for compliance, for analytics (“how many customers relocated?”), for historical accuracy (“where did this customer live when they placed that order?”). That is SCD Type 2.
Most tutorials teach Type 1 OR Type 2 separately. In production, you need BOTH in the same pipeline because real data has both corrections and business changes arriving together.
Real-life analogy: Think of a hospital medical record. If the receptionist misspelled your name, they correct it in place (Type 1 — overwrite the typo). If you change your address, they add a new entry with the date of the change while preserving the old address (Type 2 — full history). Different changes, different handling, same patient record.
The Raw Data (Intentionally Messy)
CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1001,Naveen,Vuppula,naveen@email.com,Hyderabad,India,2026-01-01
1002,Ninad,,ninad@email.com,Bangalore,India,2026-01-15
1002,Ninad,Patil,,Pune,India,2026-02-05
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-01-10
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-02-01
1004,Shreyash,Reddy,shreyash@email.com,,India,2026-01-20
1005,Vishnu,Kumar,vishnu@gmail,Mumbai,India,2026-01-05
1006,Anita,,anita.placeholder@email.com,Delhi,India,2026-01-25
1007,Priya,Mehta,priya@email.com,Kolkata,India,2026-02-10
1008, Ravi ,Desai,ravi@email.com,Jaipur,India,2026-01-30
1009,Sunita,Nair,sunita@email.com,12345,India,2026-02-15
1010,Arjun,Patel,arjun@email.com,Ahmedabad,India,2026-01-12
1011,Deepa,Iyer,deepa@email.com,Lucknow,,2026-02-20
1012,Kiran,Joshi,kiran@email.com,Chandigarh,India,2026-01-08
1013,Manoj,Gupta,,Patna,India,2026-02-25
The Data Quality Issues
| Row | Issue | Type |
|---|---|---|
| 1002 (row 2) | NULL LastName, valid email | NULL handling |
| 1002 (row 3) | NULL Email but has LastName | NULL + Duplicate |
| 1003 (rows 4-5) | Duplicate CustomerID, same data, different dates | Deduplication |
| 1004 | NULL City | NULL handling |
| 1005 | Invalid email (vishnu@gmail — missing .com) |
Standardization |
| 1006 | NULL LastName | NULL handling |
| 1008 | Leading/trailing spaces in FirstName (Ravi) |
Standardization |
| 1009 | City is numeric (12345) — not a real city |
Standardization |
| 1011 | NULL Country | NULL handling |
| 1013 | NULL Email | NULL handling |
15 rows of raw data with 10+ quality issues. Real data is exactly like this.
The Target Dimension Table
CREATE TABLE DimCustomer_Project4 (
CustomerKey INT IDENTITY(1,1) PRIMARY KEY,
CustomerID INT NOT NULL,
FirstName VARCHAR(100),
LastName VARCHAR(100),
Email VARCHAR(200),
City VARCHAR(100),
Country VARCHAR(100),
UpdateDate DATE,
Hash_SCD1 VARCHAR(256),
Hash_SCD2 VARCHAR(256),
StartDate DATE NOT NULL,
EndDate DATE NOT NULL DEFAULT '9999-12-31',
is_active BIT NOT NULL DEFAULT 1,
inserted_tms DATETIME DEFAULT GETDATE(),
inserted_by VARCHAR(50) DEFAULT 'dataflow_insert',
updated_tms DATETIME NULL,
updated_by VARCHAR(50) NULL
);
Why Two Hash Columns
| Column | Covers | Change Behavior |
|---|---|---|
Hash_SCD1 |
FirstName, LastName, Email | If different → overwrite (Type 1) |
Hash_SCD2 |
City, Country | If different → expire old + insert new (Type 2) |
This dual-hash design is the key innovation. One hash for corrections, another for business changes. Compare each separately to determine the right action.
Real-life analogy: Think of two fingerprint scanners at a bank vault. Scanner 1 (Hash_SCD1) checks your personal details — if your name spelling changed, update the file. Scanner 2 (Hash_SCD2) checks your address — if you moved, create a new file while keeping the old one. Different scanners, different actions, same person.
The Complete Data Flow Architecture
SourceCSV (15 rows)
|
v
HandleNulls (Derived Column) — replace NULLs with defaults
|
v
StandardizeData (Derived Column) — fix emails, clean cities, trim/case names
|
v
AddRowNumber (Window) — rank by UpdateDate DESC within each CustomerID
|
v
KeepLatest (Filter) — RowNum == 1 → 13 unique rows
|
v
GenerateBothHashes (Derived Column) — Hash_SCD1 + Hash_SCD2
|
v
LookupExisting ←── ExistingDimTable (Source: active dim records)
| |
| RenameExisting (Select: DIM_ prefix)
|
v
SplitRecords (Conditional Split)
| | | |
v v v v
New Type2Changed Type1Changed Unchanged
Records (city moved) (email fixed) (dropped)
| | |
| ┌────┴────┐ |
| | | |
| Expire InsertNew |
| (update) (insert) |
| | | |
| SinkExpire | SinkType1Update
| |
Union (New + InsertNew)
|
SinkInsert
Stage 1: Null Handling
What It Does
Replaces NULL values with sensible defaults so downstream transformations do not fail on NULLs.
Configuration
- Add Derived Column after Source
- Name:
HandleNulls - Columns:
| Column | Expression | Logic |
|---|---|---|
| LastName | iif(isNull(LastName), 'Unknown', LastName) |
NULL → ‘Unknown’ |
iif(isNull(Email), 'noemail@placeholder.com', Email) |
NULL → placeholder | |
| City | iif(isNull(City), 'Unknown', City) |
NULL → ‘Unknown’ |
| Country | iif(isNull(Country), 'India', Country) |
NULL → ‘India’ (default for this dataset) |
After This Stage
- 1002 (row 2): LastName goes from NULL to ‘Unknown’
- 1002 (row 3): Email goes from NULL to ‘noemail@placeholder.com’
- 1004: City goes from NULL to ‘Unknown’
- 1013: Email goes from NULL to ‘noemail@placeholder.com’
Real-life analogy: When a patient arrives at the hospital without insurance information, the intake form does not stay blank — it gets marked “Insurance: Unknown — follow up required.” The form is complete enough to proceed while flagging the missing information.
Stage 2: Data Standardization
What It Does
Fixes invalid data formats, standardizes casing, and trims whitespace.
Configuration
- Add Derived Column after HandleNulls
- Name:
StandardizeData - Columns:
| Column | Expression | What It Fixes |
|---|---|---|
| FirstName | initCap(trim(FirstName)) |
Ravi → Ravi |
| LastName | initCap(trim(LastName)) |
Proper casing |
iif(!contains(Email, '.'), concat(Email, '.com'), lower(trim(Email))) |
vishnu@gmail → vishnu@gmail.com |
|
| City | iif(regexMatch(City, '^[0-9]+$'), 'Unknown', initCap(trim(City))) |
12345 → Unknown |
Key Expressions Explained
initCap(trim(FirstName)) — trim removes leading/trailing spaces. initCap capitalizes the first letter of each word. So ravi becomes Ravi.
iif(!contains(Email, '.'), concat(Email, '.com'), ...) — If the email does not contain a dot (like vishnu@gmail), append .com. Otherwise, lowercase and trim.
regexMatch(City, '^[0-9]+$') — Checks if the city is ALL numbers (like 12345). If so, replace with ‘Unknown’. A city should never be numeric.
Real-life analogy: This is like a proofreader reviewing a document before publication. They fix spelling (trim/initCap), correct formatting (email validation), and flag nonsensical entries (numeric city → Unknown). The content is improved without changing its meaning.
Stage 3: Deduplication with Window Function
What It Does
When the same CustomerID appears multiple times (duplicates), keeps only the most recent record based on UpdateDate.
Configuration
Window Transformation:
1. Add Window after StandardizeData
2. Name: AddRowNumber
3. Over tab: CustomerID (partition by)
4. Sort tab: UpdateDate → Descending (latest first)
5. Window columns: Column = RowNum, Expression = rowNumber()
Filter Transformation:
1. Add Filter after AddRowNumber
2. Name: KeepLatest
3. Expression: RowNum == 1
What Happens
CustomerID 1002 has 2 rows: – Row 1: LastName=’Unknown’, Email=’ninad@email.com’, City=’Bangalore’, UpdateDate=2026-01-15 – Row 2: LastName=’Patil’, Email=’noemail@placeholder.com’, City=’Pune’, UpdateDate=2026-02-05
After Window + Filter: Row 2 wins (latest date). CustomerID 1002 = Patil, Pune, Feb 5.
CustomerID 1003 has 2 rows (same data, different dates). Latest wins.
Result: 15 rows → 13 unique rows.
Real-life analogy: When a student submits homework twice, the teacher keeps the latest submission. The earlier version is discarded. The Window function is the timestamp on the submission.
Stage 4: Dual Hash Generation (SCD1 + SCD2)
What It Does
Creates two separate fingerprints for each row — one for Type 1 columns and one for Type 2 columns.
Configuration
- Add Derived Column after KeepLatest
- Name:
GenerateBothHashes - Columns:
| Column | Expression | Covers |
|---|---|---|
| Hash_SCD1 | toString(sha2(256, concatWS('\|', toString(FirstName), toString(LastName), toString(Email)))) |
Name and email (corrections) |
| Hash_SCD2 | toString(sha2(256, concatWS('\|', toString(City), toString(Country)))) |
City and country (business changes) |
Why Two Hashes Instead of One
With a single hash, any change — email typo fix or city relocation — triggers the same action. You cannot distinguish corrections from business events.
With two hashes:
– Hash_SCD1 changes + Hash_SCD2 same → Type 1 (overwrite the correction)
– Hash_SCD2 changes → Type 2 (expire + new version, regardless of Hash_SCD1)
– Both same → Unchanged (skip)
Stage 5: Lookup Against Existing Dimension Table
Second Source: Existing Dimension Table
- Add new Source (not connected to the main flow)
- Name:
ExistingDimTable - Query:
SELECT CustomerKey, CustomerID, Hash_SCD1, Hash_SCD2 FROM DimCustomer_Project4 WHERE is_active = 1
Why WHERE is_active = 1? Only compare against current records. Expired historical rows are irrelevant for change detection.
Rename with DIM_ Prefix
- Add Select after ExistingDimTable
- Name:
RenameExisting - Mappings:
CustomerKey→DIM_CustomerKeyCustomerID→DIM_CustomerIDHash_SCD1→DIM_Hash_SCD1Hash_SCD2→DIM_Hash_SCD2
Lookup
- Add Lookup after GenerateBothHashes
- Name:
LookupExisting - Primary: GenerateBothHashes
- Lookup: RenameExisting
- Condition:
CustomerID == DIM_CustomerID - Type: Left outer (keep all source rows)
Stage 6: Conditional Split (Four Streams)
Configuration
- Add Conditional Split after LookupExisting
- Name:
SplitRecords - Conditions (ORDER MATTERS):
| # | Stream | Condition | Logic |
|---|---|---|---|
| 1 | NewRecords |
isNull(DIM_CustomerID) |
No match → new customer |
| 2 | Type2Changed |
Hash_SCD2 != DIM_Hash_SCD2 |
City/Country changed → expire + insert |
| 3 | Type1Changed |
Hash_SCD1 != DIM_Hash_SCD1 |
Name/Email corrected → overwrite |
| Default | Unchanged |
Both hashes match → skip |
Why This Order
Type 2 is checked BEFORE Type 1. If a customer moved to a new city AND fixed their email (both hashes changed), we treat it as Type 2. The new version row will have the corrected email included, so the Type 1 fix comes along for free.
If we checked Type 1 first, we would overwrite the email in the old row and then ALSO create a new row for the city change — resulting in the email fix appearing in BOTH the old and new versions, which is wrong.
Real-life analogy: At airport immigration, they check your visa status (Type 2 — major change) before checking your name spelling (Type 1 — minor correction). If your visa changed, you go through the full re-entry process. A name spelling fix is handled within that process, not separately.
Stage 7: New Records Path
From NewRecords stream:
- Derived Column —
PrepareNewInsert StartDate=currentDate()EndDate=toDate('9999-12-31')-
is_active=true() -
Alter Row —
MarkNewInsert - Insert if:
true()
Stage 8: SCD Type 2 Path (Expire + Insert New Version)
The Type2Changed stream splits into TWO parallel operations:
Path A: Expire the Old Row
- Derived Column —
PrepareExpire EndDate=subDays(currentDate(), 1)(yesterday)-
is_active=false() -
Select —
SelectExpireColumns -
Only 3 columns:
DIM_CustomerKey→CustomerKey,EndDate,is_active -
Alter Row —
MarkExpire -
Update if:
true() -
Sink —
SinkExpire - Dataset: DimCustomer_Project4
- Allow update: checked
- Key columns:
CustomerKey
Path B: Insert New Version
- Derived Column —
PrepareNewVersion StartDate=currentDate()EndDate=toDate('9999-12-31')-
is_active=true() -
Alter Row —
MarkNewVersionInsert - Insert if:
true()
This stream joins the Union with NewRecords.
Stage 9: SCD Type 1 Path (Overwrite in Place)
From Type1Changed stream:
- Derived Column —
PrepareType1Update updated_tms=currentTimestamp()-
updated_by='dataflow_type1_update' -
Select —
SelectType1Columns -
Map:
DIM_CustomerKey→CustomerKey,FirstName,LastName,Email,Hash_SCD1,updated_tms,updated_by -
Alter Row —
MarkType1Update -
Update if:
true() -
Sink —
SinkType1Update - Dataset: DimCustomer_Project4
- Allow update: checked
- Key columns:
CustomerKey
Key design decision: The Type 1 sink only updates the corrected columns + audit trail. It does NOT touch StartDate, EndDate, is_active, City, Country, or Hash_SCD2. The existing row stays exactly as it was except for the corrected fields.
Stage 10: Union and Three Sinks
Union
Combines MarkNewInsert (new records) + MarkNewVersionInsert (new SCD2 versions):
- Union —
UnionAllInserts - Inputs: MarkNewInsert + MarkNewVersionInsert
SinkInsert
- Sink after Union
- Name:
SinkInsert - Dataset: DimCustomer_Project4
- Allow insert: checked
- Do NOT map CustomerKey (auto-generated IDENTITY)
The Three Sinks Explained
| Sink | Operation | When | What It Does |
|---|---|---|---|
| SinkInsert | INSERT | New records + new SCD2 versions | Adds rows with auto-generated CustomerKey |
| SinkExpire | UPDATE | SCD Type 2 changes | Sets EndDate and is_active=0 on the OLD row using CustomerKey |
| SinkType1Update | UPDATE | SCD Type 1 corrections | Overwrites name/email on the EXISTING row using CustomerKey |
All three sinks point to the SAME table (DimCustomer_Project4). Different operations, same destination.
Real-life analogy: Three doors to the same room. Door 1 (SinkInsert) is for new arrivals — they get a new desk. Door 2 (SinkExpire) is for the records clerk — they stamp “CLOSED” on old files. Door 3 (SinkType1Update) is for corrections — they white-out and rewrite the typo on the existing file.
The Audit Trail Design
Inserted Records (New + New SCD2 Versions)
inserted_tms = GETDATE() ← SQL DEFAULT (auto-set on INSERT)
inserted_by = 'dataflow_insert' ← SQL DEFAULT
updated_tms = NULL ← Not yet updated
updated_by = NULL
SCD Type 2 — Expired Row
EndDate = yesterday ← Set by PrepareExpire
is_active = 0 ← Set by PrepareExpire
(inserted_tms preserved) ← NOT touched
(inserted_by preserved) ← NOT touched
SCD Type 1 — Corrected Row
updated_tms = NOW ← Set by PrepareType1Update
updated_by = 'dataflow_type1_update' ← Identifies what made the change
(inserted_tms preserved) ← NOT touched
(StartDate, EndDate preserved) ← NOT touched — no version change
This design lets you answer:
– “When was this record first created?” → inserted_tms
– “Was this record ever corrected?” → updated_tms IS NOT NULL
– “What corrected it?” → updated_by = 'dataflow_type1_update'
– “When did the customer move?” → Check StartDate on the new version
First Run Walkthrough
Source: 15 rows → After cleaning and dedup: 13 rows.
Dimension table is empty → ALL 13 go to NewRecords → SinkInsert: 13 INSERTs.
DimCustomer_Project4 after first run:
| CKey | CID | FirstName | City | Hash_SCD2 | StartDate | EndDate | is_active |
|------|------|-----------|------------|-----------|------------|------------|-----------|
| 1 | 1001 | Naveen | Hyderabad | abc123... | 2026-04-20 | 9999-12-31 | 1 |
| 2 | 1002 | Ninad | Pune | def456... | 2026-04-20 | 9999-12-31 | 1 |
| 3 | 1003 | Veda | Chennai | ghi789... | 2026-04-20 | 9999-12-31 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 13 | 1013 | Manoj | Patna | xyz000... | 2026-04-20 | 9999-12-31 | 1 |
Second Run Walkthrough
New CSV with 4 records:
CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1003,Veda,Sharma,veda@email.com,Delhi,India,2026-03-15
1006,Anita,Unknown,anita.correct@email.com,Delhi,India,2026-03-20
1005,Vishnu,Kumar,vishnu@gmail.com,Mumbai,India,2026-03-25
1014,Neha,Reddy,neha@email.com,Pune,India,2026-03-10
What Happens
| CustomerID | Source Hash_SCD2 | Dim Hash_SCD2 | Source Hash_SCD1 | Dim Hash_SCD1 | Result |
|---|---|---|---|---|---|
| 1003 | hash(Delhi|India) | hash(Chennai|India) | SAME | SAME | Type 2 — city changed |
| 1006 | SAME | SAME | hash(Anita|Unknown|anita.correct@) | hash(Anita|Unknown|anita.placeholder@) | Type 1 — email corrected |
| 1005 | SAME | SAME | hash(Vishnu|Kumar|vishnu@gmail.com) | hash(Vishnu|Kumar|vishnu@gmail.com) | Unchanged (email was already fixed in stage 2) |
| 1014 | N/A | N/A | N/A | N/A | New — not in dim table |
Actions
- 1003 (Type 2): SinkExpire: UPDATE CustomerKey=3 SET EndDate=’2026-04-20′, is_active=0. SinkInsert: INSERT new row with City=’Delhi’, new CustomerKey.
- 1006 (Type 1): SinkType1Update: UPDATE CustomerKey=6 SET Email=’anita.correct@email.com’, updated_tms=NOW, updated_by=’dataflow_type1_update’.
- 1005 (Unchanged): Dropped — nothing happens.
- 1014 (New): SinkInsert: INSERT new row.
Table After Second Run
| CKey | CID | City | Email | StartDate | EndDate | active | updated_by |
|------|------|-----------|--------------------------|------------|------------|--------|-------------------------|
| 1 | 1001 | Hyderabad | naveen@email.com | 2026-04-20 | 9999-12-31 | 1 | NULL |
| 2 | 1002 | Pune | noemail@placeholder.com | 2026-04-20 | 9999-12-31 | 1 | NULL |
| 3 | 1003 | Chennai | veda@email.com | 2026-04-20 | 2026-04-20 | 0 | NULL | ← EXPIRED
| ... | | | | | | | |
| 6 | 1006 | Delhi | anita.correct@email.com | 2026-04-20 | 9999-12-31 | 1 | dataflow_type1_update | ← TYPE 1 UPDATE
| ... | | | | | | | |
| 14 | 1003 | Delhi | veda@email.com | 2026-04-21 | 9999-12-31 | 1 | NULL | ← NEW VERSION
| 15 | 1014 | Pune | neha@email.com | 2026-04-21 | 9999-12-31 | 1 | NULL | ← NEW RECORD
CustomerID 1003 now has TWO rows — Chennai (expired) and Delhi (active). Full history preserved. CustomerID 1006 was corrected in place — no new version, just the email fixed with an audit timestamp.
Complete Transformation Summary
| # | Transformation | Type | Input Rows | Output Rows | Purpose |
|---|---|---|---|---|---|
| 1 | HandleNulls | Derived Column | 15 | 15 | Replace NULLs with defaults |
| 2 | StandardizeData | Derived Column | 15 | 15 | Fix emails, cities, casing |
| 3 | AddRowNumber | Window | 15 | 15 | Rank records within each CustomerID |
| 4 | KeepLatest | Filter | 15 | 13 | Keep only latest record per CustomerID |
| 5 | GenerateBothHashes | Derived Column | 13 | 13 | Create Hash_SCD1 + Hash_SCD2 |
| 6 | ExistingDimTable | Source | – | varies | Read current active dim records |
| 7 | RenameExisting | Select | varies | varies | Prefix with DIM_ |
| 8 | LookupExisting | Lookup | 13 | 13 | Join source with dim |
| 9 | SplitRecords | Conditional Split | 13 | 4 streams | Route: New, Type2, Type1, Unchanged |
| 10 | SinkInsert | Sink | varies | – | INSERT new + new versions |
| 11 | SinkExpire | Sink | varies | – | UPDATE expired rows |
| 12 | SinkType1Update | Sink | varies | – | UPDATE corrections |
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “No updateable columns match” in SinkExpire | Column mapping mismatch | Ensure SelectExpireColumns outputs exactly: CustomerKey, EndDate, is_active |
| Type 1 updates creating new rows | SinkType1Update set to Allow insert instead of Allow update | Only check Allow update, set Key = CustomerKey |
| All rows going to NewRecords on second run | ExistingDimTable query missing WHERE is_active = 1 |
Add the filter |
| Hash always different between runs | Column order in hash differs or NULL handling inconsistent | Use concatWS with consistent column order and coalesce |
| Duplicate CustomerKey error on insert | Mapping CustomerKey in SinkInsert | Do NOT map CustomerKey — let IDENTITY auto-generate |
| Type 2 and Type 1 both firing for same row | Condition order wrong | Put Type2Changed BEFORE Type1Changed in Conditional Split |
Interview Questions
Q: Why use two separate hash columns instead of one? A: To distinguish between corrections (SCD Type 1 — overwrite) and business changes (SCD Type 2 — version history). Hash_SCD1 covers columns like name and email that should be silently corrected. Hash_SCD2 covers columns like city and country that represent real business events requiring historical tracking.
Q: How do you handle both SCD Type 1 and Type 2 in one data flow? A: Use dual hash columns and a Conditional Split with four streams: New records, Type 2 changes (Hash_SCD2 differs), Type 1 changes (Hash_SCD1 differs), and Unchanged. Type 2 changes go to an expire sink (update old) + insert sink (new version). Type 1 changes go to a separate update sink (overwrite in place). Three sinks, one dimension table.
Q: Why is Type 2 checked before Type 1 in the Conditional Split? A: If both hashes changed (customer moved AND corrected email), treating it as Type 2 is correct — the new version row will include the email correction. Checking Type 1 first would overwrite the email on the old row AND create a new row for the city change, resulting in the correction appearing in both versions incorrectly.
Q: How does the deduplication work? A: A Window transformation partitions by CustomerID, orders by UpdateDate descending, and assigns a row_number. A Filter keeps only RowNum == 1 (the latest record). This resolves duplicate CustomerIDs by keeping the most recent version.
Q: What are the three sinks and why do you need all three? A: SinkInsert (INSERT new records + new SCD2 versions), SinkExpire (UPDATE EndDate and is_active on old SCD2 rows), SinkType1Update (UPDATE corrected columns in place for SCD1). All three target the same table but perform different operations using different column sets and different key columns.
Wrapping Up
This pipeline represents what production data engineering actually looks like — messy data in, clean and properly versioned data out. It combines data quality (null handling, standardization, deduplication), change detection (dual hashing), and change management (SCD Type 1 + Type 2) in a single, coherent Data Flow.
The dual-hash design is the key takeaway. By separating “correction columns” from “business change columns,” you get precise control over which changes create history and which ones silently overwrite. No unnecessary version rows for typo fixes. Full audit trail for real business events.
Build this once, and you have a template that works for any dimension table in any data warehouse.
Related posts: – SCD Types Explained (0, 1, 2, 3, 6) – SCD Type 1 Hash Pipeline – SCD Type 2 Pipeline – Data Flows Guide – Joins in Data Flows
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.