Production Data Quality Pipeline with SCD Type 1 and Type 2 in Azure Synapse Data Flows

Production Data Quality Pipeline with SCD Type 1 and Type 2 in Azure Synapse Data Flows

In real projects, source data is messy. It has NULLs where there should be values, invalid emails, numeric garbage in city fields, and duplicate records with conflicting information. And when you finally clean the data, some changes are corrections (fix a typo) while others are real business events (customer moved to a new city). Your pipeline needs to handle both differently.

This post builds a production-grade pipeline that does it all — cleans messy data, deduplicates records, and applies BOTH SCD Type 1 (overwrite corrections) and SCD Type 2 (track business changes with full history) in a single Data Flow. This is the most complex and realistic pipeline we have built on this blog, and it is exactly what enterprise data platforms look like.

Think of it like a hospital intake process. The patient (raw data) arrives with incomplete paperwork (NULLs), illegible handwriting (invalid data), and sometimes the same patient has two conflicting forms (duplicates). The intake team (Data Flow) cleans the paperwork, resolves the conflicts, and routes the patient correctly — new patient goes to registration (INSERT), returning patient with a new address goes to records update (SCD Type 2), and returning patient with a corrected phone number gets their file updated in place (SCD Type 1).

Table of Contents

  • What This Pipeline Does
  • Why Two SCD Types in One Pipeline
  • The Raw Data (Intentionally Messy)
  • The Target Dimension Table
  • The Complete Data Flow Architecture
  • Stage 1: Null Handling
  • Stage 2: Data Standardization
  • Stage 3: Deduplication with Window Function
  • Stage 4: Dual Hash Generation (SCD1 + SCD2)
  • Stage 5: Lookup Against Existing Dimension Table
  • Stage 6: Conditional Split (Four Streams)
  • Stage 7: New Records Path
  • Stage 8: SCD Type 2 Path (Expire + Insert New Version)
  • Stage 9: SCD Type 1 Path (Overwrite in Place)
  • Stage 10: Union and Three Sinks
  • The Three Sinks Explained
  • The Audit Trail Design
  • First Run Walkthrough
  • Second Run Walkthrough
  • Why Type 2 Is Checked Before Type 1
  • Complete Transformation Summary
  • Performance Considerations
  • Common Errors and Fixes
  • Interview Questions
  • Wrapping Up

What This Pipeline Does

RAW CSV (15 rows, messy)
  |
  v
[1] NULL HANDLING → Replace missing values with defaults
  |
  v
[2] DATA STANDARDIZATION → Fix invalid emails, clean city values, proper casing
  |
  v
[3] DEDUPLICATION → Window function ranks by date, keeps latest → 13 unique rows
  |
  v
[4] DUAL HASH GENERATION → Hash_SCD1 (name, email) + Hash_SCD2 (city, country)
  |
  v
[5] LOOKUP → Join with existing dimension table
  |
  v
[6] CONDITIONAL SPLIT
    |           |              |              |
    v           v              v              v
  New       Type 2          Type 1        Unchanged
  Records   Changed         Changed       (dropped)
    |       (city moved)    (email fixed)
    |           |              |
    v           v              v
  SinkInsert  SinkExpire    SinkType1Update
              + SinkInsert

Why Two SCD Types in One Pipeline

The business reality: Not all changes are equal.

When a customer fixes a typo in their email address (vishnu@gmailvishnu@gmail.com), that is a correction. You do NOT want to create a new historical version for a typo fix. Just overwrite it silently. That is SCD Type 1.

When a customer moves from Chennai to Delhi, that is a real business event. You WANT to track this — for compliance, for analytics (“how many customers relocated?”), for historical accuracy (“where did this customer live when they placed that order?”). That is SCD Type 2.

Most tutorials teach Type 1 OR Type 2 separately. In production, you need BOTH in the same pipeline because real data has both corrections and business changes arriving together.

Real-life analogy: Think of a hospital medical record. If the receptionist misspelled your name, they correct it in place (Type 1 — overwrite the typo). If you change your address, they add a new entry with the date of the change while preserving the old address (Type 2 — full history). Different changes, different handling, same patient record.

The Raw Data (Intentionally Messy)

CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1001,Naveen,Vuppula,naveen@email.com,Hyderabad,India,2026-01-01
1002,Ninad,,ninad@email.com,Bangalore,India,2026-01-15
1002,Ninad,Patil,,Pune,India,2026-02-05
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-01-10
1003,Veda,Sharma,veda@email.com,Chennai,India,2026-02-01
1004,Shreyash,Reddy,shreyash@email.com,,India,2026-01-20
1005,Vishnu,Kumar,vishnu@gmail,Mumbai,India,2026-01-05
1006,Anita,,anita.placeholder@email.com,Delhi,India,2026-01-25
1007,Priya,Mehta,priya@email.com,Kolkata,India,2026-02-10
1008,  Ravi ,Desai,ravi@email.com,Jaipur,India,2026-01-30
1009,Sunita,Nair,sunita@email.com,12345,India,2026-02-15
1010,Arjun,Patel,arjun@email.com,Ahmedabad,India,2026-01-12
1011,Deepa,Iyer,deepa@email.com,Lucknow,,2026-02-20
1012,Kiran,Joshi,kiran@email.com,Chandigarh,India,2026-01-08
1013,Manoj,Gupta,,Patna,India,2026-02-25

The Data Quality Issues

Row Issue Type
1002 (row 2) NULL LastName, valid email NULL handling
1002 (row 3) NULL Email but has LastName NULL + Duplicate
1003 (rows 4-5) Duplicate CustomerID, same data, different dates Deduplication
1004 NULL City NULL handling
1005 Invalid email (vishnu@gmail — missing .com) Standardization
1006 NULL LastName NULL handling
1008 Leading/trailing spaces in FirstName (Ravi) Standardization
1009 City is numeric (12345) — not a real city Standardization
1011 NULL Country NULL handling
1013 NULL Email NULL handling

15 rows of raw data with 10+ quality issues. Real data is exactly like this.

The Target Dimension Table

CREATE TABLE DimCustomer_Project4 (
    CustomerKey     INT IDENTITY(1,1) PRIMARY KEY,
    CustomerID      INT NOT NULL,
    FirstName       VARCHAR(100),
    LastName        VARCHAR(100),
    Email           VARCHAR(200),
    City            VARCHAR(100),
    Country         VARCHAR(100),
    UpdateDate      DATE,
    Hash_SCD1       VARCHAR(256),
    Hash_SCD2       VARCHAR(256),
    StartDate       DATE NOT NULL,
    EndDate         DATE NOT NULL DEFAULT '9999-12-31',
    is_active       BIT NOT NULL DEFAULT 1,
    inserted_tms    DATETIME DEFAULT GETDATE(),
    inserted_by     VARCHAR(50) DEFAULT 'dataflow_insert',
    updated_tms     DATETIME NULL,
    updated_by      VARCHAR(50) NULL
);

Why Two Hash Columns

Column Covers Change Behavior
Hash_SCD1 FirstName, LastName, Email If different → overwrite (Type 1)
Hash_SCD2 City, Country If different → expire old + insert new (Type 2)

This dual-hash design is the key innovation. One hash for corrections, another for business changes. Compare each separately to determine the right action.

Real-life analogy: Think of two fingerprint scanners at a bank vault. Scanner 1 (Hash_SCD1) checks your personal details — if your name spelling changed, update the file. Scanner 2 (Hash_SCD2) checks your address — if you moved, create a new file while keeping the old one. Different scanners, different actions, same person.

The Complete Data Flow Architecture

SourceCSV (15 rows)
  |
  v
HandleNulls (Derived Column) — replace NULLs with defaults
  |
  v
StandardizeData (Derived Column) — fix emails, clean cities, trim/case names
  |
  v
AddRowNumber (Window) — rank by UpdateDate DESC within each CustomerID
  |
  v
KeepLatest (Filter) — RowNum == 1 → 13 unique rows
  |
  v
GenerateBothHashes (Derived Column) — Hash_SCD1 + Hash_SCD2
  |
  v
LookupExisting ←── ExistingDimTable (Source: active dim records)
  |                  |
  |                  RenameExisting (Select: DIM_ prefix)
  |
  v
SplitRecords (Conditional Split)
  |          |              |              |
  v          v              v              v
New       Type2Changed   Type1Changed   Unchanged
Records   (city moved)   (email fixed)  (dropped)
  |          |              |
  |     ┌────┴────┐         |
  |     |         |         |
  |  Expire    InsertNew    |
  |  (update)  (insert)     |
  |     |         |         |
  |  SinkExpire   |     SinkType1Update
  |               |
  Union (New + InsertNew)
  |
  SinkInsert

Stage 1: Null Handling

What It Does

Replaces NULL values with sensible defaults so downstream transformations do not fail on NULLs.

Configuration

  1. Add Derived Column after Source
  2. Name: HandleNulls
  3. Columns:
Column Expression Logic
LastName iif(isNull(LastName), 'Unknown', LastName) NULL → ‘Unknown’
Email iif(isNull(Email), 'noemail@placeholder.com', Email) NULL → placeholder
City iif(isNull(City), 'Unknown', City) NULL → ‘Unknown’
Country iif(isNull(Country), 'India', Country) NULL → ‘India’ (default for this dataset)

After This Stage

  • 1002 (row 2): LastName goes from NULL to ‘Unknown’
  • 1002 (row 3): Email goes from NULL to ‘noemail@placeholder.com’
  • 1004: City goes from NULL to ‘Unknown’
  • 1013: Email goes from NULL to ‘noemail@placeholder.com’

Real-life analogy: When a patient arrives at the hospital without insurance information, the intake form does not stay blank — it gets marked “Insurance: Unknown — follow up required.” The form is complete enough to proceed while flagging the missing information.

Stage 2: Data Standardization

What It Does

Fixes invalid data formats, standardizes casing, and trims whitespace.

Configuration

  1. Add Derived Column after HandleNulls
  2. Name: StandardizeData
  3. Columns:
Column Expression What It Fixes
FirstName initCap(trim(FirstName)) RaviRavi
LastName initCap(trim(LastName)) Proper casing
Email iif(!contains(Email, '.'), concat(Email, '.com'), lower(trim(Email))) vishnu@gmailvishnu@gmail.com
City iif(regexMatch(City, '^[0-9]+$'), 'Unknown', initCap(trim(City))) 12345Unknown

Key Expressions Explained

initCap(trim(FirstName))trim removes leading/trailing spaces. initCap capitalizes the first letter of each word. So ravi becomes Ravi.

iif(!contains(Email, '.'), concat(Email, '.com'), ...) — If the email does not contain a dot (like vishnu@gmail), append .com. Otherwise, lowercase and trim.

regexMatch(City, '^[0-9]+$') — Checks if the city is ALL numbers (like 12345). If so, replace with ‘Unknown’. A city should never be numeric.

Real-life analogy: This is like a proofreader reviewing a document before publication. They fix spelling (trim/initCap), correct formatting (email validation), and flag nonsensical entries (numeric city → Unknown). The content is improved without changing its meaning.

Stage 3: Deduplication with Window Function

What It Does

When the same CustomerID appears multiple times (duplicates), keeps only the most recent record based on UpdateDate.

Configuration

Window Transformation: 1. Add Window after StandardizeData 2. Name: AddRowNumber 3. Over tab: CustomerID (partition by) 4. Sort tab: UpdateDateDescending (latest first) 5. Window columns: Column = RowNum, Expression = rowNumber()

Filter Transformation: 1. Add Filter after AddRowNumber 2. Name: KeepLatest 3. Expression: RowNum == 1

What Happens

CustomerID 1002 has 2 rows: – Row 1: LastName=’Unknown’, Email=’ninad@email.com’, City=’Bangalore’, UpdateDate=2026-01-15 – Row 2: LastName=’Patil’, Email=’noemail@placeholder.com’, City=’Pune’, UpdateDate=2026-02-05

After Window + Filter: Row 2 wins (latest date). CustomerID 1002 = Patil, Pune, Feb 5.

CustomerID 1003 has 2 rows (same data, different dates). Latest wins.

Result: 15 rows → 13 unique rows.

Real-life analogy: When a student submits homework twice, the teacher keeps the latest submission. The earlier version is discarded. The Window function is the timestamp on the submission.

Stage 4: Dual Hash Generation (SCD1 + SCD2)

What It Does

Creates two separate fingerprints for each row — one for Type 1 columns and one for Type 2 columns.

Configuration

  1. Add Derived Column after KeepLatest
  2. Name: GenerateBothHashes
  3. Columns:
Column Expression Covers
Hash_SCD1 toString(sha2(256, concatWS('\|', toString(FirstName), toString(LastName), toString(Email)))) Name and email (corrections)
Hash_SCD2 toString(sha2(256, concatWS('\|', toString(City), toString(Country)))) City and country (business changes)

Why Two Hashes Instead of One

With a single hash, any change — email typo fix or city relocation — triggers the same action. You cannot distinguish corrections from business events.

With two hashes: – Hash_SCD1 changes + Hash_SCD2 same → Type 1 (overwrite the correction) – Hash_SCD2 changes → Type 2 (expire + new version, regardless of Hash_SCD1) – Both same → Unchanged (skip)

Stage 5: Lookup Against Existing Dimension Table

Second Source: Existing Dimension Table

  1. Add new Source (not connected to the main flow)
  2. Name: ExistingDimTable
  3. Query: SELECT CustomerKey, CustomerID, Hash_SCD1, Hash_SCD2 FROM DimCustomer_Project4 WHERE is_active = 1

Why WHERE is_active = 1? Only compare against current records. Expired historical rows are irrelevant for change detection.

Rename with DIM_ Prefix

  1. Add Select after ExistingDimTable
  2. Name: RenameExisting
  3. Mappings:
  4. CustomerKeyDIM_CustomerKey
  5. CustomerIDDIM_CustomerID
  6. Hash_SCD1DIM_Hash_SCD1
  7. Hash_SCD2DIM_Hash_SCD2

Lookup

  1. Add Lookup after GenerateBothHashes
  2. Name: LookupExisting
  3. Primary: GenerateBothHashes
  4. Lookup: RenameExisting
  5. Condition: CustomerID == DIM_CustomerID
  6. Type: Left outer (keep all source rows)

Stage 6: Conditional Split (Four Streams)

Configuration

  1. Add Conditional Split after LookupExisting
  2. Name: SplitRecords
  3. Conditions (ORDER MATTERS):
# Stream Condition Logic
1 NewRecords isNull(DIM_CustomerID) No match → new customer
2 Type2Changed Hash_SCD2 != DIM_Hash_SCD2 City/Country changed → expire + insert
3 Type1Changed Hash_SCD1 != DIM_Hash_SCD1 Name/Email corrected → overwrite
Default Unchanged Both hashes match → skip

Why This Order

Type 2 is checked BEFORE Type 1. If a customer moved to a new city AND fixed their email (both hashes changed), we treat it as Type 2. The new version row will have the corrected email included, so the Type 1 fix comes along for free.

If we checked Type 1 first, we would overwrite the email in the old row and then ALSO create a new row for the city change — resulting in the email fix appearing in BOTH the old and new versions, which is wrong.

Real-life analogy: At airport immigration, they check your visa status (Type 2 — major change) before checking your name spelling (Type 1 — minor correction). If your visa changed, you go through the full re-entry process. A name spelling fix is handled within that process, not separately.

Stage 7: New Records Path

From NewRecords stream:

  1. Derived ColumnPrepareNewInsert
  2. StartDate = currentDate()
  3. EndDate = toDate('9999-12-31')
  4. is_active = true()

  5. Alter RowMarkNewInsert

  6. Insert if: true()

Stage 8: SCD Type 2 Path (Expire + Insert New Version)

The Type2Changed stream splits into TWO parallel operations:

Path A: Expire the Old Row

  1. Derived ColumnPrepareExpire
  2. EndDate = subDays(currentDate(), 1) (yesterday)
  3. is_active = false()

  4. SelectSelectExpireColumns

  5. Only 3 columns: DIM_CustomerKeyCustomerKey, EndDate, is_active

  6. Alter RowMarkExpire

  7. Update if: true()

  8. SinkSinkExpire

  9. Dataset: DimCustomer_Project4
  10. Allow update: checked
  11. Key columns: CustomerKey

Path B: Insert New Version

  1. Derived ColumnPrepareNewVersion
  2. StartDate = currentDate()
  3. EndDate = toDate('9999-12-31')
  4. is_active = true()

  5. Alter RowMarkNewVersionInsert

  6. Insert if: true()

This stream joins the Union with NewRecords.

Stage 9: SCD Type 1 Path (Overwrite in Place)

From Type1Changed stream:

  1. Derived ColumnPrepareType1Update
  2. updated_tms = currentTimestamp()
  3. updated_by = 'dataflow_type1_update'

  4. SelectSelectType1Columns

  5. Map: DIM_CustomerKeyCustomerKey, FirstName, LastName, Email, Hash_SCD1, updated_tms, updated_by

  6. Alter RowMarkType1Update

  7. Update if: true()

  8. SinkSinkType1Update

  9. Dataset: DimCustomer_Project4
  10. Allow update: checked
  11. Key columns: CustomerKey

Key design decision: The Type 1 sink only updates the corrected columns + audit trail. It does NOT touch StartDate, EndDate, is_active, City, Country, or Hash_SCD2. The existing row stays exactly as it was except for the corrected fields.

Stage 10: Union and Three Sinks

Union

Combines MarkNewInsert (new records) + MarkNewVersionInsert (new SCD2 versions):

  1. UnionUnionAllInserts
  2. Inputs: MarkNewInsert + MarkNewVersionInsert

SinkInsert

  1. Sink after Union
  2. Name: SinkInsert
  3. Dataset: DimCustomer_Project4
  4. Allow insert: checked
  5. Do NOT map CustomerKey (auto-generated IDENTITY)

The Three Sinks Explained

Sink Operation When What It Does
SinkInsert INSERT New records + new SCD2 versions Adds rows with auto-generated CustomerKey
SinkExpire UPDATE SCD Type 2 changes Sets EndDate and is_active=0 on the OLD row using CustomerKey
SinkType1Update UPDATE SCD Type 1 corrections Overwrites name/email on the EXISTING row using CustomerKey

All three sinks point to the SAME table (DimCustomer_Project4). Different operations, same destination.

Real-life analogy: Three doors to the same room. Door 1 (SinkInsert) is for new arrivals — they get a new desk. Door 2 (SinkExpire) is for the records clerk — they stamp “CLOSED” on old files. Door 3 (SinkType1Update) is for corrections — they white-out and rewrite the typo on the existing file.

The Audit Trail Design

Inserted Records (New + New SCD2 Versions)

inserted_tms = GETDATE()       ← SQL DEFAULT (auto-set on INSERT)
inserted_by  = 'dataflow_insert'  ← SQL DEFAULT
updated_tms  = NULL              ← Not yet updated
updated_by   = NULL

SCD Type 2 — Expired Row

EndDate    = yesterday            ← Set by PrepareExpire
is_active  = 0                    ← Set by PrepareExpire
(inserted_tms preserved)          ← NOT touched
(inserted_by preserved)           ← NOT touched

SCD Type 1 — Corrected Row

updated_tms = NOW                 ← Set by PrepareType1Update
updated_by  = 'dataflow_type1_update'  ← Identifies what made the change
(inserted_tms preserved)          ← NOT touched
(StartDate, EndDate preserved)    ← NOT touched — no version change

This design lets you answer: – “When was this record first created?” → inserted_tms – “Was this record ever corrected?” → updated_tms IS NOT NULL – “What corrected it?” → updated_by = 'dataflow_type1_update' – “When did the customer move?” → Check StartDate on the new version

First Run Walkthrough

Source: 15 rows → After cleaning and dedup: 13 rows.

Dimension table is empty → ALL 13 go to NewRecords → SinkInsert: 13 INSERTs.

DimCustomer_Project4 after first run:
| CKey | CID  | FirstName | City       | Hash_SCD2 | StartDate  | EndDate    | is_active |
|------|------|-----------|------------|-----------|------------|------------|-----------|
| 1    | 1001 | Naveen    | Hyderabad  | abc123... | 2026-04-20 | 9999-12-31 | 1         |
| 2    | 1002 | Ninad     | Pune       | def456... | 2026-04-20 | 9999-12-31 | 1         |
| 3    | 1003 | Veda      | Chennai    | ghi789... | 2026-04-20 | 9999-12-31 | 1         |
| ...  | ...  | ...       | ...        | ...       | ...        | ...        | ...       |
| 13   | 1013 | Manoj     | Patna      | xyz000... | 2026-04-20 | 9999-12-31 | 1         |

Second Run Walkthrough

New CSV with 4 records:

CustomerID,FirstName,LastName,Email,City,Country,UpdateDate
1003,Veda,Sharma,veda@email.com,Delhi,India,2026-03-15
1006,Anita,Unknown,anita.correct@email.com,Delhi,India,2026-03-20
1005,Vishnu,Kumar,vishnu@gmail.com,Mumbai,India,2026-03-25
1014,Neha,Reddy,neha@email.com,Pune,India,2026-03-10

What Happens

CustomerID Source Hash_SCD2 Dim Hash_SCD2 Source Hash_SCD1 Dim Hash_SCD1 Result
1003 hash(Delhi|India) hash(Chennai|India) SAME SAME Type 2 — city changed
1006 SAME SAME hash(Anita|Unknown|anita.correct@) hash(Anita|Unknown|anita.placeholder@) Type 1 — email corrected
1005 SAME SAME hash(Vishnu|Kumar|vishnu@gmail.com) hash(Vishnu|Kumar|vishnu@gmail.com) Unchanged (email was already fixed in stage 2)
1014 N/A N/A N/A N/A New — not in dim table

Actions

  1. 1003 (Type 2): SinkExpire: UPDATE CustomerKey=3 SET EndDate=’2026-04-20′, is_active=0. SinkInsert: INSERT new row with City=’Delhi’, new CustomerKey.
  2. 1006 (Type 1): SinkType1Update: UPDATE CustomerKey=6 SET Email=’anita.correct@email.com’, updated_tms=NOW, updated_by=’dataflow_type1_update’.
  3. 1005 (Unchanged): Dropped — nothing happens.
  4. 1014 (New): SinkInsert: INSERT new row.

Table After Second Run

| CKey | CID  | City      | Email                    | StartDate  | EndDate    | active | updated_by              |
|------|------|-----------|--------------------------|------------|------------|--------|-------------------------|
| 1    | 1001 | Hyderabad | naveen@email.com         | 2026-04-20 | 9999-12-31 | 1      | NULL                    |
| 2    | 1002 | Pune      | noemail@placeholder.com  | 2026-04-20 | 9999-12-31 | 1      | NULL                    |
| 3    | 1003 | Chennai   | veda@email.com           | 2026-04-20 | 2026-04-20 | 0      | NULL                    | ← EXPIRED
| ...  |      |           |                          |            |            |        |                         |
| 6    | 1006 | Delhi     | anita.correct@email.com  | 2026-04-20 | 9999-12-31 | 1      | dataflow_type1_update   | ← TYPE 1 UPDATE
| ...  |      |           |                          |            |            |        |                         |
| 14   | 1003 | Delhi     | veda@email.com           | 2026-04-21 | 9999-12-31 | 1      | NULL                    | ← NEW VERSION
| 15   | 1014 | Pune      | neha@email.com           | 2026-04-21 | 9999-12-31 | 1      | NULL                    | ← NEW RECORD

CustomerID 1003 now has TWO rows — Chennai (expired) and Delhi (active). Full history preserved. CustomerID 1006 was corrected in place — no new version, just the email fixed with an audit timestamp.

Complete Transformation Summary

# Transformation Type Input Rows Output Rows Purpose
1 HandleNulls Derived Column 15 15 Replace NULLs with defaults
2 StandardizeData Derived Column 15 15 Fix emails, cities, casing
3 AddRowNumber Window 15 15 Rank records within each CustomerID
4 KeepLatest Filter 15 13 Keep only latest record per CustomerID
5 GenerateBothHashes Derived Column 13 13 Create Hash_SCD1 + Hash_SCD2
6 ExistingDimTable Source varies Read current active dim records
7 RenameExisting Select varies varies Prefix with DIM_
8 LookupExisting Lookup 13 13 Join source with dim
9 SplitRecords Conditional Split 13 4 streams Route: New, Type2, Type1, Unchanged
10 SinkInsert Sink varies INSERT new + new versions
11 SinkExpire Sink varies UPDATE expired rows
12 SinkType1Update Sink varies UPDATE corrections

Common Errors and Fixes

Error Cause Fix
“No updateable columns match” in SinkExpire Column mapping mismatch Ensure SelectExpireColumns outputs exactly: CustomerKey, EndDate, is_active
Type 1 updates creating new rows SinkType1Update set to Allow insert instead of Allow update Only check Allow update, set Key = CustomerKey
All rows going to NewRecords on second run ExistingDimTable query missing WHERE is_active = 1 Add the filter
Hash always different between runs Column order in hash differs or NULL handling inconsistent Use concatWS with consistent column order and coalesce
Duplicate CustomerKey error on insert Mapping CustomerKey in SinkInsert Do NOT map CustomerKey — let IDENTITY auto-generate
Type 2 and Type 1 both firing for same row Condition order wrong Put Type2Changed BEFORE Type1Changed in Conditional Split

Interview Questions

Q: Why use two separate hash columns instead of one? A: To distinguish between corrections (SCD Type 1 — overwrite) and business changes (SCD Type 2 — version history). Hash_SCD1 covers columns like name and email that should be silently corrected. Hash_SCD2 covers columns like city and country that represent real business events requiring historical tracking.

Q: How do you handle both SCD Type 1 and Type 2 in one data flow? A: Use dual hash columns and a Conditional Split with four streams: New records, Type 2 changes (Hash_SCD2 differs), Type 1 changes (Hash_SCD1 differs), and Unchanged. Type 2 changes go to an expire sink (update old) + insert sink (new version). Type 1 changes go to a separate update sink (overwrite in place). Three sinks, one dimension table.

Q: Why is Type 2 checked before Type 1 in the Conditional Split? A: If both hashes changed (customer moved AND corrected email), treating it as Type 2 is correct — the new version row will include the email correction. Checking Type 1 first would overwrite the email on the old row AND create a new row for the city change, resulting in the correction appearing in both versions incorrectly.

Q: How does the deduplication work? A: A Window transformation partitions by CustomerID, orders by UpdateDate descending, and assigns a row_number. A Filter keeps only RowNum == 1 (the latest record). This resolves duplicate CustomerIDs by keeping the most recent version.

Q: What are the three sinks and why do you need all three? A: SinkInsert (INSERT new records + new SCD2 versions), SinkExpire (UPDATE EndDate and is_active on old SCD2 rows), SinkType1Update (UPDATE corrected columns in place for SCD1). All three target the same table but perform different operations using different column sets and different key columns.

Wrapping Up

This pipeline represents what production data engineering actually looks like — messy data in, clean and properly versioned data out. It combines data quality (null handling, standardization, deduplication), change detection (dual hashing), and change management (SCD Type 1 + Type 2) in a single, coherent Data Flow.

The dual-hash design is the key takeaway. By separating “correction columns” from “business change columns,” you get precise control over which changes create history and which ones silently overwrite. No unnecessary version rows for typo fixes. Full audit trail for real business events.

Build this once, and you have a template that works for any dimension table in any data warehouse.

Related posts:SCD Types Explained (0, 1, 2, 3, 6)SCD Type 1 Hash PipelineSCD Type 2 PipelineData Flows GuideJoins in Data Flows


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link