Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection
In the SCD Type 1 hash pipeline, we built a pipeline that detects changes and overwrites old values. Simple, effective, but history is lost. Alice moved from Toronto to Mumbai, and Toronto disappeared forever.
SCD Type 2 fixes this. When Alice moves, we do NOT overwrite Toronto. Instead, we expire the Toronto row (set an end date and mark it inactive) and insert a new row with Mumbai. Both versions coexist. You can answer “where does Alice live NOW?” and “where did Alice live in 2023?” from the same table.
Think of it like a passport system. When you renew your passport, the old one gets a “cancelled” stamp and an expiry date. The new passport gets issued with today’s date. Immigration can look at both passports and see your complete travel history. You never lose the old record.
This post walks through building the complete SCD Type 2 pipeline using Synapse Data Flows — every transformation, every expression, and the exact logic for expiring old records and inserting new versions.
Table of Contents
- How SCD Type 2 Differs from Type 1
- The Target Table Structure
- The Complete Pipeline Architecture
- The Source Data
- Building the Data Flow: Step by Step
- Step A: Source — Read CSV from ADLS Gen2
- Step B: Derived Column — Generate Source Hash
- Step C: Select — Prefix with SRC_
- Step D: Source 2 — Read Existing Dimension Table
- Step E: Select — Prefix Existing with DIM_
- Step F: Lookup — Join Source with Existing Dimension
- Step G: Conditional Split — New, Changed, Unchanged
- Step H: The New Records Path (Insert with Surrogate Key)
- Step I: The Changed Records Path — Two Operations
- Step I-1: Expire Old Row (Update EndDate and is_active)
- Step I-2: Insert New Version (New Row with Current Values)
- Step J: Union — Merge All Inserts
- Step K: Sink Insert — Write New and New-Version Rows
- Step L: Sink Expire — Update Existing Rows
- First Run vs Subsequent Runs
- The Surrogate Key Strategy
- Multiple Changes Over Time
- Querying the SCD Type 2 Table
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
How SCD Type 2 Differs from Type 1
| Aspect | Type 1 (Previous Post) | Type 2 (This Post) |
|---|---|---|
| When data changes | Overwrite old value | Expire old row, insert new row |
| History | Lost | Preserved |
| Row count | Fixed (one row per entity) | Grows (multiple rows per entity) |
| Key | Business key (employee_id) | Surrogate key (CustomerKey) + business key (CustomerID) |
| Audit columns | updated_tms | StartDate, EndDate, is_active |
| Sinks needed | 1 (insert + update) | 2 (one for inserts, one for expire updates) |
Real-life analogy: – Type 1 = Updating your phone contact. Old number is gone. – Type 2 = Your phone bill history. Every number you ever had is listed with dates of service.
The Target Table Structure
CREATE TABLE DimCustomer_SCD2 (
CustomerKey INT IDENTITY(1,1) PRIMARY KEY, -- Surrogate key
CustomerID INT NOT NULL, -- Business key
FirstName VARCHAR(100),
LastName VARCHAR(100),
City VARCHAR(50),
StateProvince VARCHAR(50),
CompanyName VARCHAR(200),
HashValue VARCHAR(64), -- SHA-256 hash of all business columns
StartDate DATE DEFAULT CAST(GETDATE() AS DATE),
EndDate DATE DEFAULT '9999-12-31',
is_active BIT DEFAULT 1
);
Why Two Keys?
CustomerID (business key) = identifies the PERSON. Alice is always CustomerID=100, whether she lives in Toronto or Mumbai.
CustomerKey (surrogate key) = identifies the VERSION. Alice in Toronto is CustomerKey=1001. Alice in Mumbai is CustomerKey=1002. Different keys for different versions of the same person.
Fact tables reference CustomerKey, not CustomerID. This is how an order from 2023 links to Alice-in-Toronto, and an order from 2026 links to Alice-in-Mumbai.
Real-life analogy: Your national ID number (CustomerID) stays the same your whole life. But your passport number (CustomerKey) changes every time you renew. When immigration checks which country you visited in 2020, they look at the passport number (CustomerKey) that was active then — not your current passport.
The Magic Columns
| Column | Purpose | Example (Alice in Toronto) | Example (Alice in Mumbai) |
|---|---|---|---|
StartDate |
When this version became active | 2022-03-15 | 2026-01-15 |
EndDate |
When this version expired | 2026-01-14 | 9999-12-31 (still active) |
is_active |
Is this the current version? | 0 (expired) | 1 (current) |
HashValue |
Fingerprint of all business columns | a3f7b2c1… | e9d4f8a6… |
9999-12-31 is a convention meaning “this version has not expired yet.” It is easier to query than NULL and works with date range comparisons.
The Complete Pipeline Architecture
ADLS Gen2 (CSV) Synapse Data Flow Azure SQL DB
┌──────────┐ ┌─────────────────────────────────────┐ ┌──────────────────┐
│ │ │ │ │ │
│ source.csv│──>│ Source → Hash → Select(SRC_) │ │ DimCustomer_SCD2 │
│ │ │ │ │ │ │
│ │ │ Lookup ←── Source2(Dim) │ │ CustomerKey(PK) │
│ │ │ │ Select(DIM_) │<───│ CustomerID │
│ │ │ Conditional Split │ │ HashValue │
│ │ │ / | \ │ │ StartDate │
│ │ │ New Changed Unchanged │ │ EndDate │
│ │ │ | | (drop) │ │ is_active │
│ │ │ | ┌────┴────┐ │ │ │
│ │ │ | │ Expire │ InsertNew │ │ │
│ │ │ | │ (update)│ (insert) │ │ │
│ │ │ | │ │ │ │ │
│ │ │ | SinkExpire│ │ │ │
│ │ │ | │ │ │ │
│ │ │ Union (New + InsertNew) │ │ │
│ │ │ | │ │ │
│ │ │ SinkInsert ──────────────────>│ │ │
└──────────┘ └─────────────────────────────────────┘ └──────────────────┘
Key difference from Type 1: Two sinks instead of one. – SinkInsert — inserts brand new records AND new versions of changed records – SinkExpire — updates existing rows to set EndDate and is_active=0
The Source Data
Same CSV source as our Type 1 pipeline:
CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Toronto,Ontario,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd
On subsequent run with changes:
CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Mumbai,Maharashtra,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd
500,Frank,Taylor,Berlin,Berlin,Euro Systems
Alice moved to Mumbai. Frank is new. Bob, Carol, Dave unchanged.
Building the Data Flow: Step by Step
Step A: Source — Read CSV from ADLS Gen2
- Add Source transformation
- Name:
SourceCSV - Dataset: your parameterized ADLS CSV dataset
- Preview: 4 rows (first run) or 5 rows (with Frank)
Step B: Derived Column — Generate Source Hash
- Add Derived Column after Source
- Name:
GenerateHash - New column:
SRC_HashValue - Expression:
sha2(256, concat(
toString(CustomerID), '|',
coalesce(FirstName,''), '|',
coalesce(LastName,''), '|',
coalesce(City,''), '|',
coalesce(StateProvince,''), '|',
coalesce(CompanyName,'')
))
Why coalesce? If any column is NULL, the entire concat returns NULL, making every row with a NULL hash identical. coalesce(column, '') converts NULL to empty string, ensuring unique hashes.
Step C: Select — Prefix with SRC_
- Add Select after GenerateHash
- Name:
SelectSourceColumns - Rename all columns:
CustomerID→SRC_CustomerIDFirstName→SRC_FirstNameLastName→SRC_LastNameCity→SRC_CityStateProvince→SRC_StateProvinceCompanyName→SRC_CompanyNameSRC_HashValue→SRC_HashValue(keep as is)
Step D: Source 2 — Read Existing Dimension Table
- Add a second Source transformation (separate from Source A)
- Name:
ExistingDimTable - Source type: Inline or Dataset pointing to
DimCustomer_SCD2 - Query:
SELECT CustomerKey, CustomerID, HashValue FROM DimCustomer_SCD2 WHERE is_active = 1
Why WHERE is_active = 1? We only compare against CURRENT records. Expired historical rows are irrelevant for change detection.
Step E: Select — Prefix Existing with DIM_
- Add Select after ExistingDimTable
- Name:
SelectExistingColumns - Rename:
CustomerKey→DIM_CustomerKeyCustomerID→DIM_CustomerIDHashValue→DIM_HashValue
Step F: Lookup — Join Source with Existing Dimension
- Add Lookup after SelectSourceColumns
- Name:
LookupExisting - Primary stream:
SelectSourceColumns - Lookup stream:
SelectExistingColumns - Condition:
SRC_CustomerID == DIM_CustomerID - Lookup type: Left outer (keep all source rows)
- Multiple matches: First match
Step G: Conditional Split — New, Changed, Unchanged
- Add Conditional Split after LookupExisting
- Name:
SplitNewChangedUnchanged - Conditions:
| Stream Name | Condition |
|---|---|
NewRecords |
isNull(DIM_CustomerID) |
ChangedRecords |
SRC_HashValue != DIM_HashValue |
UnchangedRecords |
Default stream |
Same as Type 1 — the split logic is identical.
Step H: The New Records Path (Insert)
- Add Derived Column after
NewRecords - Name:
PrepareNewInsert - Add columns:
StartDate=currentDate()EndDate=toDate('9999-12-31')-
is_active=true() -
Add Alter Row after PrepareNewInsert
- Name:
MarkNewInsert - Insert if:
true()
Step I: The Changed Records Path — Two Operations
This is where Type 2 differs fundamentally from Type 1. A changed record requires TWO operations:
- Expire the old row — UPDATE the existing dim table row (set EndDate and is_active=0)
- Insert the new version — INSERT a new row with current values
These go to TWO different sinks.
Step I-1: Expire Old Row (Select + Alter Row + Sink)
From the ChangedRecords stream:
- Add Derived Column
- Name:
PrepareExpireColumns - Columns:
EndDate=currentDate()-
is_active=false() -
Add Select after PrepareExpireColumns
- Name:
SelectExpireColumns - Keep only:
DIM_CustomerKey→CustomerKeyEndDate→EndDateis_active→is_active
Why only 3 columns? The expire operation only UPDATES EndDate and is_active. It does not change any business columns. The CustomerKey is the key to find the correct row.
- Add Alter Row
- Name:
MarkExpireUpdate -
Update if:
true() -
Add Sink
- Name:
SinkExpire - Dataset: DimCustomer_SCD2
- Settings:
- Allow update: checked
- Key columns:
CustomerKey
- Mapping: CustomerKey → CustomerKey, EndDate → EndDate, is_active → is_active
What this generates:
UPDATE DimCustomer_SCD2
SET EndDate = '2026-04-20', is_active = 0
WHERE CustomerKey = 1001; -- Alice's old Toronto row
Step I-2: Insert New Version
Also from the ChangedRecords stream (branch from the SAME split output):
- Add Derived Column (branching from ChangedRecords, NOT from the expire path)
- Name:
PrepareNewVersion - Columns:
StartDate=currentDate()EndDate=toDate('9999-12-31')-
is_active=true() -
Add Alter Row
- Name:
MarkNewVersionInsert - Insert if:
true()
This stream will be unioned with the NewRecords path.
Step J: Union — Merge All Inserts
- Add Union transformation
- Name:
UnionAllInserts - Inputs:
- Stream 1:
MarkNewInsert(brand new records) - Stream 2:
MarkNewVersionInsert(new versions of changed records)
Both are INSERT operations going to the same sink.
Step K: Sink Insert — Write New and New-Version Rows
- Add Sink after Union
- Name:
SinkInsert - Dataset: DimCustomer_SCD2
- Settings:
- Allow insert: checked
- Mapping:
- SRC_CustomerID → CustomerID
- SRC_FirstName → FirstName
- SRC_LastName → LastName
- SRC_City → City
- SRC_StateProvince → StateProvince
- SRC_CompanyName → CompanyName
- SRC_HashValue → HashValue
- StartDate → StartDate
- EndDate → EndDate
- is_active → is_active
- Do NOT map CustomerKey (auto-generated IDENTITY)
Step L: Sink Expire — Already Created in Step I-1
The SinkExpire was already configured. It handles UPDATE operations.
First Run vs Subsequent Runs
First Run (Empty Dim Table)
Source: 4 rows
Lookup: All DIM_ columns are NULL (empty table)
Split: 4 → NewRecords, 0 → Changed, 0 → Unchanged
SinkInsert: INSERT 4 rows
SinkExpire: 0 updates
DimCustomer_SCD2:
| CustomerKey | CustomerID | City | StartDate | EndDate | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1 | 100 | Toronto | 2026-04-20 | 9999-12-31 | 1 |
| 2 | 200 | Mumbai | 2026-04-20 | 9999-12-31 | 1 |
| 3 | 300 | London | 2026-04-20 | 9999-12-31 | 1 |
| 4 | 400 | Toronto | 2026-04-20 | 9999-12-31 | 1 |
Second Run (Alice Changed, Frank New)
Source: 5 rows
Lookup: 4 matches, 1 NULL (Frank)
Split: 1 New (Frank), 1 Changed (Alice), 3 Unchanged
SinkExpire: UPDATE 1 row (Alice's old Toronto row → EndDate, is_active=0)
SinkInsert: INSERT 2 rows (Frank + Alice's new Mumbai version)
DimCustomer_SCD2:
| CustomerKey | CustomerID | City | StartDate | EndDate | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1 | 100 | Toronto | 2026-04-20 | 2026-04-21 | 0 | ← EXPIRED
| 2 | 200 | Mumbai | 2026-04-20 | 9999-12-31 | 1 |
| 3 | 300 | London | 2026-04-20 | 9999-12-31 | 1 |
| 4 | 400 | Toronto | 2026-04-20 | 9999-12-31 | 1 |
| 5 | 100 | Mumbai | 2026-04-21 | 9999-12-31 | 1 | ← NEW VERSION
| 6 | 500 | Berlin | 2026-04-21 | 9999-12-31 | 1 | ← NEW RECORD
Alice now has TWO rows: CustomerKey=1 (Toronto, expired) and CustomerKey=5 (Mumbai, active).
The Surrogate Key Strategy
The CustomerKey is an IDENTITY column that auto-increments. The Data Flow does NOT generate it — SQL Server generates it on INSERT.
Why not use CustomerID as the primary key?
Because CustomerID=100 (Alice) now has TWO rows. If CustomerID were the primary key, the second INSERT would fail with a duplicate key violation. The surrogate key (CustomerKey) ensures every row — every VERSION — has a unique identifier.
Multiple Changes Over Time
After Alice moves a third time (Mumbai → Paris), the table shows three versions:
| CustomerKey | CustomerID | City | StartDate | EndDate | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1 | 100 | Toronto | 2026-04-20 | 2026-04-21 | 0 |
| 5 | 100 | Mumbai | 2026-04-21 | 2026-06-15 | 0 |
| 8 | 100 | Paris | 2026-06-15 | 9999-12-31 | 1 |
Three versions. Complete history. Fact tables can link to any version.
Querying the SCD Type 2 Table
-- Current state (like Type 1)
SELECT * FROM DimCustomer_SCD2 WHERE is_active = 1;
-- All versions of Alice
SELECT * FROM DimCustomer_SCD2 WHERE CustomerID = 100 ORDER BY StartDate;
-- Where was Alice on 2026-05-01?
SELECT * FROM DimCustomer_SCD2
WHERE CustomerID = 100
AND '2026-05-01' >= StartDate
AND '2026-05-01' < EndDate;
-- Result: Mumbai (CustomerKey=5)
-- Join with fact table for historical accuracy
SELECT f.OrderID, f.Amount, d.City, d.StartDate, d.EndDate
FROM FactOrders f
JOIN DimCustomer_SCD2 d ON f.CustomerKey = d.CustomerKey;
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “No updateable columns match” | SinkExpire mapping does not match table columns | Ensure column names in Select match exactly: CustomerKey, EndDate, is_active |
| Duplicate CustomerKey | Trying to insert with an explicit CustomerKey value | Do NOT map CustomerKey in SinkInsert — let IDENTITY auto-generate |
| All rows going to NewRecords on second run | ExistingDimTable source not filtering WHERE is_active = 1 |
Add the filter so only current rows are compared |
| Hash always different | NULL handling in hash formula | Use coalesce() for every column in the hash |
| Expire sink updating wrong rows | Key column not set to CustomerKey | Set Key columns to CustomerKey in SinkExpire settings |
| Changed records not branching to both paths | Only one path from ChangedRecords | Branch from the Conditional Split output, not from a downstream transformation |
Interview Questions
Q: How does SCD Type 2 differ from Type 1 in terms of pipeline design? A: Type 1 has one sink (insert + update to the same table). Type 2 needs two sinks — one for inserting new records and new versions (SinkInsert), and one for expiring old records (SinkExpire with UPDATE). The changed records path splits into two parallel operations: expire the old row AND insert the new version.
Q: Why do you need a surrogate key in SCD Type 2? A: Because one business entity (CustomerID=100) can have multiple rows representing different time periods. The surrogate key (CustomerKey) uniquely identifies each VERSION. Without it, you cannot have multiple rows for the same customer. Fact tables reference the surrogate key to link to the correct historical version.
Q: How do you query the current state of an SCD Type 2 table?
A: WHERE is_active = 1 returns only the latest version of each record. For point-in-time queries, use WHERE date >= StartDate AND date < EndDate to find the version that was active on a specific date.
Q: Why does the Lookup only read active records from the dimension table? A: Because we only need to compare against CURRENT records for change detection. If Alice has 3 historical rows, we only compare the source with her active row. Expired rows are irrelevant for detecting new changes.
Q: Is this pipeline idempotent? A: Yes, for unchanged data. If you run it with the same source file, all hashes match, everything goes to Unchanged (dropped), and nothing is inserted or expired. However, if the source has changes, running it twice would create duplicate new versions — so in production, you should process each source file exactly once.
Wrapping Up
SCD Type 2 is the gold standard for dimension tables that need historical tracking. The pipeline is more complex than Type 1 (two sinks instead of one, expire logic, surrogate keys), but the payoff is massive: complete history, accurate point-in-time reporting, and regulatory compliance.
The pattern we built — hash-based detection, conditional split, parallel expire+insert paths, dual sinks — works for any dimension table. Change the source, change the columns in the hash, and the rest of the flow stays the same.
Related posts: – SCD Type 1 Hash Pipeline – SCD Types Explained (0, 1, 2, 3, 6) – Data Flows Guide – Joins in Data Flows
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.