Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

In the SCD Type 1 hash pipeline, we built a pipeline that detects changes and overwrites old values. Simple, effective, but history is lost. Alice moved from Toronto to Mumbai, and Toronto disappeared forever.

SCD Type 2 fixes this. When Alice moves, we do NOT overwrite Toronto. Instead, we expire the Toronto row (set an end date and mark it inactive) and insert a new row with Mumbai. Both versions coexist. You can answer “where does Alice live NOW?” and “where did Alice live in 2023?” from the same table.

Think of it like a passport system. When you renew your passport, the old one gets a “cancelled” stamp and an expiry date. The new passport gets issued with today’s date. Immigration can look at both passports and see your complete travel history. You never lose the old record.

This post walks through building the complete SCD Type 2 pipeline using Synapse Data Flows — every transformation, every expression, and the exact logic for expiring old records and inserting new versions.

How SCD Type 2 Differs from Type 1
The Target Table Structure
The Complete Pipeline Architecture
The Source Data
Building the Data Flow: Step by Step
Step A: Source — Read CSV from ADLS Gen2
Step B: Derived Column — Generate Source Hash
Step C: Select — Prefix with SRC_
Step D: Source 2 — Read Existing Dimension Table
Step E: Select — Prefix Existing with DIM_
Step F: Lookup — Join Source with Existing Dimension
Step G: Conditional Split — New, Changed, Unchanged
Step H: The New Records Path (Insert with Surrogate Key)
Step I: The Changed Records Path — Two Operations
Step I-1: Expire Old Row (Update EndDate and is_active)
Step I-2: Insert New Version (New Row with Current Values)
Step J: Union — Merge All Inserts
Step K: Sink Insert — Write New and New-Version Rows
Step L: Sink Expire — Update Existing Rows
First Run vs Subsequent Runs
The Surrogate Key Strategy
Multiple Changes Over Time
Querying the SCD Type 2 Table
Common Errors and Fixes
Interview Questions
Wrapping Up

How SCD Type 2 Differs from Type 1

Aspect	Type 1 (Previous Post)	Type 2 (This Post)
When data changes	Overwrite old value	Expire old row, insert new row
History	Lost	Preserved
Row count	Fixed (one row per entity)	Grows (multiple rows per entity)
Key	Business key (employee_id)	Surrogate key (CustomerKey) + business key (CustomerID)
Audit columns	updated_tms	StartDate, EndDate, is_active
Sinks needed	1 (insert + update)	2 (one for inserts, one for expire updates)

Real-life analogy: – Type 1 = Updating your phone contact. Old number is gone. – Type 2 = Your phone bill history. Every number you ever had is listed with dates of service.

The Target Table Structure

CREATE TABLE DimCustomer_SCD2 (
    CustomerKey     INT IDENTITY(1,1) PRIMARY KEY,  -- Surrogate key
    CustomerID      INT NOT NULL,                     -- Business key
    FirstName       VARCHAR(100),
    LastName        VARCHAR(100),
    City            VARCHAR(50),
    StateProvince   VARCHAR(50),
    CompanyName     VARCHAR(200),
    HashValue       VARCHAR(64),                      -- SHA-256 hash of all business columns
    StartDate       DATE DEFAULT CAST(GETDATE() AS DATE),
    EndDate         DATE DEFAULT '9999-12-31',
    is_active       BIT DEFAULT 1
);

Why Two Keys?

CustomerID (business key) = identifies the PERSON. Alice is always CustomerID=100, whether she lives in Toronto or Mumbai.

CustomerKey (surrogate key) = identifies the VERSION. Alice in Toronto is CustomerKey=1001. Alice in Mumbai is CustomerKey=1002. Different keys for different versions of the same person.

Fact tables reference CustomerKey, not CustomerID. This is how an order from 2023 links to Alice-in-Toronto, and an order from 2026 links to Alice-in-Mumbai.

Real-life analogy: Your national ID number (CustomerID) stays the same your whole life. But your passport number (CustomerKey) changes every time you renew. When immigration checks which country you visited in 2020, they look at the passport number (CustomerKey) that was active then — not your current passport.

The Magic Columns

Column	Purpose	Example (Alice in Toronto)	Example (Alice in Mumbai)
`StartDate`	When this version became active	2022-03-15	2026-01-15
`EndDate`	When this version expired	2026-01-14	9999-12-31 (still active)
`is_active`	Is this the current version?	0 (expired)	1 (current)
`HashValue`	Fingerprint of all business columns	a3f7b2c1…	e9d4f8a6…

9999-12-31 is a convention meaning “this version has not expired yet.” It is easier to query than NULL and works with date range comparisons.

The Complete Pipeline Architecture

ADLS Gen2 (CSV)           Synapse Data Flow                   Azure SQL DB
┌──────────┐    ┌─────────────────────────────────────┐    ┌──────────────────┐
│           │    │                                     │    │                  │
│ source.csv│──>│ Source → Hash → Select(SRC_)         │    │ DimCustomer_SCD2 │
│           │    │                    │                 │    │                  │
│           │    │              Lookup ←── Source2(Dim)  │    │ CustomerKey(PK)  │
│           │    │                    │    Select(DIM_) │<───│ CustomerID       │
│           │    │             Conditional Split        │    │ HashValue        │
│           │    │            /        |         \      │    │ StartDate        │
│           │    │         New     Changed   Unchanged  │    │ EndDate          │
│           │    │          |         |        (drop)   │    │ is_active        │
│           │    │          |    ┌────┴────┐            │    │                  │
│           │    │          |    │ Expire  │ InsertNew  │    │                  │
│           │    │          |    │ (update)│ (insert)   │    │                  │
│           │    │          |    │         │            │    │                  │
│           │    │          |    SinkExpire│            │    │                  │
│           │    │          |             │            │    │                  │
│           │    │       Union (New + InsertNew)        │    │                  │
│           │    │          |                          │    │                  │
│           │    │      SinkInsert ──────────────────>│    │                  │
└──────────┘    └─────────────────────────────────────┘    └──────────────────┘

Key difference from Type 1: Two sinks instead of one. – SinkInsert — inserts brand new records AND new versions of changed records – SinkExpire — updates existing rows to set EndDate and is_active=0

The Source Data

Same CSV source as our Type 1 pipeline:

CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Toronto,Ontario,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd

On subsequent run with changes:

CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Mumbai,Maharashtra,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd
500,Frank,Taylor,Berlin,Berlin,Euro Systems

Alice moved to Mumbai. Frank is new. Bob, Carol, Dave unchanged.

Building the Data Flow: Step by Step

Step A: Source — Read CSV from ADLS Gen2

Add Source transformation
Name: SourceCSV
Dataset: your parameterized ADLS CSV dataset
Preview: 4 rows (first run) or 5 rows (with Frank)

Step B: Derived Column — Generate Source Hash

Add Derived Column after Source
Name: GenerateHash
New column: SRC_HashValue
Expression:

sha2(256, concat(
    toString(CustomerID), '|',
    coalesce(FirstName,''), '|',
    coalesce(LastName,''), '|',
    coalesce(City,''), '|',
    coalesce(StateProvince,''), '|',
    coalesce(CompanyName,'')
))

Why coalesce? If any column is NULL, the entire concat returns NULL, making every row with a NULL hash identical. coalesce(column, '') converts NULL to empty string, ensuring unique hashes.

Step C: Select — Prefix with SRC_

Add Select after GenerateHash
Name: SelectSourceColumns
Rename all columns:
CustomerID → SRC_CustomerID
FirstName → SRC_FirstName
LastName → SRC_LastName
City → SRC_City
StateProvince → SRC_StateProvince
CompanyName → SRC_CompanyName
SRC_HashValue → SRC_HashValue (keep as is)

Step D: Source 2 — Read Existing Dimension Table

Add a second Source transformation (separate from Source A)
Name: ExistingDimTable
Source type: Inline or Dataset pointing to DimCustomer_SCD2
Query: SELECT CustomerKey, CustomerID, HashValue FROM DimCustomer_SCD2 WHERE is_active = 1

Why WHERE is_active = 1? We only compare against CURRENT records. Expired historical rows are irrelevant for change detection.

Step E: Select — Prefix Existing with DIM_

Add Select after ExistingDimTable
Name: SelectExistingColumns
Rename:
CustomerKey → DIM_CustomerKey
CustomerID → DIM_CustomerID
HashValue → DIM_HashValue

Step F: Lookup — Join Source with Existing Dimension

Add Lookup after SelectSourceColumns
Name: LookupExisting
Primary stream: SelectSourceColumns
Lookup stream: SelectExistingColumns
Condition: SRC_CustomerID == DIM_CustomerID
Lookup type: Left outer (keep all source rows)
Multiple matches: First match

Step G: Conditional Split — New, Changed, Unchanged

Add Conditional Split after LookupExisting
Name: SplitNewChangedUnchanged
Conditions:

Stream Name	Condition
`NewRecords`	`isNull(DIM_CustomerID)`
`ChangedRecords`	`SRC_HashValue != DIM_HashValue`
`UnchangedRecords`	Default stream

Same as Type 1 — the split logic is identical.

Step H: The New Records Path (Insert)

Add Derived Column after NewRecords
Name: PrepareNewInsert
Add columns:
StartDate = currentDate()
EndDate = toDate('9999-12-31')
is_active = true()
Add Alter Row after PrepareNewInsert
Name: MarkNewInsert
Insert if: true()

Step I: The Changed Records Path — Two Operations

This is where Type 2 differs fundamentally from Type 1. A changed record requires TWO operations:

Expire the old row — UPDATE the existing dim table row (set EndDate and is_active=0)
Insert the new version — INSERT a new row with current values

These go to TWO different sinks.

Step I-1: Expire Old Row (Select + Alter Row + Sink)

From the ChangedRecords stream:

Add Derived Column
Name: PrepareExpireColumns
Columns:
EndDate = currentDate()
is_active = false()
Add Select after PrepareExpireColumns
Name: SelectExpireColumns
Keep only:
DIM_CustomerKey → CustomerKey
EndDate → EndDate
is_active → is_active

Why only 3 columns? The expire operation only UPDATES EndDate and is_active. It does not change any business columns. The CustomerKey is the key to find the correct row.

Add Alter Row
Name: MarkExpireUpdate
Update if: true()
Add Sink
Name: SinkExpire
Dataset: DimCustomer_SCD2
Settings:
- Allow update: checked
- Key columns: CustomerKey
Mapping: CustomerKey → CustomerKey, EndDate → EndDate, is_active → is_active

What this generates:

UPDATE DimCustomer_SCD2
SET EndDate = '2026-04-20', is_active = 0
WHERE CustomerKey = 1001;  -- Alice's old Toronto row

Step I-2: Insert New Version

Also from the ChangedRecords stream (branch from the SAME split output):

Add Derived Column (branching from ChangedRecords, NOT from the expire path)
Name: PrepareNewVersion
Columns:
StartDate = currentDate()
EndDate = toDate('9999-12-31')
is_active = true()
Add Alter Row
Name: MarkNewVersionInsert
Insert if: true()

This stream will be unioned with the NewRecords path.

Step J: Union — Merge All Inserts

Add Union transformation
Name: UnionAllInserts
Inputs:
Stream 1: MarkNewInsert (brand new records)
Stream 2: MarkNewVersionInsert (new versions of changed records)

Both are INSERT operations going to the same sink.

Step K: Sink Insert — Write New and New-Version Rows

Add Sink after Union
Name: SinkInsert
Dataset: DimCustomer_SCD2
Settings:
Allow insert: checked
Mapping:
SRC_CustomerID → CustomerID
SRC_FirstName → FirstName
SRC_LastName → LastName
SRC_City → City
SRC_StateProvince → StateProvince
SRC_CompanyName → CompanyName
SRC_HashValue → HashValue
StartDate → StartDate
EndDate → EndDate
is_active → is_active
Do NOT map CustomerKey (auto-generated IDENTITY)

Step L: Sink Expire — Already Created in Step I-1

The SinkExpire was already configured. It handles UPDATE operations.

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Source: 4 rows
Lookup: All DIM_ columns are NULL (empty table)
Split: 4 → NewRecords, 0 → Changed, 0 → Unchanged
SinkInsert: INSERT 4 rows
SinkExpire: 0 updates

DimCustomer_SCD2:
| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 9999-12-31 | 1         |
| 2           | 200       | Mumbai  | 2026-04-20 | 9999-12-31 | 1         |
| 3           | 300       | London  | 2026-04-20 | 9999-12-31 | 1         |
| 4           | 400       | Toronto | 2026-04-20 | 9999-12-31 | 1         |

Second Run (Alice Changed, Frank New)

Source: 5 rows
Lookup: 4 matches, 1 NULL (Frank)
Split: 1 New (Frank), 1 Changed (Alice), 3 Unchanged

SinkExpire: UPDATE 1 row (Alice's old Toronto row → EndDate, is_active=0)
SinkInsert: INSERT 2 rows (Frank + Alice's new Mumbai version)

DimCustomer_SCD2:
| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 2026-04-21 | 0         | ← EXPIRED
| 2           | 200       | Mumbai  | 2026-04-20 | 9999-12-31 | 1         |
| 3           | 300       | London  | 2026-04-20 | 9999-12-31 | 1         |
| 4           | 400       | Toronto | 2026-04-20 | 9999-12-31 | 1         |
| 5           | 100       | Mumbai  | 2026-04-21 | 9999-12-31 | 1         | ← NEW VERSION
| 6           | 500       | Berlin  | 2026-04-21 | 9999-12-31 | 1         | ← NEW RECORD

Alice now has TWO rows: CustomerKey=1 (Toronto, expired) and CustomerKey=5 (Mumbai, active).

The Surrogate Key Strategy

The CustomerKey is an IDENTITY column that auto-increments. The Data Flow does NOT generate it — SQL Server generates it on INSERT.

Why not use CustomerID as the primary key?

Because CustomerID=100 (Alice) now has TWO rows. If CustomerID were the primary key, the second INSERT would fail with a duplicate key violation. The surrogate key (CustomerKey) ensures every row — every VERSION — has a unique identifier.

Multiple Changes Over Time

After Alice moves a third time (Mumbai → Paris), the table shows three versions:

| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 2026-04-21 | 0         |
| 5           | 100       | Mumbai  | 2026-04-21 | 2026-06-15 | 0         |
| 8           | 100       | Paris   | 2026-06-15 | 9999-12-31 | 1         |

Three versions. Complete history. Fact tables can link to any version.

Querying the SCD Type 2 Table

-- Current state (like Type 1)
SELECT * FROM DimCustomer_SCD2 WHERE is_active = 1;

-- All versions of Alice
SELECT * FROM DimCustomer_SCD2 WHERE CustomerID = 100 ORDER BY StartDate;

-- Where was Alice on 2026-05-01?
SELECT * FROM DimCustomer_SCD2
WHERE CustomerID = 100
  AND '2026-05-01' >= StartDate
  AND '2026-05-01' < EndDate;
-- Result: Mumbai (CustomerKey=5)

-- Join with fact table for historical accuracy
SELECT f.OrderID, f.Amount, d.City, d.StartDate, d.EndDate
FROM FactOrders f
JOIN DimCustomer_SCD2 d ON f.CustomerKey = d.CustomerKey;

Common Errors and Fixes

Error	Cause	Fix
“No updateable columns match”	SinkExpire mapping does not match table columns	Ensure column names in Select match exactly: CustomerKey, EndDate, is_active
Duplicate CustomerKey	Trying to insert with an explicit CustomerKey value	Do NOT map CustomerKey in SinkInsert — let IDENTITY auto-generate
All rows going to NewRecords on second run	ExistingDimTable source not filtering `WHERE is_active = 1`	Add the filter so only current rows are compared
Hash always different	NULL handling in hash formula	Use `coalesce()` for every column in the hash
Expire sink updating wrong rows	Key column not set to CustomerKey	Set Key columns to CustomerKey in SinkExpire settings
Changed records not branching to both paths	Only one path from ChangedRecords	Branch from the Conditional Split output, not from a downstream transformation

Interview Questions

Q: How does SCD Type 2 differ from Type 1 in terms of pipeline design? A: Type 1 has one sink (insert + update to the same table). Type 2 needs two sinks — one for inserting new records and new versions (SinkInsert), and one for expiring old records (SinkExpire with UPDATE). The changed records path splits into two parallel operations: expire the old row AND insert the new version.

Q: Why do you need a surrogate key in SCD Type 2? A: Because one business entity (CustomerID=100) can have multiple rows representing different time periods. The surrogate key (CustomerKey) uniquely identifies each VERSION. Without it, you cannot have multiple rows for the same customer. Fact tables reference the surrogate key to link to the correct historical version.

Q: How do you query the current state of an SCD Type 2 table? A: WHERE is_active = 1 returns only the latest version of each record. For point-in-time queries, use WHERE date >= StartDate AND date < EndDate to find the version that was active on a specific date.

Q: Why does the Lookup only read active records from the dimension table? A: Because we only need to compare against CURRENT records for change detection. If Alice has 3 historical rows, we only compare the source with her active row. Expired rows are irrelevant for detecting new changes.

Q: Is this pipeline idempotent? A: Yes, for unchanged data. If you run it with the same source file, all hashes match, everything goes to Unchanged (dropped), and nothing is inserted or expired. However, if the source has changes, running it twice would create duplicate new versions — so in production, you should process each source file exactly once.

Wrapping Up

SCD Type 2 is the gold standard for dimension tables that need historical tracking. The pipeline is more complex than Type 1 (two sinks instead of one, expire logic, surrogate keys), but the payoff is massive: complete history, accurate point-in-time reporting, and regulatory compliance.

The pattern we built — hash-based detection, conditional split, parallel expire+insert paths, dual sinks — works for any dimension table. Change the source, change the columns in the hash, and the rest of the flow stays the same.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

Table of Contents

How SCD Type 2 Differs from Type 1

The Target Table Structure

Why Two Keys?

The Magic Columns

The Complete Pipeline Architecture

The Source Data

Building the Data Flow: Step by Step

Step A: Source — Read CSV from ADLS Gen2

Step B: Derived Column — Generate Source Hash

Step C: Select — Prefix with SRC_

Step D: Source 2 — Read Existing Dimension Table

Step E: Select — Prefix Existing with DIM_

Step F: Lookup — Join Source with Existing Dimension

Step G: Conditional Split — New, Changed, Unchanged

Step H: The New Records Path (Insert)

Step I: The Changed Records Path — Two Operations

Step I-1: Expire Old Row (Select + Alter Row + Sink)

Step I-2: Insert New Version

Step J: Union — Merge All Inserts

Step K: Sink Insert — Write New and New-Version Rows

Step L: Sink Expire — Already Created in Step I-1

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Second Run (Alice Changed, Frank New)

The Surrogate Key Strategy

Multiple Changes Over Time

Querying the SCD Type 2 Table

Common Errors and Fixes

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

Table of Contents

How SCD Type 2 Differs from Type 1

The Target Table Structure

Why Two Keys?

The Magic Columns

The Complete Pipeline Architecture

The Source Data

Building the Data Flow: Step by Step

Step A: Source — Read CSV from ADLS Gen2

Step B: Derived Column — Generate Source Hash

Step C: Select — Prefix with SRC_

Step D: Source 2 — Read Existing Dimension Table

Step E: Select — Prefix Existing with DIM_

Step F: Lookup — Join Source with Existing Dimension

Step G: Conditional Split — New, Changed, Unchanged

Step H: The New Records Path (Insert)

Step I: The Changed Records Path — Two Operations

Step I-1: Expire Old Row (Select + Alter Row + Sink)

Step I-2: Insert New Version

Step J: Union — Merge All Inserts

Step K: Sink Insert — Write New and New-Version Rows

Step L: Sink Expire — Already Created in Step I-1

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Second Run (Alice Changed, Frank New)

The Surrogate Key Strategy

Multiple Changes Over Time

Querying the SCD Type 2 Table

Common Errors and Fixes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply