Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

Building an SCD Type 2 Pipeline in Azure Synapse Data Flows: Full History with Hash-Based Change Detection

In the SCD Type 1 hash pipeline, we built a pipeline that detects changes and overwrites old values. Simple, effective, but history is lost. Alice moved from Toronto to Mumbai, and Toronto disappeared forever.

SCD Type 2 fixes this. When Alice moves, we do NOT overwrite Toronto. Instead, we expire the Toronto row (set an end date and mark it inactive) and insert a new row with Mumbai. Both versions coexist. You can answer “where does Alice live NOW?” and “where did Alice live in 2023?” from the same table.

Think of it like a passport system. When you renew your passport, the old one gets a “cancelled” stamp and an expiry date. The new passport gets issued with today’s date. Immigration can look at both passports and see your complete travel history. You never lose the old record.

This post walks through building the complete SCD Type 2 pipeline using Synapse Data Flows — every transformation, every expression, and the exact logic for expiring old records and inserting new versions.

Table of Contents

  • How SCD Type 2 Differs from Type 1
  • The Target Table Structure
  • The Complete Pipeline Architecture
  • The Source Data
  • Building the Data Flow: Step by Step
  • Step A: Source — Read CSV from ADLS Gen2
  • Step B: Derived Column — Generate Source Hash
  • Step C: Select — Prefix with SRC_
  • Step D: Source 2 — Read Existing Dimension Table
  • Step E: Select — Prefix Existing with DIM_
  • Step F: Lookup — Join Source with Existing Dimension
  • Step G: Conditional Split — New, Changed, Unchanged
  • Step H: The New Records Path (Insert with Surrogate Key)
  • Step I: The Changed Records Path — Two Operations
  • Step I-1: Expire Old Row (Update EndDate and is_active)
  • Step I-2: Insert New Version (New Row with Current Values)
  • Step J: Union — Merge All Inserts
  • Step K: Sink Insert — Write New and New-Version Rows
  • Step L: Sink Expire — Update Existing Rows
  • First Run vs Subsequent Runs
  • The Surrogate Key Strategy
  • Multiple Changes Over Time
  • Querying the SCD Type 2 Table
  • Common Errors and Fixes
  • Interview Questions
  • Wrapping Up

How SCD Type 2 Differs from Type 1

Aspect Type 1 (Previous Post) Type 2 (This Post)
When data changes Overwrite old value Expire old row, insert new row
History Lost Preserved
Row count Fixed (one row per entity) Grows (multiple rows per entity)
Key Business key (employee_id) Surrogate key (CustomerKey) + business key (CustomerID)
Audit columns updated_tms StartDate, EndDate, is_active
Sinks needed 1 (insert + update) 2 (one for inserts, one for expire updates)

Real-life analogy:Type 1 = Updating your phone contact. Old number is gone. – Type 2 = Your phone bill history. Every number you ever had is listed with dates of service.

The Target Table Structure

CREATE TABLE DimCustomer_SCD2 (
    CustomerKey     INT IDENTITY(1,1) PRIMARY KEY,  -- Surrogate key
    CustomerID      INT NOT NULL,                     -- Business key
    FirstName       VARCHAR(100),
    LastName        VARCHAR(100),
    City            VARCHAR(50),
    StateProvince   VARCHAR(50),
    CompanyName     VARCHAR(200),
    HashValue       VARCHAR(64),                      -- SHA-256 hash of all business columns
    StartDate       DATE DEFAULT CAST(GETDATE() AS DATE),
    EndDate         DATE DEFAULT '9999-12-31',
    is_active       BIT DEFAULT 1
);

Why Two Keys?

CustomerID (business key) = identifies the PERSON. Alice is always CustomerID=100, whether she lives in Toronto or Mumbai.

CustomerKey (surrogate key) = identifies the VERSION. Alice in Toronto is CustomerKey=1001. Alice in Mumbai is CustomerKey=1002. Different keys for different versions of the same person.

Fact tables reference CustomerKey, not CustomerID. This is how an order from 2023 links to Alice-in-Toronto, and an order from 2026 links to Alice-in-Mumbai.

Real-life analogy: Your national ID number (CustomerID) stays the same your whole life. But your passport number (CustomerKey) changes every time you renew. When immigration checks which country you visited in 2020, they look at the passport number (CustomerKey) that was active then — not your current passport.

The Magic Columns

Column Purpose Example (Alice in Toronto) Example (Alice in Mumbai)
StartDate When this version became active 2022-03-15 2026-01-15
EndDate When this version expired 2026-01-14 9999-12-31 (still active)
is_active Is this the current version? 0 (expired) 1 (current)
HashValue Fingerprint of all business columns a3f7b2c1… e9d4f8a6…

9999-12-31 is a convention meaning “this version has not expired yet.” It is easier to query than NULL and works with date range comparisons.

The Complete Pipeline Architecture

ADLS Gen2 (CSV)           Synapse Data Flow                   Azure SQL DB
┌──────────┐    ┌─────────────────────────────────────┐    ┌──────────────────┐
│           │    │                                     │    │                  │
│ source.csv│──>│ Source → Hash → Select(SRC_)         │    │ DimCustomer_SCD2 │
│           │    │                    │                 │    │                  │
│           │    │              Lookup ←── Source2(Dim)  │    │ CustomerKey(PK)  │
│           │    │                    │    Select(DIM_) │<───│ CustomerID       │
│           │    │             Conditional Split        │    │ HashValue        │
│           │    │            /        |         \      │    │ StartDate        │
│           │    │         New     Changed   Unchanged  │    │ EndDate          │
│           │    │          |         |        (drop)   │    │ is_active        │
│           │    │          |    ┌────┴────┐            │    │                  │
│           │    │          |    │ Expire  │ InsertNew  │    │                  │
│           │    │          |    │ (update)│ (insert)   │    │                  │
│           │    │          |    │         │            │    │                  │
│           │    │          |    SinkExpire│            │    │                  │
│           │    │          |             │            │    │                  │
│           │    │       Union (New + InsertNew)        │    │                  │
│           │    │          |                          │    │                  │
│           │    │      SinkInsert ──────────────────>│    │                  │
└──────────┘    └─────────────────────────────────────┘    └──────────────────┘

Key difference from Type 1: Two sinks instead of one. – SinkInsert — inserts brand new records AND new versions of changed records – SinkExpire — updates existing rows to set EndDate and is_active=0

The Source Data

Same CSV source as our Type 1 pipeline:

CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Toronto,Ontario,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd

On subsequent run with changes:

CustomerID,FirstName,LastName,City,StateProvince,CompanyName
100,Alice,Johnson,Mumbai,Maharashtra,Acme Corp
200,Bob,Smith,Mumbai,Maharashtra,TechStart Inc
300,Carol,Williams,London,,Global Finance
400,Dave,Brown,Toronto,Ontario,DataPro Ltd
500,Frank,Taylor,Berlin,Berlin,Euro Systems

Alice moved to Mumbai. Frank is new. Bob, Carol, Dave unchanged.

Building the Data Flow: Step by Step

Step A: Source — Read CSV from ADLS Gen2

  1. Add Source transformation
  2. Name: SourceCSV
  3. Dataset: your parameterized ADLS CSV dataset
  4. Preview: 4 rows (first run) or 5 rows (with Frank)

Step B: Derived Column — Generate Source Hash

  1. Add Derived Column after Source
  2. Name: GenerateHash
  3. New column: SRC_HashValue
  4. Expression:
sha2(256, concat(
    toString(CustomerID), '|',
    coalesce(FirstName,''), '|',
    coalesce(LastName,''), '|',
    coalesce(City,''), '|',
    coalesce(StateProvince,''), '|',
    coalesce(CompanyName,'')
))

Why coalesce? If any column is NULL, the entire concat returns NULL, making every row with a NULL hash identical. coalesce(column, '') converts NULL to empty string, ensuring unique hashes.

Step C: Select — Prefix with SRC_

  1. Add Select after GenerateHash
  2. Name: SelectSourceColumns
  3. Rename all columns:
  4. CustomerIDSRC_CustomerID
  5. FirstNameSRC_FirstName
  6. LastNameSRC_LastName
  7. CitySRC_City
  8. StateProvinceSRC_StateProvince
  9. CompanyNameSRC_CompanyName
  10. SRC_HashValueSRC_HashValue (keep as is)

Step D: Source 2 — Read Existing Dimension Table

  1. Add a second Source transformation (separate from Source A)
  2. Name: ExistingDimTable
  3. Source type: Inline or Dataset pointing to DimCustomer_SCD2
  4. Query: SELECT CustomerKey, CustomerID, HashValue FROM DimCustomer_SCD2 WHERE is_active = 1

Why WHERE is_active = 1? We only compare against CURRENT records. Expired historical rows are irrelevant for change detection.

Step E: Select — Prefix Existing with DIM_

  1. Add Select after ExistingDimTable
  2. Name: SelectExistingColumns
  3. Rename:
  4. CustomerKeyDIM_CustomerKey
  5. CustomerIDDIM_CustomerID
  6. HashValueDIM_HashValue

Step F: Lookup — Join Source with Existing Dimension

  1. Add Lookup after SelectSourceColumns
  2. Name: LookupExisting
  3. Primary stream: SelectSourceColumns
  4. Lookup stream: SelectExistingColumns
  5. Condition: SRC_CustomerID == DIM_CustomerID
  6. Lookup type: Left outer (keep all source rows)
  7. Multiple matches: First match

Step G: Conditional Split — New, Changed, Unchanged

  1. Add Conditional Split after LookupExisting
  2. Name: SplitNewChangedUnchanged
  3. Conditions:
Stream Name Condition
NewRecords isNull(DIM_CustomerID)
ChangedRecords SRC_HashValue != DIM_HashValue
UnchangedRecords Default stream

Same as Type 1 — the split logic is identical.

Step H: The New Records Path (Insert)

  1. Add Derived Column after NewRecords
  2. Name: PrepareNewInsert
  3. Add columns:
  4. StartDate = currentDate()
  5. EndDate = toDate('9999-12-31')
  6. is_active = true()

  7. Add Alter Row after PrepareNewInsert

  8. Name: MarkNewInsert
  9. Insert if: true()

Step I: The Changed Records Path — Two Operations

This is where Type 2 differs fundamentally from Type 1. A changed record requires TWO operations:

  1. Expire the old row — UPDATE the existing dim table row (set EndDate and is_active=0)
  2. Insert the new version — INSERT a new row with current values

These go to TWO different sinks.

Step I-1: Expire Old Row (Select + Alter Row + Sink)

From the ChangedRecords stream:

  1. Add Derived Column
  2. Name: PrepareExpireColumns
  3. Columns:
  4. EndDate = currentDate()
  5. is_active = false()

  6. Add Select after PrepareExpireColumns

  7. Name: SelectExpireColumns
  8. Keep only:
  9. DIM_CustomerKeyCustomerKey
  10. EndDateEndDate
  11. is_activeis_active

Why only 3 columns? The expire operation only UPDATES EndDate and is_active. It does not change any business columns. The CustomerKey is the key to find the correct row.

  1. Add Alter Row
  2. Name: MarkExpireUpdate
  3. Update if: true()

  4. Add Sink

  5. Name: SinkExpire
  6. Dataset: DimCustomer_SCD2
  7. Settings:
    • Allow update: checked
    • Key columns: CustomerKey
  8. Mapping: CustomerKey → CustomerKey, EndDate → EndDate, is_active → is_active

What this generates:

UPDATE DimCustomer_SCD2
SET EndDate = '2026-04-20', is_active = 0
WHERE CustomerKey = 1001;  -- Alice's old Toronto row

Step I-2: Insert New Version

Also from the ChangedRecords stream (branch from the SAME split output):

  1. Add Derived Column (branching from ChangedRecords, NOT from the expire path)
  2. Name: PrepareNewVersion
  3. Columns:
  4. StartDate = currentDate()
  5. EndDate = toDate('9999-12-31')
  6. is_active = true()

  7. Add Alter Row

  8. Name: MarkNewVersionInsert
  9. Insert if: true()

This stream will be unioned with the NewRecords path.

Step J: Union — Merge All Inserts

  1. Add Union transformation
  2. Name: UnionAllInserts
  3. Inputs:
  4. Stream 1: MarkNewInsert (brand new records)
  5. Stream 2: MarkNewVersionInsert (new versions of changed records)

Both are INSERT operations going to the same sink.

Step K: Sink Insert — Write New and New-Version Rows

  1. Add Sink after Union
  2. Name: SinkInsert
  3. Dataset: DimCustomer_SCD2
  4. Settings:
  5. Allow insert: checked
  6. Mapping:
  7. SRC_CustomerID → CustomerID
  8. SRC_FirstName → FirstName
  9. SRC_LastName → LastName
  10. SRC_City → City
  11. SRC_StateProvince → StateProvince
  12. SRC_CompanyName → CompanyName
  13. SRC_HashValue → HashValue
  14. StartDate → StartDate
  15. EndDate → EndDate
  16. is_active → is_active
  17. Do NOT map CustomerKey (auto-generated IDENTITY)

Step L: Sink Expire — Already Created in Step I-1

The SinkExpire was already configured. It handles UPDATE operations.

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Source: 4 rows
Lookup: All DIM_ columns are NULL (empty table)
Split: 4 → NewRecords, 0 → Changed, 0 → Unchanged
SinkInsert: INSERT 4 rows
SinkExpire: 0 updates

DimCustomer_SCD2:
| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 9999-12-31 | 1         |
| 2           | 200       | Mumbai  | 2026-04-20 | 9999-12-31 | 1         |
| 3           | 300       | London  | 2026-04-20 | 9999-12-31 | 1         |
| 4           | 400       | Toronto | 2026-04-20 | 9999-12-31 | 1         |

Second Run (Alice Changed, Frank New)

Source: 5 rows
Lookup: 4 matches, 1 NULL (Frank)
Split: 1 New (Frank), 1 Changed (Alice), 3 Unchanged

SinkExpire: UPDATE 1 row (Alice's old Toronto row → EndDate, is_active=0)
SinkInsert: INSERT 2 rows (Frank + Alice's new Mumbai version)

DimCustomer_SCD2:
| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 2026-04-21 | 0         | ← EXPIRED
| 2           | 200       | Mumbai  | 2026-04-20 | 9999-12-31 | 1         |
| 3           | 300       | London  | 2026-04-20 | 9999-12-31 | 1         |
| 4           | 400       | Toronto | 2026-04-20 | 9999-12-31 | 1         |
| 5           | 100       | Mumbai  | 2026-04-21 | 9999-12-31 | 1         | ← NEW VERSION
| 6           | 500       | Berlin  | 2026-04-21 | 9999-12-31 | 1         | ← NEW RECORD

Alice now has TWO rows: CustomerKey=1 (Toronto, expired) and CustomerKey=5 (Mumbai, active).

The Surrogate Key Strategy

The CustomerKey is an IDENTITY column that auto-increments. The Data Flow does NOT generate it — SQL Server generates it on INSERT.

Why not use CustomerID as the primary key?

Because CustomerID=100 (Alice) now has TWO rows. If CustomerID were the primary key, the second INSERT would fail with a duplicate key violation. The surrogate key (CustomerKey) ensures every row — every VERSION — has a unique identifier.

Multiple Changes Over Time

After Alice moves a third time (Mumbai → Paris), the table shows three versions:

| CustomerKey | CustomerID | City    | StartDate  | EndDate    | is_active |
|-------------|-----------|---------|------------|------------|-----------|
| 1           | 100       | Toronto | 2026-04-20 | 2026-04-21 | 0         |
| 5           | 100       | Mumbai  | 2026-04-21 | 2026-06-15 | 0         |
| 8           | 100       | Paris   | 2026-06-15 | 9999-12-31 | 1         |

Three versions. Complete history. Fact tables can link to any version.

Querying the SCD Type 2 Table

-- Current state (like Type 1)
SELECT * FROM DimCustomer_SCD2 WHERE is_active = 1;

-- All versions of Alice
SELECT * FROM DimCustomer_SCD2 WHERE CustomerID = 100 ORDER BY StartDate;

-- Where was Alice on 2026-05-01?
SELECT * FROM DimCustomer_SCD2
WHERE CustomerID = 100
  AND '2026-05-01' >= StartDate
  AND '2026-05-01' < EndDate;
-- Result: Mumbai (CustomerKey=5)

-- Join with fact table for historical accuracy
SELECT f.OrderID, f.Amount, d.City, d.StartDate, d.EndDate
FROM FactOrders f
JOIN DimCustomer_SCD2 d ON f.CustomerKey = d.CustomerKey;

Common Errors and Fixes

Error Cause Fix
“No updateable columns match” SinkExpire mapping does not match table columns Ensure column names in Select match exactly: CustomerKey, EndDate, is_active
Duplicate CustomerKey Trying to insert with an explicit CustomerKey value Do NOT map CustomerKey in SinkInsert — let IDENTITY auto-generate
All rows going to NewRecords on second run ExistingDimTable source not filtering WHERE is_active = 1 Add the filter so only current rows are compared
Hash always different NULL handling in hash formula Use coalesce() for every column in the hash
Expire sink updating wrong rows Key column not set to CustomerKey Set Key columns to CustomerKey in SinkExpire settings
Changed records not branching to both paths Only one path from ChangedRecords Branch from the Conditional Split output, not from a downstream transformation

Interview Questions

Q: How does SCD Type 2 differ from Type 1 in terms of pipeline design? A: Type 1 has one sink (insert + update to the same table). Type 2 needs two sinks — one for inserting new records and new versions (SinkInsert), and one for expiring old records (SinkExpire with UPDATE). The changed records path splits into two parallel operations: expire the old row AND insert the new version.

Q: Why do you need a surrogate key in SCD Type 2? A: Because one business entity (CustomerID=100) can have multiple rows representing different time periods. The surrogate key (CustomerKey) uniquely identifies each VERSION. Without it, you cannot have multiple rows for the same customer. Fact tables reference the surrogate key to link to the correct historical version.

Q: How do you query the current state of an SCD Type 2 table? A: WHERE is_active = 1 returns only the latest version of each record. For point-in-time queries, use WHERE date >= StartDate AND date < EndDate to find the version that was active on a specific date.

Q: Why does the Lookup only read active records from the dimension table? A: Because we only need to compare against CURRENT records for change detection. If Alice has 3 historical rows, we only compare the source with her active row. Expired rows are irrelevant for detecting new changes.

Q: Is this pipeline idempotent? A: Yes, for unchanged data. If you run it with the same source file, all hashes match, everything goes to Unchanged (dropped), and nothing is inserted or expired. However, if the source has changes, running it twice would create duplicate new versions — so in production, you should process each source file exactly once.

Wrapping Up

SCD Type 2 is the gold standard for dimension tables that need historical tracking. The pipeline is more complex than Type 1 (two sinks instead of one, expire logic, surrogate keys), but the payoff is massive: complete history, accurate point-in-time reporting, and regulatory compliance.

The pattern we built — hash-based detection, conditional split, parallel expire+insert paths, dual sinks — works for any dimension table. Change the source, change the columns in the hash, and the rest of the flow stays the same.

Related posts:SCD Type 1 Hash PipelineSCD Types Explained (0, 1, 2, 3, 6)Data Flows GuideJoins in Data Flows


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link