SCD Type 1 Pipeline with Hash-Based Change Detection in Azure Synapse: Every Activity Explained

In the previous post, we built an SCD Type 1 pipeline that does a full load every time — dump everything from source to destination. Simple, reliable, but wasteful. If you have 10 million rows and only 50 changed, you are rewriting 10 million rows just to update 50.

This post builds a smarter SCD Type 1 pipeline that uses hash-based change detection to figure out exactly which rows are new, which have changed, and which are identical — and only processes what needs processing.

Think of it like a hotel housekeeping team. The basic approach (full load) sends housekeeping to clean EVERY room, even if guests have not checked in. The smart approach (hash-based) checks the “Do Not Disturb” sign first. If the sign is out (hash matches = no change), skip the room. If the sign is NOT out (hash differs = something changed), go in and clean. If it is a brand new room (no hash = new record), set it up from scratch.

This is the exact pipeline pattern we built in Synapse using Data Flows, and this post explains every single activity, transformation, and expression.

What Is Hash-Based Change Detection?
Why Hashing Instead of Column-by-Column Comparison?
The Complete Pipeline Architecture
The Source Data (CSV in ADLS Gen2)
The Target Table (SQL Dimension Table with Audit Columns)
The Pipeline Overview
Activity 1: Data Flow (The Brain of the Pipeline)
Transformation 1: Source (Reading the CSV)
Transformation 2: Derived Column — Generate Source Hash
Transformation 3: Select — Prefix Columns with SRC_
Transformation 4: Lookup — Join with Existing Dimension Table
Transformation 5: Conditional Split — Route to New, Changed, or Unchanged
Transformation 6: Alter Row (Insert) — New Records
Transformation 7: Alter Row (Update) — Changed Records
Transformation 8: Derived Column — Add Update Audit Columns
Transformation 9: Union — Merge Insert and Update Streams
Transformation 10: Sink — Write to SQL Dimension Table
The Audit Trail: inserted_tms vs updated_tms
Idempotency: Why Running Twice Does Nothing
First Run vs Subsequent Runs
The Complete Data Flow Visual
Parameterized Source Dataset
Performance Considerations
Common Mistakes
Interview Questions
Wrapping Up

What Is Hash-Based Change Detection?

Instead of comparing every column value between source and target to find changes, you create a single hash value from all columns. If the hash is the same, nothing changed. If the hash is different, something changed.

Source Row:    Name="Alice", City="Toronto", Salary=95000
Source Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Target Row:    Name="Alice", City="Toronto", Salary=95000
Target Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Comparison:    "a3f7b2c1..." == "a3f7b2c1..." → NO CHANGE (skip)

Now Alice moves to Mumbai:

Source Row:    Name="Alice", City="Mumbai", Salary=95000
Source Hash:   SHA256("Alice|Mumbai|95000") = "e9d4f8a6..."

Target Row:    Name="Alice", City="Toronto", Salary=95000
Target Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Comparison:    "e9d4f8a6..." != "a3f7b2c1..." → CHANGED (update)

Real-life analogy: Hashing is like a fingerprint for your data. Every person (row) has a unique fingerprint (hash). If someone’s fingerprint changes, their identity has changed. You do not need to compare height, weight, hair color, and eye color individually — the fingerprint catches ALL changes in one comparison.

Why Hashing Instead of Column-by-Column Comparison?

Approach	Comparison Logic	Complexity
Column-by-column	`source.name != target.name OR source.city != target.city OR source.salary != target.salary OR ...`	Grows with every column. 20 columns = 20 OR conditions.
Hash-based	`source.hash != target.hash`	Always ONE comparison, regardless of column count.

Additional Benefits of Hashing

NULL handling — hashing naturally handles NULLs (they become part of the hash string)
Any data type — dates, numbers, strings all concatenate into one hash
Performance — one string comparison vs 20 individual comparisons
Maintainability — adding a new column means adding it to the hash formula, not rewriting the comparison logic

The Hash Formula

hash_value = SHA256(concat(column1, '|', column2, '|', column3, '|', ...))

The | delimiter prevents false matches. Without it, Name="AB" City="C" and Name="A" City="BC" would produce the same concatenation (“ABC”). With delimiter: “AB|C” vs “A|BC” — different hashes.

The Complete Pipeline Architecture

ADLS Gen2 (CSV)                     Synapse Data Flow                        Azure SQL DB
┌──────────────┐    ┌──────────────────────────────────────────────┐    ┌──────────────┐
│              │    │                                              │    │              │
│ employees.csv│───>│ Source ──> Derived (Hash) ──> Select (SRC_) │    │ dim_employee │
│              │    │                                    │         │    │              │
│              │    │                              Lookup (Join    │    │ (hash_value, │
│              │    │                              with dim table) │<───│  inserted_tms│
│              │    │                                    │         │    │  updated_tms)│
│              │    │                           Conditional Split  │    │              │
│              │    │                           /       |      \   │    │              │
│              │    │                        New    Changed  Unchanged│  │              │
│              │    │                         |        |       (drop) │  │              │
│              │    │                    AlterRow  AlterRow+Derived  │    │              │
│              │    │                    (Insert)  (Update+audit)    │    │              │
│              │    │                         \       /              │    │              │
│              │    │                          Union                 │    │              │
│              │    │                            │                   │    │              │
│              │    │                          Sink ────────────────>│    │              │
│              │    │                                              │    │              │
└──────────────┘    └──────────────────────────────────────────────┘    └──────────────┘

The Source Data (CSV in ADLS Gen2)

A simple employee CSV file stored in your data lake:

employee_id,name,department,city,salary
1,Alice,Engineering,Toronto,95000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000

This file is the source of truth. On subsequent runs, it might have changes:

employee_id,name,department,city,salary
1,Alice,Engineering,Mumbai,98000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000
6,Frank,HR,Berlin,70000

Changes: Alice moved to Mumbai and got a raise. Frank is new. Bob, Carol, Dave, Eve are unchanged.

The Target Table (SQL Dimension Table with Audit Columns)

CREATE TABLE dim_employee (
    employee_id     INT PRIMARY KEY,
    name            VARCHAR(100),
    department      VARCHAR(50),
    city            VARCHAR(50),
    salary          INT,
    hash_value      VARCHAR(64),
    inserted_tms    DATETIME DEFAULT GETDATE(),
    inserted_by     VARCHAR(100) DEFAULT SYSTEM_USER,
    updated_tms     DATETIME NULL,
    updated_by      VARCHAR(100) NULL
);

Why These Audit Columns?

Column	Purpose	Populated When
`hash_value`	Stores the SHA-256 hash of all business columns	Every insert and update
`inserted_tms`	When the record was FIRST created	Insert only (DEFAULT GETDATE())
`inserted_by`	Who/what created the record	Insert only (DEFAULT SYSTEM_USER)
`updated_tms`	When the record was LAST modified	Update only (set by data flow)
`updated_by`	Who/what modified the record	Update only (set by data flow)

The key design: inserted_tms and inserted_by have DEFAULT constraints and are NEVER overwritten on updates. This preserves the original creation audit trail even after multiple updates.

Real-life analogy: It is like a package tracking system. inserted_tms is when the package was originally shipped (never changes). updated_tms is when the last status update happened (changes with every scan). Even after 10 status updates, you can still see the original ship date.

The Pipeline Overview

The pipeline has one primary activity: a Data Flow activity that contains all the transformation logic. Unlike the previous post’s pipeline with separate Lookup, ForEach, and Copy activities, this pipeline uses a Data Flow to handle everything in a single Spark-powered execution.

Pipeline: PL_SCD1_HashBased
  |
  |-- Data Flow Activity: DF_SCD1_Employee
      |
      |-- Source (CSV from ADLS)
      |-- Derived Column (generate hash)
      |-- Select (prefix with SRC_)
      |-- Lookup (join with dim table)
      |-- Conditional Split (new / changed / unchanged)
      |-- Alter Row + Union + Sink

Activity 1: Data Flow (The Brain of the Pipeline)

Why a Data Flow Instead of Copy + Stored Procedure?

The Copy activity can move data but cannot compare, hash, split, or conditionally insert/update. For change detection logic, you need either:

Data Flow (visual, no code, runs on Spark) — what we use here
Stored Procedure (SQL-based MERGE statement) — alternative approach
Spark Notebook (PySpark code) — for maximum flexibility

Data Flow is the best choice for visual, maintainable, code-free SCD implementation.

Real-life analogy: The Copy activity is a delivery truck — it moves boxes from A to B without opening them. The Data Flow is a sorting facility — it opens every box, inspects the contents, routes them to the right destination, and stamps them with tracking information.

Transformation 1: Source (Reading the CSV)

What It Does

Reads the CSV file from ADLS Gen2 into the Data Flow as the starting dataset.

Configuration

Add Source transformation
Name: SourceCSV
Source type: Dataset (or Inline)
Dataset: parameterized ADLS CSV dataset
Linked service: your ADLS Gen2
File path: container + folder path (parameterized for reuse)
Source options:
Header: First row as header
Schema: infer from connection (or define manually)

Why Parameterized

The source dataset accepts a FileName parameter, so the same pipeline can process different CSV files:

Run 1: FileName = "employees.csv"
Run 2: FileName = "departments.csv"
Run 3: FileName = "products.csv"

One pipeline, multiple use cases. The pipeline parameter feeds into the Data Flow parameter, which feeds into the dataset parameter.

Data Preview After Source

employee_id	name	department	city	salary
1	Alice	Engineering	Toronto	95000
2	Bob	Sales	Mumbai	72000
3	Carol	Marketing	London	68000
4	Dave	Engineering	Toronto	91000
5	Eve	Sales	Paris	85000

Transformation 2: Derived Column — Generate Source Hash

What It Does

Calculates a SHA-256 hash from all business columns to create a unique fingerprint for each row.

Configuration

Add Derived Column after Source
Name: GenerateHash
New column: src_hash_value
Expression:

sha2(256, concat(toString(employee_id), '|', name, '|', department, '|', city, '|', toString(salary)))

Breaking Down the Expression

Part	Purpose
`sha2(256, ...)`	Generates a SHA-256 hash (64-character hex string)
`concat(...)`	Joins all column values into one string
`toString(employee_id)`	Converts integer to string for concatenation
`'\|'`	Delimiter between values to prevent false matches

Why SHA-256?

Fixed length — always 64 characters, regardless of input size
Collision resistant — virtually impossible for two different inputs to produce the same hash
Deterministic — same input always produces the same output (critical for comparison)
One-way — you cannot reverse-engineer the original data from the hash (security benefit)

Data Preview After Hash

employee_id	name	city	salary	src_hash_value
1	Alice	Toronto	95000	a3f7b2c1d9e4…
2	Bob	Mumbai	72000	b8c2e5f1a7d3…

Real-life analogy: Generating the hash is like taking a photograph of each row. Instead of describing every detail (“Alice, Engineering, Toronto, 95000”), you take a snapshot. Next time, you take another snapshot. If the photos look different, something changed. You do not need to compare every detail — the photo tells you instantly.

Transformation 3: Select — Prefix Columns with SRC_

What It Does

Renames all source columns with a SRC_ prefix to distinguish them from the dimension table columns during the upcoming Lookup join.

Configuration

Add Select after GenerateHash
Name: PrefixSRC
Rename mappings:

Input Column	Output Column
employee_id	SRC_employee_id
name	SRC_name
department	SRC_department
city	SRC_city
salary	SRC_salary
src_hash_value	SRC_hash_value

Why Prefixing Is Essential

When you join the source data with the dimension table in the next step, BOTH have columns called employee_id, name, city, etc. Without prefixing, the join output has ambiguous column names — which city is the source and which is the target?

Without prefix:   employee_id, name, city, employee_id, name, city  ← Ambiguous!
With SRC_ prefix: SRC_employee_id, SRC_name, SRC_city, employee_id, name, city  ← Clear!

Real-life analogy: When two guests at a wedding have the same name (both called “Alice”), the seating chart adds a label: “Alice (Bride’s side)” and “Alice (Groom’s side).” The prefix is the label that prevents confusion.

Transformation 4: Lookup — Join with Existing Dimension Table

What It Does

Joins the source data (CSV) with the existing dimension table (SQL) to find which source records already exist in the target.

Configuration

Add Lookup transformation after PrefixSRC
Name: LookupDim
Primary stream: PrefixSRC (source CSV)
Lookup stream: add a second Source that reads from dim_employee SQL table
Lookup condition: SRC_employee_id == employee_id
Multiple matches: First match (each employee_id should be unique)
Lookup type: Left outer (keep ALL source rows, even if not in dim table)

Why Left Outer Lookup (Not Inner)

Inner would drop source rows that do not exist in the dim table — meaning new employees would be lost.

Left Outer keeps ALL source rows. If a source row has no match in the dim table (new employee), the dim columns come back as NULL. We use this NULL to identify new records in the Conditional Split.

Data Preview After Lookup (First Run — Empty Dim Table)

SRC_employee_id	SRC_name	SRC_city	SRC_hash_value	employee_id	hash_value
1	Alice	Toronto	a3f7…	NULL	NULL
2	Bob	Mumbai	b8c2…	NULL	NULL
3	Carol	London	f1d9…	NULL	NULL

All dim columns are NULL because the dim table is empty. Every record is NEW.

Data Preview After Lookup (Subsequent Run — Dim Table Has Data)

SRC_employee_id	SRC_name	SRC_city	SRC_hash_value	employee_id	hash_value
1	Alice	Mumbai	e9d4…	1	a3f7…
2	Bob	Mumbai	b8c2…	2	b8c2…
6	Frank	Berlin	c7a1…	NULL	NULL

Alice: hash differs (changed city and salary)
Bob: hash matches (no change)
Frank: no match in dim (new employee)

Transformation 5: Conditional Split — Route to New, Changed, or Unchanged

What It Does

Routes each row to a different stream based on whether it is new, changed, or unchanged.

Configuration

Add Conditional Split after LookupDim
Name: SplitNewChangedUnchanged
Conditions (evaluated top to bottom):

Stream Name	Condition	Logic
`NewRecords`	`isNull(employee_id)`	Dim table has no match → new employee
`ChangedRecords`	`SRC_hash_value != hash_value`	Hash differs → something changed
`UnchangedRecords`	(Default stream)	Everything else → hash matches, no change

Why This Order Matters

Conditions are evaluated TOP to BOTTOM. A row matches the FIRST condition that is true.

Check NEW first — if employee_id is NULL (no dim match), it is new. Do not check hash (hash is also NULL for new records).
Check CHANGED second — if employee_id exists but hash differs, it is changed.
Default = UNCHANGED — everything else has matching hashes. No action needed.

Real-life analogy: Airport immigration. First lane: “No passport on file? → New visitor, go to registration.” Second lane: “Passport on file but photo looks different? → Changed appearance, go to verification.” Third lane: “Passport matches? → Welcome back, proceed.”

Row Counts After Split

First run (empty dim): All 5 rows go to NewRecords. 0 Changed. 0 Unchanged.

Second run (with changes): 1 New (Frank), 1 Changed (Alice), 3 Unchanged (Bob, Carol, Dave).

Transformation 6: Alter Row (Insert) — New Records

What It Does

Marks all rows in the NewRecords stream for INSERT into the SQL table.

Configuration

Add Alter Row after NewRecords output of Conditional Split
Name: MarkInsert
Alter row condition: Insert if true() (all rows in this stream are inserts)

Why Alter Row Is Needed

Data Flow sinks can perform INSERT, UPDATE, DELETE, or UPSERT. But the sink needs to know WHICH operation to perform for each row. Alter Row is the instruction that says “every row coming through this path should be INSERTed.”

Without Alter Row, the sink does not know what to do and defaults to INSERT only (which would fail for updates).

Real-life analogy: Alter Row is like stamping a package at the post office. “NEW DELIVERY” stamp means deliver to a new address. “RETURN TO SENDER” stamp means update the existing record. The delivery driver (sink) reads the stamp and acts accordingly.

Transformation 7: Alter Row (Update) — Changed Records

What It Does

Marks all rows in the ChangedRecords stream for UPDATE in the SQL table.

Configuration

Add Alter Row after ChangedRecords output
Name: MarkUpdate
Alter row condition: Update if true() (all rows in this stream are updates)

Transformation 8: Derived Column — Add Update Audit Columns

What It Does

Adds updated_tms and updated_by columns ONLY for changed records. New records do not get these columns — their inserted_tms comes from the SQL DEFAULT constraint.

Configuration

Add Derived Column after MarkUpdate (on the Changed path only)
Name: AddUpdateAudit
Columns:
updated_tms = currentTimestamp()
updated_by = 'SynapseDataFlow'

Why Only on the Update Path

New records should have: – inserted_tms = GETDATE() (from SQL DEFAULT) – inserted_by = SYSTEM_USER (from SQL DEFAULT) – updated_tms = NULL (never updated yet) – updated_by = NULL

Changed records should have: – inserted_tms = original value (preserved, NOT overwritten) – inserted_by = original value (preserved) – updated_tms = NOW (set by this Derived Column) – updated_by = ‘SynapseDataFlow’ (set by this Derived Column)

This is why we add audit columns on the update path only. The insert path relies on SQL DEFAULTs.

Real-life analogy: When a package is first shipped, the warehouse stamps the “Ship Date” (inserted_tms). When the package is redirected to a new address, the courier stamps a “Redirect Date” (updated_tms) but does NOT change the original “Ship Date.” Both timestamps coexist, telling the complete story.

Transformation 9: Union — Merge Insert and Update Streams

What It Does

Combines the Insert stream and Update stream back into a single stream that feeds into the Sink.

Configuration

Add Union transformation
Name: UnionInsertUpdate
Inputs:
Stream 1: MarkInsert (new records marked for INSERT)
Stream 2: AddUpdateAudit (changed records marked for UPDATE, with audit columns)

Why Union Before Sink

You COULD have two separate sinks — one for inserts and one for updates. But Union + single sink is cleaner:

One sink = one connection, one transaction
Easier monitoring — one output shows total inserts + updates
Less code — no duplicate sink configuration

The Alter Row stamps (INSERT vs UPDATE) travel through the Union. The sink reads each row’s stamp and performs the correct operation.

Real-life analogy: At the post office, packages stamped “NEW DELIVERY” and “RETURN/UPDATE” are placed on the same conveyor belt (Union). The sorting machine at the end (Sink) reads the stamp and routes each package to the correct truck.

Transformation 10: Sink — Write to SQL Dimension Table

What It Does

Writes the final data to the dim_employee SQL table. Inserts new rows and updates changed rows based on the Alter Row stamps.

Configuration

Add Sink transformation after Union
Name: SinkDimEmployee
Dataset: SQL dataset pointing to dim_employee
Settings:
Update method: Allow insert ✓ and Allow update ✓
Key columns: SRC_employee_id (the business key for matching updates)
Mapping:
Map SRC_employee_id → employee_id
Map SRC_name → name
Map SRC_department → department
Map SRC_city → city
Map SRC_salary → salary
Map SRC_hash_value → hash_value
Map updated_tms → updated_tms (NULL for inserts, timestamp for updates)
Map updated_by → updated_by (NULL for inserts, value for updates)
Do NOT map inserted_tms or inserted_by (let SQL DEFAULTs handle them)

Why Key Columns Matter

Key columns tell the sink HOW to match for updates:

UPDATE dim_employee
SET name = 'Alice', city = 'Mumbai', salary = 98000, hash_value = 'e9d4...',
    updated_tms = '2026-04-16 15:30:00', updated_by = 'SynapseDataFlow'
WHERE employee_id = 1;  ← This is the key column

Without the key column, the sink does not know WHICH row to update.

The Audit Trail: inserted_tms vs updated_tms

After the first run, Alice’s record:

employee_id=1, name='Alice', city='Toronto', hash='a3f7...',
inserted_tms='2026-04-16 14:00:00', inserted_by='SynapseUser',
updated_tms=NULL, updated_by=NULL

After the second run (Alice moved to Mumbai):

employee_id=1, name='Alice', city='Mumbai', hash='e9d4...',
inserted_tms='2026-04-16 14:00:00',       ← PRESERVED (original insert date)
inserted_by='SynapseUser',                  ← PRESERVED (original inserter)
updated_tms='2026-04-17 08:00:00',         ← SET (when the update happened)
updated_by='SynapseDataFlow'                ← SET (what did the update)

The inserted_tms is NEVER overwritten. You can always answer: “When was Alice first added to the system?” AND “When was she last updated?”

Idempotency: Why Running Twice Does Nothing

Run the pipeline twice with the same unchanged CSV file:

Run 1: – Lookup finds all records in dim table – Hash comparison: ALL hashes match (nothing changed in the CSV) – Conditional Split: ALL rows go to UnchangedRecords stream – UnchangedRecords stream is dropped (no sink) – Result: 0 inserts, 0 updates

Run 2 (identical file): – Same result: 0 inserts, 0 updates

This is idempotency — running the same operation multiple times produces the same result. The pipeline is safe to rerun after failures without creating duplicates or false updates.

Real-life analogy: Pressing the elevator button 5 times does not make the elevator come 5 times. It comes once. Pressing again while it is already coming does nothing. The hash comparison is the “already pressed” check.

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Source CSV: 5 rows
  → Generate Hash: 5 rows with src_hash_value
  → Prefix SRC_: 5 rows with SRC_ columns
  → Lookup Dim: 5 rows, ALL dim columns are NULL (empty table)
  → Conditional Split:
      NewRecords: 5 (all isNull(employee_id))
      ChangedRecords: 0
      UnchangedRecords: 0
  → Alter Row (Insert): 5 rows
  → Sink: INSERT 5 rows into dim_employee

Audit result: All 5 rows have inserted_tms, updated_tms=NULL

Second Run (Alice Changed, Frank Added)

Source CSV: 6 rows (Alice changed, Frank is new)
  → Generate Hash: 6 rows
  → Prefix SRC_: 6 rows
  → Lookup Dim: 6 rows joined with 5 existing dim rows
  → Conditional Split:
      NewRecords: 1 (Frank, isNull(employee_id))
      ChangedRecords: 1 (Alice, SRC_hash != hash)
      UnchangedRecords: 4 (Bob, Carol, Dave, Eve, hashes match)
  → Alter Row (Insert): 1 row (Frank)
  → Alter Row (Update) + Derived (audit): 1 row (Alice)
  → Union: 2 rows
  → Sink: INSERT 1 + UPDATE 1

Audit result: Frank has inserted_tms, Alice has updated_tms set

Third Run (No Changes)

Source CSV: 6 rows (identical to previous run)
  → Conditional Split:
      NewRecords: 0
      ChangedRecords: 0
      UnchangedRecords: 6 (all hashes match)
  → Nothing reaches the Sink
  → 0 inserts, 0 updates (idempotent)

The Complete Data Flow Visual

SourceCSV
    │
    ▼
GenerateHash (SHA-256 of all business columns)
    │
    ▼
PrefixSRC (rename columns with SRC_ prefix)
    │
    ▼
LookupDim ◄──── DimSource (reads dim_employee SQL table)
    │
    ▼
ConditionalSplit
    │          │           │
    ▼          ▼           ▼
NewRecords  Changed    Unchanged
    │       Records      (dropped)
    │          │
    ▼          ▼
AlterRow   AlterRow
(Insert)   (Update)
    │          │
    │          ▼
    │     AddUpdateAudit
    │     (updated_tms, updated_by)
    │          │
    ▼          ▼
    Union (merge both streams)
       │
       ▼
    SinkDimEmployee (INSERT new + UPDATE changed)

Parameterized Source Dataset

Pipeline Parameter

Pipeline parameter: SourceFileName (String)
Default: "employees.csv"

Data Flow Parameter

Data Flow parameter: DFSourceFile (String)

Dataset Parameter

Dataset parameter: FileName (String)
Connection: container/folder/@dataset().FileName

Wiring

Pipeline parameter SourceFileName
    → Data Flow activity parameter DFSourceFile = @pipeline().parameters.SourceFileName
        → Source dataset FileName = $DFSourceFile

This chain lets you reuse the same pipeline for different files by passing a different SourceFileName when you trigger it.

Performance Considerations

Broadcast the dim table — if your dimension table is small (under 50 MB), broadcast it during the Lookup. This avoids shuffling the large source data.
Index the hash column — add an index on hash_value in the SQL table for faster comparisons.
Index the key column — employee_id should be the PRIMARY KEY (already indexed).
Partition large source files — if the CSV has millions of rows, Spark will auto-partition for parallel processing.
Use the right cluster size — for dev, 4+4 cores is fine. For production with millions of rows, scale to 16+16 or more.

Common Mistakes

1. Forgetting the Delimiter in Hash Concatenation

WRONG:  sha2(256, concat(name, city, salary))
        "AliceToronto95000" and "AliceT" + "oronto95000" produce different results
        but "AB" + "C" and "A" + "BC" produce the same "ABC"

RIGHT:  sha2(256, concat(name, '|', city, '|', toString(salary)))
        "Alice|Toronto|95000" is unambiguous

2. Not Handling NULLs in Hash

If a column is NULL, concat(NULL, '|', 'Toronto') returns NULL in some engines.

Fix: Use coalesce() to replace NULLs:

sha2(256, concat(coalesce(toString(employee_id),''), '|', coalesce(name,''), '|', coalesce(city,'')))

3. Mapping inserted_tms in the Sink

If you map inserted_tms in the sink, updates will overwrite the original insert timestamp.

Fix: Do NOT map inserted_tms or inserted_by. Let SQL DEFAULTs handle them on insert, and they are untouched on update.

4. Using Inner Lookup Instead of Left Outer

Inner Lookup drops new records (no match in dim table).

Fix: Always use Left Outer for SCD pipelines. New records must flow through.

5. Wrong Conditional Split Order

If you check Changed before New, records with NULL hashes (new records) might match SRC_hash != NULL unexpectedly.

Fix: Always check isNull(employee_id) FIRST (new records), then hash comparison (changed), then default (unchanged).

Interview Questions

Q: What is hash-based change detection? A: Instead of comparing every column individually, you concatenate all column values and generate a SHA-256 hash. Compare the source hash with the stored target hash — if they differ, the record changed. One comparison instead of N column comparisons, and it scales regardless of column count.

Q: Why prefix source columns with SRC_ in the data flow? A: When joining source and target, both have columns with the same names (employee_id, name, city). Prefixing source columns with SRC_ prevents ambiguity and makes it clear which column belongs to which stream in downstream transformations and mappings.

Q: How does the pipeline ensure inserted_tms is never overwritten? A: By NOT mapping inserted_tms in the Data Flow sink. On INSERT, the SQL DEFAULT constraint sets it to GETDATE(). On UPDATE, the column is not included in the SET clause, so it retains its original value. The updated_tms column is only added on the update path via a Derived Column transformation.

Q: What makes this pipeline idempotent? A: The hash comparison. If the source data has not changed since the last run, all hashes match, all rows go to the Unchanged stream (dropped), and the sink receives zero rows. No inserts, no updates, no side effects. The pipeline is safe to rerun indefinitely.

Q: What is the difference between Alter Row Insert and Alter Row Update? A: Alter Row stamps each row with an operation type. Insert tells the sink to INSERT the row (new record). Update tells the sink to UPDATE the matching row based on the key column. Without Alter Row, the sink does not know which operation to perform.

Q: Why Union before the Sink instead of two separate Sinks? A: Union merges the insert and update streams into one. One sink means one connection, one transaction, and simpler monitoring. The Alter Row stamps survive through the Union, so the sink still knows which rows to insert and which to update.

Wrapping Up

This pipeline is the gold standard for SCD Type 1 with change detection. Instead of blindly overwriting everything, it fingerprints each row with a hash, compares it against the existing data, and only processes what actually changed.

The result: faster execution (skip unchanged rows), complete audit trail (inserted_tms preserved, updated_tms tracked), and idempotency (safe to rerun).

The pattern works for any dimension table — employees, customers, products, stores. Change the source, change the columns in the hash, and the rest of the pipeline stays the same.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

SCD Type 1 Pipeline with Hash-Based Change Detection in Azure Synapse: Every Activity Explained

Table of Contents

What Is Hash-Based Change Detection?

Why Hashing Instead of Column-by-Column Comparison?

Additional Benefits of Hashing

The Hash Formula

The Complete Pipeline Architecture

The Source Data (CSV in ADLS Gen2)

The Target Table (SQL Dimension Table with Audit Columns)

Why These Audit Columns?

The Pipeline Overview

Activity 1: Data Flow (The Brain of the Pipeline)

Why a Data Flow Instead of Copy + Stored Procedure?

Transformation 1: Source (Reading the CSV)

What It Does

Configuration

Why Parameterized

Data Preview After Source

Transformation 2: Derived Column — Generate Source Hash

What It Does

Configuration

Breaking Down the Expression

Why SHA-256?

Data Preview After Hash

Transformation 3: Select — Prefix Columns with SRC_

What It Does

Configuration

Why Prefixing Is Essential

Transformation 4: Lookup — Join with Existing Dimension Table

What It Does

Configuration

Why Left Outer Lookup (Not Inner)

Data Preview After Lookup (First Run — Empty Dim Table)

Data Preview After Lookup (Subsequent Run — Dim Table Has Data)

Transformation 5: Conditional Split — Route to New, Changed, or Unchanged

What It Does

Configuration

Why This Order Matters

Row Counts After Split

Transformation 6: Alter Row (Insert) — New Records

What It Does

Configuration

Why Alter Row Is Needed

Transformation 7: Alter Row (Update) — Changed Records

What It Does

Configuration

Transformation 8: Derived Column — Add Update Audit Columns

What It Does

Configuration

Why Only on the Update Path

Transformation 9: Union — Merge Insert and Update Streams

What It Does

Configuration

Why Union Before Sink

Transformation 10: Sink — Write to SQL Dimension Table

What It Does

Configuration

Why Key Columns Matter

The Audit Trail: inserted_tms vs updated_tms

Idempotency: Why Running Twice Does Nothing

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Second Run (Alice Changed, Frank Added)

Third Run (No Changes)

The Complete Data Flow Visual

Parameterized Source Dataset

Pipeline Parameter

Data Flow Parameter

Dataset Parameter

Wiring

Performance Considerations

Common Mistakes

1. Forgetting the Delimiter in Hash Concatenation

2. Not Handling NULLs in Hash

3. Mapping inserted_tms in the Sink

4. Using Inner Lookup Instead of Left Outer

5. Wrong Conditional Split Order

Interview Questions

Wrapping Up

Related Posts