SCD Type 1 Pipeline with Hash-Based Change Detection in Azure Synapse: Every Activity Explained
In the previous post, we built an SCD Type 1 pipeline that does a full load every time — dump everything from source to destination. Simple, reliable, but wasteful. If you have 10 million rows and only 50 changed, you are rewriting 10 million rows just to update 50.
This post builds a smarter SCD Type 1 pipeline that uses hash-based change detection to figure out exactly which rows are new, which have changed, and which are identical — and only processes what needs processing.
Think of it like a hotel housekeeping team. The basic approach (full load) sends housekeeping to clean EVERY room, even if guests have not checked in. The smart approach (hash-based) checks the “Do Not Disturb” sign first. If the sign is out (hash matches = no change), skip the room. If the sign is NOT out (hash differs = something changed), go in and clean. If it is a brand new room (no hash = new record), set it up from scratch.
This is the exact pipeline pattern we built in Synapse using Data Flows, and this post explains every single activity, transformation, and expression.
Table of Contents
- What Is Hash-Based Change Detection?
- Why Hashing Instead of Column-by-Column Comparison?
- The Complete Pipeline Architecture
- The Source Data (CSV in ADLS Gen2)
- The Target Table (SQL Dimension Table with Audit Columns)
- The Pipeline Overview
- Activity 1: Data Flow (The Brain of the Pipeline)
- Transformation 1: Source (Reading the CSV)
- Transformation 2: Derived Column — Generate Source Hash
- Transformation 3: Select — Prefix Columns with SRC_
- Transformation 4: Lookup — Join with Existing Dimension Table
- Transformation 5: Conditional Split — Route to New, Changed, or Unchanged
- Transformation 6: Alter Row (Insert) — New Records
- Transformation 7: Alter Row (Update) — Changed Records
- Transformation 8: Derived Column — Add Update Audit Columns
- Transformation 9: Union — Merge Insert and Update Streams
- Transformation 10: Sink — Write to SQL Dimension Table
- The Audit Trail: inserted_tms vs updated_tms
- Idempotency: Why Running Twice Does Nothing
- First Run vs Subsequent Runs
- The Complete Data Flow Visual
- Parameterized Source Dataset
- Performance Considerations
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is Hash-Based Change Detection?
Instead of comparing every column value between source and target to find changes, you create a single hash value from all columns. If the hash is the same, nothing changed. If the hash is different, something changed.
Source Row: Name="Alice", City="Toronto", Salary=95000
Source Hash: SHA256("Alice|Toronto|95000") = "a3f7b2c1..."
Target Row: Name="Alice", City="Toronto", Salary=95000
Target Hash: SHA256("Alice|Toronto|95000") = "a3f7b2c1..."
Comparison: "a3f7b2c1..." == "a3f7b2c1..." → NO CHANGE (skip)
Now Alice moves to Mumbai:
Source Row: Name="Alice", City="Mumbai", Salary=95000
Source Hash: SHA256("Alice|Mumbai|95000") = "e9d4f8a6..."
Target Row: Name="Alice", City="Toronto", Salary=95000
Target Hash: SHA256("Alice|Toronto|95000") = "a3f7b2c1..."
Comparison: "e9d4f8a6..." != "a3f7b2c1..." → CHANGED (update)
Real-life analogy: Hashing is like a fingerprint for your data. Every person (row) has a unique fingerprint (hash). If someone’s fingerprint changes, their identity has changed. You do not need to compare height, weight, hair color, and eye color individually — the fingerprint catches ALL changes in one comparison.
Why Hashing Instead of Column-by-Column Comparison?
| Approach | Comparison Logic | Complexity |
|---|---|---|
| Column-by-column | source.name != target.name OR source.city != target.city OR source.salary != target.salary OR ... |
Grows with every column. 20 columns = 20 OR conditions. |
| Hash-based | source.hash != target.hash |
Always ONE comparison, regardless of column count. |
Additional Benefits of Hashing
- NULL handling — hashing naturally handles NULLs (they become part of the hash string)
- Any data type — dates, numbers, strings all concatenate into one hash
- Performance — one string comparison vs 20 individual comparisons
- Maintainability — adding a new column means adding it to the hash formula, not rewriting the comparison logic
The Hash Formula
hash_value = SHA256(concat(column1, '|', column2, '|', column3, '|', ...))
The | delimiter prevents false matches. Without it, Name="AB" City="C" and Name="A" City="BC" would produce the same concatenation (“ABC”). With delimiter: “AB|C” vs “A|BC” — different hashes.
The Complete Pipeline Architecture
ADLS Gen2 (CSV) Synapse Data Flow Azure SQL DB
┌──────────────┐ ┌──────────────────────────────────────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ employees.csv│───>│ Source ──> Derived (Hash) ──> Select (SRC_) │ │ dim_employee │
│ │ │ │ │ │ │
│ │ │ Lookup (Join │ │ (hash_value, │
│ │ │ with dim table) │<───│ inserted_tms│
│ │ │ │ │ │ updated_tms)│
│ │ │ Conditional Split │ │ │
│ │ │ / | \ │ │ │
│ │ │ New Changed Unchanged│ │ │
│ │ │ | | (drop) │ │ │
│ │ │ AlterRow AlterRow+Derived │ │ │
│ │ │ (Insert) (Update+audit) │ │ │
│ │ │ \ / │ │ │
│ │ │ Union │ │ │
│ │ │ │ │ │ │
│ │ │ Sink ────────────────>│ │ │
│ │ │ │ │ │
└──────────────┘ └──────────────────────────────────────────────┘ └──────────────┘
The Source Data (CSV in ADLS Gen2)
A simple employee CSV file stored in your data lake:
employee_id,name,department,city,salary
1,Alice,Engineering,Toronto,95000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000
This file is the source of truth. On subsequent runs, it might have changes:
employee_id,name,department,city,salary
1,Alice,Engineering,Mumbai,98000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000
6,Frank,HR,Berlin,70000
Changes: Alice moved to Mumbai and got a raise. Frank is new. Bob, Carol, Dave, Eve are unchanged.
The Target Table (SQL Dimension Table with Audit Columns)
CREATE TABLE dim_employee (
employee_id INT PRIMARY KEY,
name VARCHAR(100),
department VARCHAR(50),
city VARCHAR(50),
salary INT,
hash_value VARCHAR(64),
inserted_tms DATETIME DEFAULT GETDATE(),
inserted_by VARCHAR(100) DEFAULT SYSTEM_USER,
updated_tms DATETIME NULL,
updated_by VARCHAR(100) NULL
);
Why These Audit Columns?
| Column | Purpose | Populated When |
|---|---|---|
hash_value |
Stores the SHA-256 hash of all business columns | Every insert and update |
inserted_tms |
When the record was FIRST created | Insert only (DEFAULT GETDATE()) |
inserted_by |
Who/what created the record | Insert only (DEFAULT SYSTEM_USER) |
updated_tms |
When the record was LAST modified | Update only (set by data flow) |
updated_by |
Who/what modified the record | Update only (set by data flow) |
The key design: inserted_tms and inserted_by have DEFAULT constraints and are NEVER overwritten on updates. This preserves the original creation audit trail even after multiple updates.
Real-life analogy: It is like a package tracking system. inserted_tms is when the package was originally shipped (never changes). updated_tms is when the last status update happened (changes with every scan). Even after 10 status updates, you can still see the original ship date.
The Pipeline Overview
The pipeline has one primary activity: a Data Flow activity that contains all the transformation logic. Unlike the previous post’s pipeline with separate Lookup, ForEach, and Copy activities, this pipeline uses a Data Flow to handle everything in a single Spark-powered execution.
Pipeline: PL_SCD1_HashBased
|
|-- Data Flow Activity: DF_SCD1_Employee
|
|-- Source (CSV from ADLS)
|-- Derived Column (generate hash)
|-- Select (prefix with SRC_)
|-- Lookup (join with dim table)
|-- Conditional Split (new / changed / unchanged)
|-- Alter Row + Union + Sink
Activity 1: Data Flow (The Brain of the Pipeline)
Why a Data Flow Instead of Copy + Stored Procedure?
The Copy activity can move data but cannot compare, hash, split, or conditionally insert/update. For change detection logic, you need either:
- Data Flow (visual, no code, runs on Spark) — what we use here
- Stored Procedure (SQL-based MERGE statement) — alternative approach
- Spark Notebook (PySpark code) — for maximum flexibility
Data Flow is the best choice for visual, maintainable, code-free SCD implementation.
Real-life analogy: The Copy activity is a delivery truck — it moves boxes from A to B without opening them. The Data Flow is a sorting facility — it opens every box, inspects the contents, routes them to the right destination, and stamps them with tracking information.
Transformation 1: Source (Reading the CSV)
What It Does
Reads the CSV file from ADLS Gen2 into the Data Flow as the starting dataset.
Configuration
- Add Source transformation
- Name:
SourceCSV - Source type: Dataset (or Inline)
- Dataset: parameterized ADLS CSV dataset
- Linked service: your ADLS Gen2
- File path: container + folder path (parameterized for reuse)
- Source options:
- Header: First row as header
- Schema: infer from connection (or define manually)
Why Parameterized
The source dataset accepts a FileName parameter, so the same pipeline can process different CSV files:
Run 1: FileName = "employees.csv"
Run 2: FileName = "departments.csv"
Run 3: FileName = "products.csv"
One pipeline, multiple use cases. The pipeline parameter feeds into the Data Flow parameter, which feeds into the dataset parameter.
Data Preview After Source
| employee_id | name | department | city | salary |
|---|---|---|---|---|
| 1 | Alice | Engineering | Toronto | 95000 |
| 2 | Bob | Sales | Mumbai | 72000 |
| 3 | Carol | Marketing | London | 68000 |
| 4 | Dave | Engineering | Toronto | 91000 |
| 5 | Eve | Sales | Paris | 85000 |
Transformation 2: Derived Column — Generate Source Hash
What It Does
Calculates a SHA-256 hash from all business columns to create a unique fingerprint for each row.
Configuration
- Add Derived Column after Source
- Name:
GenerateHash - New column:
src_hash_value - Expression:
sha2(256, concat(toString(employee_id), '|', name, '|', department, '|', city, '|', toString(salary)))
Breaking Down the Expression
| Part | Purpose |
|---|---|
sha2(256, ...) |
Generates a SHA-256 hash (64-character hex string) |
concat(...) |
Joins all column values into one string |
toString(employee_id) |
Converts integer to string for concatenation |
'|' |
Delimiter between values to prevent false matches |
Why SHA-256?
- Fixed length — always 64 characters, regardless of input size
- Collision resistant — virtually impossible for two different inputs to produce the same hash
- Deterministic — same input always produces the same output (critical for comparison)
- One-way — you cannot reverse-engineer the original data from the hash (security benefit)
Data Preview After Hash
| employee_id | name | city | salary | src_hash_value |
|---|---|---|---|---|
| 1 | Alice | Toronto | 95000 | a3f7b2c1d9e4… |
| 2 | Bob | Mumbai | 72000 | b8c2e5f1a7d3… |
Real-life analogy: Generating the hash is like taking a photograph of each row. Instead of describing every detail (“Alice, Engineering, Toronto, 95000”), you take a snapshot. Next time, you take another snapshot. If the photos look different, something changed. You do not need to compare every detail — the photo tells you instantly.
Transformation 3: Select — Prefix Columns with SRC_
What It Does
Renames all source columns with a SRC_ prefix to distinguish them from the dimension table columns during the upcoming Lookup join.
Configuration
- Add Select after GenerateHash
- Name:
PrefixSRC - Rename mappings:
| Input Column | Output Column |
|---|---|
| employee_id | SRC_employee_id |
| name | SRC_name |
| department | SRC_department |
| city | SRC_city |
| salary | SRC_salary |
| src_hash_value | SRC_hash_value |
Why Prefixing Is Essential
When you join the source data with the dimension table in the next step, BOTH have columns called employee_id, name, city, etc. Without prefixing, the join output has ambiguous column names — which city is the source and which is the target?
Without prefix: employee_id, name, city, employee_id, name, city ← Ambiguous!
With SRC_ prefix: SRC_employee_id, SRC_name, SRC_city, employee_id, name, city ← Clear!
Real-life analogy: When two guests at a wedding have the same name (both called “Alice”), the seating chart adds a label: “Alice (Bride’s side)” and “Alice (Groom’s side).” The prefix is the label that prevents confusion.
Transformation 4: Lookup — Join with Existing Dimension Table
What It Does
Joins the source data (CSV) with the existing dimension table (SQL) to find which source records already exist in the target.
Configuration
- Add Lookup transformation after PrefixSRC
- Name:
LookupDim - Primary stream:
PrefixSRC(source CSV) - Lookup stream: add a second Source that reads from
dim_employeeSQL table - Lookup condition:
SRC_employee_id == employee_id - Multiple matches: First match (each employee_id should be unique)
- Lookup type: Left outer (keep ALL source rows, even if not in dim table)
Why Left Outer Lookup (Not Inner)
Inner would drop source rows that do not exist in the dim table — meaning new employees would be lost.
Left Outer keeps ALL source rows. If a source row has no match in the dim table (new employee), the dim columns come back as NULL. We use this NULL to identify new records in the Conditional Split.
Data Preview After Lookup (First Run — Empty Dim Table)
| SRC_employee_id | SRC_name | SRC_city | SRC_hash_value | employee_id | hash_value |
|---|---|---|---|---|---|
| 1 | Alice | Toronto | a3f7… | NULL | NULL |
| 2 | Bob | Mumbai | b8c2… | NULL | NULL |
| 3 | Carol | London | f1d9… | NULL | NULL |
All dim columns are NULL because the dim table is empty. Every record is NEW.
Data Preview After Lookup (Subsequent Run — Dim Table Has Data)
| SRC_employee_id | SRC_name | SRC_city | SRC_hash_value | employee_id | hash_value |
|---|---|---|---|---|---|
| 1 | Alice | Mumbai | e9d4… | 1 | a3f7… |
| 2 | Bob | Mumbai | b8c2… | 2 | b8c2… |
| 6 | Frank | Berlin | c7a1… | NULL | NULL |
- Alice: hash differs (changed city and salary)
- Bob: hash matches (no change)
- Frank: no match in dim (new employee)
Transformation 5: Conditional Split — Route to New, Changed, or Unchanged
What It Does
Routes each row to a different stream based on whether it is new, changed, or unchanged.
Configuration
- Add Conditional Split after LookupDim
- Name:
SplitNewChangedUnchanged - Conditions (evaluated top to bottom):
| Stream Name | Condition | Logic |
|---|---|---|
NewRecords |
isNull(employee_id) |
Dim table has no match → new employee |
ChangedRecords |
SRC_hash_value != hash_value |
Hash differs → something changed |
UnchangedRecords |
(Default stream) | Everything else → hash matches, no change |
Why This Order Matters
Conditions are evaluated TOP to BOTTOM. A row matches the FIRST condition that is true.
- Check NEW first — if
employee_idis NULL (no dim match), it is new. Do not check hash (hash is also NULL for new records). - Check CHANGED second — if employee_id exists but hash differs, it is changed.
- Default = UNCHANGED — everything else has matching hashes. No action needed.
Real-life analogy: Airport immigration. First lane: “No passport on file? → New visitor, go to registration.” Second lane: “Passport on file but photo looks different? → Changed appearance, go to verification.” Third lane: “Passport matches? → Welcome back, proceed.”
Row Counts After Split
First run (empty dim): All 5 rows go to NewRecords. 0 Changed. 0 Unchanged.
Second run (with changes): 1 New (Frank), 1 Changed (Alice), 3 Unchanged (Bob, Carol, Dave).
Transformation 6: Alter Row (Insert) — New Records
What It Does
Marks all rows in the NewRecords stream for INSERT into the SQL table.
Configuration
- Add Alter Row after
NewRecordsoutput of Conditional Split - Name:
MarkInsert - Alter row condition: Insert if
true()(all rows in this stream are inserts)
Why Alter Row Is Needed
Data Flow sinks can perform INSERT, UPDATE, DELETE, or UPSERT. But the sink needs to know WHICH operation to perform for each row. Alter Row is the instruction that says “every row coming through this path should be INSERTed.”
Without Alter Row, the sink does not know what to do and defaults to INSERT only (which would fail for updates).
Real-life analogy: Alter Row is like stamping a package at the post office. “NEW DELIVERY” stamp means deliver to a new address. “RETURN TO SENDER” stamp means update the existing record. The delivery driver (sink) reads the stamp and acts accordingly.
Transformation 7: Alter Row (Update) — Changed Records
What It Does
Marks all rows in the ChangedRecords stream for UPDATE in the SQL table.
Configuration
- Add Alter Row after
ChangedRecordsoutput - Name:
MarkUpdate - Alter row condition: Update if
true()(all rows in this stream are updates)
Transformation 8: Derived Column — Add Update Audit Columns
What It Does
Adds updated_tms and updated_by columns ONLY for changed records. New records do not get these columns — their inserted_tms comes from the SQL DEFAULT constraint.
Configuration
- Add Derived Column after
MarkUpdate(on the Changed path only) - Name:
AddUpdateAudit - Columns:
updated_tms=currentTimestamp()updated_by='SynapseDataFlow'
Why Only on the Update Path
New records should have:
– inserted_tms = GETDATE() (from SQL DEFAULT)
– inserted_by = SYSTEM_USER (from SQL DEFAULT)
– updated_tms = NULL (never updated yet)
– updated_by = NULL
Changed records should have:
– inserted_tms = original value (preserved, NOT overwritten)
– inserted_by = original value (preserved)
– updated_tms = NOW (set by this Derived Column)
– updated_by = ‘SynapseDataFlow’ (set by this Derived Column)
This is why we add audit columns on the update path only. The insert path relies on SQL DEFAULTs.
Real-life analogy: When a package is first shipped, the warehouse stamps the “Ship Date” (inserted_tms). When the package is redirected to a new address, the courier stamps a “Redirect Date” (updated_tms) but does NOT change the original “Ship Date.” Both timestamps coexist, telling the complete story.
Transformation 9: Union — Merge Insert and Update Streams
What It Does
Combines the Insert stream and Update stream back into a single stream that feeds into the Sink.
Configuration
- Add Union transformation
- Name:
UnionInsertUpdate - Inputs:
- Stream 1:
MarkInsert(new records marked for INSERT) - Stream 2:
AddUpdateAudit(changed records marked for UPDATE, with audit columns)
Why Union Before Sink
You COULD have two separate sinks — one for inserts and one for updates. But Union + single sink is cleaner:
- One sink = one connection, one transaction
- Easier monitoring — one output shows total inserts + updates
- Less code — no duplicate sink configuration
The Alter Row stamps (INSERT vs UPDATE) travel through the Union. The sink reads each row’s stamp and performs the correct operation.
Real-life analogy: At the post office, packages stamped “NEW DELIVERY” and “RETURN/UPDATE” are placed on the same conveyor belt (Union). The sorting machine at the end (Sink) reads the stamp and routes each package to the correct truck.
Transformation 10: Sink — Write to SQL Dimension Table
What It Does
Writes the final data to the dim_employee SQL table. Inserts new rows and updates changed rows based on the Alter Row stamps.
Configuration
- Add Sink transformation after Union
- Name:
SinkDimEmployee - Dataset: SQL dataset pointing to
dim_employee - Settings:
- Update method: Allow insert ✓ and Allow update ✓
- Key columns:
SRC_employee_id(the business key for matching updates) - Mapping:
- Map
SRC_employee_id→employee_id - Map
SRC_name→name - Map
SRC_department→department - Map
SRC_city→city - Map
SRC_salary→salary - Map
SRC_hash_value→hash_value - Map
updated_tms→updated_tms(NULL for inserts, timestamp for updates) - Map
updated_by→updated_by(NULL for inserts, value for updates) - Do NOT map
inserted_tmsorinserted_by(let SQL DEFAULTs handle them)
Why Key Columns Matter
Key columns tell the sink HOW to match for updates:
UPDATE dim_employee
SET name = 'Alice', city = 'Mumbai', salary = 98000, hash_value = 'e9d4...',
updated_tms = '2026-04-16 15:30:00', updated_by = 'SynapseDataFlow'
WHERE employee_id = 1; ← This is the key column
Without the key column, the sink does not know WHICH row to update.
The Audit Trail: inserted_tms vs updated_tms
After the first run, Alice’s record:
employee_id=1, name='Alice', city='Toronto', hash='a3f7...',
inserted_tms='2026-04-16 14:00:00', inserted_by='SynapseUser',
updated_tms=NULL, updated_by=NULL
After the second run (Alice moved to Mumbai):
employee_id=1, name='Alice', city='Mumbai', hash='e9d4...',
inserted_tms='2026-04-16 14:00:00', ← PRESERVED (original insert date)
inserted_by='SynapseUser', ← PRESERVED (original inserter)
updated_tms='2026-04-17 08:00:00', ← SET (when the update happened)
updated_by='SynapseDataFlow' ← SET (what did the update)
The inserted_tms is NEVER overwritten. You can always answer: “When was Alice first added to the system?” AND “When was she last updated?”
Idempotency: Why Running Twice Does Nothing
Run the pipeline twice with the same unchanged CSV file:
Run 1:
– Lookup finds all records in dim table
– Hash comparison: ALL hashes match (nothing changed in the CSV)
– Conditional Split: ALL rows go to UnchangedRecords stream
– UnchangedRecords stream is dropped (no sink)
– Result: 0 inserts, 0 updates
Run 2 (identical file): – Same result: 0 inserts, 0 updates
This is idempotency — running the same operation multiple times produces the same result. The pipeline is safe to rerun after failures without creating duplicates or false updates.
Real-life analogy: Pressing the elevator button 5 times does not make the elevator come 5 times. It comes once. Pressing again while it is already coming does nothing. The hash comparison is the “already pressed” check.
First Run vs Subsequent Runs
First Run (Empty Dim Table)
Source CSV: 5 rows
→ Generate Hash: 5 rows with src_hash_value
→ Prefix SRC_: 5 rows with SRC_ columns
→ Lookup Dim: 5 rows, ALL dim columns are NULL (empty table)
→ Conditional Split:
NewRecords: 5 (all isNull(employee_id))
ChangedRecords: 0
UnchangedRecords: 0
→ Alter Row (Insert): 5 rows
→ Sink: INSERT 5 rows into dim_employee
Audit result: All 5 rows have inserted_tms, updated_tms=NULL
Second Run (Alice Changed, Frank Added)
Source CSV: 6 rows (Alice changed, Frank is new)
→ Generate Hash: 6 rows
→ Prefix SRC_: 6 rows
→ Lookup Dim: 6 rows joined with 5 existing dim rows
→ Conditional Split:
NewRecords: 1 (Frank, isNull(employee_id))
ChangedRecords: 1 (Alice, SRC_hash != hash)
UnchangedRecords: 4 (Bob, Carol, Dave, Eve, hashes match)
→ Alter Row (Insert): 1 row (Frank)
→ Alter Row (Update) + Derived (audit): 1 row (Alice)
→ Union: 2 rows
→ Sink: INSERT 1 + UPDATE 1
Audit result: Frank has inserted_tms, Alice has updated_tms set
Third Run (No Changes)
Source CSV: 6 rows (identical to previous run)
→ Conditional Split:
NewRecords: 0
ChangedRecords: 0
UnchangedRecords: 6 (all hashes match)
→ Nothing reaches the Sink
→ 0 inserts, 0 updates (idempotent)
The Complete Data Flow Visual
SourceCSV
│
▼
GenerateHash (SHA-256 of all business columns)
│
▼
PrefixSRC (rename columns with SRC_ prefix)
│
▼
LookupDim ◄──── DimSource (reads dim_employee SQL table)
│
▼
ConditionalSplit
│ │ │
▼ ▼ ▼
NewRecords Changed Unchanged
│ Records (dropped)
│ │
▼ ▼
AlterRow AlterRow
(Insert) (Update)
│ │
│ ▼
│ AddUpdateAudit
│ (updated_tms, updated_by)
│ │
▼ ▼
Union (merge both streams)
│
▼
SinkDimEmployee (INSERT new + UPDATE changed)
Parameterized Source Dataset
Pipeline Parameter
Pipeline parameter: SourceFileName (String)
Default: "employees.csv"
Data Flow Parameter
Data Flow parameter: DFSourceFile (String)
Dataset Parameter
Dataset parameter: FileName (String)
Connection: container/folder/@dataset().FileName
Wiring
Pipeline parameter SourceFileName
→ Data Flow activity parameter DFSourceFile = @pipeline().parameters.SourceFileName
→ Source dataset FileName = $DFSourceFile
This chain lets you reuse the same pipeline for different files by passing a different SourceFileName when you trigger it.
Performance Considerations
-
Broadcast the dim table — if your dimension table is small (under 50 MB), broadcast it during the Lookup. This avoids shuffling the large source data.
-
Index the hash column — add an index on
hash_valuein the SQL table for faster comparisons. -
Index the key column —
employee_idshould be the PRIMARY KEY (already indexed). -
Partition large source files — if the CSV has millions of rows, Spark will auto-partition for parallel processing.
-
Use the right cluster size — for dev, 4+4 cores is fine. For production with millions of rows, scale to 16+16 or more.
Common Mistakes
1. Forgetting the Delimiter in Hash Concatenation
WRONG: sha2(256, concat(name, city, salary))
"AliceToronto95000" and "AliceT" + "oronto95000" produce different results
but "AB" + "C" and "A" + "BC" produce the same "ABC"
RIGHT: sha2(256, concat(name, '|', city, '|', toString(salary)))
"Alice|Toronto|95000" is unambiguous
2. Not Handling NULLs in Hash
If a column is NULL, concat(NULL, '|', 'Toronto') returns NULL in some engines.
Fix: Use coalesce() to replace NULLs:
sha2(256, concat(coalesce(toString(employee_id),''), '|', coalesce(name,''), '|', coalesce(city,'')))
3. Mapping inserted_tms in the Sink
If you map inserted_tms in the sink, updates will overwrite the original insert timestamp.
Fix: Do NOT map inserted_tms or inserted_by. Let SQL DEFAULTs handle them on insert, and they are untouched on update.
4. Using Inner Lookup Instead of Left Outer
Inner Lookup drops new records (no match in dim table).
Fix: Always use Left Outer for SCD pipelines. New records must flow through.
5. Wrong Conditional Split Order
If you check Changed before New, records with NULL hashes (new records) might match SRC_hash != NULL unexpectedly.
Fix: Always check isNull(employee_id) FIRST (new records), then hash comparison (changed), then default (unchanged).
Interview Questions
Q: What is hash-based change detection? A: Instead of comparing every column individually, you concatenate all column values and generate a SHA-256 hash. Compare the source hash with the stored target hash — if they differ, the record changed. One comparison instead of N column comparisons, and it scales regardless of column count.
Q: Why prefix source columns with SRC_ in the data flow? A: When joining source and target, both have columns with the same names (employee_id, name, city). Prefixing source columns with SRC_ prevents ambiguity and makes it clear which column belongs to which stream in downstream transformations and mappings.
Q: How does the pipeline ensure inserted_tms is never overwritten? A: By NOT mapping inserted_tms in the Data Flow sink. On INSERT, the SQL DEFAULT constraint sets it to GETDATE(). On UPDATE, the column is not included in the SET clause, so it retains its original value. The updated_tms column is only added on the update path via a Derived Column transformation.
Q: What makes this pipeline idempotent? A: The hash comparison. If the source data has not changed since the last run, all hashes match, all rows go to the Unchanged stream (dropped), and the sink receives zero rows. No inserts, no updates, no side effects. The pipeline is safe to rerun indefinitely.
Q: What is the difference between Alter Row Insert and Alter Row Update? A: Alter Row stamps each row with an operation type. Insert tells the sink to INSERT the row (new record). Update tells the sink to UPDATE the matching row based on the key column. Without Alter Row, the sink does not know which operation to perform.
Q: Why Union before the Sink instead of two separate Sinks? A: Union merges the insert and update streams into one. One sink means one connection, one transaction, and simpler monitoring. The Alter Row stamps survive through the Union, so the sink still knows which rows to insert and which to update.
Wrapping Up
This pipeline is the gold standard for SCD Type 1 with change detection. Instead of blindly overwriting everything, it fingerprints each row with a hash, compares it against the existing data, and only processes what actually changed.
The result: faster execution (skip unchanged rows), complete audit trail (inserted_tms preserved, updated_tms tracked), and idempotency (safe to rerun).
The pattern works for any dimension table — employees, customers, products, stores. Change the source, change the columns in the hash, and the rest of the pipeline stays the same.
Related posts: – SCD Type 1 Full Load Pipeline – Slowly Changing Dimensions (All Types) – Data Flows in ADF/Synapse – Joins in Synapse Data Flows – Metadata-Driven Pipeline in ADF
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.