SCD Type 1 Pipeline with Hash-Based Change Detection in Azure Synapse: Every Activity Explained

SCD Type 1 Pipeline with Hash-Based Change Detection in Azure Synapse: Every Activity Explained

In the previous post, we built an SCD Type 1 pipeline that does a full load every time — dump everything from source to destination. Simple, reliable, but wasteful. If you have 10 million rows and only 50 changed, you are rewriting 10 million rows just to update 50.

This post builds a smarter SCD Type 1 pipeline that uses hash-based change detection to figure out exactly which rows are new, which have changed, and which are identical — and only processes what needs processing.

Think of it like a hotel housekeeping team. The basic approach (full load) sends housekeeping to clean EVERY room, even if guests have not checked in. The smart approach (hash-based) checks the “Do Not Disturb” sign first. If the sign is out (hash matches = no change), skip the room. If the sign is NOT out (hash differs = something changed), go in and clean. If it is a brand new room (no hash = new record), set it up from scratch.

This is the exact pipeline pattern we built in Synapse using Data Flows, and this post explains every single activity, transformation, and expression.

Table of Contents

  • What Is Hash-Based Change Detection?
  • Why Hashing Instead of Column-by-Column Comparison?
  • The Complete Pipeline Architecture
  • The Source Data (CSV in ADLS Gen2)
  • The Target Table (SQL Dimension Table with Audit Columns)
  • The Pipeline Overview
  • Activity 1: Data Flow (The Brain of the Pipeline)
  • Transformation 1: Source (Reading the CSV)
  • Transformation 2: Derived Column — Generate Source Hash
  • Transformation 3: Select — Prefix Columns with SRC_
  • Transformation 4: Lookup — Join with Existing Dimension Table
  • Transformation 5: Conditional Split — Route to New, Changed, or Unchanged
  • Transformation 6: Alter Row (Insert) — New Records
  • Transformation 7: Alter Row (Update) — Changed Records
  • Transformation 8: Derived Column — Add Update Audit Columns
  • Transformation 9: Union — Merge Insert and Update Streams
  • Transformation 10: Sink — Write to SQL Dimension Table
  • The Audit Trail: inserted_tms vs updated_tms
  • Idempotency: Why Running Twice Does Nothing
  • First Run vs Subsequent Runs
  • The Complete Data Flow Visual
  • Parameterized Source Dataset
  • Performance Considerations
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is Hash-Based Change Detection?

Instead of comparing every column value between source and target to find changes, you create a single hash value from all columns. If the hash is the same, nothing changed. If the hash is different, something changed.

Source Row:    Name="Alice", City="Toronto", Salary=95000
Source Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Target Row:    Name="Alice", City="Toronto", Salary=95000
Target Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Comparison:    "a3f7b2c1..." == "a3f7b2c1..." → NO CHANGE (skip)

Now Alice moves to Mumbai:

Source Row:    Name="Alice", City="Mumbai", Salary=95000
Source Hash:   SHA256("Alice|Mumbai|95000") = "e9d4f8a6..."

Target Row:    Name="Alice", City="Toronto", Salary=95000
Target Hash:   SHA256("Alice|Toronto|95000") = "a3f7b2c1..."

Comparison:    "e9d4f8a6..." != "a3f7b2c1..." → CHANGED (update)

Real-life analogy: Hashing is like a fingerprint for your data. Every person (row) has a unique fingerprint (hash). If someone’s fingerprint changes, their identity has changed. You do not need to compare height, weight, hair color, and eye color individually — the fingerprint catches ALL changes in one comparison.

Why Hashing Instead of Column-by-Column Comparison?

Approach Comparison Logic Complexity
Column-by-column source.name != target.name OR source.city != target.city OR source.salary != target.salary OR ... Grows with every column. 20 columns = 20 OR conditions.
Hash-based source.hash != target.hash Always ONE comparison, regardless of column count.

Additional Benefits of Hashing

  • NULL handling — hashing naturally handles NULLs (they become part of the hash string)
  • Any data type — dates, numbers, strings all concatenate into one hash
  • Performance — one string comparison vs 20 individual comparisons
  • Maintainability — adding a new column means adding it to the hash formula, not rewriting the comparison logic

The Hash Formula

hash_value = SHA256(concat(column1, '|', column2, '|', column3, '|', ...))

The | delimiter prevents false matches. Without it, Name="AB" City="C" and Name="A" City="BC" would produce the same concatenation (“ABC”). With delimiter: “AB|C” vs “A|BC” — different hashes.

The Complete Pipeline Architecture

ADLS Gen2 (CSV)                     Synapse Data Flow                        Azure SQL DB
┌──────────────┐    ┌──────────────────────────────────────────────┐    ┌──────────────┐
│              │    │                                              │    │              │
│ employees.csv│───>│ Source ──> Derived (Hash) ──> Select (SRC_) │    │ dim_employee │
│              │    │                                    │         │    │              │
│              │    │                              Lookup (Join    │    │ (hash_value, │
│              │    │                              with dim table) │<───│  inserted_tms│
│              │    │                                    │         │    │  updated_tms)│
│              │    │                           Conditional Split  │    │              │
│              │    │                           /       |      \   │    │              │
│              │    │                        New    Changed  Unchanged│  │              │
│              │    │                         |        |       (drop) │  │              │
│              │    │                    AlterRow  AlterRow+Derived  │    │              │
│              │    │                    (Insert)  (Update+audit)    │    │              │
│              │    │                         \       /              │    │              │
│              │    │                          Union                 │    │              │
│              │    │                            │                   │    │              │
│              │    │                          Sink ────────────────>│    │              │
│              │    │                                              │    │              │
└──────────────┘    └──────────────────────────────────────────────┘    └──────────────┘

The Source Data (CSV in ADLS Gen2)

A simple employee CSV file stored in your data lake:

employee_id,name,department,city,salary
1,Alice,Engineering,Toronto,95000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000

This file is the source of truth. On subsequent runs, it might have changes:

employee_id,name,department,city,salary
1,Alice,Engineering,Mumbai,98000
2,Bob,Sales,Mumbai,72000
3,Carol,Marketing,London,68000
4,Dave,Engineering,Toronto,91000
5,Eve,Sales,Paris,85000
6,Frank,HR,Berlin,70000

Changes: Alice moved to Mumbai and got a raise. Frank is new. Bob, Carol, Dave, Eve are unchanged.

The Target Table (SQL Dimension Table with Audit Columns)

CREATE TABLE dim_employee (
    employee_id     INT PRIMARY KEY,
    name            VARCHAR(100),
    department      VARCHAR(50),
    city            VARCHAR(50),
    salary          INT,
    hash_value      VARCHAR(64),
    inserted_tms    DATETIME DEFAULT GETDATE(),
    inserted_by     VARCHAR(100) DEFAULT SYSTEM_USER,
    updated_tms     DATETIME NULL,
    updated_by      VARCHAR(100) NULL
);

Why These Audit Columns?

Column Purpose Populated When
hash_value Stores the SHA-256 hash of all business columns Every insert and update
inserted_tms When the record was FIRST created Insert only (DEFAULT GETDATE())
inserted_by Who/what created the record Insert only (DEFAULT SYSTEM_USER)
updated_tms When the record was LAST modified Update only (set by data flow)
updated_by Who/what modified the record Update only (set by data flow)

The key design: inserted_tms and inserted_by have DEFAULT constraints and are NEVER overwritten on updates. This preserves the original creation audit trail even after multiple updates.

Real-life analogy: It is like a package tracking system. inserted_tms is when the package was originally shipped (never changes). updated_tms is when the last status update happened (changes with every scan). Even after 10 status updates, you can still see the original ship date.

The Pipeline Overview

The pipeline has one primary activity: a Data Flow activity that contains all the transformation logic. Unlike the previous post’s pipeline with separate Lookup, ForEach, and Copy activities, this pipeline uses a Data Flow to handle everything in a single Spark-powered execution.

Pipeline: PL_SCD1_HashBased
  |
  |-- Data Flow Activity: DF_SCD1_Employee
      |
      |-- Source (CSV from ADLS)
      |-- Derived Column (generate hash)
      |-- Select (prefix with SRC_)
      |-- Lookup (join with dim table)
      |-- Conditional Split (new / changed / unchanged)
      |-- Alter Row + Union + Sink

Activity 1: Data Flow (The Brain of the Pipeline)

Why a Data Flow Instead of Copy + Stored Procedure?

The Copy activity can move data but cannot compare, hash, split, or conditionally insert/update. For change detection logic, you need either:

  • Data Flow (visual, no code, runs on Spark) — what we use here
  • Stored Procedure (SQL-based MERGE statement) — alternative approach
  • Spark Notebook (PySpark code) — for maximum flexibility

Data Flow is the best choice for visual, maintainable, code-free SCD implementation.

Real-life analogy: The Copy activity is a delivery truck — it moves boxes from A to B without opening them. The Data Flow is a sorting facility — it opens every box, inspects the contents, routes them to the right destination, and stamps them with tracking information.

Transformation 1: Source (Reading the CSV)

What It Does

Reads the CSV file from ADLS Gen2 into the Data Flow as the starting dataset.

Configuration

  1. Add Source transformation
  2. Name: SourceCSV
  3. Source type: Dataset (or Inline)
  4. Dataset: parameterized ADLS CSV dataset
  5. Linked service: your ADLS Gen2
  6. File path: container + folder path (parameterized for reuse)
  7. Source options:
  8. Header: First row as header
  9. Schema: infer from connection (or define manually)

Why Parameterized

The source dataset accepts a FileName parameter, so the same pipeline can process different CSV files:

Run 1: FileName = "employees.csv"
Run 2: FileName = "departments.csv"
Run 3: FileName = "products.csv"

One pipeline, multiple use cases. The pipeline parameter feeds into the Data Flow parameter, which feeds into the dataset parameter.

Data Preview After Source

employee_id name department city salary
1 Alice Engineering Toronto 95000
2 Bob Sales Mumbai 72000
3 Carol Marketing London 68000
4 Dave Engineering Toronto 91000
5 Eve Sales Paris 85000

Transformation 2: Derived Column — Generate Source Hash

What It Does

Calculates a SHA-256 hash from all business columns to create a unique fingerprint for each row.

Configuration

  1. Add Derived Column after Source
  2. Name: GenerateHash
  3. New column: src_hash_value
  4. Expression:
sha2(256, concat(toString(employee_id), '|', name, '|', department, '|', city, '|', toString(salary)))

Breaking Down the Expression

Part Purpose
sha2(256, ...) Generates a SHA-256 hash (64-character hex string)
concat(...) Joins all column values into one string
toString(employee_id) Converts integer to string for concatenation
'|' Delimiter between values to prevent false matches

Why SHA-256?

  • Fixed length — always 64 characters, regardless of input size
  • Collision resistant — virtually impossible for two different inputs to produce the same hash
  • Deterministic — same input always produces the same output (critical for comparison)
  • One-way — you cannot reverse-engineer the original data from the hash (security benefit)

Data Preview After Hash

employee_id name city salary src_hash_value
1 Alice Toronto 95000 a3f7b2c1d9e4…
2 Bob Mumbai 72000 b8c2e5f1a7d3…

Real-life analogy: Generating the hash is like taking a photograph of each row. Instead of describing every detail (“Alice, Engineering, Toronto, 95000”), you take a snapshot. Next time, you take another snapshot. If the photos look different, something changed. You do not need to compare every detail — the photo tells you instantly.

Transformation 3: Select — Prefix Columns with SRC_

What It Does

Renames all source columns with a SRC_ prefix to distinguish them from the dimension table columns during the upcoming Lookup join.

Configuration

  1. Add Select after GenerateHash
  2. Name: PrefixSRC
  3. Rename mappings:
Input Column Output Column
employee_id SRC_employee_id
name SRC_name
department SRC_department
city SRC_city
salary SRC_salary
src_hash_value SRC_hash_value

Why Prefixing Is Essential

When you join the source data with the dimension table in the next step, BOTH have columns called employee_id, name, city, etc. Without prefixing, the join output has ambiguous column names — which city is the source and which is the target?

Without prefix:   employee_id, name, city, employee_id, name, city  ← Ambiguous!
With SRC_ prefix: SRC_employee_id, SRC_name, SRC_city, employee_id, name, city  ← Clear!

Real-life analogy: When two guests at a wedding have the same name (both called “Alice”), the seating chart adds a label: “Alice (Bride’s side)” and “Alice (Groom’s side).” The prefix is the label that prevents confusion.

Transformation 4: Lookup — Join with Existing Dimension Table

What It Does

Joins the source data (CSV) with the existing dimension table (SQL) to find which source records already exist in the target.

Configuration

  1. Add Lookup transformation after PrefixSRC
  2. Name: LookupDim
  3. Primary stream: PrefixSRC (source CSV)
  4. Lookup stream: add a second Source that reads from dim_employee SQL table
  5. Lookup condition: SRC_employee_id == employee_id
  6. Multiple matches: First match (each employee_id should be unique)
  7. Lookup type: Left outer (keep ALL source rows, even if not in dim table)

Why Left Outer Lookup (Not Inner)

Inner would drop source rows that do not exist in the dim table — meaning new employees would be lost.

Left Outer keeps ALL source rows. If a source row has no match in the dim table (new employee), the dim columns come back as NULL. We use this NULL to identify new records in the Conditional Split.

Data Preview After Lookup (First Run — Empty Dim Table)

SRC_employee_id SRC_name SRC_city SRC_hash_value employee_id hash_value
1 Alice Toronto a3f7… NULL NULL
2 Bob Mumbai b8c2… NULL NULL
3 Carol London f1d9… NULL NULL

All dim columns are NULL because the dim table is empty. Every record is NEW.

Data Preview After Lookup (Subsequent Run — Dim Table Has Data)

SRC_employee_id SRC_name SRC_city SRC_hash_value employee_id hash_value
1 Alice Mumbai e9d4… 1 a3f7…
2 Bob Mumbai b8c2… 2 b8c2…
6 Frank Berlin c7a1… NULL NULL

  • Alice: hash differs (changed city and salary)
  • Bob: hash matches (no change)
  • Frank: no match in dim (new employee)

Transformation 5: Conditional Split — Route to New, Changed, or Unchanged

What It Does

Routes each row to a different stream based on whether it is new, changed, or unchanged.

Configuration

  1. Add Conditional Split after LookupDim
  2. Name: SplitNewChangedUnchanged
  3. Conditions (evaluated top to bottom):
Stream Name Condition Logic
NewRecords isNull(employee_id) Dim table has no match → new employee
ChangedRecords SRC_hash_value != hash_value Hash differs → something changed
UnchangedRecords (Default stream) Everything else → hash matches, no change

Why This Order Matters

Conditions are evaluated TOP to BOTTOM. A row matches the FIRST condition that is true.

  1. Check NEW first — if employee_id is NULL (no dim match), it is new. Do not check hash (hash is also NULL for new records).
  2. Check CHANGED second — if employee_id exists but hash differs, it is changed.
  3. Default = UNCHANGED — everything else has matching hashes. No action needed.

Real-life analogy: Airport immigration. First lane: “No passport on file? → New visitor, go to registration.” Second lane: “Passport on file but photo looks different? → Changed appearance, go to verification.” Third lane: “Passport matches? → Welcome back, proceed.”

Row Counts After Split

First run (empty dim): All 5 rows go to NewRecords. 0 Changed. 0 Unchanged.

Second run (with changes): 1 New (Frank), 1 Changed (Alice), 3 Unchanged (Bob, Carol, Dave).

Transformation 6: Alter Row (Insert) — New Records

What It Does

Marks all rows in the NewRecords stream for INSERT into the SQL table.

Configuration

  1. Add Alter Row after NewRecords output of Conditional Split
  2. Name: MarkInsert
  3. Alter row condition: Insert if true() (all rows in this stream are inserts)

Why Alter Row Is Needed

Data Flow sinks can perform INSERT, UPDATE, DELETE, or UPSERT. But the sink needs to know WHICH operation to perform for each row. Alter Row is the instruction that says “every row coming through this path should be INSERTed.”

Without Alter Row, the sink does not know what to do and defaults to INSERT only (which would fail for updates).

Real-life analogy: Alter Row is like stamping a package at the post office. “NEW DELIVERY” stamp means deliver to a new address. “RETURN TO SENDER” stamp means update the existing record. The delivery driver (sink) reads the stamp and acts accordingly.

Transformation 7: Alter Row (Update) — Changed Records

What It Does

Marks all rows in the ChangedRecords stream for UPDATE in the SQL table.

Configuration

  1. Add Alter Row after ChangedRecords output
  2. Name: MarkUpdate
  3. Alter row condition: Update if true() (all rows in this stream are updates)

Transformation 8: Derived Column — Add Update Audit Columns

What It Does

Adds updated_tms and updated_by columns ONLY for changed records. New records do not get these columns — their inserted_tms comes from the SQL DEFAULT constraint.

Configuration

  1. Add Derived Column after MarkUpdate (on the Changed path only)
  2. Name: AddUpdateAudit
  3. Columns:
  4. updated_tms = currentTimestamp()
  5. updated_by = 'SynapseDataFlow'

Why Only on the Update Path

New records should have: – inserted_tms = GETDATE() (from SQL DEFAULT) – inserted_by = SYSTEM_USER (from SQL DEFAULT) – updated_tms = NULL (never updated yet) – updated_by = NULL

Changed records should have: – inserted_tms = original value (preserved, NOT overwritten) – inserted_by = original value (preserved) – updated_tms = NOW (set by this Derived Column) – updated_by = ‘SynapseDataFlow’ (set by this Derived Column)

This is why we add audit columns on the update path only. The insert path relies on SQL DEFAULTs.

Real-life analogy: When a package is first shipped, the warehouse stamps the “Ship Date” (inserted_tms). When the package is redirected to a new address, the courier stamps a “Redirect Date” (updated_tms) but does NOT change the original “Ship Date.” Both timestamps coexist, telling the complete story.

Transformation 9: Union — Merge Insert and Update Streams

What It Does

Combines the Insert stream and Update stream back into a single stream that feeds into the Sink.

Configuration

  1. Add Union transformation
  2. Name: UnionInsertUpdate
  3. Inputs:
  4. Stream 1: MarkInsert (new records marked for INSERT)
  5. Stream 2: AddUpdateAudit (changed records marked for UPDATE, with audit columns)

Why Union Before Sink

You COULD have two separate sinks — one for inserts and one for updates. But Union + single sink is cleaner:

  • One sink = one connection, one transaction
  • Easier monitoring — one output shows total inserts + updates
  • Less code — no duplicate sink configuration

The Alter Row stamps (INSERT vs UPDATE) travel through the Union. The sink reads each row’s stamp and performs the correct operation.

Real-life analogy: At the post office, packages stamped “NEW DELIVERY” and “RETURN/UPDATE” are placed on the same conveyor belt (Union). The sorting machine at the end (Sink) reads the stamp and routes each package to the correct truck.

Transformation 10: Sink — Write to SQL Dimension Table

What It Does

Writes the final data to the dim_employee SQL table. Inserts new rows and updates changed rows based on the Alter Row stamps.

Configuration

  1. Add Sink transformation after Union
  2. Name: SinkDimEmployee
  3. Dataset: SQL dataset pointing to dim_employee
  4. Settings:
  5. Update method: Allow insert ✓ and Allow update
  6. Key columns: SRC_employee_id (the business key for matching updates)
  7. Mapping:
  8. Map SRC_employee_idemployee_id
  9. Map SRC_namename
  10. Map SRC_departmentdepartment
  11. Map SRC_citycity
  12. Map SRC_salarysalary
  13. Map SRC_hash_valuehash_value
  14. Map updated_tmsupdated_tms (NULL for inserts, timestamp for updates)
  15. Map updated_byupdated_by (NULL for inserts, value for updates)
  16. Do NOT map inserted_tms or inserted_by (let SQL DEFAULTs handle them)

Why Key Columns Matter

Key columns tell the sink HOW to match for updates:

UPDATE dim_employee
SET name = 'Alice', city = 'Mumbai', salary = 98000, hash_value = 'e9d4...',
    updated_tms = '2026-04-16 15:30:00', updated_by = 'SynapseDataFlow'
WHERE employee_id = 1;  ← This is the key column

Without the key column, the sink does not know WHICH row to update.

The Audit Trail: inserted_tms vs updated_tms

After the first run, Alice’s record:

employee_id=1, name='Alice', city='Toronto', hash='a3f7...',
inserted_tms='2026-04-16 14:00:00', inserted_by='SynapseUser',
updated_tms=NULL, updated_by=NULL

After the second run (Alice moved to Mumbai):

employee_id=1, name='Alice', city='Mumbai', hash='e9d4...',
inserted_tms='2026-04-16 14:00:00',       ← PRESERVED (original insert date)
inserted_by='SynapseUser',                  ← PRESERVED (original inserter)
updated_tms='2026-04-17 08:00:00',         ← SET (when the update happened)
updated_by='SynapseDataFlow'                ← SET (what did the update)

The inserted_tms is NEVER overwritten. You can always answer: “When was Alice first added to the system?” AND “When was she last updated?”

Idempotency: Why Running Twice Does Nothing

Run the pipeline twice with the same unchanged CSV file:

Run 1: – Lookup finds all records in dim table – Hash comparison: ALL hashes match (nothing changed in the CSV) – Conditional Split: ALL rows go to UnchangedRecords stream – UnchangedRecords stream is dropped (no sink) – Result: 0 inserts, 0 updates

Run 2 (identical file): – Same result: 0 inserts, 0 updates

This is idempotency — running the same operation multiple times produces the same result. The pipeline is safe to rerun after failures without creating duplicates or false updates.

Real-life analogy: Pressing the elevator button 5 times does not make the elevator come 5 times. It comes once. Pressing again while it is already coming does nothing. The hash comparison is the “already pressed” check.

First Run vs Subsequent Runs

First Run (Empty Dim Table)

Source CSV: 5 rows
  → Generate Hash: 5 rows with src_hash_value
  → Prefix SRC_: 5 rows with SRC_ columns
  → Lookup Dim: 5 rows, ALL dim columns are NULL (empty table)
  → Conditional Split:
      NewRecords: 5 (all isNull(employee_id))
      ChangedRecords: 0
      UnchangedRecords: 0
  → Alter Row (Insert): 5 rows
  → Sink: INSERT 5 rows into dim_employee

Audit result: All 5 rows have inserted_tms, updated_tms=NULL

Second Run (Alice Changed, Frank Added)

Source CSV: 6 rows (Alice changed, Frank is new)
  → Generate Hash: 6 rows
  → Prefix SRC_: 6 rows
  → Lookup Dim: 6 rows joined with 5 existing dim rows
  → Conditional Split:
      NewRecords: 1 (Frank, isNull(employee_id))
      ChangedRecords: 1 (Alice, SRC_hash != hash)
      UnchangedRecords: 4 (Bob, Carol, Dave, Eve, hashes match)
  → Alter Row (Insert): 1 row (Frank)
  → Alter Row (Update) + Derived (audit): 1 row (Alice)
  → Union: 2 rows
  → Sink: INSERT 1 + UPDATE 1

Audit result: Frank has inserted_tms, Alice has updated_tms set

Third Run (No Changes)

Source CSV: 6 rows (identical to previous run)
  → Conditional Split:
      NewRecords: 0
      ChangedRecords: 0
      UnchangedRecords: 6 (all hashes match)
  → Nothing reaches the Sink
  → 0 inserts, 0 updates (idempotent)

The Complete Data Flow Visual

SourceCSV
    │
    ▼
GenerateHash (SHA-256 of all business columns)
    │
    ▼
PrefixSRC (rename columns with SRC_ prefix)
    │
    ▼
LookupDim ◄──── DimSource (reads dim_employee SQL table)
    │
    ▼
ConditionalSplit
    │          │           │
    ▼          ▼           ▼
NewRecords  Changed    Unchanged
    │       Records      (dropped)
    │          │
    ▼          ▼
AlterRow   AlterRow
(Insert)   (Update)
    │          │
    │          ▼
    │     AddUpdateAudit
    │     (updated_tms, updated_by)
    │          │
    ▼          ▼
    Union (merge both streams)
       │
       ▼
    SinkDimEmployee (INSERT new + UPDATE changed)

Parameterized Source Dataset

Pipeline Parameter

Pipeline parameter: SourceFileName (String)
Default: "employees.csv"

Data Flow Parameter

Data Flow parameter: DFSourceFile (String)

Dataset Parameter

Dataset parameter: FileName (String)
Connection: container/folder/@dataset().FileName

Wiring

Pipeline parameter SourceFileName
    → Data Flow activity parameter DFSourceFile = @pipeline().parameters.SourceFileName
        → Source dataset FileName = $DFSourceFile

This chain lets you reuse the same pipeline for different files by passing a different SourceFileName when you trigger it.

Performance Considerations

  1. Broadcast the dim table — if your dimension table is small (under 50 MB), broadcast it during the Lookup. This avoids shuffling the large source data.

  2. Index the hash column — add an index on hash_value in the SQL table for faster comparisons.

  3. Index the key columnemployee_id should be the PRIMARY KEY (already indexed).

  4. Partition large source files — if the CSV has millions of rows, Spark will auto-partition for parallel processing.

  5. Use the right cluster size — for dev, 4+4 cores is fine. For production with millions of rows, scale to 16+16 or more.

Common Mistakes

1. Forgetting the Delimiter in Hash Concatenation

WRONG:  sha2(256, concat(name, city, salary))
        "AliceToronto95000" and "AliceT" + "oronto95000" produce different results
        but "AB" + "C" and "A" + "BC" produce the same "ABC"

RIGHT:  sha2(256, concat(name, '|', city, '|', toString(salary)))
        "Alice|Toronto|95000" is unambiguous

2. Not Handling NULLs in Hash

If a column is NULL, concat(NULL, '|', 'Toronto') returns NULL in some engines.

Fix: Use coalesce() to replace NULLs:

sha2(256, concat(coalesce(toString(employee_id),''), '|', coalesce(name,''), '|', coalesce(city,'')))

3. Mapping inserted_tms in the Sink

If you map inserted_tms in the sink, updates will overwrite the original insert timestamp.

Fix: Do NOT map inserted_tms or inserted_by. Let SQL DEFAULTs handle them on insert, and they are untouched on update.

4. Using Inner Lookup Instead of Left Outer

Inner Lookup drops new records (no match in dim table).

Fix: Always use Left Outer for SCD pipelines. New records must flow through.

5. Wrong Conditional Split Order

If you check Changed before New, records with NULL hashes (new records) might match SRC_hash != NULL unexpectedly.

Fix: Always check isNull(employee_id) FIRST (new records), then hash comparison (changed), then default (unchanged).

Interview Questions

Q: What is hash-based change detection? A: Instead of comparing every column individually, you concatenate all column values and generate a SHA-256 hash. Compare the source hash with the stored target hash — if they differ, the record changed. One comparison instead of N column comparisons, and it scales regardless of column count.

Q: Why prefix source columns with SRC_ in the data flow? A: When joining source and target, both have columns with the same names (employee_id, name, city). Prefixing source columns with SRC_ prevents ambiguity and makes it clear which column belongs to which stream in downstream transformations and mappings.

Q: How does the pipeline ensure inserted_tms is never overwritten? A: By NOT mapping inserted_tms in the Data Flow sink. On INSERT, the SQL DEFAULT constraint sets it to GETDATE(). On UPDATE, the column is not included in the SET clause, so it retains its original value. The updated_tms column is only added on the update path via a Derived Column transformation.

Q: What makes this pipeline idempotent? A: The hash comparison. If the source data has not changed since the last run, all hashes match, all rows go to the Unchanged stream (dropped), and the sink receives zero rows. No inserts, no updates, no side effects. The pipeline is safe to rerun indefinitely.

Q: What is the difference between Alter Row Insert and Alter Row Update? A: Alter Row stamps each row with an operation type. Insert tells the sink to INSERT the row (new record). Update tells the sink to UPDATE the matching row based on the key column. Without Alter Row, the sink does not know which operation to perform.

Q: Why Union before the Sink instead of two separate Sinks? A: Union merges the insert and update streams into one. One sink means one connection, one transaction, and simpler monitoring. The Alter Row stamps survive through the Union, so the sink still knows which rows to insert and which to update.

Wrapping Up

This pipeline is the gold standard for SCD Type 1 with change detection. Instead of blindly overwriting everything, it fingerprints each row with a hash, compares it against the existing data, and only processes what actually changed.

The result: faster execution (skip unchanged rows), complete audit trail (inserted_tms preserved, updated_tms tracked), and idempotency (safe to rerun).

The pattern works for any dimension table — employees, customers, products, stores. Change the source, change the columns in the hash, and the rest of the pipeline stays the same.

Related posts:SCD Type 1 Full Load PipelineSlowly Changing Dimensions (All Types)Data Flows in ADF/SynapseJoins in Synapse Data FlowsMetadata-Driven Pipeline in ADF


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link