Building an SCD Type 1 Pipeline in Azure: Full Load with Metadata, Audit Logging, and Every Activity Explained
In our SCD Types post, we covered the theory — Type 1 overwrites old values with new ones, no history preserved. But theory without practice is like reading a cookbook without ever stepping into the kitchen.
This post is the kitchen. We are going to build a complete SCD Type 1 pipeline from scratch in Azure Synapse (works identically in ADF), and I will explain every single activity, every expression, every dataset parameter, and WHY each piece exists.
By the end, you will have a production-grade pipeline that reads a configuration table, loops through multiple source tables, copies each one to ADLS Gen2 as Parquet, logs success and failure into an audit table, and creates date-partitioned output folders. This is the exact pattern used in real enterprise data platforms.
Table of Contents
- What Makes This SCD Type 1
- The Complete Architecture
- The Analogy: A Restaurant Kitchen
- Prerequisites: What You Need Before Building
- The Metadata Table: The Recipe Book
- The Audit Table: The Kitchen Log
- The Stored Procedure: The Log Writer
- Activity 1: Lookup (The Waiter Taking the Order)
- Activity 2: ForEach (The Kitchen Line)
- Activity 3: Copy Data (The Chef Cooking)
- Activity 4: Log Success (The Quality Check — Passed)
- Activity 5: Log Failure (The Quality Check — Failed)
- The Three Datasets Explained
- Dataset 1: DS_SqlDB_Metadata (The Menu)
- Dataset 2: DS_SqlDB_SourceTable (The Ingredient Shelf)
- Dataset 3: DS_ADLS_Sink_Parquet (The Serving Plate)
- The Linked Services: The Supply Chains
- Expression Deep Dive
- The Full Pipeline JSON
- How Data Flows Through the Pipeline
- Adding Date-Partitioned Output Folders
- Running and Verifying the Pipeline
- What Happens on the Second Run (The SCD Type 1 Behavior)
- Evolving to SCD Type 2
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
What Makes This SCD Type 1
SCD Type 1 means overwrite. Every time the pipeline runs, it replaces the destination data with the latest source data. No history. No versioning. The destination always reflects the current state of the source.
Run 1 (Monday): Source has 100 customers → Destination gets 100 customers
Run 2 (Tuesday): Source has 102 customers (2 new + Alice moved to Mumbai)
→ Destination gets 102 customers (Alice shows Mumbai, no Toronto history)
Run 3 (Wednesday): Source has 101 customers (1 deleted)
→ Destination gets 101 customers (deleted customer is gone)
Real-life analogy: SCD Type 1 is like updating your phone’s GPS address. You type your new address, and the old one disappears completely. Your GPS always shows where you live NOW, but it has no memory of where you used to live.
When this is appropriate: – Bronze layer ingestion (capture the latest source snapshot) – Reference data that is corrected, not versioned (country codes, currency rates) – Staging tables that are truncated and reloaded daily – Any scenario where “what is the data NOW?” is the only question that matters
The Complete Architecture
Azure SQL Database ADLS Gen2
┌─────────────────────┐ ┌─────────────────────────┐
│ │ │ synapse-workspace/ │
│ metadata table │─── Lookup ──>│ sqldb/ │
│ (config: 5 tables) │ │ Customer/ │
│ │ │ 2026/04/16/14/30/ │
│ SalesLT.Customer │─── Copy ───>│ part-0000.parquet│
│ SalesLT.Address │─── Copy ───>│ Address/ │
│ SalesLT.Product │─── Copy ───>│ Product/ │
│ ... │ │ ... │
│ │ │ │
│ audit table │<── Log ─────│ │
│ (success/failure) │ │ │
│ │ │ │
│ sp_insert_audit_log │ │ │
│ (stored procedure) │ │ │
└─────────────────────┘ └─────────────────────────┘
Pipeline: PL_Copy_SqlToADLS_Parquet_WithAudit
┌──────────────────────────────────────────────────────────────┐
│ │
│ Lookup_Metadata ──> ForEach_Table │
│ │ │
│ ├── Copy_TableData │
│ │ │ │
│ │ ├── (Success) ──> Log_Success │
│ │ │ │
│ │ └── (Failure) ──> Log_Failure │
│ │
└──────────────────────────────────────────────────────────────┘
The Analogy: A Restaurant Kitchen
I will use a restaurant kitchen analogy throughout this post because the pipeline workflow maps perfectly to how a kitchen operates:
| Pipeline Component | Kitchen Equivalent | What It Does |
|---|---|---|
| Metadata table | The recipe book | Lists all dishes (tables) to prepare |
| Audit table | The kitchen log | Records what was cooked, how long, any problems |
| Stored procedure | The log writer (sous chef with clipboard) | Writes entries into the kitchen log |
| Lookup activity | The waiter taking the order | Reads the recipe book and brings the list to the kitchen |
| ForEach activity | The kitchen line | Processes each dish in the order |
| Copy activity | The chef cooking | Does the actual work of preparing each dish |
| Log_Success | Quality check — passed | “Dish looks great, log it as served” |
| Log_Failure | Quality check — failed | “Dish burned, log the problem for review” |
| Linked service | The supplier contracts | Agreements with ingredient suppliers and dish delivery service |
| Dataset | The ingredient location and serving plate | Where to find ingredients and how to plate the dish |
Prerequisites: What You Need Before Building
Before we start building, ensure you have:
- Azure SQL Database with AdventureWorksLT sample data (or any database)
- ADLS Gen2 storage account with a container (e.g.,
synapse-workspaceordatabase) - Synapse Workspace (or ADF) with linked services for both SQL and ADLS
- Permissions: Managed Identity has Storage Blob Data Contributor on ADLS
The Metadata Table: The Recipe Book
This table drives the entire pipeline. Adding a new table to copy means inserting ONE row here — no pipeline changes needed.
CREATE TABLE metadata (
Id INT IDENTITY(100,1),
TableName VARCHAR(100),
SchemaName VARCHAR(50),
ContainerName VARCHAR(100),
FolderName VARCHAR(200)
);
INSERT INTO metadata (TableName, SchemaName, ContainerName, FolderName) VALUES
('Customer', 'SalesLT', 'synapse-workspace', 'sqldb/Customer'),
('Address', 'SalesLT', 'synapse-workspace', 'sqldb/Address'),
('CustomerAddress', 'SalesLT', 'synapse-workspace', 'sqldb/CustomerAddress'),
('Product', 'SalesLT', 'synapse-workspace', 'sqldb/Product'),
('ProductCategory', 'SalesLT', 'synapse-workspace', 'sqldb/ProductCategory');
Why metadata-driven? Imagine you have 50 tables to copy. Without metadata, you build 50 separate Copy activities (or worse, 50 pipelines). With metadata, you build ONE pipeline and control everything from a SQL table. Adding table #51 is an INSERT statement, not a pipeline redesign.
Real-life analogy: The metadata table is like a restaurant’s daily specials board. The kitchen (pipeline) does not change. The board (metadata) changes daily. Today we serve 5 dishes, tomorrow we add a 6th by writing it on the board. The kitchen workflow stays the same.
The Audit Table: The Kitchen Log
Every table copy operation — success or failure — gets logged here.
CREATE TABLE audit (
Id INT IDENTITY(1,1),
rows_read INT,
rows_copied INT,
copy_duration INT,
table_name VARCHAR(100),
schema_name VARCHAR(100),
error_message VARCHAR(500),
load_date DATETIME
);
Why audit logging? In production, pipelines run at 2 AM with no one watching. When the business asks “did the Customer table load successfully last night?” you need an answer. The audit table IS the answer.
After each run, you can query:
-- What happened last night?
SELECT table_name, rows_read, rows_copied, copy_duration, error_message, load_date
FROM audit
WHERE CAST(load_date AS DATE) = CAST(GETDATE() AS DATE)
ORDER BY load_date DESC;
Real-life analogy: The audit table is the kitchen log. After every dish, the sous chef notes: “Chicken was prepared at 7:15 PM, took 12 minutes, served to Table 4, no issues.” If a customer complains the next day, the log tells you exactly what happened.
The Stored Procedure: The Log Writer
CREATE PROCEDURE sp_insert_audit_log
@rows_read INT,
@rows_copied INT,
@copy_duration INT,
@table_name VARCHAR(100),
@schema_name VARCHAR(100),
@error_message VARCHAR(500)
AS
BEGIN
INSERT INTO audit (rows_read, rows_copied, copy_duration, table_name, schema_name, error_message, load_date)
VALUES (@rows_read, @rows_copied, @copy_duration, @table_name, @schema_name, @error_message, GETDATE());
END;
Why a stored procedure instead of direct INSERT? Three reasons:
- Encapsulation — the pipeline calls
sp_insert_audit_logwithout knowing the table structure. If you add a column to the audit table later, you update the SP, not every pipeline. - Reusability — multiple pipelines can call the same SP.
- Security — the pipeline needs EXECUTE permission on the SP, not INSERT permission on the table. Principle of least privilege.
Real-life analogy: The stored procedure is the sous chef with a clipboard. The chef (Copy activity) yells “5 rows cooked, 5 served, took 3 seconds, no problems!” The sous chef writes it down in the proper format. The chef does not touch the log book directly.
Activity 1: Lookup (The Waiter Taking the Order)
What It Does
Reads the metadata table and returns an array of rows. Each row describes one table to copy.
How to Configure
- Drag Lookup activity onto the canvas
- Name:
Lookup_Metadata - Settings tab:
- Source dataset:
DS_SqlDB_Metadata - Use query: Query
- Query:
SELECT TableName, SchemaName, ContainerName, FolderName FROM metadata - First row only: UNCHECKED (critical! If checked, only 1 table is processed)
What It Outputs
{
"count": 5,
"value": [
{"TableName": "Customer", "SchemaName": "SalesLT", "ContainerName": "synapse-workspace", "FolderName": "sqldb/Customer"},
{"TableName": "Address", "SchemaName": "SalesLT", "ContainerName": "synapse-workspace", "FolderName": "sqldb/Address"},
{"TableName": "CustomerAddress", "SchemaName": "SalesLT", "ContainerName": "synapse-workspace", "FolderName": "sqldb/CustomerAddress"},
{"TableName": "Product", "SchemaName": "SalesLT", "ContainerName": "synapse-workspace", "FolderName": "sqldb/Product"},
{"TableName": "ProductCategory", "SchemaName": "SalesLT", "ContainerName": "synapse-workspace", "FolderName": "sqldb/ProductCategory"}
]
}
Why This Activity Exists
Without Lookup, you would hardcode every table name in the pipeline. Adding a new table means editing the pipeline. With Lookup, the pipeline reads its instructions dynamically from a database table. The pipeline NEVER changes — only the metadata changes.
Real-life analogy: The waiter walks to the specials board (metadata table), reads today’s five dishes, and brings the list back to the kitchen. The kitchen does not decide what to cook — the board does.
Common Mistake: First Row Only
If First row only is checked, the Lookup returns a single object (not an array), and ForEach fails because it expects an array. This is the #1 mistake beginners make.
First row only CHECKED: {"TableName": "Customer", ...} ← Single object, ForEach fails
First row only UNCHECKED: {"count": 5, "value": [{...}, ...]} ← Array, ForEach works
Activity 2: ForEach (The Kitchen Line)
What It Does
Iterates over the array from Lookup and executes the inner activities (Copy + Log) for EACH item in the array.
How to Configure
- Drag ForEach activity onto the canvas
- Name:
ForEach_Table - Draw a green arrow from Lookup_Metadata to ForEach_Table
- Settings tab:
- Sequential: UNCHECKED (runs tables in parallel — faster)
- Batch count: 5 (max 5 tables processed simultaneously)
- Items:
@activity('Lookup_Metadata').output.value
What Happens During Execution
Iteration 1: item() = {"TableName": "Customer", "SchemaName": "SalesLT", ...}
Iteration 2: item() = {"TableName": "Address", "SchemaName": "SalesLT", ...}
Iteration 3: item() = {"TableName": "CustomerAddress", ...}
Iteration 4: item() = {"TableName": "Product", ...}
Iteration 5: item() = {"TableName": "ProductCategory", ...}
Inside ForEach, @item() represents the current row being processed. So @item().TableName is “Customer” in iteration 1, “Address” in iteration 2, etc.
Why This Activity Exists
ForEach is the loop that processes multiple items without duplicating activities. Without it, you would need 5 separate Copy activities, 5 Log_Success activities, and 5 Log_Failure activities. With ForEach, you write the logic ONCE and it executes for each item.
Real-life analogy: The kitchen line. Five order tickets hang on the rail. The kitchen processes each ticket one at a time (sequential) or multiple tickets simultaneously (parallel with batch count). The cooking process is the same for each ticket — only the dish name changes.
Sequential vs Parallel
| Setting | Behavior | When to Use |
|---|---|---|
| Sequential = checked | Tables copied one at a time in order | When tables have dependencies or the source cannot handle concurrent reads |
| Sequential = unchecked, Batch = 5 | Up to 5 tables copied simultaneously | Default for independent tables — much faster |
The Items Expression Explained
@activity('Lookup_Metadata').output.value
Breaking it down:
| Part | Meaning |
|---|---|
@activity('Lookup_Metadata') |
Reference the Lookup activity by its exact name |
.output |
Access the activity’s output |
.value |
The array of rows returned by the Lookup |
If you mistype the activity name (e.g., 'Lookup_metadata' with lowercase m), the pipeline fails with “Activity not found.”
Activity 3: Copy Data (The Chef Cooking)
What It Does
Copies data from Azure SQL Database to ADLS Gen2 in Parquet format. This is the core work of the pipeline.
How to Configure
- Click the pencil icon on ForEach to open the inner canvas
- Drag Copy data activity
- Name:
Copy_TableData
Source tab:
– Source dataset: DS_SqlDB_SourceTable
– SchemaName: @item().SchemaName
– TableName: @item().TableName
Sink tab:
– Sink dataset: DS_ADLS_Sink_Parquet
– ContainerName: @item().ContainerName
– FolderName: @item().FolderName
What Happens During Execution (For Customer Table)
1. Copy reads @item().SchemaName = "SalesLT", @item().TableName = "Customer"
2. These values flow to DS_SqlDB_SourceTable via @dataset().SchemaName and @dataset().TableName
3. The dataset resolves to: SELECT * FROM SalesLT.Customer
4. Copy reads all 847 rows from Azure SQL Database
5. Copy writes to ADLS Gen2 at: synapse-workspace/sqldb/Customer/part-00000.snappy.parquet
6. Copy activity records: rowsRead=847, rowsCopied=847, copyDuration=3 (seconds)
The Parameter Flow (The Most Important Concept)
Metadata table row: {"TableName": "Customer", "SchemaName": "SalesLT", ...}
|
v
ForEach @item(): @item().TableName = "Customer", @item().SchemaName = "SalesLT"
|
v
Copy activity: Passes to dataset: SchemaName="SalesLT", TableName="Customer"
|
v
Dataset: @dataset().SchemaName = "SalesLT", @dataset().TableName = "Customer"
|
v
Connection: Schema: SalesLT, Table: Customer
|
v
SQL Query: SELECT * FROM [SalesLT].[Customer]
This chain — metadata → @item() → dataset parameters → @dataset() → actual query — is the core of every metadata-driven pipeline. Understanding this flow means understanding ADF/Synapse pipeline architecture.
Real-life analogy: The order ticket says “Grilled Salmon.” The chef reads the ticket (@item()), walks to the ingredient shelf (@dataset()), grabs salmon from the section labeled “Fish” (SchemaName) shelf “Salmon” (TableName), cooks it, and plates it on the designated serving plate (sink with FolderName).
Copy Activity Output (Used by Logging)
After successful execution, the Copy activity exposes:
@activity('Copy_TableData').output.rowsRead → 847
@activity('Copy_TableData').output.rowsCopied → 847
@activity('Copy_TableData').output.copyDuration → 3
@activity('Copy_TableData').output.throughput → 256.5 (KB/s)
After failure:
@activity('Copy_TableData').error.message → "Table 'SalesLT.Cstomer' not found"
These values feed directly into the audit logging activities.
Activity 4: Log Success (The Quality Check — Passed)
What It Does
When Copy_TableData succeeds, this activity calls the stored procedure to insert a success record into the audit table.
How to Configure
- Drag Stored Procedure activity onto the inner canvas
- Name:
Log_Success - Draw a green arrow (Success) from Copy_TableData to Log_Success
- Settings tab:
- Linked service:
LS_AzureSqlDB - Stored procedure name:
sp_insert_audit_log - Parameters:
| Name | Type | Value |
|---|---|---|
| rows_read | Int32 | @activity('Copy_TableData').output.rowsRead |
| rows_copied | Int32 | @activity('Copy_TableData').output.rowsCopied |
| copy_duration | Int32 | @activity('Copy_TableData').output.copyDuration |
| table_name | String | @item().TableName |
| schema_name | String | @item().SchemaName |
| error_message | String | NA |
Why the Green Arrow Matters
The green arrow means this activity ONLY runs when the Copy activity succeeds. If Copy fails, Log_Success is skipped entirely, and Log_Failure runs instead.
Copy_TableData
|
├── Green arrow (Succeeded) ──> Log_Success ✓ Runs only on success
|
└── Red arrow (Failed) ──> Log_Failure ✓ Runs only on failure
What Gets Logged
-- Audit table after successful copy of Customer
INSERT INTO audit VALUES (847, 847, 3, 'Customer', 'SalesLT', 'NA', '2026-04-16 14:30:05');
Activity 5: Log Failure (The Quality Check — Failed)
What It Does
When Copy_TableData fails, this activity logs the error details.
How to Configure
- Drag another Stored Procedure activity
- Name:
Log_Failure - Draw a red arrow (Failure) from Copy_TableData to Log_Failure
- Settings tab:
- Same linked service and stored procedure
- Parameters:
| Name | Type | Value |
|---|---|---|
| rows_read | Int32 | 0 |
| rows_copied | Int32 | 0 |
| copy_duration | Int32 | 0 |
| table_name | String | @item().TableName |
| schema_name | String | @item().SchemaName |
| error_message | String | @activity('Copy_TableData').error.message |
Why Both Success and Failure Handlers
Without failure logging, a failed table copy is invisible. The pipeline might show “Succeeded” overall (because ForEach continues with other tables), but one table silently failed. The audit table catches this:
-- Morning health check
SELECT table_name, error_message, load_date
FROM audit
WHERE error_message != 'NA'
AND CAST(load_date AS DATE) = CAST(GETDATE() AS DATE);
-- Result: ProductCategory | Invalid object name 'SalesLT.ProductCategori' | 2026-04-16 14:30:08
Without audit logging, you would not know ProductCategory failed until someone notices the dashboard showing old data — which could be hours or days later.
Real-life analogy: In a restaurant, if a dish is burned, the sous chef does not just throw it away silently. They LOG it: “Dish: Salmon, Problem: overcooked, Time: 7:30 PM.” The chef can review the log and see patterns: “Salmon burns every Friday night — we need more staff on Fridays.”
The Three Datasets Explained
Dataset 1: DS_SqlDB_Metadata (The Menu)
Purpose: Points to the metadata table. Used by the Lookup activity.
Configuration:
– Type: Azure SQL Table
– Linked service: LS_AzureSqlDB
– Table: dbo.metadata
– Parameters: None (fixed table)
Why it is non-parameterized: The metadata table never changes location. It is always dbo.metadata in the same database. No parameters needed.
Dataset 2: DS_SqlDB_SourceTable (The Ingredient Shelf)
Purpose: Points to any source table dynamically. Used by the Copy activity source.
Configuration:
– Type: Azure SQL Table
– Linked service: LS_AzureSqlDB
– Table: left blank (parameterized)
– Parameters:
– SchemaName (String)
– TableName (String)
– Connection tab:
– Schema: @dataset().SchemaName
– Table: @dataset().TableName
Why it is parameterized: This single dataset represents ALL five source tables. During execution, the Copy activity passes different values:
Iteration 1: SchemaName="SalesLT", TableName="Customer" → reads SalesLT.Customer
Iteration 2: SchemaName="SalesLT", TableName="Address" → reads SalesLT.Address
Iteration 3: SchemaName="SalesLT", TableName="Product" → reads SalesLT.Product
Without parameterization, you would need five separate datasets — one per table. With parameterization, one dataset handles them all.
Dataset 3: DS_ADLS_Sink_Parquet (The Serving Plate)
Purpose: Points to the destination in ADLS Gen2. Used by the Copy activity sink.
Configuration:
– Type: Parquet
– Linked service: naveen-synapse-ws-WorkspaceDefaultStorage (or your ADLS linked service)
– Parameters:
– ContainerName (String)
– FolderName (String)
– Connection tab:
– File system: @dataset().ContainerName
– Directory: @dataset().FolderName
– Compression: Snappy
Why Parquet with Snappy? Parquet is columnar (fast analytical queries), and Snappy compression provides the best balance of speed and file size. A 10 MB CSV becomes a 2 MB Parquet file.
The Linked Services: The Supply Chains
LS_AzureSqlDB (SQL Database Connection)
Type: Azure SQL Database
Server: sql-dataplatform-dev.database.windows.net
Database: AdventureWorksLT
Authentication: SQL Authentication (or Managed Identity)
Connect via: AutoResolveIntegrationRuntime
Why it exists: Tells ADF/Synapse HOW to connect to the database. Credentials are stored here, not in datasets or pipelines.
Default Storage (ADLS Gen2 Connection)
Type: Azure Data Lake Storage Gen2
URL: https://naveensynapsedl.dfs.core.windows.net
Authentication: Managed Identity (automatic, no credentials)
Why Managed Identity: No passwords to manage, rotate, or accidentally expose. The Synapse workspace authenticates using its own Azure identity.
Adding Date-Partitioned Output Folders
In production, you want each run’s output in a separate dated folder to avoid overwriting:
Change the FolderName in the Copy sink to:
@concat(item().FolderName, '/',
formatDateTime(convertTimeZone(utcnow(), 'UTC', 'Eastern Standard Time'), 'yyyy/MM/dd/HH/mm'))
This produces: sqldb/Customer/2026/04/16/14/30/
Each run creates a NEW folder. Previous runs are preserved. This is technically not pure SCD Type 1 (which overwrites), but it is the common production pattern because it allows: – Rollback to previous day’s data if needed – Audit trail of what was loaded when – Delta Lake or downstream pipeline can read the latest partition
Running and Verifying the Pipeline
Debug Run
- Click Validate — fix any errors
- Click Debug — runs the pipeline immediately
- Watch the Output panel at the bottom
Expected Output
Lookup_Metadata → Succeeded (5 rows)
ForEach_Table → Succeeded (5 iterations)
Copy (Customer) → Succeeded (847 rows read, 847 copied, 3 sec)
Log_Success → Succeeded
Copy (Address) → Succeeded (450 rows read, 450 copied, 2 sec)
Log_Success → Succeeded
... (3 more tables)
Verify in ADLS Gen2
Navigate to your container. You should see:
synapse-workspace/
sqldb/
Customer/
part-00000.snappy.parquet
Address/
part-00000.snappy.parquet
CustomerAddress/
part-00000.snappy.parquet
Product/
part-00000.snappy.parquet
ProductCategory/
part-00000.snappy.parquet
Verify Audit Table
SELECT table_name, rows_read, rows_copied, copy_duration, error_message, load_date
FROM audit ORDER BY load_date DESC;
Expected: 5 rows, all with error_message = 'NA'.
What Happens on the Second Run (The SCD Type 1 Behavior)
This is the key SCD Type 1 characteristic:
Without date partitions: The second run OVERWRITES the same Parquet files. Old data is replaced. If Alice’s city changed from Toronto to Mumbai in the source, the new Parquet file shows Mumbai. Toronto is gone forever. That is Type 1.
With date partitions: The second run creates NEW files in a new folder (different timestamp). Both versions coexist in storage. A downstream process reads the latest partition. Old partitions can be archived or deleted by a cleanup pipeline.
Evolving to SCD Type 2
When business requirements change from “we only need current data” to “we need historical tracking,” you evolve this pipeline:
- Add
LoadTypecolumn to metadata table (FULLorINCREMENTAL) - Add
DeltaColumnNameandLastLoadedValuefor incremental tables - Add an If Condition inside ForEach to branch between full and incremental logic
- For SCD Type 2 target tables, use MERGE (in Delta Lake or stored procedures) instead of overwrite
We covered this evolution in detail in our incremental loading post and SCD types post.
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “First row only” gives single object | ForEach expects array, gets object | Uncheck “First row only” in Lookup |
| “Activity ‘Lookup_Metadata’ not found” | Typo in activity name in ForEach Items expression | Match exact name including case |
| “Invalid object name ‘SalesLT.Custmer'” | Typo in metadata table’s TableName | Fix the metadata row: UPDATE metadata SET TableName='Customer' WHERE TableName='Custmer' |
| “Forbidden” writing to ADLS | Managed Identity lacks permission | Add Storage Blob Data Contributor role to Synapse MI |
| Copy succeeds but 0 rows | Source table is empty or query returns nothing | Check source data: SELECT COUNT(*) FROM SalesLT.Customer |
| Log_Failure not running | Copy succeeded (green path) | Log_Failure only runs on red arrow (failure). This is correct. |
| Both Log_Success AND Log_Failure run | Impossible by design | Check arrow colors: green to Success, red to Failure |
Interview Questions
Q: What is a metadata-driven pipeline? A: A pipeline that reads its configuration (which tables to copy, where to write) from a database table instead of hardcoding. Adding or removing tables requires updating the metadata table, not the pipeline. It uses Lookup to read config, ForEach to iterate, and Copy to move data dynamically.
Q: Why use a stored procedure for audit logging instead of a direct INSERT? A: Encapsulation (SP hides table structure), reusability (multiple pipelines call the same SP), security (pipeline needs EXECUTE permission only), and maintainability (add columns to audit table by updating SP, not every pipeline).
Q: What is the difference between @item() and @dataset() in ADF expressions? A: @item() references the current ForEach iteration element and is used in pipeline activities. @dataset() references a dataset’s own parameters and is used inside the dataset’s Connection tab. The pipeline passes values via @item(), and the dataset receives them via @dataset().
Q: How do you handle both success and failure logging in a ForEach? A: Add two Stored Procedure activities after the Copy activity. Connect one with a green arrow (Succeeded) for success logging and another with a red arrow (Failed) for failure logging. Only one path executes per Copy run.
Q: What makes this pipeline SCD Type 1? A: Each run overwrites the destination with the latest source data. No history is preserved. The destination always reflects the current state. This is appropriate for staging tables, reference data corrections, and scenarios where only the current state matters.
Wrapping Up
This pipeline — Lookup, ForEach, Copy, Log Success, Log Failure — is the workhorse of Azure data engineering. It is the first pipeline every data engineer builds and the pattern that scales to hundreds of tables.
The beauty of metadata-driven design is that the pipeline is FINISHED. You never touch it again. When the business says “add 10 more tables,” you insert 10 rows into the metadata table. When they say “stop loading ProductCategory,” you delete one row. The pipeline adapts without code changes.
That is the power of SCD Type 1 with metadata-driven orchestration. Simple, reliable, and production-ready.
Related posts: – Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging – ADF Expressions Guide – Slowly Changing Dimensions – Incremental Data Loading – Parameterized Datasets
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.