Incremental Data Loading in Azure Synapse: The Delta Copy Pattern with Watermark Tables
Full load works fine when your table has 1,000 rows. But what happens when it has 10 million? Or 100 million?
At 10 million rows, a full copy takes about 20 minutes per table. At 100 million, it’s closer to 2 hours. Now multiply that by 100 tables — you’re looking at 200+ hours of daily pipeline runtime. That’s more than 8 days. Obviously, that doesn’t fit in a 24-hour window.
The solution is incremental loading (also called delta load or change data capture). Instead of copying every row every day, you copy only the rows that changed since the last run. A table with 100 million rows might only have 50,000 new or updated rows per day — that’s a 2,000x reduction in data movement.
In this guide, I’ll walk you through building an incremental load pipeline in Azure Synapse Analytics using the watermark pattern — a config table that tracks the last loaded value for each source table, and a pipeline that uses that value to extract only the delta.
This builds on my previous posts about metadata-driven pipelines in ADF and Synapse pipelines with audit logging. If you’re new to the Lookup → ForEach → Copy pattern, start with those first.
Table of Contents
- The Problem: Why Full Load Doesn’t Scale
- Full Load vs Incremental Load
- How the Watermark Pattern Works
- Two Types of Watermarks: ID-Based and Date-Based
- Step 1: Create the Config Table (Metadata + Watermark)
- Step 2: Create Source Tables and Sample Data
- Step 3: Create the Stored Procedure for Watermark Updates
- Step 4: Create Datasets
- Step 5: Build the Pipeline
- Step 6: Understanding the Delta Query Expression
- Step 7: Configure the Watermark Update
- Step 8: Add Date-Partitioned Output Folders
- Step 9: First Run — Initial Full Load
- Step 10: Test Incremental — Add New Data and Re-Run
- The dataset() vs item() Gotcha (Revisited)
- Common Errors and Fixes
- Full Load vs Incremental at Scale — The Numbers
- Production Best Practices
- Interview Questions
- Wrapping Up
The Problem: Why Full Load Doesn’t Scale
Let me paint the picture with real numbers:
| Scenario | Rows per Table | Time per Table | Tables | Total Daily Time |
|---|---|---|---|---|
| Small | 10,000 | 30 seconds | 10 | 5 minutes |
| Medium | 1,000,000 | 5 minutes | 50 | 4 hours |
| Large | 10,000,000 | 20 minutes | 100 | 33 hours |
| Enterprise | 100,000,000 | 2 hours | 100 | 200+ hours |
At the “Large” tier, you’re already over 24 hours — your pipeline literally can’t finish before the next day’s run starts. And that’s before accounting for retries, failures, and other pipelines competing for resources.
Beyond time, full load has other costs: – Storage: You’re writing 100 million identical rows every day, consuming terabytes of duplicate data – Database pressure: Full table scans on the source database impact production queries – Network: Transferring the same unchanged data wastes bandwidth – Cost: More data movement = higher ADF/Synapse pipeline costs
Full Load vs Incremental Load
| Aspect | Full Load | Incremental Load |
|---|---|---|
| What it copies | Every row, every time | Only changed rows (inserts + updates) |
| Rows per run (100M table) | 100,000,000 | ~50,000 (0.05%) |
| Time per table | ~2 hours | ~1-5 minutes |
| Storage growth | Massive (full copy daily) | Minimal (only deltas) |
| Source DB impact | Heavy (full scan) | Light (indexed WHERE clause) |
| Complexity | Simple | Moderate (needs config table + watermark logic) |
| When to use | Small tables, reference data, initial loads | Large transactional tables, daily/hourly pipelines |
Rule of thumb: Use full load for small reference tables (countries, categories, status codes) and initial loads. Use incremental load for everything else in production.
How the Watermark Pattern Works
The watermark pattern uses a config table that stores the “last loaded value” for each source table. Here’s the flow:
1. Read config table → get LastLoadedValue for each table
2. Query source table → get MAX value of the delta column
3. Copy only rows WHERE DeltaColumn > LastLoadedValue
AND DeltaColumn <= CurrentMaxValue
4. After successful copy → update LastLoadedValue = CurrentMaxValue
5. Next run → only picks up rows newer than the updated watermark
Why the upper bound (<= CurrentMaxValue)? Without it, rows inserted between the copy and the watermark update would be missed. The upper bound creates a clean boundary: everything up to MaxValue is copied, and the watermark is set to MaxValue. Any rows inserted after the query runs will have values > MaxValue and will be picked up in the next run.
Two Types of Watermarks: ID-Based and Date-Based
Option A: ID-Based Watermark (Auto-Increment Column)
- The watermark tracks the last loaded ID (e.g., EMPID)
- Works for INSERT-only tables where IDs always increase
- Example: Last loaded EMPID = 103. Next run picks up rows WHERE EMPID > 103
- Limitation: Doesn’t detect UPDATES — if you update EMPID 101’s city, the watermark won’t pick it up because 101 < 103
Option B: Date/Timestamp-Based Watermark
- The watermark tracks the latest timestamp loaded
- Works for both INSERTS and UPDATES (if the table has a
modified_dateorrcd_upd_tscolumn) - Example: Last loaded date = 2026-01-01. Next run picks up WHERE ORDERDATE > ‘2026-01-01’
- Advantage: When a row is updated, its timestamp changes to the current time, making it “newer” than the watermark
Which to use? If your source table has a modified_date or updated_at column, use date-based — it catches both inserts and updates. If it only has an auto-increment ID, use ID-based (but be aware that updates won’t be captured).
Step 1: Create the Config Table (Metadata + Watermark)
The config table extends the metadata table from the previous tutorials by adding two new columns:
CREATE TABLE configtable (
id INT IDENTITY(1, 1),
TableName VARCHAR(100),
SchemaName VARCHAR(50),
FolderName VARCHAR(100),
LastLoadedValue VARCHAR(100),
DeltaColumnName VARCHAR(100)
);
- LastLoadedValue — The watermark value from the last successful run. Starts at
'0'for ID-based or'1900-01-01 00:00:00'for date-based. - DeltaColumnName — The name of the column used as the watermark (e.g.,
EMPIDorORDERDATE).
Why VARCHAR for LastLoadedValue? Because it needs to store both IDs (integers) and dates (timestamps) in the same column. We CAST everything to VARCHAR in the pipeline expressions for consistency.
Insert the config rows:
-- ID-based watermark (EMPLOYEE table, starts at 0)
INSERT INTO configtable (TableName, SchemaName, FolderName, LastLoadedValue, DeltaColumnName)
VALUES ('EMPLOYEE', 'T1', 'T1/EMPLOYEE', '0', 'EMPID');
-- Date-based watermark (ORDERS table, starts at 1900-01-01)
INSERT INTO configtable (TableName, SchemaName, FolderName, LastLoadedValue, DeltaColumnName)
VALUES ('ORDERS', 'T1', 'T1/ORDERS', '1900-01-01 00:00:00', 'ORDERDATE');
-- Verify
SELECT * FROM configtable;
Step 2: Create Source Tables and Sample Data
-- Create a schema to keep things organized
CREATE SCHEMA T1;
-- ID-based table (watermark on EMPID)
CREATE TABLE T1.EMPLOYEE (
EMPID BIGINT,
EMPNAME VARCHAR(100),
EMPCITY VARCHAR(100)
);
INSERT INTO T1.EMPLOYEE VALUES (100, 'Naveen', 'hyd');
INSERT INTO T1.EMPLOYEE VALUES (101, 'Shrey', 'delhi');
INSERT INTO T1.EMPLOYEE VALUES (102, 'Vrushab', 'pune');
INSERT INTO T1.EMPLOYEE VALUES (103, 'Vishnu', 'noida');
-- Date-based table (watermark on ORDERDATE)
CREATE TABLE T1.ORDERS (
ORDERID INT,
ORDERDATE DATETIME,
ORDERNUMBER VARCHAR(100)
);
INSERT INTO T1.ORDERS VALUES (1, '2026-01-01 10:30:00', 'B1026');
INSERT INTO T1.ORDERS VALUES (2, '2026-10-01 10:40:00', 'B1046');
INSERT INTO T1.ORDERS VALUES (3, '2026-01-01 11:30:00', 'B1063');
Step 3: Create the Stored Procedure for Watermark Updates
This stored procedure updates the watermark after each successful copy:
CREATE PROCEDURE usp_update_config
@tablename VARCHAR(100),
@LASTLOADVALUE VARCHAR(100)
AS
UPDATE configtable
SET LastLoadedValue = @LASTLOADVALUE
WHERE TableName = @tablename
AND SchemaName = 'T1';
Why update after the copy, not before? If the copy fails, the watermark stays at the old value. The next run will retry the same data range. If you updated the watermark before the copy and the copy failed, you’d lose those rows forever.
Step 4: Create Datasets
You need three datasets:
A. DS_ConfigTable (Lookup Source)
- Type: Azure SQL Database
- Table:
dbo.configtable - No parameters needed
B. DS_SourceTable_Incremental (Parameterized Source)
- Type: Azure SQL Database
- Leave table blank
- Parameters:
SchemaName(String),TableName(String) - Connection tab (Enter manually):
- Schema:
@dataset().SchemaName - Table:
@dataset().TableName
C. DS_ADLS_Sink_Incremental (Parameterized Sink with Date Folders)
- Type: ADLS Gen2 → Parquet
- Parameters:
FolderName(String) - Connection tab:
- Container:
synapse-workspace(or your container name — can be hardcoded or parameterized) - Directory:
@dataset().FolderName
Step 5: Build the Pipeline
Create a new pipeline: PL_Incremental_DeltaLoad
The pipeline has a more complex structure inside the ForEach than our previous pipelines — three activities in sequence:
Lookup_Config
│ (Success)
▼
ForEach_Table
│
├── Lookup_MaxValue
│ │ (Success)
│ ▼
├── Copy_DeltaData
│ │ (Success)
│ ▼
└── Update_Watermark (Stored Procedure)
A. Lookup_Config
- Settings tab:
- Source dataset:
DS_ConfigTable - First row only: UNCHECKED
B. ForEach_Table
- Settings tab:
- Items:
@activity('Lookup_Config').output.value - Sequential: unchecked
C. Inside ForEach — Lookup_MaxValue
This Lookup queries the source table to find the current maximum value of the delta column:
- Drag a Lookup activity inside the ForEach
- Name:
Lookup_MaxValue - Settings:
- Source dataset:
DS_SourceTable_Incremental - SchemaName:
@item().SchemaName - TableName:
@item().TableName - Use query: Query
- Query (dynamic content):
@concat('SELECT CAST(MAX(', item().DeltaColumnName, ') AS VARCHAR(100)) AS MaxValue FROM ', item().SchemaName, '.', item().TableName)
Why CAST to VARCHAR? Because the config table stores LastLoadedValue as VARCHAR. We need MaxValue in the same format so the stored procedure can update it consistently, regardless of whether it’s an integer (EMPID) or a datetime (ORDERDATE).
First row only: Leave this CHECKED (default) — we only need one value (the MAX).
D. Inside ForEach — Copy_DeltaData
- Drag a Copy data activity
- Name:
Copy_DeltaData - Draw green arrow from
Lookup_MaxValue→Copy_DeltaData
Source tab:
– Dataset: DS_SourceTable_Incremental
– SchemaName: @item().SchemaName
– TableName: @item().TableName
– Use query: Query
– Query (dynamic content) — this is the key expression:
@concat('SELECT * FROM ', item().SchemaName, '.', item().TableName, ' WHERE CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) > ''', item().LastLoadedValue, ''' AND CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) <= ''', activity('Lookup_MaxValue').output.firstRow.MaxValue, '''')
Sink tab:
– Dataset: DS_ADLS_Sink_Incremental
– FolderName: @item().FolderName
Step 6: Understanding the Delta Query Expression
The Copy activity’s source query is the most complex expression in this pipeline. Let’s break it down:
@concat(
'SELECT * FROM ', item().SchemaName, '.', item().TableName,
' WHERE CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) > ''',
item().LastLoadedValue,
''' AND CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) <= ''',
activity('Lookup_MaxValue').output.firstRow.MaxValue,
''''
)
For the EMPLOYEE table (first run), this resolves to:
SELECT * FROM T1.EMPLOYEE
WHERE CAST(EMPID AS VARCHAR(100)) > '0'
AND CAST(EMPID AS VARCHAR(100)) <= '103'
This selects all employees with EMPID between 1 and 103 (the initial load since the watermark starts at 0).
For the ORDERS table (first run), this resolves to:
SELECT * FROM T1.ORDERS
WHERE CAST(ORDERDATE AS VARCHAR(100)) > '1900-01-01 00:00:00'
AND CAST(ORDERDATE AS VARCHAR(100)) <= '2026-10-01 10:40:00'
This selects all orders since 1900 (effectively all rows for the initial load).
On the second run (after watermark is updated):
-- Only new employees (EMPID > 103)
SELECT * FROM T1.EMPLOYEE
WHERE CAST(EMPID AS VARCHAR(100)) > '103'
AND CAST(EMPID AS VARCHAR(100)) <= '105'
-- Only new orders (ORDERDATE > '2026-10-01 10:40:00')
SELECT * FROM T1.ORDERS
WHERE CAST(ORDERDATE AS VARCHAR(100)) > '2026-10-01 10:40:00'
AND CAST(ORDERDATE AS VARCHAR(100)) <= '2026-11-15 09:00:00'
Now only 2 employees and 1 order are copied — not the entire tables.
Why the single-quote escaping (''')? In ADF expressions, a single quote inside a string is escaped with another single quote. So ''' means: end the string, insert a literal single quote, start a new string. This produces the SQL string delimiters around the watermark values.
Step 7: Configure the Watermark Update
After Copy_DeltaData succeeds, update the watermark:
- Drag a Stored Procedure activity
- Name:
Update_Watermark - Draw green arrow (Success) from
Copy_DeltaData→Update_Watermark - Settings:
- Linked service:
LS_AzureSqlDB - Stored procedure:
usp_update_config - Parameters:
| Parameter | Type | Value |
|---|---|---|
| tablename | String | @item().TableName |
| LASTLOADVALUE | String | @activity('Lookup_MaxValue').output.firstRow.MaxValue |
Critical: We update the watermark to MaxValue (from the Lookup), NOT to the Copy activity’s output. This is because MaxValue was determined before the copy ran — it’s a clean, consistent boundary.
Step 8: Add Date-Partitioned Output Folders
Instead of overwriting the same folder each run, partition by date and time. In the Copy activity’s Sink tab, change the FolderName to:
@concat(item().FolderName, '/', formatDateTime(utcnow(), 'yyyy'), '/', formatDateTime(utcnow(), 'MM'), '/', formatDateTime(utcnow(), 'dd'), '/', formatDateTime(utcnow(), 'HH'), '/', formatDateTime(utcnow(), 'mm'))
This creates a nested folder structure:
synapse-workspace/
└── T1/
└── EMPLOYEE/
└── 2026/
└── 04/
└── 05/
└── 14/
└── 30/
└── part-00000.snappy.parquet
Why this matters: Each pipeline run creates files in a unique folder based on the execution time. You never overwrite previous runs. You can easily query specific date ranges. And cleanup is straightforward — just delete old date folders.
Important: Put this concat expression in the pipeline’s Copy activity Sink tab (where you pass the FolderName value), NOT in the dataset’s Connection tab. The dataset should only have @dataset().FolderName. The pipeline builds the full path and passes it to the dataset.
Step 9: First Run — Initial Full Load
- Click Validate → fix any errors
- Click Publish all
- Click Debug
After the first run completes, verify:
-- Check the config table — watermarks should be updated
SELECT * FROM configtable;
-- Expected:
-- EMPLOYEE: LastLoadedValue = '103'
-- ORDERS: LastLoadedValue = 'Oct 1 2026 10:40AM' (or similar format)
Check ADLS Gen2 — you should see Parquet files in the date-partitioned folders under T1/EMPLOYEE/ and T1/ORDERS/.
Step 10: Test Incremental — Add New Data and Re-Run
Add new rows to the source tables:
INSERT INTO T1.EMPLOYEE VALUES (104, 'Ravi', 'bangalore');
INSERT INTO T1.EMPLOYEE VALUES (105, 'Priya', 'chennai');
INSERT INTO T1.ORDERS VALUES (4, '2026-11-15 09:00:00', 'B2001');
Run the pipeline again (Debug).
What should happen: – Only EMPID 104 and 105 are copied (not all 105 rows) – Only ORDERID 4 is copied (not all 4 orders) – New Parquet files appear in a new date-partitioned folder
Verify the updated watermarks:
SELECT * FROM configtable;
-- EMPLOYEE: LastLoadedValue = '105' (was '103')
-- ORDERS: LastLoadedValue = 'Nov 15 2026 9:00AM' (was 'Oct 1 2026 10:40AM')
This confirms incremental loading is working correctly. The pipeline only copied the delta.
The dataset() vs item() Gotcha (Revisited)
A common error in this pipeline is using @dataset() where you should use @item(), or vice versa:
Error: "The template function 'dataset' is not defined or not valid"
This happens when you put @dataset().FolderName in the pipeline’s Copy activity. Remember the rule:
| Where | Use This | Example |
|---|---|---|
| Inside a Dataset (Connection tab) | @dataset() |
@dataset().FolderName |
| Inside a Pipeline (activity settings) | @item() |
@item().FolderName |
| Inside a Pipeline (referencing another activity) | @activity() |
@activity('Lookup_MaxValue').output.firstRow.MaxValue |
The dataset defines the template. The pipeline fills in the values. Never mix them.
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
"dataset function is not defined" |
Used @dataset() in the pipeline |
Change to @item() in the pipeline; use @dataset() only in the dataset |
"Lookup activity returned no results" |
Source table is empty or watermark is ahead of all data | Check your LastLoadedValue isn’t greater than the MAX in the source |
Case-sensitivity error on @item().tablename |
Column name doesn’t match exactly | Use @item().TableName matching your SQL column exactly |
| Single quote parsing error in WHERE clause | Missing or extra quotes in the @concat() expression |
Count your quotes: ''' = end string + literal quote + start string |
| Update_Watermark runs after failed Copy | Green arrow (Success) not properly set | Verify the dependency condition is “Succeeded” not “Completed” |
Full Load vs Incremental at Scale — The Numbers
| Metric | Full Load (Daily) | Incremental Load (Daily) |
|---|---|---|
| Rows copied (100M table) | 100,000,000 | ~50,000 (changes only) |
| Time per table | ~2 hours | ~1-5 minutes |
| Time for 100 tables | 200+ hours (impossible) | ~2-8 hours (feasible) |
| Daily storage growth | 100M rows x 100 tables | 50K rows x 100 tables |
| Source DB pressure | Heavy (full table scans) | Light (indexed WHERE reads) |
| Implementation effort | Low | Moderate (config table + stored proc) |
Bottom line: Full load is simple but doesn’t scale. Incremental load is essential for large transactional tables in production. The watermark pattern (tracking LastLoadedValue in a config table) is the most common and reliable approach in ADF and Synapse.
Production Best Practices
1. Always Update Watermark AFTER Successful Copy
If you update the watermark before the copy and the copy fails, you’ll lose those rows. The stored procedure should only run on the success path (green arrow from Copy).
2. Handle Empty Deltas Gracefully
If no new data exists since the last watermark, Lookup_MaxValue might return NULL. Add a condition:
@if(equals(activity('Lookup_MaxValue').output.firstRow.MaxValue, null), item().LastLoadedValue, activity('Lookup_MaxValue').output.firstRow.MaxValue)
This skips the copy when there’s no new data.
3. Index the Delta Column
Make sure your delta column (EMPID, ORDERDATE) has an index on the source database. Without an index, the WHERE clause in the delta query does a full table scan — defeating the purpose of incremental loading.
CREATE INDEX IX_EMPLOYEE_EMPID ON T1.EMPLOYEE (EMPID);
CREATE INDEX IX_ORDERS_ORDERDATE ON T1.ORDERS (ORDERDATE);
4. Combine with Audit Logging
Add the audit logging from the previous post to track how many delta rows were copied per table per run. This gives you a complete picture of your pipeline’s behavior over time.
5. Monitor Watermark Drift
Set up alerts if a watermark hasn’t moved in X days — it might indicate the source table isn’t getting updates, or the pipeline is silently failing.
Interview Questions
Q: What is the difference between full load and incremental load? A: Full load copies every row every time. Incremental load copies only rows that changed since the last run, using a watermark (like a last-loaded ID or timestamp). Incremental is essential for large tables where full load is too slow or expensive.
Q: What is the watermark pattern in ADF/Synapse? A: It’s a pattern where a config table stores the last loaded value (watermark) for each source table. The pipeline reads the watermark, queries only rows newer than that value, copies them, and then updates the watermark to the new MAX value.
Q: How do you handle both inserts and updates in incremental loading?
A: Use a timestamp-based watermark (like modified_date or rcd_upd_ts). When a row is updated, its timestamp changes to the current time, making it newer than the watermark. ID-based watermarks only catch inserts, not updates.
Q: Why do you use CAST(DeltaColumn AS VARCHAR) in the query? A: Because the config table stores LastLoadedValue as VARCHAR (to support both integers and dates in the same column). We CAST the source column to VARCHAR for consistent comparison.
Q: What happens if the pipeline fails midway through a run? A: The watermark only updates after a successful copy (green arrow dependency). If the copy fails, the watermark stays at the old value, and the next run retries the same data range. No data is lost.
Q: How would you handle deletes in incremental loading? A: The watermark pattern doesn’t handle hard deletes. Options include: soft deletes (is_deleted flag), CDC with change tracking features in SQL Server, or periodic full reconciliation runs that compare source and destination.
Wrapping Up
You’ve now built a complete incremental data loading pipeline with:
- Config table with watermark tracking (LastLoadedValue + DeltaColumnName)
- Dynamic delta queries that only copy changed rows
- Watermark updates after successful copies via stored procedure
- Date-partitioned output folders for organized data lake storage
- Support for both ID-based and date-based watermarks using CAST to VARCHAR
This is how real production data pipelines work. The watermark pattern is simple, reliable, and scales to thousands of tables with millions of rows.
Related posts: – Building a Metadata-Driven Pipeline in Azure Data Factory – Synapse Pipeline with Audit Logging – Building a REST API with FastAPI on AWS Lambda – Top 15 Azure Data Factory Interview Questions (coming soon)
If this guide helped you, share it with a fellow data engineer. Have questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.