Incremental Data Loading in Azure Synapse: The Delta Copy Pattern with Watermark Tables

Incremental Data Loading in Azure Synapse: The Delta Copy Pattern with Watermark Tables

Full load works fine when your table has 1,000 rows. But what happens when it has 10 million? Or 100 million?

At 10 million rows, a full copy takes about 20 minutes per table. At 100 million, it’s closer to 2 hours. Now multiply that by 100 tables — you’re looking at 200+ hours of daily pipeline runtime. That’s more than 8 days. Obviously, that doesn’t fit in a 24-hour window.

The solution is incremental loading (also called delta load or change data capture). Instead of copying every row every day, you copy only the rows that changed since the last run. A table with 100 million rows might only have 50,000 new or updated rows per day — that’s a 2,000x reduction in data movement.

In this guide, I’ll walk you through building an incremental load pipeline in Azure Synapse Analytics using the watermark pattern — a config table that tracks the last loaded value for each source table, and a pipeline that uses that value to extract only the delta.

This builds on my previous posts about metadata-driven pipelines in ADF and Synapse pipelines with audit logging. If you’re new to the Lookup → ForEach → Copy pattern, start with those first.

Table of Contents

  • The Problem: Why Full Load Doesn’t Scale
  • Full Load vs Incremental Load
  • How the Watermark Pattern Works
  • Two Types of Watermarks: ID-Based and Date-Based
  • Step 1: Create the Config Table (Metadata + Watermark)
  • Step 2: Create Source Tables and Sample Data
  • Step 3: Create the Stored Procedure for Watermark Updates
  • Step 4: Create Datasets
  • Step 5: Build the Pipeline
  • Step 6: Understanding the Delta Query Expression
  • Step 7: Configure the Watermark Update
  • Step 8: Add Date-Partitioned Output Folders
  • Step 9: First Run — Initial Full Load
  • Step 10: Test Incremental — Add New Data and Re-Run
  • The dataset() vs item() Gotcha (Revisited)
  • Common Errors and Fixes
  • Full Load vs Incremental at Scale — The Numbers
  • Production Best Practices
  • Interview Questions
  • Wrapping Up

The Problem: Why Full Load Doesn’t Scale

Let me paint the picture with real numbers:

Scenario Rows per Table Time per Table Tables Total Daily Time
Small 10,000 30 seconds 10 5 minutes
Medium 1,000,000 5 minutes 50 4 hours
Large 10,000,000 20 minutes 100 33 hours
Enterprise 100,000,000 2 hours 100 200+ hours

At the “Large” tier, you’re already over 24 hours — your pipeline literally can’t finish before the next day’s run starts. And that’s before accounting for retries, failures, and other pipelines competing for resources.

Beyond time, full load has other costs: – Storage: You’re writing 100 million identical rows every day, consuming terabytes of duplicate data – Database pressure: Full table scans on the source database impact production queries – Network: Transferring the same unchanged data wastes bandwidth – Cost: More data movement = higher ADF/Synapse pipeline costs

Full Load vs Incremental Load

Aspect Full Load Incremental Load
What it copies Every row, every time Only changed rows (inserts + updates)
Rows per run (100M table) 100,000,000 ~50,000 (0.05%)
Time per table ~2 hours ~1-5 minutes
Storage growth Massive (full copy daily) Minimal (only deltas)
Source DB impact Heavy (full scan) Light (indexed WHERE clause)
Complexity Simple Moderate (needs config table + watermark logic)
When to use Small tables, reference data, initial loads Large transactional tables, daily/hourly pipelines

Rule of thumb: Use full load for small reference tables (countries, categories, status codes) and initial loads. Use incremental load for everything else in production.

How the Watermark Pattern Works

The watermark pattern uses a config table that stores the “last loaded value” for each source table. Here’s the flow:

1. Read config table → get LastLoadedValue for each table
2. Query source table → get MAX value of the delta column
3. Copy only rows WHERE DeltaColumn > LastLoadedValue
                    AND DeltaColumn <= CurrentMaxValue
4. After successful copy → update LastLoadedValue = CurrentMaxValue
5. Next run → only picks up rows newer than the updated watermark

Why the upper bound (<= CurrentMaxValue)? Without it, rows inserted between the copy and the watermark update would be missed. The upper bound creates a clean boundary: everything up to MaxValue is copied, and the watermark is set to MaxValue. Any rows inserted after the query runs will have values > MaxValue and will be picked up in the next run.

Two Types of Watermarks: ID-Based and Date-Based

Option A: ID-Based Watermark (Auto-Increment Column)

  • The watermark tracks the last loaded ID (e.g., EMPID)
  • Works for INSERT-only tables where IDs always increase
  • Example: Last loaded EMPID = 103. Next run picks up rows WHERE EMPID > 103
  • Limitation: Doesn’t detect UPDATES — if you update EMPID 101’s city, the watermark won’t pick it up because 101 < 103

Option B: Date/Timestamp-Based Watermark

  • The watermark tracks the latest timestamp loaded
  • Works for both INSERTS and UPDATES (if the table has a modified_date or rcd_upd_ts column)
  • Example: Last loaded date = 2026-01-01. Next run picks up WHERE ORDERDATE > ‘2026-01-01’
  • Advantage: When a row is updated, its timestamp changes to the current time, making it “newer” than the watermark

Which to use? If your source table has a modified_date or updated_at column, use date-based — it catches both inserts and updates. If it only has an auto-increment ID, use ID-based (but be aware that updates won’t be captured).

Step 1: Create the Config Table (Metadata + Watermark)

The config table extends the metadata table from the previous tutorials by adding two new columns:

CREATE TABLE configtable (
    id               INT IDENTITY(1, 1),
    TableName        VARCHAR(100),
    SchemaName       VARCHAR(50),
    FolderName       VARCHAR(100),
    LastLoadedValue  VARCHAR(100),
    DeltaColumnName  VARCHAR(100)
);
  • LastLoadedValue — The watermark value from the last successful run. Starts at '0' for ID-based or '1900-01-01 00:00:00' for date-based.
  • DeltaColumnName — The name of the column used as the watermark (e.g., EMPID or ORDERDATE).

Why VARCHAR for LastLoadedValue? Because it needs to store both IDs (integers) and dates (timestamps) in the same column. We CAST everything to VARCHAR in the pipeline expressions for consistency.

Insert the config rows:

-- ID-based watermark (EMPLOYEE table, starts at 0)
INSERT INTO configtable (TableName, SchemaName, FolderName, LastLoadedValue, DeltaColumnName)
VALUES ('EMPLOYEE', 'T1', 'T1/EMPLOYEE', '0', 'EMPID');

-- Date-based watermark (ORDERS table, starts at 1900-01-01)
INSERT INTO configtable (TableName, SchemaName, FolderName, LastLoadedValue, DeltaColumnName)
VALUES ('ORDERS', 'T1', 'T1/ORDERS', '1900-01-01 00:00:00', 'ORDERDATE');

-- Verify
SELECT * FROM configtable;

Step 2: Create Source Tables and Sample Data

-- Create a schema to keep things organized
CREATE SCHEMA T1;

-- ID-based table (watermark on EMPID)
CREATE TABLE T1.EMPLOYEE (
    EMPID    BIGINT,
    EMPNAME  VARCHAR(100),
    EMPCITY  VARCHAR(100)
);

INSERT INTO T1.EMPLOYEE VALUES (100, 'Naveen', 'hyd');
INSERT INTO T1.EMPLOYEE VALUES (101, 'Shrey', 'delhi');
INSERT INTO T1.EMPLOYEE VALUES (102, 'Vrushab', 'pune');
INSERT INTO T1.EMPLOYEE VALUES (103, 'Vishnu', 'noida');

-- Date-based table (watermark on ORDERDATE)
CREATE TABLE T1.ORDERS (
    ORDERID     INT,
    ORDERDATE   DATETIME,
    ORDERNUMBER VARCHAR(100)
);

INSERT INTO T1.ORDERS VALUES (1, '2026-01-01 10:30:00', 'B1026');
INSERT INTO T1.ORDERS VALUES (2, '2026-10-01 10:40:00', 'B1046');
INSERT INTO T1.ORDERS VALUES (3, '2026-01-01 11:30:00', 'B1063');

Step 3: Create the Stored Procedure for Watermark Updates

This stored procedure updates the watermark after each successful copy:

CREATE PROCEDURE usp_update_config
    @tablename     VARCHAR(100),
    @LASTLOADVALUE VARCHAR(100)
AS
    UPDATE configtable
    SET LastLoadedValue = @LASTLOADVALUE
    WHERE TableName = @tablename
    AND SchemaName = 'T1';

Why update after the copy, not before? If the copy fails, the watermark stays at the old value. The next run will retry the same data range. If you updated the watermark before the copy and the copy failed, you’d lose those rows forever.

Step 4: Create Datasets

You need three datasets:

A. DS_ConfigTable (Lookup Source)

  • Type: Azure SQL Database
  • Table: dbo.configtable
  • No parameters needed

B. DS_SourceTable_Incremental (Parameterized Source)

  • Type: Azure SQL Database
  • Leave table blank
  • Parameters: SchemaName (String), TableName (String)
  • Connection tab (Enter manually):
  • Schema: @dataset().SchemaName
  • Table: @dataset().TableName

C. DS_ADLS_Sink_Incremental (Parameterized Sink with Date Folders)

  • Type: ADLS Gen2 → Parquet
  • Parameters: FolderName (String)
  • Connection tab:
  • Container: synapse-workspace (or your container name — can be hardcoded or parameterized)
  • Directory: @dataset().FolderName

Step 5: Build the Pipeline

Create a new pipeline: PL_Incremental_DeltaLoad

The pipeline has a more complex structure inside the ForEach than our previous pipelines — three activities in sequence:

Lookup_Config
      │ (Success)
      ▼
ForEach_Table
      │
      ├── Lookup_MaxValue
      │       │ (Success)
      │       ▼
      ├── Copy_DeltaData
      │       │ (Success)
      │       ▼
      └── Update_Watermark (Stored Procedure)

A. Lookup_Config

  • Settings tab:
  • Source dataset: DS_ConfigTable
  • First row only: UNCHECKED

B. ForEach_Table

  • Settings tab:
  • Items: @activity('Lookup_Config').output.value
  • Sequential: unchecked

C. Inside ForEach — Lookup_MaxValue

This Lookup queries the source table to find the current maximum value of the delta column:

  1. Drag a Lookup activity inside the ForEach
  2. Name: Lookup_MaxValue
  3. Settings:
  4. Source dataset: DS_SourceTable_Incremental
  5. SchemaName: @item().SchemaName
  6. TableName: @item().TableName
  7. Use query: Query
  8. Query (dynamic content):
@concat('SELECT CAST(MAX(', item().DeltaColumnName, ') AS VARCHAR(100)) AS MaxValue FROM ', item().SchemaName, '.', item().TableName)

Why CAST to VARCHAR? Because the config table stores LastLoadedValue as VARCHAR. We need MaxValue in the same format so the stored procedure can update it consistently, regardless of whether it’s an integer (EMPID) or a datetime (ORDERDATE).

First row only: Leave this CHECKED (default) — we only need one value (the MAX).

D. Inside ForEach — Copy_DeltaData

  1. Drag a Copy data activity
  2. Name: Copy_DeltaData
  3. Draw green arrow from Lookup_MaxValueCopy_DeltaData

Source tab: – Dataset: DS_SourceTable_Incremental – SchemaName: @item().SchemaName – TableName: @item().TableName – Use query: Query – Query (dynamic content) — this is the key expression:

@concat('SELECT * FROM ', item().SchemaName, '.', item().TableName, ' WHERE CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) > ''', item().LastLoadedValue, ''' AND CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) <= ''', activity('Lookup_MaxValue').output.firstRow.MaxValue, '''')

Sink tab: – Dataset: DS_ADLS_Sink_Incremental – FolderName: @item().FolderName

Step 6: Understanding the Delta Query Expression

The Copy activity’s source query is the most complex expression in this pipeline. Let’s break it down:

@concat(
    'SELECT * FROM ', item().SchemaName, '.', item().TableName,
    ' WHERE CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) > ''',
    item().LastLoadedValue,
    ''' AND CAST(', item().DeltaColumnName, ' AS VARCHAR(100)) <= ''',
    activity('Lookup_MaxValue').output.firstRow.MaxValue,
    ''''
)

For the EMPLOYEE table (first run), this resolves to:

SELECT * FROM T1.EMPLOYEE
WHERE CAST(EMPID AS VARCHAR(100)) > '0'
  AND CAST(EMPID AS VARCHAR(100)) <= '103'

This selects all employees with EMPID between 1 and 103 (the initial load since the watermark starts at 0).

For the ORDERS table (first run), this resolves to:

SELECT * FROM T1.ORDERS
WHERE CAST(ORDERDATE AS VARCHAR(100)) > '1900-01-01 00:00:00'
  AND CAST(ORDERDATE AS VARCHAR(100)) <= '2026-10-01 10:40:00'

This selects all orders since 1900 (effectively all rows for the initial load).

On the second run (after watermark is updated):

-- Only new employees (EMPID > 103)
SELECT * FROM T1.EMPLOYEE
WHERE CAST(EMPID AS VARCHAR(100)) > '103'
  AND CAST(EMPID AS VARCHAR(100)) <= '105'

-- Only new orders (ORDERDATE > '2026-10-01 10:40:00')
SELECT * FROM T1.ORDERS
WHERE CAST(ORDERDATE AS VARCHAR(100)) > '2026-10-01 10:40:00'
  AND CAST(ORDERDATE AS VARCHAR(100)) <= '2026-11-15 09:00:00'

Now only 2 employees and 1 order are copied — not the entire tables.

Why the single-quote escaping (''')? In ADF expressions, a single quote inside a string is escaped with another single quote. So ''' means: end the string, insert a literal single quote, start a new string. This produces the SQL string delimiters around the watermark values.

Step 7: Configure the Watermark Update

After Copy_DeltaData succeeds, update the watermark:

  1. Drag a Stored Procedure activity
  2. Name: Update_Watermark
  3. Draw green arrow (Success) from Copy_DeltaDataUpdate_Watermark
  4. Settings:
  5. Linked service: LS_AzureSqlDB
  6. Stored procedure: usp_update_config
  7. Parameters:
Parameter Type Value
tablename String @item().TableName
LASTLOADVALUE String @activity('Lookup_MaxValue').output.firstRow.MaxValue

Critical: We update the watermark to MaxValue (from the Lookup), NOT to the Copy activity’s output. This is because MaxValue was determined before the copy ran — it’s a clean, consistent boundary.

Step 8: Add Date-Partitioned Output Folders

Instead of overwriting the same folder each run, partition by date and time. In the Copy activity’s Sink tab, change the FolderName to:

@concat(item().FolderName, '/', formatDateTime(utcnow(), 'yyyy'), '/', formatDateTime(utcnow(), 'MM'), '/', formatDateTime(utcnow(), 'dd'), '/', formatDateTime(utcnow(), 'HH'), '/', formatDateTime(utcnow(), 'mm'))

This creates a nested folder structure:

synapse-workspace/
└── T1/
    └── EMPLOYEE/
        └── 2026/
            └── 04/
                └── 05/
                    └── 14/
                        └── 30/
                            └── part-00000.snappy.parquet

Why this matters: Each pipeline run creates files in a unique folder based on the execution time. You never overwrite previous runs. You can easily query specific date ranges. And cleanup is straightforward — just delete old date folders.

Important: Put this concat expression in the pipeline’s Copy activity Sink tab (where you pass the FolderName value), NOT in the dataset’s Connection tab. The dataset should only have @dataset().FolderName. The pipeline builds the full path and passes it to the dataset.

Step 9: First Run — Initial Full Load

  1. Click Validate → fix any errors
  2. Click Publish all
  3. Click Debug

After the first run completes, verify:

-- Check the config table — watermarks should be updated
SELECT * FROM configtable;

-- Expected:
-- EMPLOYEE: LastLoadedValue = '103'
-- ORDERS:   LastLoadedValue = 'Oct  1 2026 10:40AM' (or similar format)

Check ADLS Gen2 — you should see Parquet files in the date-partitioned folders under T1/EMPLOYEE/ and T1/ORDERS/.

Step 10: Test Incremental — Add New Data and Re-Run

Add new rows to the source tables:

INSERT INTO T1.EMPLOYEE VALUES (104, 'Ravi', 'bangalore');
INSERT INTO T1.EMPLOYEE VALUES (105, 'Priya', 'chennai');

INSERT INTO T1.ORDERS VALUES (4, '2026-11-15 09:00:00', 'B2001');

Run the pipeline again (Debug).

What should happen: – Only EMPID 104 and 105 are copied (not all 105 rows) – Only ORDERID 4 is copied (not all 4 orders) – New Parquet files appear in a new date-partitioned folder

Verify the updated watermarks:

SELECT * FROM configtable;

-- EMPLOYEE: LastLoadedValue = '105' (was '103')
-- ORDERS:   LastLoadedValue = 'Nov 15 2026  9:00AM' (was 'Oct  1 2026 10:40AM')

This confirms incremental loading is working correctly. The pipeline only copied the delta.

The dataset() vs item() Gotcha (Revisited)

A common error in this pipeline is using @dataset() where you should use @item(), or vice versa:

Error: "The template function 'dataset' is not defined or not valid"

This happens when you put @dataset().FolderName in the pipeline’s Copy activity. Remember the rule:

Where Use This Example
Inside a Dataset (Connection tab) @dataset() @dataset().FolderName
Inside a Pipeline (activity settings) @item() @item().FolderName
Inside a Pipeline (referencing another activity) @activity() @activity('Lookup_MaxValue').output.firstRow.MaxValue

The dataset defines the template. The pipeline fills in the values. Never mix them.

Common Errors and Fixes

Error Cause Fix
"dataset function is not defined" Used @dataset() in the pipeline Change to @item() in the pipeline; use @dataset() only in the dataset
"Lookup activity returned no results" Source table is empty or watermark is ahead of all data Check your LastLoadedValue isn’t greater than the MAX in the source
Case-sensitivity error on @item().tablename Column name doesn’t match exactly Use @item().TableName matching your SQL column exactly
Single quote parsing error in WHERE clause Missing or extra quotes in the @concat() expression Count your quotes: ''' = end string + literal quote + start string
Update_Watermark runs after failed Copy Green arrow (Success) not properly set Verify the dependency condition is “Succeeded” not “Completed”

Full Load vs Incremental at Scale — The Numbers

Metric Full Load (Daily) Incremental Load (Daily)
Rows copied (100M table) 100,000,000 ~50,000 (changes only)
Time per table ~2 hours ~1-5 minutes
Time for 100 tables 200+ hours (impossible) ~2-8 hours (feasible)
Daily storage growth 100M rows x 100 tables 50K rows x 100 tables
Source DB pressure Heavy (full table scans) Light (indexed WHERE reads)
Implementation effort Low Moderate (config table + stored proc)

Bottom line: Full load is simple but doesn’t scale. Incremental load is essential for large transactional tables in production. The watermark pattern (tracking LastLoadedValue in a config table) is the most common and reliable approach in ADF and Synapse.

Production Best Practices

1. Always Update Watermark AFTER Successful Copy

If you update the watermark before the copy and the copy fails, you’ll lose those rows. The stored procedure should only run on the success path (green arrow from Copy).

2. Handle Empty Deltas Gracefully

If no new data exists since the last watermark, Lookup_MaxValue might return NULL. Add a condition:

@if(equals(activity('Lookup_MaxValue').output.firstRow.MaxValue, null), item().LastLoadedValue, activity('Lookup_MaxValue').output.firstRow.MaxValue)

This skips the copy when there’s no new data.

3. Index the Delta Column

Make sure your delta column (EMPID, ORDERDATE) has an index on the source database. Without an index, the WHERE clause in the delta query does a full table scan — defeating the purpose of incremental loading.

CREATE INDEX IX_EMPLOYEE_EMPID ON T1.EMPLOYEE (EMPID);
CREATE INDEX IX_ORDERS_ORDERDATE ON T1.ORDERS (ORDERDATE);

4. Combine with Audit Logging

Add the audit logging from the previous post to track how many delta rows were copied per table per run. This gives you a complete picture of your pipeline’s behavior over time.

5. Monitor Watermark Drift

Set up alerts if a watermark hasn’t moved in X days — it might indicate the source table isn’t getting updates, or the pipeline is silently failing.

Interview Questions

Q: What is the difference between full load and incremental load? A: Full load copies every row every time. Incremental load copies only rows that changed since the last run, using a watermark (like a last-loaded ID or timestamp). Incremental is essential for large tables where full load is too slow or expensive.

Q: What is the watermark pattern in ADF/Synapse? A: It’s a pattern where a config table stores the last loaded value (watermark) for each source table. The pipeline reads the watermark, queries only rows newer than that value, copies them, and then updates the watermark to the new MAX value.

Q: How do you handle both inserts and updates in incremental loading? A: Use a timestamp-based watermark (like modified_date or rcd_upd_ts). When a row is updated, its timestamp changes to the current time, making it newer than the watermark. ID-based watermarks only catch inserts, not updates.

Q: Why do you use CAST(DeltaColumn AS VARCHAR) in the query? A: Because the config table stores LastLoadedValue as VARCHAR (to support both integers and dates in the same column). We CAST the source column to VARCHAR for consistent comparison.

Q: What happens if the pipeline fails midway through a run? A: The watermark only updates after a successful copy (green arrow dependency). If the copy fails, the watermark stays at the old value, and the next run retries the same data range. No data is lost.

Q: How would you handle deletes in incremental loading? A: The watermark pattern doesn’t handle hard deletes. Options include: soft deletes (is_deleted flag), CDC with change tracking features in SQL Server, or periodic full reconciliation runs that compare source and destination.

Wrapping Up

You’ve now built a complete incremental data loading pipeline with:

  • Config table with watermark tracking (LastLoadedValue + DeltaColumnName)
  • Dynamic delta queries that only copy changed rows
  • Watermark updates after successful copies via stored procedure
  • Date-partitioned output folders for organized data lake storage
  • Support for both ID-based and date-based watermarks using CAST to VARCHAR

This is how real production data pipelines work. The watermark pattern is simple, reliable, and scales to thousands of tables with millions of rows.

Related posts:Building a Metadata-Driven Pipeline in Azure Data FactorySynapse Pipeline with Audit LoggingBuilding a REST API with FastAPI on AWS Lambda – Top 15 Azure Data Factory Interview Questions (coming soon)

If this guide helped you, share it with a fellow data engineer. Have questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link