Building a Metadata-Driven Pipeline in Azure Data Factory: The Complete Guide

If you’re working in data engineering, you’ll build dozens of pipelines that copy data from one place to another. The naive approach is creating a separate pipeline for each table — but that quickly becomes a maintenance nightmare when you have 50+ tables.

The smarter approach is a metadata-driven pipeline — a single, reusable pipeline that reads its configuration from a database table and dynamically copies any number of tables. Change the metadata, change what gets copied. No pipeline code changes needed.

In this guide, I’ll walk you through building a production-quality metadata-driven pipeline in Azure Data Factory (ADF) using the Lookup → ForEach → Copy pattern. This is the same pattern used in enterprise data platforms, and it’s one of the most commonly asked topics in data engineering interviews.

This tutorial is based on a real project I built copying SalesLT tables from Azure SQL Database to Azure Data Lake Storage Gen2. I’ll share every step, every gotcha, and every real error I hit along the way.

What is a Metadata-Driven Pipeline?
Architecture Overview
Azure Resources You’ll Need
Step 1: Create the Metadata Table
Step 2: Create Linked Services
Step 3: Create Datasets (The Tricky Part)
Step 4: Build the Pipeline
Step 5: Configure the Lookup Activity
Step 6: Configure the ForEach Activity
Step 7: Configure the Copy Activity Inside ForEach
Step 8: Run and Debug the Pipeline
Step 9: Verify the Output in ADLS Gen2
Understanding the Dynamic Expressions
The BadRequest Null Error (And How to Fix It)
Common Mistakes and How to Avoid Them
Production Best Practices
Interview Questions About This Pattern
What’s Next: Adding Audit Logging
Wrapping Up

What is a Metadata-Driven Pipeline?

A metadata-driven pipeline is a design pattern where the pipeline’s behavior — which tables to copy, where to copy them, what format to use — is controlled by configuration data stored in a database table, not hardcoded in the pipeline itself.

Instead of building this:

Pipeline_Copy_Customer     → copies Customer table
Pipeline_Copy_Address      → copies Address table
Pipeline_Copy_Product      → copies Product table
Pipeline_Copy_Orders       → copies Orders table
... (50 more pipelines)

You build this:

Pipeline_Copy_All_Tables   → reads metadata table
                           → loops through each row
                           → copies each table dynamically

Why this matters:

One pipeline handles everything. Adding a new table means inserting a row in the metadata table — not building a new pipeline.
Easy to maintain. Change the destination folder? Update the metadata. Disable a table? Delete or flag the row.
Consistent behavior. Every table goes through the same copy logic, error handling, and logging.
Interview favorite. This is asked in almost every Azure Data Engineering interview. Understanding it deeply gives you a significant advantage.

Architecture Overview

Here’s the high-level flow of what we’re building:

Pipeline: PL_Copy_SqlToADLS
│
├── Lookup Activity: "Lookup_Metadata"
│   └── Reads from: dbo.metadata table
│   └── Output: Array of table configurations
│
└── ForEach Activity: "ForEach_Table"
    └── Iterates over: Each row from Lookup output
    │
    └── Copy Activity: "Copy_TableData" (inside ForEach)
        ├── Source: Azure SQL Database (dynamic schema + table)
        └── Sink: ADLS Gen2 (dynamic container + folder path)

Data flow:

The Lookup activity reads the metadata table and returns an array of objects — one for each table to copy
The ForEach activity iterates over this array, executing the inner activities once per table
The Copy activity inside ForEach dynamically resolves the source table and destination path using @item() expressions that reference the current metadata row

Azure Resources You’ll Need

Before we start building, make sure you have these resources provisioned:

Resource	Details
Azure Data Factory	Your ADF workspace (e.g., `naveen-datafactory-de`)
Azure SQL Database	AdventureWorksLT sample database (or any SQL DB with multiple tables)
Azure Data Lake Storage Gen2	Storage account with hierarchical namespace enabled
ADLS Container	A container created manually (e.g., `database`) — ADF cannot create containers, only folders

Important: The ADLS Gen2 container must exist before running the pipeline. ADF will auto-create folders inside the container, but it cannot create the container itself. If the container doesn’t exist, you’ll get a runtime error.

Step 1: Create the Metadata Table

The metadata table is the brain of your pipeline. Each row represents one table to be copied. Connect to your Azure SQL Database and run:

-- Create the metadata table
CREATE TABLE metadata (
    id          INT IDENTITY(100, 1),
    TableName   VARCHAR(100),
    SchemaName  VARCHAR(100),
    ContainerName VARCHAR(100),
    FolderName  VARCHAR(100)
);

-- Insert table configurations
INSERT INTO metadata (TableName, SchemaName, ContainerName, FolderName)
VALUES
    ('Customer',        'SalesLT', 'database', 'sqldb/Customer'),
    ('Address',         'SalesLT', 'database', 'sqldb/Address'),
    ('CustomerAddress', 'SalesLT', 'database', 'sqldb/CustomerAddress'),
    ('Product',         'SalesLT', 'database', 'sqldb/Product'),
    ('ProductCategory', 'SalesLT', 'database', 'sqldb/ProductCategory');

-- Verify
SELECT * FROM metadata;

Output:

id	TableName	SchemaName	ContainerName	FolderName
100	Customer	SalesLT	database	sqldb/Customer
101	Address	SalesLT	database	sqldb/Address
102	CustomerAddress	SalesLT	database	sqldb/CustomerAddress
103	Product	SalesLT	database	sqldb/Product
104	ProductCategory	SalesLT	database	sqldb/ProductCategory

Understanding the Column Design

Each column serves a specific purpose in the pipeline:

TableName — The table to copy from the source database. Used in the Copy activity’s source query.
SchemaName — The schema prefix (e.g., SalesLT). Combined with TableName to form the fully qualified table name SalesLT.Customer.
ContainerName — The ADLS Gen2 container where data will land. Must already exist.
FolderName — The folder path inside the container. ADF creates this automatically. Using sqldb/Customer creates a nested folder structure.

Pro tip: In production, you’d add more columns like IsActive (to enable/disable tables), LoadType (full/incremental), FileFormat (CSV/Parquet/JSON), and LastLoadDate (for tracking).

Step 2: Create Linked Services

Linked Services are connection strings to your data sources. You need two:

A. Linked Service for Azure SQL Database

In ADF Studio, go to Manage tab (wrench icon) → Linked services → + New
Search for Azure SQL Database → select it
Configure:
Name: LS_AzureSqlDB
Server name: select your Azure SQL server from the dropdown
Database name: select your database
Authentication: SQL authentication (enter username and password)
Click Test connection — make sure it shows “Connection successful”
Click Create

B. Linked Service for ADLS Gen2

+ New → search for Azure Data Lake Storage Gen2 → select it
Configure:
Name: LS_ADLS_Gen2
Authentication: Account Key (easiest) or Managed Identity (more secure, recommended for production)
Select your storage account from the dropdown
Click Test connection → Create

Note: If you’re using Synapse instead of ADF, the default workspace storage linked service (naveen-synapse-ws-WorkspaceDefaultStorage) is already created for you. You only need to create the SQL linked service.

Step 3: Create Datasets (The Tricky Part)

This is where most people make mistakes. You need three datasets, and two of them are parameterized — meaning they don’t point to a specific table or folder. Instead, they accept parameters at runtime that tell them which table to read from and which folder to write to.

Dataset 1: DS_SqlDB_Metadata (Lookup Source)

This is the simplest dataset — it just points to the metadata table.

Go to Author tab → + → Dataset → Azure SQL Database
Name: DS_SqlDB_Metadata
Linked service: LS_AzureSqlDB
Table: select dbo.metadata
Click OK

No parameters needed here — this dataset always reads from the same table.

Dataset 2: DS_SqlDB_SourceTable (Parameterized Copy Source)

This dataset needs to point to different tables depending on which iteration of the ForEach loop is running. We achieve this with parameters.

+ → Dataset → Azure SQL Database
Name: DS_SqlDB_SourceTable
Linked service: LS_AzureSqlDB
Leave the table name blank — don’t select any table
Click OK
Go to the Parameters tab → add two parameters:
SchemaName (Type: String) — leave default value EMPTY
TableName (Type: String) — leave default value EMPTY
Go to the Connection tab:
Check the “Edit” checkbox (in ADF) or “Enter manually” checkbox (in Synapse)
Two text boxes appear for Schema and Table
Schema field → click Add dynamic content → type: @dataset().SchemaName
Table field → click Add dynamic content → type: @dataset().TableName

CRITICAL: Leave the parameter default values EMPTY. Do not put any expression, variable reference, or placeholder text in the default value field. This is the #1 cause of the dreaded “BadRequest null” error. The values will be passed by the Copy activity at runtime.

Dataset 3: DS_ADLS_Sink (Parameterized Sink)

This dataset needs to write to different folders depending on the current table being copied.

+ → Dataset → Azure Data Lake Storage Gen2 → select DelimitedText (for CSV) or Parquet (for columnar format)
Name: DS_ADLS_Sink
Linked service: LS_ADLS_Gen2
Check “First row as header” (for CSV)
Leave file path blank
Click OK
Go to the Parameters tab → add two parameters:
ContainerName (Type: String) — leave default value EMPTY
FolderName (Type: String) — leave default value EMPTY
Go to the Connection tab:
File system (Container) → Add dynamic content → @dataset().ContainerName
Directory (Folder) → Add dynamic content → @dataset().FolderName
File name → leave blank (ADF auto-generates file names)

The Critical Rule for Avoiding BadRequest Errors

This deserves its own callout because it’s the most common mistake:

Never put expressions, variable references, or placeholder text in dataset parameter default values.

When you add parameters to a dataset, ADF shows a “Default value” field. Leave it empty. If you put something like @pipeline().parameters.TableName or even just the word “test” there, and later remove or rename that pipeline parameter, ADF will throw the cryptic “BadRequest null” error with no useful error message.

The default value field is only for when you want a fallback — and in a metadata-driven pipeline, you always want the value to come from the ForEach loop, not from a default.

Step 4: Build the Pipeline

Now let’s create the pipeline and connect the activities:

Go to Author tab → + → Pipeline
Name it: PL_Copy_SqlToADLS

Add the activities:

Drag a Lookup activity onto the canvas → name it: Lookup_Metadata
Drag a ForEach activity onto the canvas → name it: ForEach_Table
Connect them: hover over the Lookup activity → drag the green success arrow to the ForEach activity
Double-click the ForEach activity to open its inner canvas
Drag a Copy Data activity inside → name it: Copy_TableData

Your canvas should look like:

[Lookup_Metadata] ──green arrow──> [ForEach_Table]
                                       │
                                       └── Inside: [Copy_TableData]

Step 5: Configure the Lookup Activity

Click on the Lookup_Metadata activity:

Go to the Settings tab
Source dataset: select DS_SqlDB_Metadata
UNCHECK “First row only” — this is critical! If checked, the Lookup returns only one row (a single object). Unchecked, it returns an array of all rows, which is what ForEach needs.
Query: Use Table (default) — this reads all rows from the metadata table

What the Lookup returns (when “First row only” is unchecked):

{
    "count": 5,
    "value": [
        {
            "id": 100,
            "TableName": "Customer",
            "SchemaName": "SalesLT",
            "ContainerName": "database",
            "FolderName": "sqldb/Customer"
        },
        {
            "id": 101,
            "TableName": "Address",
            "SchemaName": "SalesLT",
            "ContainerName": "database",
            "FolderName": "sqldb/Address"
        },
        ...
    ]
}

What happens if “First row only” is checked (common mistake):

{
    "firstRow": {
        "id": 100,
        "TableName": "Customer",
        "SchemaName": "SalesLT",
        ...
    }
}

The output structure is completely different — firstRow vs value array. This breaks the ForEach activity because it expects an array to iterate over, not a single object.

Step 6: Configure the ForEach Activity

Click on the ForEach_Table activity:

Go to the Settings tab
Items field → click Add dynamic content → enter:

@activity('Lookup_Metadata').output.value

This expression does the following: – @activity('Lookup_Metadata') — references the Lookup activity by name (must match exactly, case-sensitive) – .output — accesses the activity’s output – .value — accesses the array of rows (this is why “First row only” must be unchecked)

Sequential checkbox: Leave unchecked for parallel execution (faster), or check it if you want tables to be copied one at a time (useful for debugging or when resources are limited)
Batch count: Default is 20 (max parallel copies). For a small number of tables (5-10), leave the default.

Step 7: Configure the Copy Activity Inside ForEach

Double-click the ForEach activity to open its inner canvas. Click on the Copy_TableData activity:

Source Tab

Source dataset: select DS_SqlDB_SourceTable
Two parameter fields will appear (SchemaName and TableName)
Click Add dynamic content for each:
SchemaName → @item().SchemaName
TableName → @item().TableName

What does @item() mean?

Inside a ForEach activity, @item() refers to the current element being processed in the loop. If the ForEach is iterating over the array from the Lookup output, then on the first iteration, @item() is:

{
    "id": 100,
    "TableName": "Customer",
    "SchemaName": "SalesLT",
    "ContainerName": "database",
    "FolderName": "sqldb/Customer"
}

So @item().SchemaName resolves to "SalesLT" and @item().TableName resolves to "Customer". The Copy activity then reads from SalesLT.Customer.

On the second iteration, @item() becomes the Address row, and the Copy reads from SalesLT.Address. And so on.

Sink Tab

Sink dataset: select DS_ADLS_Sink
Two parameter fields will appear (ContainerName and FolderName)
Click Add dynamic content for each:
ContainerName → @item().ContainerName
FolderName → @item().FolderName

Remember: @dataset() vs @item()

This confuses many people, so let me clarify:

Expression	Where It’s Used	What It Does
`@dataset().SchemaName`	In the Dataset Connection tab	References the dataset’s own parameter — like a function parameter
`@item().SchemaName`	In the Pipeline Copy activity	Passes a value from the current ForEach item to the dataset parameter

Think of it like a function call:

// Dataset is like a function definition
function copyFromTable(SchemaName, TableName) {
    // @dataset().SchemaName is used inside the function body
    return readFrom(SchemaName + "." + TableName);
}

// Pipeline is like calling the function
for (item in lookupResults) {
    // @item().SchemaName is the argument passed to the function
    copyFromTable(item.SchemaName, item.TableName);
}

Step 8: Run and Debug the Pipeline

Before running, do this pre-flight checklist:

Verify the ADLS container exists — Go to your Storage Account → Containers → confirm database container is there
Validate the pipeline — Click the Validate button (checkmark icon) at the top. Fix any errors.
Publish all — Click Publish all to save your changes to the ADF service

Now click Debug to run the pipeline.

What to watch in the output pane:

Lookup_Metadata should show: Status: Succeeded, Output: (click the glasses icon to see the JSON array)
ForEach_Table should show: Status: Succeeded, with 5 iterations
Click on ForEach_Table → View iterations → each Copy_TableData should show Succeeded with rows read/written

Successful run example:

Lookup_Metadata    → Succeeded (5 rows returned)
ForEach_Table      → Succeeded
  ├── Copy_TableData (Customer)        → 847 rows read, 847 rows written
  ├── Copy_TableData (Address)         → 450 rows read, 450 rows written
  ├── Copy_TableData (CustomerAddress) → 417 rows read, 417 rows written
  ├── Copy_TableData (Product)         → 295 rows read, 295 rows written
  └── Copy_TableData (ProductCategory) → 41 rows read, 41 rows written

Step 9: Verify the Output in ADLS Gen2

After the pipeline succeeds, check your ADLS Gen2 storage:

Go to your Storage Account → Containers → database
Navigate into the sqldb folder
You should see five subfolders:

database/
└── sqldb/
    ├── Customer/
    │   └── part-00000-xxxx.csv    (847 rows)
    ├── Address/
    │   └── part-00000-xxxx.csv    (450 rows)
    ├── CustomerAddress/
    │   └── part-00000-xxxx.csv    (417 rows)
    ├── Product/
    │   └── part-00000-xxxx.csv    (295 rows)
    └── ProductCategory/
        └── part-00000-xxxx.csv    (41 rows)

The file names are auto-generated by ADF. Each CSV file contains the complete table data with headers in the first row.

Understanding the Dynamic Expressions

Let me consolidate all the dynamic expressions used in this pipeline so you have a complete reference:

Lookup Output Expression

@activity('Lookup_Metadata').output.value

@activity('Lookup_Metadata') — references the Lookup activity (name must match exactly)
.output — the activity’s full output object
.value — the array of rows (only exists when “First row only” is unchecked)

ForEach Item Expressions

@item().TableName       → "Customer" (current iteration's table name)
@item().SchemaName      → "SalesLT" (current iteration's schema)
@item().ContainerName   → "database" (ADLS container name)
@item().FolderName      → "sqldb/Customer" (ADLS folder path)

@item() — the current element in the ForEach loop
Property names are case-sensitive — @item().tablename won’t work if the column is TableName

Dataset Parameter Expressions

@dataset().SchemaName       → used in dataset Connection tab
@dataset().TableName        → used in dataset Connection tab
@dataset().ContainerName    → used in dataset Connection tab
@dataset().FolderName       → used in dataset Connection tab

@dataset() can only be used inside a dataset’s own configuration
It references the parameters defined on that dataset
Values are passed by the pipeline activity that uses the dataset

The BadRequest Null Error (And How to Fix It)

If you’ve worked with ADF, you’ve probably seen this error:

{
    "code": "BadRequest",
    "message": null,
    "target": "pipeline//runid/03227a01-...",
    "details": null,
    "error": null
}

This is the most frustrating error in ADF — the message is literally null, giving you zero information about what went wrong.

The Top 5 Causes (In Order of Likelihood)

1. Stale dataset parameter defaults — You had a default value in a dataset parameter that referenced a pipeline variable you later deleted. Even though you’re not using the default, ADF still tries to resolve it. Fix: Open every dataset, go to Parameters tab, and clear all default values.

2. Trailing whitespace on names — An activity named "Copy_TableData " (with a trailing space) will cause this error, and it passes validation. Fix: Carefully retype activity, dataset, and parameter names.

3. Incorrect expression references — An expression like @activity('LookupMetadata').output.value when the activity is actually named Lookup_Metadata (with underscore). Fix: Copy-paste activity names into expressions to avoid typos.

4. Key Vault linked service UI bug — Editing a Key Vault-based linked service through the UI can inject stray characters. Fix: Edit the linked service JSON directly using the {} code button.

5. Nested ForEach loops — ADF does not support ForEach inside another ForEach. Fix: Use an Execute Pipeline activity to call a child pipeline for the inner loop.

How to Debug When the Message is Null

Use the binary chop method:

Add a breakpoint after the Lookup activity → Debug → does it succeed?
If yes, enable ForEach but replace the Copy with a Set Variable activity → Debug → does iteration work?
If yes, add the Copy back → the problem is in the Copy activity’s dataset parameter mapping
Check each dataset’s JSON by clicking the {} code button → look for stale references

Common Mistakes and How to Avoid Them

Based on my experience and common issues I’ve seen:

Mistake	Impact	Prevention
Leaving “First row only” checked on Lookup	ForEach gets an object instead of an array — fails silently or copies only one table	Always uncheck for metadata-driven pipelines
Default values in dataset parameters	BadRequest null error at runtime	Never put values in dataset parameter defaults
Case mismatch in `@item()` properties	Runtime error — `@item().tablename` won’t find `TableName`	Match the exact column name case from your SQL table
Missing ADLS container	Runtime error — ADF cannot create containers	Always create the container manually before running
Not connecting Lookup to ForEach	ForEach runs before Lookup completes — empty or stale data	Drag the green success arrow from Lookup to ForEach
Using `@dataset()` in the pipeline	Error: “dataset function is not valid”	Use `@dataset()` only inside datasets. Use `@item()` in the pipeline.
Not clicking “Publish all” before Debug	Runs with old configuration	Always Publish before Debugging

Production Best Practices

If you’re taking this pattern to production, consider these enhancements:

1. Add an IsActive Flag

ALTER TABLE metadata ADD IsActive BIT DEFAULT 1;

Then in your Lookup, use a query instead of the table:

SELECT * FROM metadata WHERE IsActive = 1

This lets you disable tables without deleting rows.

2. Add Audit Logging

Create a stored procedure that logs success/failure details:

CREATE PROCEDURE sp_LogAudit
    @table_name VARCHAR(100),
    @schema_name VARCHAR(100),
    @rows_read INT,
    @rows_copied INT,
    @copy_duration INT,
    @error_message VARCHAR(500)
AS
BEGIN
    INSERT INTO audit (table_name, schema_name, rows_read, rows_copied,
                       copy_duration, error_message, load_date)
    VALUES (@table_name, @schema_name, @rows_read, @rows_copied,
            @copy_duration, @error_message, GETDATE());
END

I cover this in detail in my next post: Building a Synapse Pipeline with Audit Logging.

3. Use Parquet Instead of CSV

Parquet is better for analytics workloads: – 70-90% smaller files (columnar compression) – Faster query performance in Synapse, Databricks, and Spark – Schema is embedded in the file — no guessing data types – Supports nested data structures

Just change your sink dataset from DelimitedText to Parquet format.

4. Add Date-Based Folder Partitioning

Instead of writing all data to sqldb/Customer/, add date partitions:

@concat(item().FolderName, '/', formatDateTime(utcnow(), 'yyyy'), '/',
        formatDateTime(utcnow(), 'MM'), '/', formatDateTime(utcnow(), 'dd'))

This creates:

sqldb/Customer/2026/04/05/part-00000.csv

Making it easy to query specific dates and manage data retention.

5. Add Error Handling

In the ForEach activity, you can handle individual table failures without stopping the entire pipeline:

In the ForEach Settings tab, check “Continue on error”
This means if the Customer table fails, the pipeline continues with Address, Product, etc.
Combine with audit logging to track which tables succeeded and which failed

Interview Questions About This Pattern

Here are questions you might face in an Azure Data Engineering interview:

Q: What is a metadata-driven pipeline in ADF? A: It’s a pipeline that reads its configuration (which tables to copy, where to write, what format to use) from a metadata table rather than having these details hardcoded. It uses Lookup to read the config, ForEach to iterate, and Copy to move data dynamically.

Q: Why use @item() instead of @dataset() in the Copy activity? A: @item() references the current element in the ForEach loop and is used in the pipeline. @dataset() references the dataset’s own parameters and is used only inside the dataset’s configuration (Connection tab). The pipeline passes values to the dataset using @item(), and the dataset uses those values internally via @dataset().

Q: What happens if “First row only” is checked on the Lookup? A: The Lookup returns a single object (output.firstRow) instead of an array (output.value). The ForEach activity expects an array, so it will fail or behave unexpectedly.

Q: How do you add a new table to the pipeline? A: Insert a new row in the metadata table with the table name, schema, container, and folder path. No pipeline changes needed. The next time the pipeline runs, it picks up the new row automatically.

Q: How do you handle failures for individual tables? A: Enable “Continue on error” on the ForEach activity. Combine with audit logging using Stored Procedure activities that capture error messages for failed tables.

Q: What’s the difference between ADF and Synapse Pipelines? A: They use the same engine and almost identical UI. Synapse Pipelines are integrated into the Synapse workspace and can leverage Synapse-specific features (Spark pools, SQL pools, data flows). ADF is a standalone service. For pure ETL workloads, both work equally well.

What’s Next: Adding Audit Logging

The pipeline we built works great, but in production you need visibility into what happened during each run. In my next post, I’ll extend this pipeline in Azure Synapse Analytics to include:

Parquet output instead of CSV (smaller, faster, better for analytics)
Stored procedure audit logging that captures rows_read, rows_copied, copy_duration, and error_message for every table
Success and failure tracking — the pipeline logs both successful copies and failures separately
Incremental loading with a watermark pattern using a config table

Stay tuned for: Building a Metadata-Driven Pipeline in Azure Synapse Analytics with Audit Logging

Wrapping Up

You now have a production-quality, metadata-driven pipeline in Azure Data Factory that:

Dynamically copies any number of tables from Azure SQL Database to ADLS Gen2
Uses parameterized datasets for maximum reusability
Can be extended with new tables by simply adding rows to a metadata table
Follows the Lookup → ForEach → Copy pattern used in enterprise data platforms

This pattern is fundamental to modern data engineering on Azure. Whether you’re building a data lake, a warehouse loading process, or preparing for an interview — mastering this pattern will serve you well.

Related posts: – Building a REST API with Python FastAPI on AWS Lambda – Building a Synapse Pipeline with Audit Logging (coming soon) – Incremental Data Loading with Delta Copy Pattern (coming soon) – Top 15 Azure Data Factory Interview Questions (coming soon)

If this guide helped you, share it with a fellow data engineer. Have questions? Drop a comment below.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.