Top 15 Azure Data Factory Interview Questions Every Data Engineer Should Know
Azure Data Factory (ADF) is one of the most in-demand skills for data engineering roles in 2026. Whether you’re interviewing for a Data Engineer, ETL Developer, or Cloud Data Architect position, you’ll almost certainly face ADF questions.
The problem is that most interview prep resources give you textbook answers that sound generic. Interviewers can tell when you’ve memorized a definition versus when you’ve actually built pipelines in production.
In this post, I’m sharing 15 real ADF interview questions with detailed answers based on hands-on experience. These aren’t theoretical — they’re the exact questions I’ve encountered and the kind of answers that demonstrate real understanding.
For each question, I’ll give you the concise answer (what to say in the interview) and the deep explanation (what to study to truly understand the topic).
Table of Contents
- Beginner Level (Questions 1-5)
- Intermediate Level (Questions 6-10)
- Advanced Level (Questions 11-15)
- Bonus: Questions to Ask the Interviewer
- Wrapping Up
Beginner Level
Question 1: What is Azure Data Factory and when would you use it?
Concise answer:
Azure Data Factory is a cloud-based ETL/ELT service that lets you create data pipelines to move and transform data between different sources and destinations. It’s a fully managed, serverless service — you don’t manage any infrastructure.
Deep explanation:
ADF is Microsoft’s answer to the data integration problem: how do you get data from Point A to Point B reliably, at scale, and on a schedule?
You’d use ADF when you need to:
- Ingest data from on-premises databases, SaaS applications, or files into a cloud data lake or warehouse
- Orchestrate complex data workflows with dependencies, retries, and error handling
- Move data between Azure services (SQL Database → ADLS Gen2 → Synapse SQL Pool)
- Schedule recurring data pipelines (daily loads, hourly refreshes)
ADF is NOT a transformation engine by itself. It orchestrates data movement and can call external transformation services (Databricks, Synapse Spark, SQL stored procedures), but the heavy data transformation usually happens outside ADF.
Key components to mention:
- Pipelines — the workflow containers
- Activities — individual tasks inside a pipeline (Copy, Lookup, ForEach, etc.)
- Datasets — pointers to your data (table, file, blob)
- Linked Services — connection strings to data stores
- Triggers — what starts the pipeline (schedule, event, manual)
- Integration Runtime — the compute that actually moves the data
Question 2: What is the difference between ADF and Synapse Pipelines?
Concise answer:
They use the same pipeline engine and nearly identical UI. Synapse Pipelines are embedded inside the Synapse Analytics workspace and can integrate directly with Synapse SQL pools and Spark pools. ADF is a standalone service focused purely on data integration.
Deep explanation:
| Aspect | Azure Data Factory | Synapse Pipelines |
|---|---|---|
| Deployment | Standalone service | Inside Synapse workspace |
| Pipeline engine | Same | Same |
| Expression language | Same | Same |
| UI navigation | Author tab | Integrate tab |
| Default ADLS linked service | Must create manually | Auto-created with workspace |
| Spark integration | Requires Databricks | Built-in Spark pools |
| SQL pool integration | Not available | Native support |
| Pricing | Per activity run + data movement | Same pricing model |
When to use which:
- Use ADF when you only need data movement/orchestration and don’t need Synapse-specific features
- Use Synapse Pipelines when you’re already in the Synapse ecosystem and want everything in one workspace
Interview tip: Many interviewers ask this to see if you understand that the underlying engine is the same. Saying “they’re the same thing but Synapse embeds it in the analytics workspace” shows you know both.
Question 3: What are the different types of activities in ADF?
Concise answer:
ADF has three categories of activities: Data Movement (Copy), Data Transformation (Data Flow, Databricks, Stored Procedure, HDInsight), and Control Flow (ForEach, If Condition, Lookup, Set Variable, Wait, Execute Pipeline, Web).
Deep explanation:
Data Movement Activities:
- Copy Activity — the workhorse of ADF. Copies data between 90+ supported sources and sinks. Supports schema mapping, type conversion, fault tolerance, and parallel copying.
Data Transformation Activities:
- Data Flow — visual data transformation using a Spark-based engine (no code required)
- Databricks Notebook/Jar/Python — runs transformation logic in Azure Databricks
- Stored Procedure — executes a SQL stored procedure
- HDInsight activities — runs Hive, Pig, MapReduce, Spark on HDInsight clusters
Control Flow Activities:
- Lookup — reads data and passes it to downstream activities
- ForEach — iterates over an array and executes child activities for each item
- If Condition — branches pipeline execution based on a boolean expression
- Set Variable — sets a pipeline variable value
- Execute Pipeline — calls another pipeline (useful for modularity)
- Wait — pauses execution for a specified duration
- Web — calls an external REST API
- Validation — checks if a file exists before proceeding
What interviewers really want to hear: That you know the difference between Copy (data movement), Data Flow (transformation), and control activities (orchestration). Bonus points for mentioning you’ve used Lookup + ForEach + Copy together in a metadata-driven pattern.
Question 4: What is a Linked Service and how is it different from a Dataset?
Concise answer:
A Linked Service is the connection string — it defines HOW to connect to a data store (server address, credentials, authentication method). A Dataset is the data reference — it defines WHAT data to read or write (which table, which file, which folder).
Deep explanation:
Think of it like a database connection in application code:
# Linked Service = the connection
connection = connect(server="myserver.database.windows.net",
database="mydb",
username="admin",
password="secret")
# Dataset = what to read
data = connection.query("SELECT * FROM SalesLT.Customer")
Multiple datasets can share the same linked service. For example, you might have one linked service for your Azure SQL Database, but three datasets pointing to different tables (Customer, Address, Product).
Key relationships:
Linked Service: LS_AzureSqlDB
├── Dataset: DS_Customer (table: SalesLT.Customer)
├── Dataset: DS_Address (table: SalesLT.Address)
└── Dataset: DS_Product (table: SalesLT.Product)
Interview tip: Mention that in parameterized datasets, you can make the table name dynamic — so one dataset handles multiple tables. This shows you understand metadata-driven patterns.
Question 5: What is Integration Runtime (IR) and what are the types?
Concise answer:
Integration Runtime is the compute infrastructure that ADF uses to actually move and transform data. There are three types: Azure IR (for cloud-to-cloud), Self-hosted IR (for on-premises or private network access), and Azure-SSIS IR (for running legacy SSIS packages).
Deep explanation:
Azure Integration Runtime (default): – Fully managed by Microsoft – Used for cloud-to-cloud data movement (Azure SQL → ADLS, S3 → Blob Storage) – Auto-resolves the region for best performance – Supports Data Flows (Spark-based transformations)
Self-hosted Integration Runtime: – Installed on your on-premises machine or VM – Required when accessing data behind a firewall or VPN (on-prem SQL Server, file shares) – You manage the installation, updates, and high availability – Acts as a bridge between your private network and Azure
Azure-SSIS Integration Runtime: – Runs SQL Server Integration Services (SSIS) packages in the cloud – Used for lift-and-shift migration of existing SSIS workloads – Provisions a dedicated cluster of VMs
Interview tip: The most common follow-up is “when would you use a Self-hosted IR?” Answer: whenever the source data is behind a corporate firewall or in a private network that ADF’s Azure IR can’t reach directly.
Intermediate Level
Question 6: What is a metadata-driven pipeline and how do you build one?
Concise answer:
A metadata-driven pipeline reads its configuration (which tables to copy, where to write them) from a database table instead of hardcoding these details. It uses a Lookup activity to read the config, a ForEach activity to iterate, and a Copy activity to move data dynamically.
Deep explanation:
Instead of building a separate pipeline for each table:
Pipeline_Customer → copies Customer
Pipeline_Address → copies Address
Pipeline_Product → copies Product
You build one pipeline that reads a metadata table:
SELECT TableName, SchemaName, ContainerName, FolderName
FROM metadata;
And dynamically copies each table using parameterized datasets with @item() expressions.
The pattern:
Lookup (read metadata) → ForEach (iterate rows) → Copy (dynamic source + sink)
Key expressions:
ForEach items: @activity('Lookup').output.value
Copy source: @item().SchemaName, @item().TableName
Copy sink: @item().ContainerName, @item().FolderName
Why interviewers love this question: It tests whether you can build reusable, scalable pipelines rather than one-off copy jobs. Being able to walk through this pattern confidently is a strong signal.
I’ve written a complete guide on this: Building a Metadata-Driven Pipeline in Azure Data Factory
Question 7: What is the difference between @item(), @dataset(), and @activity()?
Concise answer:
@item() references the current element in a ForEach loop and is used in pipeline activities. @dataset() references dataset parameters and is used only inside a dataset’s configuration. @activity() references another activity’s output and is used in pipeline activities.
Deep explanation:
| Expression | Where to Use | What It Does | Example |
|---|---|---|---|
@item() |
Pipeline (inside ForEach) | Current loop element | @item().TableName |
@dataset() |
Dataset (Connection tab) | Dataset’s own parameters | @dataset().SchemaName |
@activity() |
Pipeline | Another activity’s output | @activity('Lookup').output.value |
@pipeline() |
Pipeline | Pipeline-level properties | @pipeline().RunId |
The most common mistake: Using @dataset() in a pipeline activity. This causes the error: “The template function ‘dataset’ is not defined.”
Think of @dataset() like a function parameter and @item() like the argument you pass:
# Dataset = function definition
def copy_table(SchemaName, TableName): # @dataset().SchemaName
read_from(SchemaName + "." + TableName)
# Pipeline = function call
for row in metadata: # ForEach
copy_table(row.SchemaName, row.TableName) # @item().SchemaName
Question 8: How do you implement incremental loading in ADF?
Concise answer:
Use a watermark pattern with a config table that stores the last loaded value for each source table. The pipeline reads the current MAX value, copies only rows WHERE delta_column > last_loaded_value, and then updates the watermark after a successful copy.
Deep explanation:
The watermark pattern involves:
- A config table with
LastLoadedValueandDeltaColumnNamecolumns - A Lookup to get the current MAX value of the delta column
- A Copy with a dynamic WHERE clause:
WHERE DeltaCol > LastLoadedValue AND DeltaCol <= MaxValue - A Stored Procedure to update the watermark after successful copy
Two watermark types: – ID-based: Track the last loaded EMPID (catches inserts only) – Date-based: Track the last loaded timestamp (catches inserts AND updates)
Key expressions:
Delta query:
@concat('SELECT * FROM ', item().SchemaName, '.', item().TableName,
' WHERE ', item().DeltaColumnName, ' > ''', item().LastLoadedValue, '''')
Watermark update:
@activity('Lookup_MaxValue').output.firstRow.MaxValue
Why this matters: Full load works for small tables, but a 100-million-row table takes 2 hours. Incremental load copies only the changes (maybe 50K rows) in minutes.
Full guide: Incremental Data Loading with Delta Copy
Question 9: How do you handle errors and logging in ADF pipelines?
Concise answer:
Use dependency conditions (green arrow for success, red arrow for failure) to branch execution. Add Stored Procedure activities after the Copy activity — one on the success path that logs rows_read, rows_copied, and copy_duration, and one on the failure path that logs the error message. Both write to an audit table.
Deep explanation:
Dependency conditions:
| Condition | Arrow Color | When It Runs |
|---|---|---|
| Succeeded | Green | Only when parent succeeds |
| Failed | Red | Only when parent fails |
| Completed | Blue | Always (success or failure) |
| Skipped | Gray | Only when parent is skipped |
Success logging expressions:
@activity('Copy').output.rowsRead → rows read from source
@activity('Copy').output.rowsCopied → rows written to sink
@activity('Copy').output.copyDuration → time in seconds
@item().TableName → which table
Failure logging:
@concat('Copy failed for ', item().SchemaName, '.', item().TableName)
For ForEach failures: Enable “Continue on error” so that if one table fails, the others still get processed. Without this, the entire ForEach stops on the first failure.
Full guide: Synapse Pipeline with Audit Logging
Question 10: What are Triggers in ADF and what types are available?
Concise answer:
Triggers define when a pipeline runs. There are three types: Schedule triggers (time-based — run daily at 2 AM), Tumbling Window triggers (time-based with state tracking — processes each time window exactly once), and Event triggers (react to file creation/deletion in Blob Storage or ADLS Gen2).
Deep explanation:
Schedule Trigger: – Runs on a cron-like schedule (daily, hourly, every 15 minutes) – Simplest type — most commonly used – Example: Run the daily ETL pipeline at 2:00 AM UTC
Tumbling Window Trigger: – Similar to Schedule but tracks state per time window – If a window is missed (pipeline was paused), it catches up on the missed windows – Supports dependencies between windows (wait for previous window to complete) – Example: Process each hour’s data independently — if the 3 PM window fails, 4 PM still runs, and 3 PM is retried
Event Trigger:
– Fires when a file is created or deleted in Blob Storage or ADLS Gen2
– Can filter by folder path and file name pattern
– Example: When a CSV file is uploaded to /incoming/, trigger the ingestion pipeline
Interview tip: Mention that Tumbling Window is preferred over Schedule for data pipelines because it guarantees exactly-once processing per time window and handles backfills automatically.
Advanced Level
Question 11: How do you parameterize pipelines and datasets for reusability?
Concise answer:
Add parameters to datasets (for dynamic table names and file paths) and pipelines (for passing configuration values). Use @dataset() in the dataset connection tab and @pipeline().parameters in pipeline activities. For metadata-driven patterns, use @item() inside ForEach to pass values from the metadata table to parameterized datasets.
Deep explanation:
Dataset parameters (most common):
Dataset: DS_SqlDB_SourceTable
Parameters:
- SchemaName (String)
- TableName (String)
Connection tab:
- Schema: @dataset().SchemaName
- Table: @dataset().TableName
Pipeline parameters:
Pipeline: PL_Copy_Data
Parameters:
- SourceSchema (String, default: "SalesLT")
- TargetContainer (String, default: "database")
Copy activity:
- Source dataset SchemaName: @pipeline().parameters.SourceSchema
The critical rule: Dataset parameter default values must be EMPTY. Putting expressions or placeholder text in defaults is the #1 cause of the BadRequest null error.
Question 12: What is the difference between Copy Activity and Data Flow?
Concise answer:
Copy Activity moves data as-is from source to sink — it’s fast, simple, and handles schema mapping but doesn’t transform data. Data Flow is a visual transformation tool powered by Spark that can do joins, aggregations, pivots, derived columns, and complex transformations.
Deep explanation:
| Feature | Copy Activity | Data Flow |
|---|---|---|
| Purpose | Move data from A to B | Transform data (ETL) |
| Engine | Optimized native copy engine | Apache Spark (managed) |
| Transformations | Column mapping, type conversion only | Joins, aggregates, window functions, pivots, derived columns |
| Speed for simple copy | Fast (direct copy) | Slower (Spark startup overhead) |
| Cost | Cheaper (DIU-based billing) | More expensive (Spark cluster billing) |
| Code required | No | No (visual drag-and-drop) |
| When to use | Moving data without transformation | Cleaning, joining, aggregating, reshaping data |
Interview tip: Say “I use Copy Activity for ingestion (landing raw data) and Data Flow or Databricks for transformation. This follows the ELT pattern — Extract and Load first, then Transform in the target system.”
Question 13: How would you migrate on-premises data to Azure using ADF?
Concise answer:
Install a Self-hosted Integration Runtime on the on-premises network, create a linked service that uses this IR to connect to the on-prem database, and build a Copy activity pipeline that reads from the on-prem source and writes to an Azure destination (ADLS Gen2, Azure SQL, Synapse).
Deep explanation:
Step-by-step approach:
- Install Self-hosted IR on a Windows machine inside the corporate network
- Register the IR with your ADF workspace using the authentication key
- Create a Linked Service for the on-prem database (e.g., SQL Server) using the Self-hosted IR
- Create source dataset pointing to the on-prem tables
- Create sink dataset pointing to Azure (ADLS Gen2, Azure SQL)
- Build the pipeline with Copy activities
- Test connectivity and run a debug execution
- Set up triggers for scheduled execution
Key considerations: – The Self-hosted IR machine needs network access to both the on-prem database AND the internet (to communicate with ADF) – For high availability, install the IR on multiple nodes – For large data volumes, consider using AzCopy or Azure Data Box for the initial bulk migration, then ADF for ongoing incremental syncs
Question 14: How do you handle schema drift in ADF?
Concise answer:
Schema drift is when the source data structure changes (new columns added, columns removed, data types changed) without updating the pipeline. ADF handles this through the Copy activity’s “Allow schema drift” option and Data Flow’s built-in schema drift capabilities.
Deep explanation:
In Copy Activity: – Enable “Column mapping” with dynamic schema detection – ADF can auto-map columns by name when source and sink have matching column names – For Parquet/JSON sinks, new columns are added automatically – For SQL sinks, you may need to handle schema changes manually or use stored procedures
In Data Flow: – Built-in schema drift setting allows processing columns that aren’t in the defined schema – Use derived column patterns to handle unknown columns – Column pattern matching lets you apply transformations based on column name patterns instead of specific names
Interview tip: Mention that Parquet and JSON sinks are more forgiving of schema drift than SQL sinks because they don’t enforce a rigid table structure. This is one reason data lakes (schema-on-read) are preferred for raw ingestion.
Question 15: How do you monitor and optimize ADF pipeline performance?
Concise answer:
Use the Monitor tab for real-time and historical run data, set up alerts for failures, check Copy activity throughput and DIU utilization, and optimize by tuning parallelism, DIU allocation, partition settings, and choosing the right Integration Runtime.
Deep explanation:
Monitoring: – Monitor tab — shows all pipeline runs with status, duration, and activity-level details – Azure Monitor — integrates ADF metrics (pipeline runs, activity runs, trigger runs) with Azure alerting – Log Analytics — stores detailed diagnostic logs for trend analysis – Custom audit tables — build your own logging with Stored Procedure activities
Performance optimization:
-
Increase DIUs (Data Integration Units) — each DIU adds more compute for the Copy activity. Default is “Auto” (4 DIU). For large datasets, try 16-32 DIU.
-
Enable parallel copy — set
parallelCopiesin the Copy activity for partitioned sources. This reads multiple partitions simultaneously. -
Use staging — for cross-region or on-prem copies, staging through Azure Blob can be faster than direct copy.
-
Choose the right IR region — place the Azure IR close to both source and sink for minimum latency.
-
Partition your source — if your table has a numeric or date partition key, ADF can read partitions in parallel.
-
Avoid scan() on large DynamoDB tables — use query() with indexes instead.
Interview tip: Mention specific metrics: “I typically monitor Copy throughput (MB/s), DIU utilization percentage, and pipeline duration trends over time. If throughput is low, I increase DIUs. If DIU utilization is already at 100%, I optimize the source query instead.”
Bonus: Questions to Ask the Interviewer
These show you understand production data engineering, not just ADF features:
- “What’s the typical data volume your pipelines handle daily?” — shows you think about scale
- “Do you use ADF for orchestration only, or also for transformations?” — shows you understand ELT vs ETL
- “How do you handle pipeline failures and retries?” — shows you think about reliability
- “Do you have a CI/CD process for ADF pipelines?” — shows you understand DevOps for data
- “Are you using incremental loading or full loads for most tables?” — shows you understand performance patterns
Wrapping Up
Azure Data Factory interviews aren’t just about knowing what each activity does — they’re about demonstrating that you’ve built real pipelines, debugged real errors, and understand production patterns.
The questions above cover the full range from basic concepts to advanced architecture. If you can walk through a metadata-driven pipeline, explain incremental loading, discuss error handling, and talk about performance optimization — you’ll stand out from candidates who only know textbook definitions.
Study these posts for deeper understanding: – Metadata-Driven Pipeline in Azure Data Factory – Synapse Pipeline with Audit Logging – Incremental Data Loading with Delta Copy – Common ADF/Synapse Pipeline Errors – Building a REST API with FastAPI on AWS Lambda
Good luck with your interview! If you have questions or want to share your experience, drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.