Top 15 Azure Data Factory Interview Questions Every Data Engineer Should Know

Top 15 Azure Data Factory Interview Questions Every Data Engineer Should Know

Azure Data Factory (ADF) is one of the most in-demand skills for data engineering roles in 2026. Whether you’re interviewing for a Data Engineer, ETL Developer, or Cloud Data Architect position, you’ll almost certainly face ADF questions.

The problem is that most interview prep resources give you textbook answers that sound generic. Interviewers can tell when you’ve memorized a definition versus when you’ve actually built pipelines in production.

In this post, I’m sharing 15 real ADF interview questions with detailed answers based on hands-on experience. These aren’t theoretical — they’re the exact questions I’ve encountered and the kind of answers that demonstrate real understanding.

For each question, I’ll give you the concise answer (what to say in the interview) and the deep explanation (what to study to truly understand the topic).

Table of Contents

  • Beginner Level (Questions 1-5)
  • Intermediate Level (Questions 6-10)
  • Advanced Level (Questions 11-15)
  • Bonus: Questions to Ask the Interviewer
  • Wrapping Up

Beginner Level

Question 1: What is Azure Data Factory and when would you use it?

Concise answer:

Azure Data Factory is a cloud-based ETL/ELT service that lets you create data pipelines to move and transform data between different sources and destinations. It’s a fully managed, serverless service — you don’t manage any infrastructure.

Deep explanation:

ADF is Microsoft’s answer to the data integration problem: how do you get data from Point A to Point B reliably, at scale, and on a schedule?

You’d use ADF when you need to:

  • Ingest data from on-premises databases, SaaS applications, or files into a cloud data lake or warehouse
  • Orchestrate complex data workflows with dependencies, retries, and error handling
  • Move data between Azure services (SQL Database → ADLS Gen2 → Synapse SQL Pool)
  • Schedule recurring data pipelines (daily loads, hourly refreshes)

ADF is NOT a transformation engine by itself. It orchestrates data movement and can call external transformation services (Databricks, Synapse Spark, SQL stored procedures), but the heavy data transformation usually happens outside ADF.

Key components to mention:

  • Pipelines — the workflow containers
  • Activities — individual tasks inside a pipeline (Copy, Lookup, ForEach, etc.)
  • Datasets — pointers to your data (table, file, blob)
  • Linked Services — connection strings to data stores
  • Triggers — what starts the pipeline (schedule, event, manual)
  • Integration Runtime — the compute that actually moves the data

Question 2: What is the difference between ADF and Synapse Pipelines?

Concise answer:

They use the same pipeline engine and nearly identical UI. Synapse Pipelines are embedded inside the Synapse Analytics workspace and can integrate directly with Synapse SQL pools and Spark pools. ADF is a standalone service focused purely on data integration.

Deep explanation:

Aspect Azure Data Factory Synapse Pipelines
Deployment Standalone service Inside Synapse workspace
Pipeline engine Same Same
Expression language Same Same
UI navigation Author tab Integrate tab
Default ADLS linked service Must create manually Auto-created with workspace
Spark integration Requires Databricks Built-in Spark pools
SQL pool integration Not available Native support
Pricing Per activity run + data movement Same pricing model

When to use which:

  • Use ADF when you only need data movement/orchestration and don’t need Synapse-specific features
  • Use Synapse Pipelines when you’re already in the Synapse ecosystem and want everything in one workspace

Interview tip: Many interviewers ask this to see if you understand that the underlying engine is the same. Saying “they’re the same thing but Synapse embeds it in the analytics workspace” shows you know both.

Question 3: What are the different types of activities in ADF?

Concise answer:

ADF has three categories of activities: Data Movement (Copy), Data Transformation (Data Flow, Databricks, Stored Procedure, HDInsight), and Control Flow (ForEach, If Condition, Lookup, Set Variable, Wait, Execute Pipeline, Web).

Deep explanation:

Data Movement Activities:

  • Copy Activity — the workhorse of ADF. Copies data between 90+ supported sources and sinks. Supports schema mapping, type conversion, fault tolerance, and parallel copying.

Data Transformation Activities:

  • Data Flow — visual data transformation using a Spark-based engine (no code required)
  • Databricks Notebook/Jar/Python — runs transformation logic in Azure Databricks
  • Stored Procedure — executes a SQL stored procedure
  • HDInsight activities — runs Hive, Pig, MapReduce, Spark on HDInsight clusters

Control Flow Activities:

  • Lookup — reads data and passes it to downstream activities
  • ForEach — iterates over an array and executes child activities for each item
  • If Condition — branches pipeline execution based on a boolean expression
  • Set Variable — sets a pipeline variable value
  • Execute Pipeline — calls another pipeline (useful for modularity)
  • Wait — pauses execution for a specified duration
  • Web — calls an external REST API
  • Validation — checks if a file exists before proceeding

What interviewers really want to hear: That you know the difference between Copy (data movement), Data Flow (transformation), and control activities (orchestration). Bonus points for mentioning you’ve used Lookup + ForEach + Copy together in a metadata-driven pattern.

Question 4: What is a Linked Service and how is it different from a Dataset?

Concise answer:

A Linked Service is the connection string — it defines HOW to connect to a data store (server address, credentials, authentication method). A Dataset is the data reference — it defines WHAT data to read or write (which table, which file, which folder).

Deep explanation:

Think of it like a database connection in application code:

# Linked Service = the connection
connection = connect(server="myserver.database.windows.net",
                     database="mydb",
                     username="admin",
                     password="secret")

# Dataset = what to read
data = connection.query("SELECT * FROM SalesLT.Customer")

Multiple datasets can share the same linked service. For example, you might have one linked service for your Azure SQL Database, but three datasets pointing to different tables (Customer, Address, Product).

Key relationships:

Linked Service: LS_AzureSqlDB
    ├── Dataset: DS_Customer (table: SalesLT.Customer)
    ├── Dataset: DS_Address (table: SalesLT.Address)
    └── Dataset: DS_Product (table: SalesLT.Product)

Interview tip: Mention that in parameterized datasets, you can make the table name dynamic — so one dataset handles multiple tables. This shows you understand metadata-driven patterns.

Question 5: What is Integration Runtime (IR) and what are the types?

Concise answer:

Integration Runtime is the compute infrastructure that ADF uses to actually move and transform data. There are three types: Azure IR (for cloud-to-cloud), Self-hosted IR (for on-premises or private network access), and Azure-SSIS IR (for running legacy SSIS packages).

Deep explanation:

Azure Integration Runtime (default): – Fully managed by Microsoft – Used for cloud-to-cloud data movement (Azure SQL → ADLS, S3 → Blob Storage) – Auto-resolves the region for best performance – Supports Data Flows (Spark-based transformations)

Self-hosted Integration Runtime: – Installed on your on-premises machine or VM – Required when accessing data behind a firewall or VPN (on-prem SQL Server, file shares) – You manage the installation, updates, and high availability – Acts as a bridge between your private network and Azure

Azure-SSIS Integration Runtime: – Runs SQL Server Integration Services (SSIS) packages in the cloud – Used for lift-and-shift migration of existing SSIS workloads – Provisions a dedicated cluster of VMs

Interview tip: The most common follow-up is “when would you use a Self-hosted IR?” Answer: whenever the source data is behind a corporate firewall or in a private network that ADF’s Azure IR can’t reach directly.

Intermediate Level

Question 6: What is a metadata-driven pipeline and how do you build one?

Concise answer:

A metadata-driven pipeline reads its configuration (which tables to copy, where to write them) from a database table instead of hardcoding these details. It uses a Lookup activity to read the config, a ForEach activity to iterate, and a Copy activity to move data dynamically.

Deep explanation:

Instead of building a separate pipeline for each table:

Pipeline_Customer  → copies Customer
Pipeline_Address   → copies Address
Pipeline_Product   → copies Product

You build one pipeline that reads a metadata table:

SELECT TableName, SchemaName, ContainerName, FolderName
FROM metadata;

And dynamically copies each table using parameterized datasets with @item() expressions.

The pattern:

Lookup (read metadata) → ForEach (iterate rows) → Copy (dynamic source + sink)

Key expressions:

ForEach items:  @activity('Lookup').output.value
Copy source:    @item().SchemaName, @item().TableName
Copy sink:      @item().ContainerName, @item().FolderName

Why interviewers love this question: It tests whether you can build reusable, scalable pipelines rather than one-off copy jobs. Being able to walk through this pattern confidently is a strong signal.

I’ve written a complete guide on this: Building a Metadata-Driven Pipeline in Azure Data Factory

Question 7: What is the difference between @item(), @dataset(), and @activity()?

Concise answer:

@item() references the current element in a ForEach loop and is used in pipeline activities. @dataset() references dataset parameters and is used only inside a dataset’s configuration. @activity() references another activity’s output and is used in pipeline activities.

Deep explanation:

Expression Where to Use What It Does Example
@item() Pipeline (inside ForEach) Current loop element @item().TableName
@dataset() Dataset (Connection tab) Dataset’s own parameters @dataset().SchemaName
@activity() Pipeline Another activity’s output @activity('Lookup').output.value
@pipeline() Pipeline Pipeline-level properties @pipeline().RunId

The most common mistake: Using @dataset() in a pipeline activity. This causes the error: “The template function ‘dataset’ is not defined.”

Think of @dataset() like a function parameter and @item() like the argument you pass:

# Dataset = function definition
def copy_table(SchemaName, TableName):    # @dataset().SchemaName
    read_from(SchemaName + "." + TableName)

# Pipeline = function call
for row in metadata:                      # ForEach
    copy_table(row.SchemaName, row.TableName)  # @item().SchemaName

Question 8: How do you implement incremental loading in ADF?

Concise answer:

Use a watermark pattern with a config table that stores the last loaded value for each source table. The pipeline reads the current MAX value, copies only rows WHERE delta_column > last_loaded_value, and then updates the watermark after a successful copy.

Deep explanation:

The watermark pattern involves:

  1. A config table with LastLoadedValue and DeltaColumnName columns
  2. A Lookup to get the current MAX value of the delta column
  3. A Copy with a dynamic WHERE clause: WHERE DeltaCol > LastLoadedValue AND DeltaCol <= MaxValue
  4. A Stored Procedure to update the watermark after successful copy

Two watermark types:ID-based: Track the last loaded EMPID (catches inserts only) – Date-based: Track the last loaded timestamp (catches inserts AND updates)

Key expressions:

Delta query:
@concat('SELECT * FROM ', item().SchemaName, '.', item().TableName,
        ' WHERE ', item().DeltaColumnName, ' > ''', item().LastLoadedValue, '''')

Watermark update:
@activity('Lookup_MaxValue').output.firstRow.MaxValue

Why this matters: Full load works for small tables, but a 100-million-row table takes 2 hours. Incremental load copies only the changes (maybe 50K rows) in minutes.

Full guide: Incremental Data Loading with Delta Copy

Question 9: How do you handle errors and logging in ADF pipelines?

Concise answer:

Use dependency conditions (green arrow for success, red arrow for failure) to branch execution. Add Stored Procedure activities after the Copy activity — one on the success path that logs rows_read, rows_copied, and copy_duration, and one on the failure path that logs the error message. Both write to an audit table.

Deep explanation:

Dependency conditions:

Condition Arrow Color When It Runs
Succeeded Green Only when parent succeeds
Failed Red Only when parent fails
Completed Blue Always (success or failure)
Skipped Gray Only when parent is skipped

Success logging expressions:

@activity('Copy').output.rowsRead       → rows read from source
@activity('Copy').output.rowsCopied     → rows written to sink
@activity('Copy').output.copyDuration   → time in seconds
@item().TableName                        → which table

Failure logging:

@concat('Copy failed for ', item().SchemaName, '.', item().TableName)

For ForEach failures: Enable “Continue on error” so that if one table fails, the others still get processed. Without this, the entire ForEach stops on the first failure.

Full guide: Synapse Pipeline with Audit Logging

Question 10: What are Triggers in ADF and what types are available?

Concise answer:

Triggers define when a pipeline runs. There are three types: Schedule triggers (time-based — run daily at 2 AM), Tumbling Window triggers (time-based with state tracking — processes each time window exactly once), and Event triggers (react to file creation/deletion in Blob Storage or ADLS Gen2).

Deep explanation:

Schedule Trigger: – Runs on a cron-like schedule (daily, hourly, every 15 minutes) – Simplest type — most commonly used – Example: Run the daily ETL pipeline at 2:00 AM UTC

Tumbling Window Trigger: – Similar to Schedule but tracks state per time window – If a window is missed (pipeline was paused), it catches up on the missed windows – Supports dependencies between windows (wait for previous window to complete) – Example: Process each hour’s data independently — if the 3 PM window fails, 4 PM still runs, and 3 PM is retried

Event Trigger: – Fires when a file is created or deleted in Blob Storage or ADLS Gen2 – Can filter by folder path and file name pattern – Example: When a CSV file is uploaded to /incoming/, trigger the ingestion pipeline

Interview tip: Mention that Tumbling Window is preferred over Schedule for data pipelines because it guarantees exactly-once processing per time window and handles backfills automatically.

Advanced Level

Question 11: How do you parameterize pipelines and datasets for reusability?

Concise answer:

Add parameters to datasets (for dynamic table names and file paths) and pipelines (for passing configuration values). Use @dataset() in the dataset connection tab and @pipeline().parameters in pipeline activities. For metadata-driven patterns, use @item() inside ForEach to pass values from the metadata table to parameterized datasets.

Deep explanation:

Dataset parameters (most common):

Dataset: DS_SqlDB_SourceTable
Parameters:
  - SchemaName (String)
  - TableName (String)

Connection tab:
  - Schema: @dataset().SchemaName
  - Table: @dataset().TableName

Pipeline parameters:

Pipeline: PL_Copy_Data
Parameters:
  - SourceSchema (String, default: "SalesLT")
  - TargetContainer (String, default: "database")

Copy activity:
  - Source dataset SchemaName: @pipeline().parameters.SourceSchema

The critical rule: Dataset parameter default values must be EMPTY. Putting expressions or placeholder text in defaults is the #1 cause of the BadRequest null error.

Question 12: What is the difference between Copy Activity and Data Flow?

Concise answer:

Copy Activity moves data as-is from source to sink — it’s fast, simple, and handles schema mapping but doesn’t transform data. Data Flow is a visual transformation tool powered by Spark that can do joins, aggregations, pivots, derived columns, and complex transformations.

Deep explanation:

Feature Copy Activity Data Flow
Purpose Move data from A to B Transform data (ETL)
Engine Optimized native copy engine Apache Spark (managed)
Transformations Column mapping, type conversion only Joins, aggregates, window functions, pivots, derived columns
Speed for simple copy Fast (direct copy) Slower (Spark startup overhead)
Cost Cheaper (DIU-based billing) More expensive (Spark cluster billing)
Code required No No (visual drag-and-drop)
When to use Moving data without transformation Cleaning, joining, aggregating, reshaping data

Interview tip: Say “I use Copy Activity for ingestion (landing raw data) and Data Flow or Databricks for transformation. This follows the ELT pattern — Extract and Load first, then Transform in the target system.”

Question 13: How would you migrate on-premises data to Azure using ADF?

Concise answer:

Install a Self-hosted Integration Runtime on the on-premises network, create a linked service that uses this IR to connect to the on-prem database, and build a Copy activity pipeline that reads from the on-prem source and writes to an Azure destination (ADLS Gen2, Azure SQL, Synapse).

Deep explanation:

Step-by-step approach:

  1. Install Self-hosted IR on a Windows machine inside the corporate network
  2. Register the IR with your ADF workspace using the authentication key
  3. Create a Linked Service for the on-prem database (e.g., SQL Server) using the Self-hosted IR
  4. Create source dataset pointing to the on-prem tables
  5. Create sink dataset pointing to Azure (ADLS Gen2, Azure SQL)
  6. Build the pipeline with Copy activities
  7. Test connectivity and run a debug execution
  8. Set up triggers for scheduled execution

Key considerations: – The Self-hosted IR machine needs network access to both the on-prem database AND the internet (to communicate with ADF) – For high availability, install the IR on multiple nodes – For large data volumes, consider using AzCopy or Azure Data Box for the initial bulk migration, then ADF for ongoing incremental syncs

Question 14: How do you handle schema drift in ADF?

Concise answer:

Schema drift is when the source data structure changes (new columns added, columns removed, data types changed) without updating the pipeline. ADF handles this through the Copy activity’s “Allow schema drift” option and Data Flow’s built-in schema drift capabilities.

Deep explanation:

In Copy Activity: – Enable “Column mapping” with dynamic schema detection – ADF can auto-map columns by name when source and sink have matching column names – For Parquet/JSON sinks, new columns are added automatically – For SQL sinks, you may need to handle schema changes manually or use stored procedures

In Data Flow: – Built-in schema drift setting allows processing columns that aren’t in the defined schema – Use derived column patterns to handle unknown columns – Column pattern matching lets you apply transformations based on column name patterns instead of specific names

Interview tip: Mention that Parquet and JSON sinks are more forgiving of schema drift than SQL sinks because they don’t enforce a rigid table structure. This is one reason data lakes (schema-on-read) are preferred for raw ingestion.

Question 15: How do you monitor and optimize ADF pipeline performance?

Concise answer:

Use the Monitor tab for real-time and historical run data, set up alerts for failures, check Copy activity throughput and DIU utilization, and optimize by tuning parallelism, DIU allocation, partition settings, and choosing the right Integration Runtime.

Deep explanation:

Monitoring:Monitor tab — shows all pipeline runs with status, duration, and activity-level details – Azure Monitor — integrates ADF metrics (pipeline runs, activity runs, trigger runs) with Azure alerting – Log Analytics — stores detailed diagnostic logs for trend analysis – Custom audit tables — build your own logging with Stored Procedure activities

Performance optimization:

  1. Increase DIUs (Data Integration Units) — each DIU adds more compute for the Copy activity. Default is “Auto” (4 DIU). For large datasets, try 16-32 DIU.

  2. Enable parallel copy — set parallelCopies in the Copy activity for partitioned sources. This reads multiple partitions simultaneously.

  3. Use staging — for cross-region or on-prem copies, staging through Azure Blob can be faster than direct copy.

  4. Choose the right IR region — place the Azure IR close to both source and sink for minimum latency.

  5. Partition your source — if your table has a numeric or date partition key, ADF can read partitions in parallel.

  6. Avoid scan() on large DynamoDB tables — use query() with indexes instead.

Interview tip: Mention specific metrics: “I typically monitor Copy throughput (MB/s), DIU utilization percentage, and pipeline duration trends over time. If throughput is low, I increase DIUs. If DIU utilization is already at 100%, I optimize the source query instead.”

Bonus: Questions to Ask the Interviewer

These show you understand production data engineering, not just ADF features:

  1. “What’s the typical data volume your pipelines handle daily?” — shows you think about scale
  2. “Do you use ADF for orchestration only, or also for transformations?” — shows you understand ELT vs ETL
  3. “How do you handle pipeline failures and retries?” — shows you think about reliability
  4. “Do you have a CI/CD process for ADF pipelines?” — shows you understand DevOps for data
  5. “Are you using incremental loading or full loads for most tables?” — shows you understand performance patterns

Wrapping Up

Azure Data Factory interviews aren’t just about knowing what each activity does — they’re about demonstrating that you’ve built real pipelines, debugged real errors, and understand production patterns.

The questions above cover the full range from basic concepts to advanced architecture. If you can walk through a metadata-driven pipeline, explain incremental loading, discuss error handling, and talk about performance optimization — you’ll stand out from candidates who only know textbook definitions.

Study these posts for deeper understanding:Metadata-Driven Pipeline in Azure Data FactorySynapse Pipeline with Audit LoggingIncremental Data Loading with Delta CopyCommon ADF/Synapse Pipeline ErrorsBuilding a REST API with FastAPI on AWS Lambda

Good luck with your interview! If you have questions or want to share your experience, drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link