What is Azure Data Factory? A Beginner’s Guide for Data Engineers
If you’re starting your journey in Azure data engineering, Azure Data Factory (ADF) is one of the first services you’ll encounter — and one of the most important to understand. It’s the backbone of data movement on Azure, and it shows up in almost every data engineering job description.
But what exactly IS it? How does it work? And when should you use it versus other tools?
This guide will give you a solid foundation — no prior Azure experience required. By the end, you’ll understand what ADF does, how its components fit together, and when to reach for it in a real project.
Table of Contents
- What is Azure Data Factory?
- The Problem ADF Solves
- ADF in the Real World — A Simple Example
- The 6 Core Components of ADF
- How the Components Work Together
- What ADF is NOT
- ADF vs Other Data Integration Tools
- When to Use ADF
- When NOT to Use ADF
- How Much Does ADF Cost?
- Your First Pipeline — Where to Start
- Common Terminology Cheat Sheet
- Wrapping Up
What is Azure Data Factory?
Azure Data Factory is Microsoft’s cloud-based data integration service. In plain English, it’s a tool that moves data from one place to another and orchestrates data workflows — all without you managing any servers.
Think of ADF as a traffic controller for your data. It doesn’t store data itself. It doesn’t transform data in complex ways by itself. What it does is coordinate the movement of data between systems, schedule when things happen, and handle errors when things go wrong.
ADF is: – Fully managed — Microsoft handles all the infrastructure – Serverless — you don’t provision or manage any VMs – Code-free — you build pipelines using a visual drag-and-drop interface (though you can also use code) – Pay-per-use — you pay only for what you run
The Problem ADF Solves
Every organization has data scattered across multiple systems:
- Customer data in a CRM (Salesforce, Dynamics 365)
- Financial data in an on-premises SQL Server
- Product data in an e-commerce platform (Shopify, SAP)
- Log data in cloud storage (S3, Blob Storage)
- Analytics data in a data warehouse (Synapse, Snowflake)
To do anything useful with this data — build dashboards, train ML models, generate reports — you need to bring it all together in one place. That’s the data integration problem.
Before cloud tools like ADF existed, companies would: – Write custom Python/Java scripts to extract and load data – Use expensive enterprise ETL tools (Informatica, DataStage) – Build fragile cron jobs that broke silently – Hire entire teams just to manage data pipelines
ADF replaces all of that with a managed service that connects to 90+ data sources, handles scheduling and retries, and provides monitoring out of the box.
ADF in the Real World — A Simple Example
Let’s say you work at a company that has: – An Azure SQL Database with customer and order data – An Azure Data Lake where the analytics team wants the data – A requirement to refresh the data lake daily at 2 AM
Here’s what the ADF solution looks like:
Step 1: Create a Linked Service to Azure SQL Database (connection)
Step 2: Create a Linked Service to ADLS Gen2 (connection)
Step 3: Create a Dataset for the Customer table (source)
Step 4: Create a Dataset for the ADLS folder (destination)
Step 5: Create a Pipeline with a Copy Activity (move data)
Step 6: Create a Schedule Trigger (run daily at 2 AM)
That’s it. Six configuration steps — no code, no servers, no cron jobs. ADF handles the execution, monitoring, retries, and alerting.
Now imagine you have 50 tables instead of one. Instead of creating 50 pipelines, you build a metadata-driven pipeline that reads a config table and dynamically copies all 50 tables in a single pipeline. I cover this in detail in my post on metadata-driven pipelines in ADF.
The 6 Core Components of ADF
ADF has six building blocks that work together. Understanding how they relate is the key to mastering ADF.
1. Linked Services (The Connections)
A Linked Service is a connection string to a data store or compute service. It defines HOW to connect — the server address, authentication method, and credentials.
Examples:
- LS_AzureSqlDB → connects to Azure SQL Database
- LS_ADLS_Gen2 → connects to Azure Data Lake Storage
- LS_Salesforce → connects to Salesforce CRM
- LS_OnPremSQL → connects to on-premises SQL Server (via Self-hosted IR)
Think of it like a saved connection in a database client. You configure it once, and everything else references it.
2. Datasets (The Data Pointers)
A Dataset points to a specific piece of data within a Linked Service. It defines WHAT data to read or write — which table, which file, which folder.
Examples:
- DS_Customer → points to SalesLT.Customer table in Azure SQL
- DS_ADLS_Output → points to /data/customer/ folder in ADLS Gen2
Multiple datasets can share the same Linked Service. One SQL connection, many table datasets.
Key feature: Datasets can be parameterized — instead of pointing to a specific table, the table name comes from a parameter at runtime. This enables metadata-driven patterns.
3. Activities (The Tasks)
An Activity is a single task within a pipeline. ADF has three categories:
Data Movement: – Copy Activity — the most-used activity. Copies data from a source to a sink with schema mapping and type conversion.
Data Transformation: – Data Flow — visual Spark-based transformations (joins, aggregations, pivots) – Databricks — runs notebooks in Azure Databricks – Stored Procedure — executes SQL stored procedures – HDInsight — runs Hive, Pig, MapReduce
Control Flow: – Lookup — reads data and passes it to other activities – ForEach — loops over an array – If Condition — branches based on a boolean – Set Variable — stores a value for later use – Execute Pipeline — calls another pipeline – Wait — pauses for a specified duration – Web — calls a REST API
4. Pipelines (The Workflows)
A Pipeline is a container of activities that defines the execution flow. Activities within a pipeline are connected by dependency arrows (success, failure, completion, skip).
Example pipeline: PL_Daily_ETL
├── Lookup (read config table)
├── ForEach (iterate over tables)
│ └── Copy Activity (move each table)
└── Web Activity (send completion notification)
Pipelines can call other pipelines using the Execute Pipeline activity, enabling modular design.
5. Triggers (The Schedulers)
A Trigger defines when a pipeline runs:
- Schedule Trigger — runs on a time schedule (every day at 2 AM)
- Tumbling Window Trigger — time-based with state tracking (each window processed exactly once)
- Event Trigger — runs when a file is created/deleted in storage
Pipelines can also be run manually (Debug button) or via the REST API.
6. Integration Runtime (The Compute)
The Integration Runtime (IR) is the compute engine that actually executes your activities. Three types:
- Azure IR — managed cloud compute for cloud-to-cloud data movement
- Self-hosted IR — installed on your machine for accessing on-premises or private network data
- Azure-SSIS IR — runs legacy SSIS packages in the cloud
Most beginners use Azure IR exclusively. Self-hosted IR becomes important when your source data is behind a corporate firewall.
How the Components Work Together
Here’s how everything connects:
Trigger (WHEN)
└── starts Pipeline (WHAT to do)
└── contains Activities (HOW to do it)
├── reads from Dataset (SOURCE data)
│ └── uses Linked Service (SOURCE connection)
└── writes to Dataset (DESTINATION data)
└── uses Linked Service (DESTINATION connection)
All executed by Integration Runtime (WHERE computation happens)
Real example:
Schedule Trigger (daily at 2 AM)
└── Pipeline: PL_Copy_Customer
└── Copy Activity: Copy_Customer
├── Source Dataset: DS_SQL_Customer
│ └── Linked Service: LS_AzureSqlDB
└── Sink Dataset: DS_ADLS_Customer
└── Linked Service: LS_ADLS_Gen2
Executed by: Azure Integration Runtime (auto-managed)
What ADF is NOT
This is important because many beginners have incorrect expectations:
ADF is NOT a database. It doesn’t store data. It moves data between systems that DO store data.
ADF is NOT a transformation engine (primarily). While it has Data Flows for transformation, heavy transformations are better done in Databricks, Synapse Spark, or SQL stored procedures. ADF’s strength is orchestration and data movement.
ADF is NOT real-time. It’s designed for batch processing (every hour, every day). For real-time streaming, use Azure Event Hubs, Azure Stream Analytics, or Kafka.
ADF is NOT a scheduler alone. While it has triggers, it’s much more than a cron job — it provides monitoring, retry logic, dependency management, and integration with 90+ data sources.
ADF vs Other Data Integration Tools
| Tool | Type | Best For | Cost |
|---|---|---|---|
| Azure Data Factory | Cloud-native, serverless | Azure ecosystem, hybrid cloud | Pay-per-use |
| AWS Glue | Cloud-native, serverless | AWS ecosystem | Pay-per-use |
| Informatica | Enterprise ETL | Large enterprises, complex transformations | License-based (expensive) |
| Apache Airflow | Open-source orchestrator | Custom orchestration, multi-cloud | Free (but you manage infrastructure) |
| dbt | Transformation only | SQL-based transformations in warehouse | Free (open-source) / paid (cloud) |
| Fivetran/Airbyte | Managed ELT | SaaS data ingestion | Per-connector pricing |
ADF’s sweet spot: You’re on Azure, you need to move data from multiple sources to a data lake or warehouse, and you want a managed service with visual pipeline building.
When to Use ADF
- Moving data from on-premises to Azure (via Self-hosted IR)
- Copying data from SaaS applications (Salesforce, SAP, Dynamics) to Azure
- Building daily/hourly ETL pipelines to refresh a data warehouse
- Orchestrating complex workflows with dependencies, retries, and error handling
- Metadata-driven ingestion — copying dozens or hundreds of tables with a single pipeline
- Incremental loading — copying only changed data using watermark patterns
When NOT to Use ADF
- Real-time streaming — use Event Hubs or Stream Analytics instead
- Heavy data transformations — use Databricks or Synapse Spark (ADF can call them, but don’t try to do complex joins/aggregations in ADF itself)
- Simple file copies — if you just need to copy a single file once, AzCopy or Storage Explorer is simpler
- Non-Azure environments — if you’re fully on AWS, use AWS Glue instead
How Much Does ADF Cost?
ADF pricing has three components:
| Component | What It Measures | Cost |
|---|---|---|
| Pipeline orchestration | Activity runs per month | $1.00 per 1,000 activity runs |
| Data movement | Data volume moved by Copy activities | ~$0.25 per DIU-hour |
| Data Flow | Spark cluster time for transformations | ~$0.27 per vCore-hour |
Real-world example: A pipeline that copies 10 tables daily (each taking 2 minutes with 4 DIUs) costs roughly $5-10/month. That’s significantly cheaper than maintaining your own ETL server.
Free tier: ADF doesn’t have a free tier, but the costs are so low for small workloads that you’ll barely notice them on your Azure bill.
Your First Pipeline — Where to Start
If you’re new to ADF, here’s the learning path I recommend:
-
Start with a simple Copy pipeline — copy one table from Azure SQL to ADLS Gen2. This teaches you Linked Services, Datasets, and the Copy Activity.
-
Add parameterized datasets — make the table name dynamic. This teaches you parameters and expressions.
-
Build a metadata-driven pipeline — add Lookup + ForEach to copy multiple tables from a config table. This is the pattern you’ll use in production. I have a complete guide: Metadata-Driven Pipeline in ADF
-
Add audit logging — track success/failure with Stored Procedure activities. Guide: Synapse Pipeline with Audit Logging
-
Implement incremental loading — copy only changed data using watermarks. Guide: Incremental Data Loading
Each step builds on the previous one, and by step 5, you have a production-quality data pipeline.
Common Terminology Cheat Sheet
| Term | Meaning |
|---|---|
| Pipeline | A workflow containing activities |
| Activity | A single task (Copy, Lookup, ForEach, etc.) |
| Dataset | A pointer to specific data (table, file, folder) |
| Linked Service | A connection to a data store |
| Trigger | What starts a pipeline (schedule, event) |
| Integration Runtime | The compute that runs activities |
| DIU | Data Integration Unit — compute power for Copy activities |
| Sink | The destination in a Copy activity |
| Source | The origin in a Copy activity |
| Expression | Dynamic content using @ syntax (e.g., @item().TableName) |
| Publish | Save changes to the ADF service (like “deploy”) |
| Debug | Run a pipeline manually for testing |
| Monitor | View pipeline run history and status |
Wrapping Up
Azure Data Factory is the entry point to Azure data engineering. It’s not the flashiest tool, but it’s the one that makes everything else possible — without ADF (or something like it), your data stays siloed in disconnected systems.
The key takeaway: ADF is an orchestrator and data mover, not a transformation engine. Use it to get data from A to B reliably, and use other tools (Databricks, Synapse, dbt) for complex transformations.
Continue your learning: – Building a Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging – Incremental Data Loading with Delta Copy – Common ADF/Synapse Errors – Top 15 ADF Interview Questions
If this guide helped you get started, share it with someone who’s new to Azure. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.