What is Azure Data Factory? A Beginner’s Guide for Data Engineers

If you’re starting your journey in Azure data engineering, Azure Data Factory (ADF) is one of the first services you’ll encounter — and one of the most important to understand. It’s the backbone of data movement on Azure, and it shows up in almost every data engineering job description.

But what exactly IS it? How does it work? And when should you use it versus other tools?

This guide will give you a solid foundation — no prior Azure experience required. By the end, you’ll understand what ADF does, how its components fit together, and when to reach for it in a real project.

What is Azure Data Factory?
The Problem ADF Solves
ADF in the Real World — A Simple Example
The 6 Core Components of ADF
How the Components Work Together
What ADF is NOT
ADF vs Other Data Integration Tools
When to Use ADF
When NOT to Use ADF
How Much Does ADF Cost?
Your First Pipeline — Where to Start
Common Terminology Cheat Sheet
Wrapping Up

What is Azure Data Factory?

Azure Data Factory is Microsoft’s cloud-based data integration service. In plain English, it’s a tool that moves data from one place to another and orchestrates data workflows — all without you managing any servers.

Think of ADF as a traffic controller for your data. It doesn’t store data itself. It doesn’t transform data in complex ways by itself. What it does is coordinate the movement of data between systems, schedule when things happen, and handle errors when things go wrong.

ADF is: – Fully managed — Microsoft handles all the infrastructure – Serverless — you don’t provision or manage any VMs – Code-free — you build pipelines using a visual drag-and-drop interface (though you can also use code) – Pay-per-use — you pay only for what you run

The Problem ADF Solves

Every organization has data scattered across multiple systems:

Customer data in a CRM (Salesforce, Dynamics 365)
Financial data in an on-premises SQL Server
Product data in an e-commerce platform (Shopify, SAP)
Log data in cloud storage (S3, Blob Storage)
Analytics data in a data warehouse (Synapse, Snowflake)

To do anything useful with this data — build dashboards, train ML models, generate reports — you need to bring it all together in one place. That’s the data integration problem.

Before cloud tools like ADF existed, companies would: – Write custom Python/Java scripts to extract and load data – Use expensive enterprise ETL tools (Informatica, DataStage) – Build fragile cron jobs that broke silently – Hire entire teams just to manage data pipelines

ADF replaces all of that with a managed service that connects to 90+ data sources, handles scheduling and retries, and provides monitoring out of the box.

ADF in the Real World — A Simple Example

Let’s say you work at a company that has: – An Azure SQL Database with customer and order data – An Azure Data Lake where the analytics team wants the data – A requirement to refresh the data lake daily at 2 AM

Here’s what the ADF solution looks like:

Step 1: Create a Linked Service to Azure SQL Database (connection)
Step 2: Create a Linked Service to ADLS Gen2 (connection)
Step 3: Create a Dataset for the Customer table (source)
Step 4: Create a Dataset for the ADLS folder (destination)
Step 5: Create a Pipeline with a Copy Activity (move data)
Step 6: Create a Schedule Trigger (run daily at 2 AM)

That’s it. Six configuration steps — no code, no servers, no cron jobs. ADF handles the execution, monitoring, retries, and alerting.

Now imagine you have 50 tables instead of one. Instead of creating 50 pipelines, you build a metadata-driven pipeline that reads a config table and dynamically copies all 50 tables in a single pipeline. I cover this in detail in my post on metadata-driven pipelines in ADF.

The 6 Core Components of ADF

ADF has six building blocks that work together. Understanding how they relate is the key to mastering ADF.

1. Linked Services (The Connections)

A Linked Service is a connection string to a data store or compute service. It defines HOW to connect — the server address, authentication method, and credentials.

Examples:
- LS_AzureSqlDB → connects to Azure SQL Database
- LS_ADLS_Gen2 → connects to Azure Data Lake Storage
- LS_Salesforce → connects to Salesforce CRM
- LS_OnPremSQL → connects to on-premises SQL Server (via Self-hosted IR)

Think of it like a saved connection in a database client. You configure it once, and everything else references it.

2. Datasets (The Data Pointers)

A Dataset points to a specific piece of data within a Linked Service. It defines WHAT data to read or write — which table, which file, which folder.

Examples:
- DS_Customer → points to SalesLT.Customer table in Azure SQL
- DS_ADLS_Output → points to /data/customer/ folder in ADLS Gen2

Multiple datasets can share the same Linked Service. One SQL connection, many table datasets.

Key feature: Datasets can be parameterized — instead of pointing to a specific table, the table name comes from a parameter at runtime. This enables metadata-driven patterns.

3. Activities (The Tasks)

An Activity is a single task within a pipeline. ADF has three categories:

Data Movement: – Copy Activity — the most-used activity. Copies data from a source to a sink with schema mapping and type conversion.

Data Transformation: – Data Flow — visual Spark-based transformations (joins, aggregations, pivots) – Databricks — runs notebooks in Azure Databricks – Stored Procedure — executes SQL stored procedures – HDInsight — runs Hive, Pig, MapReduce

Control Flow: – Lookup — reads data and passes it to other activities – ForEach — loops over an array – If Condition — branches based on a boolean – Set Variable — stores a value for later use – Execute Pipeline — calls another pipeline – Wait — pauses for a specified duration – Web — calls a REST API

4. Pipelines (The Workflows)

A Pipeline is a container of activities that defines the execution flow. Activities within a pipeline are connected by dependency arrows (success, failure, completion, skip).

Example pipeline: PL_Daily_ETL
├── Lookup (read config table)
├── ForEach (iterate over tables)
│   └── Copy Activity (move each table)
└── Web Activity (send completion notification)

Pipelines can call other pipelines using the Execute Pipeline activity, enabling modular design.

5. Triggers (The Schedulers)

A Trigger defines when a pipeline runs:

Schedule Trigger — runs on a time schedule (every day at 2 AM)
Tumbling Window Trigger — time-based with state tracking (each window processed exactly once)
Event Trigger — runs when a file is created/deleted in storage

Pipelines can also be run manually (Debug button) or via the REST API.

6. Integration Runtime (The Compute)

The Integration Runtime (IR) is the compute engine that actually executes your activities. Three types:

Azure IR — managed cloud compute for cloud-to-cloud data movement
Self-hosted IR — installed on your machine for accessing on-premises or private network data
Azure-SSIS IR — runs legacy SSIS packages in the cloud

Most beginners use Azure IR exclusively. Self-hosted IR becomes important when your source data is behind a corporate firewall.

How the Components Work Together

Here’s how everything connects:

Trigger (WHEN)
  └── starts Pipeline (WHAT to do)
        └── contains Activities (HOW to do it)
              ├── reads from Dataset (SOURCE data)
              │     └── uses Linked Service (SOURCE connection)
              └── writes to Dataset (DESTINATION data)
                    └── uses Linked Service (DESTINATION connection)

All executed by Integration Runtime (WHERE computation happens)

Real example:

Schedule Trigger (daily at 2 AM)
  └── Pipeline: PL_Copy_Customer
        └── Copy Activity: Copy_Customer
              ├── Source Dataset: DS_SQL_Customer
              │     └── Linked Service: LS_AzureSqlDB
              └── Sink Dataset: DS_ADLS_Customer
                    └── Linked Service: LS_ADLS_Gen2

Executed by: Azure Integration Runtime (auto-managed)

What ADF is NOT

This is important because many beginners have incorrect expectations:

ADF is NOT a database. It doesn’t store data. It moves data between systems that DO store data.

ADF is NOT a transformation engine (primarily). While it has Data Flows for transformation, heavy transformations are better done in Databricks, Synapse Spark, or SQL stored procedures. ADF’s strength is orchestration and data movement.

ADF is NOT real-time. It’s designed for batch processing (every hour, every day). For real-time streaming, use Azure Event Hubs, Azure Stream Analytics, or Kafka.

ADF is NOT a scheduler alone. While it has triggers, it’s much more than a cron job — it provides monitoring, retry logic, dependency management, and integration with 90+ data sources.

ADF vs Other Data Integration Tools

Tool	Type	Best For	Cost
Azure Data Factory	Cloud-native, serverless	Azure ecosystem, hybrid cloud	Pay-per-use
AWS Glue	Cloud-native, serverless	AWS ecosystem	Pay-per-use
Informatica	Enterprise ETL	Large enterprises, complex transformations	License-based (expensive)
Apache Airflow	Open-source orchestrator	Custom orchestration, multi-cloud	Free (but you manage infrastructure)
dbt	Transformation only	SQL-based transformations in warehouse	Free (open-source) / paid (cloud)
Fivetran/Airbyte	Managed ELT	SaaS data ingestion	Per-connector pricing

ADF’s sweet spot: You’re on Azure, you need to move data from multiple sources to a data lake or warehouse, and you want a managed service with visual pipeline building.

When to Use ADF

Moving data from on-premises to Azure (via Self-hosted IR)
Copying data from SaaS applications (Salesforce, SAP, Dynamics) to Azure
Building daily/hourly ETL pipelines to refresh a data warehouse
Orchestrating complex workflows with dependencies, retries, and error handling
Metadata-driven ingestion — copying dozens or hundreds of tables with a single pipeline
Incremental loading — copying only changed data using watermark patterns

When NOT to Use ADF

Real-time streaming — use Event Hubs or Stream Analytics instead
Heavy data transformations — use Databricks or Synapse Spark (ADF can call them, but don’t try to do complex joins/aggregations in ADF itself)
Simple file copies — if you just need to copy a single file once, AzCopy or Storage Explorer is simpler
Non-Azure environments — if you’re fully on AWS, use AWS Glue instead

How Much Does ADF Cost?

ADF pricing has three components:

Component	What It Measures	Cost
Pipeline orchestration	Activity runs per month	$1.00 per 1,000 activity runs
Data movement	Data volume moved by Copy activities	~$0.25 per DIU-hour
Data Flow	Spark cluster time for transformations	~$0.27 per vCore-hour

Real-world example: A pipeline that copies 10 tables daily (each taking 2 minutes with 4 DIUs) costs roughly $5-10/month. That’s significantly cheaper than maintaining your own ETL server.

Free tier: ADF doesn’t have a free tier, but the costs are so low for small workloads that you’ll barely notice them on your Azure bill.

Your First Pipeline — Where to Start

If you’re new to ADF, here’s the learning path I recommend:

Start with a simple Copy pipeline — copy one table from Azure SQL to ADLS Gen2. This teaches you Linked Services, Datasets, and the Copy Activity.
Add parameterized datasets — make the table name dynamic. This teaches you parameters and expressions.
Build a metadata-driven pipeline — add Lookup + ForEach to copy multiple tables from a config table. This is the pattern you’ll use in production. I have a complete guide: Metadata-Driven Pipeline in ADF
Add audit logging — track success/failure with Stored Procedure activities. Guide: Synapse Pipeline with Audit Logging
Implement incremental loading — copy only changed data using watermarks. Guide: Incremental Data Loading

Each step builds on the previous one, and by step 5, you have a production-quality data pipeline.

Common Terminology Cheat Sheet

Term	Meaning
Pipeline	A workflow containing activities
Activity	A single task (Copy, Lookup, ForEach, etc.)
Dataset	A pointer to specific data (table, file, folder)
Linked Service	A connection to a data store
Trigger	What starts a pipeline (schedule, event)
Integration Runtime	The compute that runs activities
DIU	Data Integration Unit — compute power for Copy activities
Sink	The destination in a Copy activity
Source	The origin in a Copy activity
Expression	Dynamic content using `@` syntax (e.g., `@item().TableName`)
Publish	Save changes to the ADF service (like “deploy”)
Debug	Run a pipeline manually for testing
Monitor	View pipeline run history and status

Wrapping Up

Azure Data Factory is the entry point to Azure data engineering. It’s not the flashiest tool, but it’s the one that makes everything else possible — without ADF (or something like it), your data stays siloed in disconnected systems.

The key takeaway: ADF is an orchestrator and data mover, not a transformation engine. Use it to get data from A to B reliably, and use other tools (Databricks, Synapse, dbt) for complex transformations.

Continue your learning: – Building a Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging – Incremental Data Loading with Delta Copy – Common ADF/Synapse Errors – Top 15 ADF Interview Questions

If this guide helped you get started, share it with someone who’s new to Azure. Questions? Drop a comment below.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.