Data Flows in Azure Data Factory and Synapse: Visual Data Transformation Without Code
In all our previous pipelines, we moved data from Point A to Point B using the Copy activity. We never changed the data — we just transported it. That is like a moving company that carries your furniture to a new house but never assembles, cleans, or rearranges anything.
Data Flows are the assembly team. They take raw, messy data and transform it — join tables, filter rows, add calculated columns, aggregate numbers, remove duplicates, and reshape data — all through a visual drag-and-drop interface powered by Apache Spark under the hood.
The best part? You do not write a single line of Spark or Python code. You drag boxes, connect them with arrows, and configure each step visually. ADF generates the Spark code behind the scenes.
Table of Contents
- What Are Data Flows?
- Data Flows vs Copy Activity: When to Use Which
- How Data Flows Work Under the Hood
- The Debug Cluster (And Why It Costs Money)
- The Transformation Types
- Building Your First Data Flow: Step by Step
- Source Transformation
- Filter Transformation
- Derived Column Transformation
- Join Transformation
- Aggregate Transformation
- Select Transformation
- Sort Transformation
- Sink Transformation
- Conditional Split (If/Else for Data)
- Union (Combining Datasets)
- Lookup Transformation
- Exists Transformation
- Alter Row (Insert, Update, Delete, Upsert)
- Window Transformation
- Data Flow Parameters
- Data Preview and Debugging
- Performance Optimization
- Cost and Pricing
- Real-World Scenarios
- Common Mistakes
- Interview Questions
- Wrapping Up
What Are Data Flows?
Think of a Data Flow like a kitchen assembly line in a restaurant:
Raw ingredients (Source) --> Wash (Filter) --> Chop (Derived Column) -->
Combine (Join) --> Cook (Aggregate) --> Plate (Select) --> Serve (Sink)
Each station does one thing. The ingredients flow from one station to the next. At the end, you have a finished dish (transformed data) ready to serve (write to the destination).
In technical terms, a Data Flow is a visual data transformation designer that runs on a managed Spark cluster. You build the transformation logic by dragging transformation blocks onto a canvas and connecting them.
Data Flows vs Copy Activity: When to Use Which
| Feature | Copy Activity | Data Flow |
|---|---|---|
| Purpose | Move data as-is (A to B) | Transform data (clean, join, aggregate) |
| Engine | Native optimized copy engine | Apache Spark cluster |
| Transformations | Column mapping only | Joins, aggregations, filters, derived columns, and 20+ more |
| Speed for simple copy | Fast (optimized for data movement) | Slower (Spark startup overhead) |
| Cost | Cheaper (DIU-based) | More expensive (Spark cluster) |
| Code required | None | None (visual designer) |
| Startup time | Instant | 3-5 minutes (cluster spin-up) |
Real-life analogy:
- Copy Activity = A courier service. They pick up your package and deliver it to the destination. They do not open it, inspect it, or change it. Fast and cheap.
- Data Flow = A factory. Raw materials go in, finished products come out. The materials are cleaned, shaped, assembled, tested, and packaged. Slower to start but does the actual work.
When to use Copy Activity: Raw data ingestion (Bronze layer). Moving data from source to data lake without transformation.
When to use Data Flow: Silver layer transformations. Cleaning, joining, aggregating, and reshaping data before loading into Gold layer or warehouse.
How Data Flows Work Under the Hood
When you add a Data Flow activity to a pipeline and run it:
- ADF/Synapse provisions a Spark cluster (managed, serverless)
- Your visual transformation logic is translated into Spark code automatically
- The Spark cluster executes the transformations
- Results are written to the Sink
- The cluster shuts down (unless TTL keeps it warm)
You never see the Spark code, never manage the cluster, never install libraries. It is fully managed.
Think of it like driving an automatic car. You press the gas (click Run), the car handles the gears (Spark optimization), and you arrive at your destination (transformed data). You do not need to know how the transmission works — but knowing it exists helps you understand performance.
The Debug Cluster (And Why It Costs Money)
This is where many people get surprised with their Azure bill.
When you turn on Data Flow Debug in ADF/Synapse Studio, it starts a Spark cluster that stays running so you can preview data at each transformation step. This is incredibly useful for development — you see your data flowing through each transformation in real-time.
But here is the catch: That cluster costs money every minute it runs, even if you are not actively using it. It is like leaving your car engine running while you go grocery shopping. The engine burns fuel whether you are driving or not.
Debug cluster cost: ~$0.27 per vCore-hour
Default: 8 cores = ~$2.16/hour
Left on for 8 hours while you are in meetings = ~$17 wasted
Left on overnight (forgot to turn off) = ~$40+ wasted
How to Manage Debug Costs
- Set Time to Live (TTL) to 60 minutes. The cluster auto-shuts down after 60 minutes of inactivity.
- Turn off Debug when you are done developing. Click the Debug toggle OFF.
- Use the smallest cluster for development (4 cores is enough for testing).
- Never leave Debug on overnight. Set a calendar reminder or auto-shutdown policy.
Real-life analogy: Debug mode is like keeping a taxi waiting outside with the meter running. Great when you need quick rides between testing steps. Expensive if you forget the taxi is there.
The Transformation Types
Data Flows offer 20+ transformation types. Here are the most important ones, organized by what they do:
Filtering and Routing
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Filter | Removes rows that do not meet a condition | Sorting mail — throwing away junk, keeping bills |
| Conditional Split | Routes rows to different paths based on conditions | Airport baggage sorting — domestic vs international vs oversized |
| Exists | Keeps only rows that exist in another dataset | Checking a guest list — only people on the list get in |
Shaping and Cleaning
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Derived Column | Adds new calculated columns or modifies existing ones | Adding a tip calculation column to a restaurant bill |
| Select | Picks specific columns, renames, reorders | Choosing which items to pack for a trip — leave the rest |
| Cast | Changes data types (string to integer, etc.) | Converting currency — turning dollars into euros |
Combining Data
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Join | Combines two datasets on matching keys (like SQL JOIN) | Matching students with their grades using student ID |
| Union | Stacks two datasets vertically (like SQL UNION) | Combining two guest lists into one big list |
| Lookup | Enriches rows with values from a reference dataset | Looking up a zip code to add city and state |
Aggregating
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Aggregate | Groups rows and calculates SUM, AVG, COUNT, etc. | Monthly expense report — total spending per category |
| Window | Calculates over a window of rows (running totals, ranks) | Leaderboard rankings in a video game |
Writing
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Sink | Writes the final result to a destination (ADLS, SQL, etc.) | Filing the finished report in a cabinet |
| Alter Row | Marks rows for INSERT, UPDATE, DELETE, or UPSERT | Deciding what to do with each item — keep, replace, or throw away |
Building Your First Data Flow: Step by Step
Let us build a Data Flow that: 1. Reads customer data from ADLS Gen2 (Parquet) 2. Filters out inactive customers 3. Adds a full name column (first + last) 4. Joins with orders data 5. Aggregates total order amount per customer 6. Writes the result to a SQL table
Step 1: Create the Data Flow
- In ADF Studio: Author > Data flows > + New data flow > Mapping Data Flow
- Name:
DF_Customer_Orders_Summary
In Synapse Studio: Develop > + > Data flow
Step 2: Turn On Debug (Optional but Recommended)
Click the Data flow debug toggle at the top. Select cluster size (4+4 cores for dev). Wait 3-5 minutes for the cluster to start. You will see a green indicator when ready.
Remember: this starts the meter. Turn it off when done!
Source Transformation
Every Data Flow starts with at least one Source — where the data comes from.
- Click Add Source on the canvas
- Name:
CustomerSource - Dataset: select your ADLS Parquet dataset (or create inline)
- Configure:
- Source type: Dataset or Inline (inline lets you configure without a separate dataset)
- Sampling: you can sample rows for debugging (e.g., first 1000 rows)
With Debug on, click Data preview tab at the bottom to see actual data flowing through.
Think of the Source as opening a book to read it. You have not done anything with the content yet — you are just loading it into the data flow.
Multiple Sources
You can have multiple sources in one Data Flow. For our example, add a second source:
- Click Add Source again
- Name:
OrderSource - Dataset: orders Parquet dataset
Filter Transformation
Removes rows that do not meet your condition. Like a bouncer at a club checking IDs.
- Click the + icon on the CustomerSource output arrow
- Select Filter
- Name:
FilterActive - Filter expression:
status == 'Active'
Or using the expression builder:
isNull(deactivated_date)
Data preview now shows only active customers.
Common Filter Expressions
status == 'Active' -- exact match
salary > 50000 -- numeric comparison
!isNull(email) -- not null
year(order_date) == 2026 -- date function
contains(product_name, 'Pro') -- string contains
length(phone) == 10 -- string length
Derived Column Transformation
Adds new columns or modifies existing ones. Like adding a calculated field to a spreadsheet.
- Click + on FilterActive output
- Select Derived column
- Name:
AddFullName - Add columns:
- Column:
full_name, Expression:concat(first_name, ' ', last_name) - Column:
email_domain, Expression:split(email, '@')[2] - Column:
salary_band, Expression:iif(salary > 80000, 'Senior', iif(salary > 60000, 'Mid', 'Junior'))
Common Derived Column Expressions
concat(first_name, ' ', last_name) -- combine strings
upper(city) -- uppercase
trim(name) -- remove whitespace
toInteger(string_column) -- type conversion
year(order_date) -- extract year
addDays(currentDate(), -30) -- date arithmetic
iif(amount > 1000, 'High', 'Low') -- conditional
coalesce(phone, 'N/A') -- null handling
sha2(256, email) -- hashing (for PII)
regexReplace(phone, '[^0-9]', '') -- regex cleanup
Real-life analogy: Derived columns are like adding a calculated column to a spreadsheet. The original data stays, and you add new columns based on formulas.
Join Transformation
Combines two data streams on matching keys. Just like SQL JOIN but visual.
- Click + on AddFullName output
- Select Join
- Name:
JoinOrders - Right stream: select
OrderSource - Join type: Inner (or Left Outer, Right Outer, Full Outer, Cross)
- Join condition:
CustomerSource@customer_id == OrderSource@customer_id
Join Types Available
| Join Type | What It Does | Real-Life Analogy |
|---|---|---|
| Inner | Only matching rows from both sides | Wedding seating — only couples where both RSVP’d |
| Left Outer | All from left, matching from right | Employee list with optional department info |
| Right Outer | All from right, matching from left | Department list with optional employee info |
| Full Outer | Everything from both sides | Complete inventory — items in warehouse AND in transit |
| Cross | Every combination (Cartesian product) | Every color x every size = all product variants |
Aggregate Transformation
Groups rows and calculates summary values. Like creating a pivot table in Excel.
- Click + on JoinOrders output
- Select Aggregate
- Name:
AggByCustomer - Group by:
customer_id,full_name,email - Aggregates:
- Column:
total_orders, Expression:count(order_id) - Column:
total_amount, Expression:sum(amount) - Column:
avg_order_value, Expression:round(avg(amount), 2) - Column:
last_order_date, Expression:max(order_date)
Real-life analogy: Aggregate is like your credit card monthly statement. It groups your transactions by merchant and shows the total amount spent at each one.
Select Transformation
Picks, renames, and reorders columns. Like choosing which columns to show in a final report.
- Click + on AggByCustomer output
- Select Select
- Name:
SelectFinalColumns - Choose columns: customer_id, full_name, email, total_orders, total_amount, avg_order_value, last_order_date
- Rename if needed:
total_amounttolifetime_value
Use Select to clean up before writing to the Sink. Remove intermediate columns, rename for clarity, and set the final column order.
Sort Transformation
Orders rows by one or more columns.
- Click + on SelectFinalColumns output
- Select Sort
- Name:
SortByValue - Sort by:
lifetime_valueDescending
Your highest-value customers appear first.
Sink Transformation
Writes the final transformed data to a destination. Every Data Flow must end with at least one Sink.
- Click + on SortByValue output
- Select Sink
- Name:
WriteToSQL - Dataset: Azure SQL Database dataset (or ADLS Parquet dataset)
- Settings:
- Update method: Insert (or Upsert if using Alter Row)
- Table action: Recreate table (for dev) or None (for production)
Sink Options
| Setting | Options | When to Use |
|---|---|---|
| Table action | None, Recreate, Truncate | Recreate for dev, None for prod |
| Update method | Insert, Upsert, Update, Delete | Insert for append, Upsert for merge |
| Batch size | Default or custom | Increase for large datasets |
| Pre/Post SQL | Custom SQL to run before/after | Truncate staging table before insert |
Conditional Split (If/Else for Data)
Routes rows to different outputs based on conditions. Like a highway interchange — different cars take different exits.
All Orders
|-- Amount > 1000 --> HighValueOrders (Sink 1)
|-- Amount 100-1000 --> MediumOrders (Sink 2)
|-- Amount < 100 --> SmallOrders (Sink 3)
Configure:
– Stream 1: HighValue, Condition: amount > 1000
– Stream 2: Medium, Condition: amount >= 100 && amount <= 1000
– Default stream: Small (everything else)
Each output stream can have its own downstream transformations and sinks.
Data Flow Parameters
Data Flows can accept parameters from the pipeline, making them reusable:
- In the Data Flow designer, click Parameters at the top
- Add:
SourceFolder(String),TargetTable(String) - Use in transformations:
$SourceFolder,$TargetTable
In the pipeline’s Data Flow activity, pass values:
SourceFolder: @pipeline().parameters.FolderPath
TargetTable: @concat('staging_', pipeline().parameters.TableName)
Real-life analogy: Parameters are like adjustable settings on a washing machine. The same machine washes different loads — you just change the settings (temperature, cycle) each time.
Data Preview and Debugging
With Debug mode on, click Data preview on ANY transformation to see the data at that point:
Source (1000 rows) --> Filter (750 rows) --> Join (680 rows) --> Aggregate (45 rows) --> Sink
You can see exactly how many rows survive each step and inspect the actual data. This is incredibly powerful for debugging.
Real-life analogy: Data preview is like a glass-walled factory. You can walk along the production line and peek through the glass at any station to see the work in progress. Without it, you only see what goes in and what comes out, with no visibility into the middle.
Performance Optimization
1. Choose the Right Cluster Size
| Workload | Recommended Cores |
|---|---|
| Development/testing | 4+4 (8 total) |
| Small transformations (under 1 GB) | 8+8 (16 total) |
| Medium transformations (1-10 GB) | 16+16 (32 total) |
| Large transformations (10+ GB) | 32+32 or more |
2. Partition Your Data
Spark works best with partitioned data. If your source is Parquet with date partitions, Spark reads only the partitions it needs.
3. Avoid Unnecessary Columns Early
Use a Select transformation after the Source to drop columns you do not need. Less data flowing through the pipeline = faster execution.
4. Use Broadcast for Small Lookup Tables
When joining a large table with a small reference table, enable Broadcast on the small side. This copies the small table to every Spark node, avoiding an expensive shuffle.
5. Set TTL on the Integration Runtime
TTL keeps the Spark cluster warm between runs. Second run starts instantly instead of waiting 3-5 minutes.
TTL = 0: Every run waits 3-5 minutes (cold start)
TTL = 10: Second run within 10 minutes = instant start
TTL = 60: Cluster stays warm for 1 hour
Cost and Pricing
| Component | Cost | Example |
|---|---|---|
| Data Flow execution | ~$0.27 per vCore-hour | 8 cores for 15 minutes = $0.54 |
| Debug cluster | Same rate | 8 cores for 2 hours = $4.32 |
| Cluster startup | Included (but takes 3-5 min) | No extra cost, just time |
Cost optimization: – Use smallest cluster for development – Set TTL to avoid repeated cold starts – Turn off Debug when not actively developing – Schedule Data Flows during off-peak hours if possible
Real-World Scenarios
Scenario 1: Customer 360 View
Source: CRM data (customers)
Source: Order history (transactions)
Source: Support tickets (interactions)
|
Join: CRM + Orders on customer_id
Join: Result + Tickets on customer_id
|
Derived Column: customer_segment (based on total spend)
Derived Column: health_score (based on recent activity)
|
Aggregate: total_spend, ticket_count, last_interaction
|
Sink: dim_customer table in SQL warehouse
Scenario 2: Daily Sales Aggregation
Source: Raw transactions (Bronze layer, Parquet)
|
Filter: Remove test transactions (amount > 0)
Filter: Remove future dates (order_date <= currentDate())
|
Derived Column: revenue = quantity * unit_price
Derived Column: order_month = format(order_date, 'yyyy-MM')
|
Aggregate: Group by product, region, order_month
SUM(revenue), COUNT(order_id), AVG(unit_price)
|
Sink: fact_daily_sales (Gold layer, SQL pool)
Scenario 3: Data Quality Cleanup
Source: Raw customer data (messy CSV from vendor)
|
Derived Column: email = lower(trim(email))
Derived Column: phone = regexReplace(phone, '[^0-9]', '')
Derived Column: name = initCap(trim(name))
|
Filter: !isNull(email) && length(email) > 5
|
Conditional Split:
Valid (email contains '@') --> Sink: clean_customers
Invalid --> Sink: rejected_customers (for review)
Common Mistakes
1. Leaving Debug On Overnight
The Spark cluster keeps running and charging. Set TTL or turn Debug off manually. Think of it as leaving your car engine running while you sleep — expensive and unnecessary.
2. Using Data Flow for Simple Copies
If you just need to move data without transformations, use Copy Activity. It is faster, cheaper, and does not need a Spark cluster. Do not use a factory assembly line when you just need a courier.
3. Joining Large Tables Without Broadcast
Two large tables joined without optimization cause a “shuffle” — all data moves across the network. If one table is small (under 50 MB), broadcast it.
4. Not Previewing Data at Each Step
Build incrementally. Add one transformation, preview the data, confirm it looks right, then add the next. Do not build 10 transformations and then wonder why the output is wrong.
5. Ignoring Partition Skew
If you partition by a column with uneven distribution (e.g., country where 80% of data is USA), one partition gets overwhelmed. Use hash partitioning or salt the key.
Interview Questions
Q: What is a Data Flow in ADF and when would you use it? A: A Data Flow is a visual data transformation designer powered by Spark. Use it when you need to clean, join, aggregate, or reshape data — transformations beyond what Copy Activity can do. It is code-free and runs on managed Spark clusters.
Q: What is the difference between Data Flow and Copy Activity? A: Copy Activity moves data as-is from source to sink (fast, cheap). Data Flow transforms data with joins, aggregations, filters, and derived columns (slower, more expensive, runs on Spark). Use Copy for ingestion, Data Flow for transformation.
Q: Why does the Debug cluster cost money even when idle? A: Debug mode provisions a Spark cluster that stays running for instant data previews. The cluster consumes compute resources whether you are actively using it or not. Set TTL to auto-shutdown or manually turn off Debug when not needed.
Q: What is the difference between a Join and a Lookup in Data Flows? A: Join combines two full datasets on matching keys (like SQL JOIN). Lookup enriches the primary stream by fetching matching values from a reference dataset without duplicating rows. Use Join for combining equal datasets, Lookup for adding reference columns.
Q: How do you optimize Data Flow performance? A: Right-size the cluster, broadcast small tables in joins, drop unnecessary columns early with Select, use partitioned source data, and set TTL on the Integration Runtime to avoid cold starts.
Wrapping Up
Data Flows bring the power of Apache Spark to your fingertips without writing code. They are the transformation engine that turns raw Bronze layer data into clean, aggregated, business-ready Gold layer data.
The key is knowing WHEN to use them: not for simple data movement (that is Copy Activity’s job), but for the heavy lifting — joins, aggregations, cleanup, and reshaping that turn raw data into insights.
And always remember to turn off the Debug cluster when you leave. Your Azure bill will thank you.
Related posts: – What is Azure Data Factory? – Metadata-Driven Pipeline in ADF – Parquet vs CSV vs JSON – Schema-on-Write vs Schema-on-Read – ADF Expressions Guide
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.