Data Flows in Azure Data Factory and Synapse: Visual Data Transformation Without Code

Data Flows in Azure Data Factory and Synapse: Visual Data Transformation Without Code

In all our previous pipelines, we moved data from Point A to Point B using the Copy activity. We never changed the data — we just transported it. That is like a moving company that carries your furniture to a new house but never assembles, cleans, or rearranges anything.

Data Flows are the assembly team. They take raw, messy data and transform it — join tables, filter rows, add calculated columns, aggregate numbers, remove duplicates, and reshape data — all through a visual drag-and-drop interface powered by Apache Spark under the hood.

The best part? You do not write a single line of Spark or Python code. You drag boxes, connect them with arrows, and configure each step visually. ADF generates the Spark code behind the scenes.

Table of Contents

  • What Are Data Flows?
  • Data Flows vs Copy Activity: When to Use Which
  • How Data Flows Work Under the Hood
  • The Debug Cluster (And Why It Costs Money)
  • The Transformation Types
  • Building Your First Data Flow: Step by Step
  • Source Transformation
  • Filter Transformation
  • Derived Column Transformation
  • Join Transformation
  • Aggregate Transformation
  • Select Transformation
  • Sort Transformation
  • Sink Transformation
  • Conditional Split (If/Else for Data)
  • Union (Combining Datasets)
  • Lookup Transformation
  • Exists Transformation
  • Alter Row (Insert, Update, Delete, Upsert)
  • Window Transformation
  • Data Flow Parameters
  • Data Preview and Debugging
  • Performance Optimization
  • Cost and Pricing
  • Real-World Scenarios
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Are Data Flows?

Think of a Data Flow like a kitchen assembly line in a restaurant:

Raw ingredients (Source) --> Wash (Filter) --> Chop (Derived Column) -->
Combine (Join) --> Cook (Aggregate) --> Plate (Select) --> Serve (Sink)

Each station does one thing. The ingredients flow from one station to the next. At the end, you have a finished dish (transformed data) ready to serve (write to the destination).

In technical terms, a Data Flow is a visual data transformation designer that runs on a managed Spark cluster. You build the transformation logic by dragging transformation blocks onto a canvas and connecting them.

Data Flows vs Copy Activity: When to Use Which

Feature Copy Activity Data Flow
Purpose Move data as-is (A to B) Transform data (clean, join, aggregate)
Engine Native optimized copy engine Apache Spark cluster
Transformations Column mapping only Joins, aggregations, filters, derived columns, and 20+ more
Speed for simple copy Fast (optimized for data movement) Slower (Spark startup overhead)
Cost Cheaper (DIU-based) More expensive (Spark cluster)
Code required None None (visual designer)
Startup time Instant 3-5 minutes (cluster spin-up)

Real-life analogy:

  • Copy Activity = A courier service. They pick up your package and deliver it to the destination. They do not open it, inspect it, or change it. Fast and cheap.
  • Data Flow = A factory. Raw materials go in, finished products come out. The materials are cleaned, shaped, assembled, tested, and packaged. Slower to start but does the actual work.

When to use Copy Activity: Raw data ingestion (Bronze layer). Moving data from source to data lake without transformation.

When to use Data Flow: Silver layer transformations. Cleaning, joining, aggregating, and reshaping data before loading into Gold layer or warehouse.

How Data Flows Work Under the Hood

When you add a Data Flow activity to a pipeline and run it:

  1. ADF/Synapse provisions a Spark cluster (managed, serverless)
  2. Your visual transformation logic is translated into Spark code automatically
  3. The Spark cluster executes the transformations
  4. Results are written to the Sink
  5. The cluster shuts down (unless TTL keeps it warm)

You never see the Spark code, never manage the cluster, never install libraries. It is fully managed.

Think of it like driving an automatic car. You press the gas (click Run), the car handles the gears (Spark optimization), and you arrive at your destination (transformed data). You do not need to know how the transmission works — but knowing it exists helps you understand performance.

The Debug Cluster (And Why It Costs Money)

This is where many people get surprised with their Azure bill.

When you turn on Data Flow Debug in ADF/Synapse Studio, it starts a Spark cluster that stays running so you can preview data at each transformation step. This is incredibly useful for development — you see your data flowing through each transformation in real-time.

But here is the catch: That cluster costs money every minute it runs, even if you are not actively using it. It is like leaving your car engine running while you go grocery shopping. The engine burns fuel whether you are driving or not.

Debug cluster cost: ~$0.27 per vCore-hour
Default: 8 cores = ~$2.16/hour
Left on for 8 hours while you are in meetings = ~$17 wasted
Left on overnight (forgot to turn off) = ~$40+ wasted

How to Manage Debug Costs

  1. Set Time to Live (TTL) to 60 minutes. The cluster auto-shuts down after 60 minutes of inactivity.
  2. Turn off Debug when you are done developing. Click the Debug toggle OFF.
  3. Use the smallest cluster for development (4 cores is enough for testing).
  4. Never leave Debug on overnight. Set a calendar reminder or auto-shutdown policy.

Real-life analogy: Debug mode is like keeping a taxi waiting outside with the meter running. Great when you need quick rides between testing steps. Expensive if you forget the taxi is there.

The Transformation Types

Data Flows offer 20+ transformation types. Here are the most important ones, organized by what they do:

Filtering and Routing

Transformation What It Does Real-Life Analogy
Filter Removes rows that do not meet a condition Sorting mail — throwing away junk, keeping bills
Conditional Split Routes rows to different paths based on conditions Airport baggage sorting — domestic vs international vs oversized
Exists Keeps only rows that exist in another dataset Checking a guest list — only people on the list get in

Shaping and Cleaning

Transformation What It Does Real-Life Analogy
Derived Column Adds new calculated columns or modifies existing ones Adding a tip calculation column to a restaurant bill
Select Picks specific columns, renames, reorders Choosing which items to pack for a trip — leave the rest
Cast Changes data types (string to integer, etc.) Converting currency — turning dollars into euros

Combining Data

Transformation What It Does Real-Life Analogy
Join Combines two datasets on matching keys (like SQL JOIN) Matching students with their grades using student ID
Union Stacks two datasets vertically (like SQL UNION) Combining two guest lists into one big list
Lookup Enriches rows with values from a reference dataset Looking up a zip code to add city and state

Aggregating

Transformation What It Does Real-Life Analogy
Aggregate Groups rows and calculates SUM, AVG, COUNT, etc. Monthly expense report — total spending per category
Window Calculates over a window of rows (running totals, ranks) Leaderboard rankings in a video game

Writing

Transformation What It Does Real-Life Analogy
Sink Writes the final result to a destination (ADLS, SQL, etc.) Filing the finished report in a cabinet
Alter Row Marks rows for INSERT, UPDATE, DELETE, or UPSERT Deciding what to do with each item — keep, replace, or throw away

Building Your First Data Flow: Step by Step

Let us build a Data Flow that: 1. Reads customer data from ADLS Gen2 (Parquet) 2. Filters out inactive customers 3. Adds a full name column (first + last) 4. Joins with orders data 5. Aggregates total order amount per customer 6. Writes the result to a SQL table

Step 1: Create the Data Flow

  1. In ADF Studio: Author > Data flows > + New data flow > Mapping Data Flow
  2. Name: DF_Customer_Orders_Summary

In Synapse Studio: Develop > + > Data flow

Click the Data flow debug toggle at the top. Select cluster size (4+4 cores for dev). Wait 3-5 minutes for the cluster to start. You will see a green indicator when ready.

Remember: this starts the meter. Turn it off when done!

Source Transformation

Every Data Flow starts with at least one Source — where the data comes from.

  1. Click Add Source on the canvas
  2. Name: CustomerSource
  3. Dataset: select your ADLS Parquet dataset (or create inline)
  4. Configure:
  5. Source type: Dataset or Inline (inline lets you configure without a separate dataset)
  6. Sampling: you can sample rows for debugging (e.g., first 1000 rows)

With Debug on, click Data preview tab at the bottom to see actual data flowing through.

Think of the Source as opening a book to read it. You have not done anything with the content yet — you are just loading it into the data flow.

Multiple Sources

You can have multiple sources in one Data Flow. For our example, add a second source:

  1. Click Add Source again
  2. Name: OrderSource
  3. Dataset: orders Parquet dataset

Filter Transformation

Removes rows that do not meet your condition. Like a bouncer at a club checking IDs.

  1. Click the + icon on the CustomerSource output arrow
  2. Select Filter
  3. Name: FilterActive
  4. Filter expression: status == 'Active'

Or using the expression builder:

isNull(deactivated_date)

Data preview now shows only active customers.

Common Filter Expressions

status == 'Active'                           -- exact match
salary > 50000                               -- numeric comparison
!isNull(email)                               -- not null
year(order_date) == 2026                     -- date function
contains(product_name, 'Pro')                -- string contains
length(phone) == 10                          -- string length

Derived Column Transformation

Adds new columns or modifies existing ones. Like adding a calculated field to a spreadsheet.

  1. Click + on FilterActive output
  2. Select Derived column
  3. Name: AddFullName
  4. Add columns:
  5. Column: full_name, Expression: concat(first_name, ' ', last_name)
  6. Column: email_domain, Expression: split(email, '@')[2]
  7. Column: salary_band, Expression: iif(salary > 80000, 'Senior', iif(salary > 60000, 'Mid', 'Junior'))

Common Derived Column Expressions

concat(first_name, ' ', last_name)           -- combine strings
upper(city)                                   -- uppercase
trim(name)                                    -- remove whitespace
toInteger(string_column)                      -- type conversion
year(order_date)                              -- extract year
addDays(currentDate(), -30)                   -- date arithmetic
iif(amount > 1000, 'High', 'Low')            -- conditional
coalesce(phone, 'N/A')                        -- null handling
sha2(256, email)                              -- hashing (for PII)
regexReplace(phone, '[^0-9]', '')            -- regex cleanup

Real-life analogy: Derived columns are like adding a calculated column to a spreadsheet. The original data stays, and you add new columns based on formulas.

Join Transformation

Combines two data streams on matching keys. Just like SQL JOIN but visual.

  1. Click + on AddFullName output
  2. Select Join
  3. Name: JoinOrders
  4. Right stream: select OrderSource
  5. Join type: Inner (or Left Outer, Right Outer, Full Outer, Cross)
  6. Join condition: CustomerSource@customer_id == OrderSource@customer_id

Join Types Available

Join Type What It Does Real-Life Analogy
Inner Only matching rows from both sides Wedding seating — only couples where both RSVP’d
Left Outer All from left, matching from right Employee list with optional department info
Right Outer All from right, matching from left Department list with optional employee info
Full Outer Everything from both sides Complete inventory — items in warehouse AND in transit
Cross Every combination (Cartesian product) Every color x every size = all product variants

Aggregate Transformation

Groups rows and calculates summary values. Like creating a pivot table in Excel.

  1. Click + on JoinOrders output
  2. Select Aggregate
  3. Name: AggByCustomer
  4. Group by: customer_id, full_name, email
  5. Aggregates:
  6. Column: total_orders, Expression: count(order_id)
  7. Column: total_amount, Expression: sum(amount)
  8. Column: avg_order_value, Expression: round(avg(amount), 2)
  9. Column: last_order_date, Expression: max(order_date)

Real-life analogy: Aggregate is like your credit card monthly statement. It groups your transactions by merchant and shows the total amount spent at each one.

Select Transformation

Picks, renames, and reorders columns. Like choosing which columns to show in a final report.

  1. Click + on AggByCustomer output
  2. Select Select
  3. Name: SelectFinalColumns
  4. Choose columns: customer_id, full_name, email, total_orders, total_amount, avg_order_value, last_order_date
  5. Rename if needed: total_amount to lifetime_value

Use Select to clean up before writing to the Sink. Remove intermediate columns, rename for clarity, and set the final column order.

Sort Transformation

Orders rows by one or more columns.

  1. Click + on SelectFinalColumns output
  2. Select Sort
  3. Name: SortByValue
  4. Sort by: lifetime_value Descending

Your highest-value customers appear first.

Sink Transformation

Writes the final transformed data to a destination. Every Data Flow must end with at least one Sink.

  1. Click + on SortByValue output
  2. Select Sink
  3. Name: WriteToSQL
  4. Dataset: Azure SQL Database dataset (or ADLS Parquet dataset)
  5. Settings:
  6. Update method: Insert (or Upsert if using Alter Row)
  7. Table action: Recreate table (for dev) or None (for production)

Sink Options

Setting Options When to Use
Table action None, Recreate, Truncate Recreate for dev, None for prod
Update method Insert, Upsert, Update, Delete Insert for append, Upsert for merge
Batch size Default or custom Increase for large datasets
Pre/Post SQL Custom SQL to run before/after Truncate staging table before insert

Conditional Split (If/Else for Data)

Routes rows to different outputs based on conditions. Like a highway interchange — different cars take different exits.

All Orders
  |-- Amount > 1000 --> HighValueOrders (Sink 1)
  |-- Amount 100-1000 --> MediumOrders (Sink 2)
  |-- Amount < 100 --> SmallOrders (Sink 3)

Configure: – Stream 1: HighValue, Condition: amount > 1000 – Stream 2: Medium, Condition: amount >= 100 && amount <= 1000 – Default stream: Small (everything else)

Each output stream can have its own downstream transformations and sinks.

Data Flow Parameters

Data Flows can accept parameters from the pipeline, making them reusable:

  1. In the Data Flow designer, click Parameters at the top
  2. Add: SourceFolder (String), TargetTable (String)
  3. Use in transformations: $SourceFolder, $TargetTable

In the pipeline’s Data Flow activity, pass values:

SourceFolder: @pipeline().parameters.FolderPath
TargetTable:  @concat('staging_', pipeline().parameters.TableName)

Real-life analogy: Parameters are like adjustable settings on a washing machine. The same machine washes different loads — you just change the settings (temperature, cycle) each time.

Data Preview and Debugging

With Debug mode on, click Data preview on ANY transformation to see the data at that point:

Source (1000 rows) --> Filter (750 rows) --> Join (680 rows) --> Aggregate (45 rows) --> Sink

You can see exactly how many rows survive each step and inspect the actual data. This is incredibly powerful for debugging.

Real-life analogy: Data preview is like a glass-walled factory. You can walk along the production line and peek through the glass at any station to see the work in progress. Without it, you only see what goes in and what comes out, with no visibility into the middle.

Performance Optimization

1. Choose the Right Cluster Size

Workload Recommended Cores
Development/testing 4+4 (8 total)
Small transformations (under 1 GB) 8+8 (16 total)
Medium transformations (1-10 GB) 16+16 (32 total)
Large transformations (10+ GB) 32+32 or more

2. Partition Your Data

Spark works best with partitioned data. If your source is Parquet with date partitions, Spark reads only the partitions it needs.

3. Avoid Unnecessary Columns Early

Use a Select transformation after the Source to drop columns you do not need. Less data flowing through the pipeline = faster execution.

4. Use Broadcast for Small Lookup Tables

When joining a large table with a small reference table, enable Broadcast on the small side. This copies the small table to every Spark node, avoiding an expensive shuffle.

5. Set TTL on the Integration Runtime

TTL keeps the Spark cluster warm between runs. Second run starts instantly instead of waiting 3-5 minutes.

TTL = 0:  Every run waits 3-5 minutes (cold start)
TTL = 10: Second run within 10 minutes = instant start
TTL = 60: Cluster stays warm for 1 hour

Cost and Pricing

Component Cost Example
Data Flow execution ~$0.27 per vCore-hour 8 cores for 15 minutes = $0.54
Debug cluster Same rate 8 cores for 2 hours = $4.32
Cluster startup Included (but takes 3-5 min) No extra cost, just time

Cost optimization: – Use smallest cluster for development – Set TTL to avoid repeated cold starts – Turn off Debug when not actively developing – Schedule Data Flows during off-peak hours if possible

Real-World Scenarios

Scenario 1: Customer 360 View

Source: CRM data (customers)
Source: Order history (transactions)
Source: Support tickets (interactions)
  |
  Join: CRM + Orders on customer_id
  Join: Result + Tickets on customer_id
  |
  Derived Column: customer_segment (based on total spend)
  Derived Column: health_score (based on recent activity)
  |
  Aggregate: total_spend, ticket_count, last_interaction
  |
  Sink: dim_customer table in SQL warehouse

Scenario 2: Daily Sales Aggregation

Source: Raw transactions (Bronze layer, Parquet)
  |
  Filter: Remove test transactions (amount > 0)
  Filter: Remove future dates (order_date <= currentDate())
  |
  Derived Column: revenue = quantity * unit_price
  Derived Column: order_month = format(order_date, 'yyyy-MM')
  |
  Aggregate: Group by product, region, order_month
             SUM(revenue), COUNT(order_id), AVG(unit_price)
  |
  Sink: fact_daily_sales (Gold layer, SQL pool)

Scenario 3: Data Quality Cleanup

Source: Raw customer data (messy CSV from vendor)
  |
  Derived Column: email = lower(trim(email))
  Derived Column: phone = regexReplace(phone, '[^0-9]', '')
  Derived Column: name = initCap(trim(name))
  |
  Filter: !isNull(email) && length(email) > 5
  |
  Conditional Split:
    Valid (email contains '@') --> Sink: clean_customers
    Invalid --> Sink: rejected_customers (for review)

Common Mistakes

1. Leaving Debug On Overnight

The Spark cluster keeps running and charging. Set TTL or turn Debug off manually. Think of it as leaving your car engine running while you sleep — expensive and unnecessary.

2. Using Data Flow for Simple Copies

If you just need to move data without transformations, use Copy Activity. It is faster, cheaper, and does not need a Spark cluster. Do not use a factory assembly line when you just need a courier.

3. Joining Large Tables Without Broadcast

Two large tables joined without optimization cause a “shuffle” — all data moves across the network. If one table is small (under 50 MB), broadcast it.

4. Not Previewing Data at Each Step

Build incrementally. Add one transformation, preview the data, confirm it looks right, then add the next. Do not build 10 transformations and then wonder why the output is wrong.

5. Ignoring Partition Skew

If you partition by a column with uneven distribution (e.g., country where 80% of data is USA), one partition gets overwhelmed. Use hash partitioning or salt the key.

Interview Questions

Q: What is a Data Flow in ADF and when would you use it? A: A Data Flow is a visual data transformation designer powered by Spark. Use it when you need to clean, join, aggregate, or reshape data — transformations beyond what Copy Activity can do. It is code-free and runs on managed Spark clusters.

Q: What is the difference between Data Flow and Copy Activity? A: Copy Activity moves data as-is from source to sink (fast, cheap). Data Flow transforms data with joins, aggregations, filters, and derived columns (slower, more expensive, runs on Spark). Use Copy for ingestion, Data Flow for transformation.

Q: Why does the Debug cluster cost money even when idle? A: Debug mode provisions a Spark cluster that stays running for instant data previews. The cluster consumes compute resources whether you are actively using it or not. Set TTL to auto-shutdown or manually turn off Debug when not needed.

Q: What is the difference between a Join and a Lookup in Data Flows? A: Join combines two full datasets on matching keys (like SQL JOIN). Lookup enriches the primary stream by fetching matching values from a reference dataset without duplicating rows. Use Join for combining equal datasets, Lookup for adding reference columns.

Q: How do you optimize Data Flow performance? A: Right-size the cluster, broadcast small tables in joins, drop unnecessary columns early with Select, use partitioned source data, and set TTL on the Integration Runtime to avoid cold starts.

Wrapping Up

Data Flows bring the power of Apache Spark to your fingertips without writing code. They are the transformation engine that turns raw Bronze layer data into clean, aggregated, business-ready Gold layer data.

The key is knowing WHEN to use them: not for simple data movement (that is Copy Activity’s job), but for the heavy lifting — joins, aggregations, cleanup, and reshaping that turn raw data into insights.

And always remember to turn off the Debug cluster when you leave. Your Azure bill will thank you.

Related posts:What is Azure Data Factory?Metadata-Driven Pipeline in ADFParquet vs CSV vs JSONSchema-on-Write vs Schema-on-ReadADF Expressions Guide


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link