Data Flows in Azure Data Factory and Synapse: Visual Data Transformation Without Code
In all our previous pipelines, we moved data from Point A to Point B using the Copy activity. We never changed the data — we just transported it. That is like a moving company that carries your furniture to a new house but never assembles, cleans, or rearranges anything.
Data Flows are the assembly team. They take raw, messy data and transform it — join tables, filter rows, add calculated columns, aggregate numbers, remove duplicates, and reshape data — all through a visual drag-and-drop interface powered by Apache Spark under the hood.
The best part? You do not write a single line of Spark or Python code. You drag boxes, connect them with arrows, and configure each step visually. ADF generates the Spark code behind the scenes.
Table of Contents
- What Are Data Flows?
- Data Flows vs Copy Activity: When to Use Which
- How Data Flows Work Under the Hood
- The Debug Cluster (And Why It Costs Money)
- The Transformation Types
- Building Your First Data Flow: Step by Step
- Source Transformation
- Filter Transformation
- Derived Column Transformation
- Join Transformation
- Aggregate Transformation
- Select Transformation
- Sort Transformation
- Sink Transformation
- Conditional Split (If/Else for Data)
- Union (Combining Datasets)
- Lookup Transformation
- Exists Transformation
- Alter Row (Insert, Update, Delete, Upsert)
- Window Transformation
- Data Flow Parameters
- Data Preview and Debugging
- Performance Optimization
- Cost and Pricing
- Real-World Scenarios
- Common Mistakes
- Interview Questions
- Wrapping Up
What Are Data Flows?
Think of a Data Flow like a kitchen assembly line in a restaurant:
Raw ingredients (Source) --> Wash (Filter) --> Chop (Derived Column) -->
Combine (Join) --> Cook (Aggregate) --> Plate (Select) --> Serve (Sink)
Each station does one thing. The ingredients flow from one station to the next. At the end, you have a finished dish (transformed data) ready to serve (write to the destination).
In technical terms, a Data Flow is a visual data transformation designer that runs on a managed Spark cluster. You build the transformation logic by dragging transformation blocks onto a canvas and connecting them.
Data Flows vs Copy Activity: When to Use Which
| Feature | Copy Activity | Data Flow |
|---|---|---|
| Purpose | Move data as-is (A to B) | Transform data (clean, join, aggregate) |
| Engine | Native optimized copy engine | Apache Spark cluster |
| Transformations | Column mapping only | Joins, aggregations, filters, derived columns, and 20+ more |
| Speed for simple copy | Fast (optimized for data movement) | Slower (Spark startup overhead) |
| Cost | Cheaper (DIU-based) | More expensive (Spark cluster) |
| Code required | None | None (visual designer) |
| Startup time | Instant | 3-5 minutes (cluster spin-up) |
Real-life analogy:
- Copy Activity = A courier service. They pick up your package and deliver it to the destination. They do not open it, inspect it, or change it. Fast and cheap.
- Data Flow = A factory. Raw materials go in, finished products come out. The materials are cleaned, shaped, assembled, tested, and packaged. Slower to start but does the actual work.
When to use Copy Activity: Raw data ingestion (Bronze layer). Moving data from source to data lake without transformation.
When to use Data Flow: Silver layer transformations. Cleaning, joining, aggregating, and reshaping data before loading into Gold layer or warehouse.
How Data Flows Work Under the Hood
When you add a Data Flow activity to a pipeline and run it:
- ADF/Synapse provisions a Spark cluster (managed, serverless)
- Your visual transformation logic is translated into Spark code automatically
- The Spark cluster executes the transformations
- Results are written to the Sink
- The cluster shuts down (unless TTL keeps it warm)
You never see the Spark code, never manage the cluster, never install libraries. It is fully managed.
Think of it like driving an automatic car. You press the gas (click Run), the car handles the gears (Spark optimization), and you arrive at your destination (transformed data). You do not need to know how the transmission works — but knowing it exists helps you understand performance.
The Debug Cluster (And Why It Costs Money)
This is where many people get surprised with their Azure bill.
When you turn on Data Flow Debug in ADF/Synapse Studio, it starts a Spark cluster that stays running so you can preview data at each transformation step. This is incredibly useful for development — you see your data flowing through each transformation in real-time.
But here is the catch: That cluster costs money every minute it runs, even if you are not actively using it. It is like leaving your car engine running while you go grocery shopping. The engine burns fuel whether you are driving or not.
Debug cluster cost: ~$0.27 per vCore-hour
Default: 8 cores = ~$2.16/hour
Left on for 8 hours while you are in meetings = ~$17 wasted
Left on overnight (forgot to turn off) = ~$40+ wasted
How to Manage Debug Costs
- Set Time to Live (TTL) to 60 minutes. The cluster auto-shuts down after 60 minutes of inactivity.
- Turn off Debug when you are done developing. Click the Debug toggle OFF.
- Use the smallest cluster for development (4 cores is enough for testing).
- Never leave Debug on overnight. Set a calendar reminder or auto-shutdown policy.
Real-life analogy: Debug mode is like keeping a taxi waiting outside with the meter running. Great when you need quick rides between testing steps. Expensive if you forget the taxi is there.
The Transformation Types
Data Flows offer 20+ transformation types. Here are the most important ones, organized by what they do:
Filtering and Routing
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Filter | Removes rows that do not meet a condition | Sorting mail — throwing away junk, keeping bills |
| Conditional Split | Routes rows to different paths based on conditions | Airport baggage sorting — domestic vs international vs oversized |
| Exists | Keeps only rows that exist in another dataset | Checking a guest list — only people on the list get in |
Shaping and Cleaning
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Derived Column | Adds new calculated columns or modifies existing ones | Adding a tip calculation column to a restaurant bill |
| Select | Picks specific columns, renames, reorders | Choosing which items to pack for a trip — leave the rest |
| Cast | Changes data types (string to integer, etc.) | Converting currency — turning dollars into euros |
Combining Data
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Join | Combines two datasets on matching keys (like SQL JOIN) | Matching students with their grades using student ID |
| Union | Stacks two datasets vertically (like SQL UNION) | Combining two guest lists into one big list |
| Lookup | Enriches rows with values from a reference dataset | Looking up a zip code to add city and state |
Aggregating
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Aggregate | Groups rows and calculates SUM, AVG, COUNT, etc. | Monthly expense report — total spending per category |
| Window | Calculates over a window of rows (running totals, ranks) | Leaderboard rankings in a video game |
Writing
| Transformation | What It Does | Real-Life Analogy |
|---|---|---|
| Sink | Writes the final result to a destination (ADLS, SQL, etc.) | Filing the finished report in a cabinet |
| Alter Row | Marks rows for INSERT, UPDATE, DELETE, or UPSERT | Deciding what to do with each item — keep, replace, or throw away |
Building Your First Data Flow: Step by Step
Let us build a Data Flow that: 1. Reads customer data from ADLS Gen2 (Parquet) 2. Filters out inactive customers 3. Adds a full name column (first + last) 4. Joins with orders data 5. Aggregates total order amount per customer 6. Writes the result to a SQL table
Step 1: Create the Data Flow
- In ADF Studio: Author > Data flows > + New data flow > Mapping Data Flow
- Name:
DF_Customer_Orders_Summary
In Synapse Studio: Develop > + > Data flow
Step 2: Turn On Debug (Optional but Recommended)
Click the Data flow debug toggle at the top. Select cluster size (4+4 cores for dev). Wait 3-5 minutes for the cluster to start. You will see a green indicator when ready.
Remember: this starts the meter. Turn it off when done!
Source Transformation
Every Data Flow starts with at least one Source — where the data comes from.
- Click Add Source on the canvas
- Name:
CustomerSource - Dataset: select your ADLS Parquet dataset (or create inline)
- Configure:
- Source type: Dataset or Inline (inline lets you configure without a separate dataset)
- Sampling: you can sample rows for debugging (e.g., first 1000 rows)
With Debug on, click Data preview tab at the bottom to see actual data flowing through.
Think of the Source as opening a book to read it. You have not done anything with the content yet — you are just loading it into the data flow.
Multiple Sources
You can have multiple sources in one Data Flow. For our example, add a second source:
- Click Add Source again
- Name:
OrderSource - Dataset: orders Parquet dataset
Filter Transformation
Removes rows that do not meet your condition. Like a bouncer at a club checking IDs.
- Click the + icon on the CustomerSource output arrow
- Select Filter
- Name:
FilterActive - Filter expression:
status == 'Active'
Or using the expression builder:
isNull(deactivated_date)
Data preview now shows only active customers.
Common Filter Expressions
status == 'Active' -- exact match
salary > 50000 -- numeric comparison
!isNull(email) -- not null
year(order_date) == 2026 -- date function
contains(product_name, 'Pro') -- string contains
length(phone) == 10 -- string length
Derived Column Transformation
Adds new columns or modifies existing ones. Like adding a calculated field to a spreadsheet.
- Click + on FilterActive output
- Select Derived column
- Name:
AddFullName - Add columns:
- Column:
full_name, Expression:concat(first_name, ' ', last_name) - Column:
email_domain, Expression:split(email, '@')[2] - Column:
salary_band, Expression:iif(salary > 80000, 'Senior', iif(salary > 60000, 'Mid', 'Junior'))
Common Derived Column Expressions
concat(first_name, ' ', last_name) -- combine strings
upper(city) -- uppercase
trim(name) -- remove whitespace
toInteger(string_column) -- type conversion
year(order_date) -- extract year
addDays(currentDate(), -30) -- date arithmetic
iif(amount > 1000, 'High', 'Low') -- conditional
coalesce(phone, 'N/A') -- null handling
sha2(256, email) -- hashing (for PII)
regexReplace(phone, '[^0-9]', '') -- regex cleanup
Real-life analogy: Derived columns are like adding a calculated column to a spreadsheet. The original data stays, and you add new columns based on formulas.
Join Transformation
Combines two data streams on matching keys. Just like SQL JOIN but visual.
- Click + on AddFullName output
- Select Join
- Name:
JoinOrders - Right stream: select
OrderSource - Join type: Inner (or Left Outer, Right Outer, Full Outer, Cross)
- Join condition:
CustomerSource@customer_id == OrderSource@customer_id
Join Types Available
| Join Type | What It Does | Real-Life Analogy |
|---|---|---|
| Inner | Only matching rows from both sides | Wedding seating — only couples where both RSVP’d |
| Left Outer | All from left, matching from right | Employee list with optional department info |
| Right Outer | All from right, matching from left | Department list with optional employee info |
| Full Outer | Everything from both sides | Complete inventory — items in warehouse AND in transit |
| Cross | Every combination (Cartesian product) | Every color x every size = all product variants |
Aggregate Transformation
Groups rows and calculates summary values. Like creating a pivot table in Excel.
- Click + on JoinOrders output
- Select Aggregate
- Name:
AggByCustomer - Group by:
customer_id,full_name,email - Aggregates:
- Column:
total_orders, Expression:count(order_id) - Column:
total_amount, Expression:sum(amount) - Column:
avg_order_value, Expression:round(avg(amount), 2) - Column:
last_order_date, Expression:max(order_date)
Real-life analogy: Aggregate is like your credit card monthly statement. It groups your transactions by merchant and shows the total amount spent at each one.
Select Transformation
Picks, renames, and reorders columns. Like choosing which columns to show in a final report.
- Click + on AggByCustomer output
- Select Select
- Name:
SelectFinalColumns - Choose columns: customer_id, full_name, email, total_orders, total_amount, avg_order_value, last_order_date
- Rename if needed:
total_amounttolifetime_value
Use Select to clean up before writing to the Sink. Remove intermediate columns, rename for clarity, and set the final column order.
Sort Transformation
Orders rows by one or more columns.
- Click + on SelectFinalColumns output
- Select Sort
- Name:
SortByValue - Sort by:
lifetime_valueDescending
Your highest-value customers appear first.
Sink Transformation
Writes the final transformed data to a destination. Every Data Flow must end with at least one Sink.
- Click + on SortByValue output
- Select Sink
- Name:
WriteToSQL - Dataset: Azure SQL Database dataset (or ADLS Parquet dataset)
- Settings:
- Update method: Insert (or Upsert if using Alter Row)
- Table action: Recreate table (for dev) or None (for production)
Sink Options
| Setting | Options | When to Use |
|---|---|---|
| Table action | None, Recreate, Truncate | Recreate for dev, None for prod |
| Update method | Insert, Upsert, Update, Delete | Insert for append, Upsert for merge |
| Batch size | Default or custom | Increase for large datasets |
| Pre/Post SQL | Custom SQL to run before/after | Truncate staging table before insert |
Conditional Split (If/Else for Data)
Routes rows to different outputs based on conditions. Like a highway interchange — different cars take different exits.
All Orders
|-- Amount > 1000 --> HighValueOrders (Sink 1)
|-- Amount 100-1000 --> MediumOrders (Sink 2)
|-- Amount < 100 --> SmallOrders (Sink 3)
Configure:
– Stream 1: HighValue, Condition: amount > 1000
– Stream 2: Medium, Condition: amount >= 100 && amount <= 1000
– Default stream: Small (everything else)
Each output stream can have its own downstream transformations and sinks.
Union (Combining Datasets)
Stacks two or more data streams vertically — like SQL UNION ALL. Both streams must have the same column structure.
- Click + on any transformation output
- Select Union
- Name:
CombineAllCustomers - Add streams: click + Add stream for each additional input
Example: Combine customers from three regional sources
USCustomers (500 rows) ──┐
├──→ Union (1,300 rows) ──→ Sink
EUCustomers (450 rows) ──┤
│
APACCustomers (350 rows)──┘
All three streams must have the same columns (or be mapped in the Union settings).
Column Mapping in Union
If column names differ between streams, Union lets you map them:
Stream 1: customer_name, customer_email
Stream 2: cust_name, cust_email
Union mapping:
customer_name ←→ cust_name
customer_email ←→ cust_email
Output: customer_name, customer_email (uses first stream's names)
Key difference from Join: Join combines columns side-by-side (horizontal). Union stacks rows top-to-bottom (vertical). Join needs a matching key. Union just appends rows.
Real-life analogy: Union is like combining three separate guest lists into one master list. Each list has the same columns (name, email, RSVP status). You stack them together. Join is like matching names from a guest list with a seating chart — side-by-side combination on a key.
Where we used Union: In our SCD Type 2 pipeline, Union combines new records with new SCD2 versions before they flow to the insert sink.
Lookup Transformation
Enriches the primary stream by fetching matching values from a reference dataset. Similar to a left outer join but designed for reference lookups — it does not duplicate rows when there are multiple matches.
- Click + on a transformation output
- Select Lookup
- Name:
LookupCountry - Primary stream: your main data
- Lookup stream: the reference dataset (e.g., country codes)
- Lookup condition:
primary@country_code == lookup@code - Multiple matches: First match, Last match, or Any match
Primary stream (orders):
| order_id | product_id | amount |
|----------|-----------|--------|
| 1 | P100 | 50.00 |
| 2 | P200 | 75.00 |
Lookup stream (products):
| product_id | product_name | category |
|-----------|-------------|------------|
| P100 | Widget | Hardware |
| P200 | Gadget | Electronics|
After Lookup (product_id == product_id):
| order_id | product_id | amount | product_name | category |
|----------|-----------|--------|-------------|------------|
| 1 | P100 | 50.00 | Widget | Hardware |
| 2 | P200 | 75.00 | Gadget | Electronics|
Lookup vs Join: When to Use Which
| Feature | Join | Lookup |
|---|---|---|
| Purpose | Combine two equal datasets | Enrich primary stream with reference data |
| Multiple matches | Duplicates rows (one per match) | Returns first/last/any match (no duplication) |
| No match behavior | Depends on join type (inner drops, outer keeps) | Adds NULL columns for non-matches |
| Use case | Customer + Orders (many-to-many) | Orders + Product names (enrich with reference) |
Real-life analogy: Join is like merging two spreadsheets side-by-side. Lookup is like using VLOOKUP in Excel — you have your main data and you pull in values from a reference table.
Where we used Lookup: In every SCD pipeline on this blog, the Lookup transformation joins source records with the existing dimension table to detect new vs changed vs unchanged records.
Exists Transformation
Filters the primary stream to keep only rows that exist (or do NOT exist) in a second stream. Like SQL WHERE EXISTS or WHERE NOT EXISTS.
- Click + on a transformation output
- Select Exists
- Name:
ExistsInActiveList - Right stream: the reference dataset to check against
- Exists type: Exists (keep matches) or Doesn’t exist (keep non-matches)
- Condition:
primary@customer_id == reference@customer_id
Example 1: Keep only customers who have placed orders (Exists)
Customers (1000 rows) ──→ Exists (customer_id == order_customer_id) ──→ 750 rows
Orders (5000 rows) ──┘
Result: 750 customers who have at least one order
Example 2: Find customers who have NEVER ordered (Doesn't Exist)
Customers (1000 rows) ──→ Doesn't Exist (customer_id == order_customer_id) ──→ 250 rows
Orders (5000 rows) ──┘
Result: 250 customers with zero orders (target for marketing campaign)
Exists vs Join vs Lookup
| Transformation | Output | Adds columns? | Use When |
|---|---|---|---|
| Join | Combined rows from both sides | Yes — all columns from both | You need data from BOTH streams |
| Lookup | Primary rows enriched with reference | Yes — adds lookup columns | You need to add reference values |
| Exists | Primary rows only (filtered) | No — only primary columns | You just need to filter by existence |
Real-life analogy: Exists is like checking a guest list at a party. You do not care about the guest list details (name, table number) — you only care whether the person IS on the list or not. Join would give you the person’s assigned table. Lookup would give you their RSVP details. Exists just says “yes, they are on the list” or “no, they are not.”
Alter Row (Insert, Update, Delete, Upsert)
Marks each row with an operation — INSERT, UPDATE, DELETE, or UPSERT. Required before any Sink that does more than plain inserts.
- Click + before the Sink
- Select Alter Row
- Name:
SetRowAction - Conditions:
| Action | Condition Expression | When |
|---|---|---|
| Insert if | isNull(DIM_CustomerKey) |
New record — no match in dimension |
| Update if | !isNull(DIM_CustomerKey) && hashChanged == true() |
Existing record with changes |
| Upsert if | true() |
Insert if new, update if exists (catch-all) |
| Delete if | is_deleted == true() |
Soft delete flag from source |
Without Alter Row:
Sink can only INSERT → "Allow insert" is the only option
With Alter Row:
Sink can INSERT, UPDATE, DELETE, or UPSERT
The Alter Row tells the Sink WHAT to do with each row
Important: Alter Row does NOT perform the operation itself.
It only MARKS the row. The Sink performs the actual database operation.
Sink configuration after Alter Row:
- Allow insert: checked (for rows marked as Insert)
- Allow update: checked (for rows marked as Update)
- Allow upsert: checked (for rows marked as Upsert)
- Allow delete: checked (for rows marked as Delete)
- Key columns: the column(s) used to match existing rows (e.g., CustomerKey)
Real-life analogy: Alter Row is like a postal sorting machine. Each letter gets stamped: “deliver” (insert), “return to sender” (delete), “update address” (update), or “deliver or forward” (upsert). The stamp does not deliver the letter — it tells the mail carrier (Sink) what to do with it.
Where we used Alter Row: In every SCD pipeline — Type 1 uses Insert + Update, Type 2 uses Insert (new versions) + Update (expire old rows), and the combined pipeline uses all three sinks with different Alter Row configurations.
Window Transformation
Calculates values over a “window” of rows — running totals, rankings, row numbers — without collapsing rows like Aggregate does. Equivalent to SQL window functions (ROW_NUMBER, RANK, SUM OVER).
- Click + on a transformation output
- Select Window
- Name:
RankCustomers - Over tab: partition columns (e.g.,
region) - Sort tab: order within partition (e.g.,
total_spend DESC) - Window columns: new column name + expression
Common Window Expressions
| Expression | What It Does | Use Case |
|---|---|---|
rowNumber() |
Sequential number within partition | Deduplication — keep RowNum == 1 |
rank() |
Rank with gaps (1, 2, 2, 4) | Top N per category with ties |
denseRank() |
Rank without gaps (1, 2, 2, 3) | Continuous ranking |
cumSum(amount) |
Running total | YTD revenue by month |
lag(price, 1) |
Previous row’s value | Price change detection |
lead(price, 1) |
Next row’s value | Forward-looking comparison |
Window vs Aggregate
Aggregate: 100 rows → Group by region → 5 rows (one per region)
Collapses rows. You lose individual row detail.
Window: 100 rows → Partition by region, add rank → 100 rows (each with a rank)
Keeps all rows. Adds calculated columns alongside existing data.
Example:
Input:
| customer | region | spend |
|----------|--------|-------|
| Alice | East | 500 |
| Bob | East | 300 |
| Carol | West | 700 |
Aggregate (SUM by region):
| region | total_spend |
|--------|-------------|
| East | 800 |
| West | 700 |
(2 rows — individual customers gone)
Window (RANK by spend within region):
| customer | region | spend | rank |
|----------|--------|-------|------|
| Alice | East | 500 | 1 |
| Bob | East | 300 | 2 |
| Carol | West | 700 | 1 |
(3 rows — all customers preserved, rank added)
Real-life analogy: Aggregate is like a class report card that shows only the class average — individual students disappear. Window is like a report card that shows each student’s grade AND their rank in the class — everyone is still there, but now you know where they stand relative to peers.
Where we used Window: In our production DQ pipeline, the Window transformation with rowNumber() partitioned by CustomerID and sorted by UpdateDate DESC is how we deduplicate — keeping only the latest record per customer.
Data Flow Parameters
Data Flows can accept parameters from the pipeline, making them reusable:
- In the Data Flow designer, click Parameters at the top
- Add:
SourceFolder(String),TargetTable(String) - Use in transformations:
$SourceFolder,$TargetTable
In the pipeline’s Data Flow activity, pass values:
SourceFolder: @pipeline().parameters.FolderPath
TargetTable: @concat('staging_', pipeline().parameters.TableName)
Real-life analogy: Parameters are like adjustable settings on a washing machine. The same machine washes different loads — you just change the settings (temperature, cycle) each time.
Data Preview and Debugging
With Debug mode on, click Data preview on ANY transformation to see the data at that point:
Source (1000 rows) --> Filter (750 rows) --> Join (680 rows) --> Aggregate (45 rows) --> Sink
You can see exactly how many rows survive each step and inspect the actual data. This is incredibly powerful for debugging.
Real-life analogy: Data preview is like a glass-walled factory. You can walk along the production line and peek through the glass at any station to see the work in progress. Without it, you only see what goes in and what comes out, with no visibility into the middle.
Performance Optimization
1. Choose the Right Cluster Size
| Workload | Recommended Cores |
|---|---|
| Development/testing | 4+4 (8 total) |
| Small transformations (under 1 GB) | 8+8 (16 total) |
| Medium transformations (1-10 GB) | 16+16 (32 total) |
| Large transformations (10+ GB) | 32+32 or more |
2. Partition Your Data
Spark works best with partitioned data. If your source is Parquet with date partitions, Spark reads only the partitions it needs.
3. Avoid Unnecessary Columns Early
Use a Select transformation after the Source to drop columns you do not need. Less data flowing through the pipeline = faster execution.
4. Use Broadcast for Small Lookup Tables
When joining a large table with a small reference table, enable Broadcast on the small side. This copies the small table to every Spark node, avoiding an expensive shuffle.
5. Set TTL on the Integration Runtime
TTL keeps the Spark cluster warm between runs. Second run starts instantly instead of waiting 3-5 minutes.
TTL = 0: Every run waits 3-5 minutes (cold start)
TTL = 10: Second run within 10 minutes = instant start
TTL = 60: Cluster stays warm for 1 hour
Cost and Pricing
| Component | Cost | Example |
|---|---|---|
| Data Flow execution | ~$0.27 per vCore-hour | 8 cores for 15 minutes = $0.54 |
| Debug cluster | Same rate | 8 cores for 2 hours = $4.32 |
| Cluster startup | Included (but takes 3-5 min) | No extra cost, just time |
Cost optimization: – Use smallest cluster for development – Set TTL to avoid repeated cold starts – Turn off Debug when not actively developing – Schedule Data Flows during off-peak hours if possible
Real-World Scenarios
Scenario 1: Customer 360 View
Source: CRM data (customers)
Source: Order history (transactions)
Source: Support tickets (interactions)
|
Join: CRM + Orders on customer_id
Join: Result + Tickets on customer_id
|
Derived Column: customer_segment (based on total spend)
Derived Column: health_score (based on recent activity)
|
Aggregate: total_spend, ticket_count, last_interaction
|
Sink: dim_customer table in SQL warehouse
Scenario 2: Daily Sales Aggregation
Source: Raw transactions (Bronze layer, Parquet)
|
Filter: Remove test transactions (amount > 0)
Filter: Remove future dates (order_date <= currentDate())
|
Derived Column: revenue = quantity * unit_price
Derived Column: order_month = format(order_date, 'yyyy-MM')
|
Aggregate: Group by product, region, order_month
SUM(revenue), COUNT(order_id), AVG(unit_price)
|
Sink: fact_daily_sales (Gold layer, SQL pool)
Scenario 3: Data Quality Cleanup
Source: Raw customer data (messy CSV from vendor)
|
Derived Column: email = lower(trim(email))
Derived Column: phone = regexReplace(phone, '[^0-9]', '')
Derived Column: name = initCap(trim(name))
|
Filter: !isNull(email) && length(email) > 5
|
Conditional Split:
Valid (email contains '@') --> Sink: clean_customers
Invalid --> Sink: rejected_customers (for review)
Common Mistakes
1. Leaving Debug On Overnight
The Spark cluster keeps running and charging. Set TTL or turn Debug off manually. Think of it as leaving your car engine running while you sleep — expensive and unnecessary.
2. Using Data Flow for Simple Copies
If you just need to move data without transformations, use Copy Activity. It is faster, cheaper, and does not need a Spark cluster. Do not use a factory assembly line when you just need a courier.
3. Joining Large Tables Without Broadcast
Two large tables joined without optimization cause a “shuffle” — all data moves across the network. If one table is small (under 50 MB), broadcast it.
4. Not Previewing Data at Each Step
Build incrementally. Add one transformation, preview the data, confirm it looks right, then add the next. Do not build 10 transformations and then wonder why the output is wrong.
5. Ignoring Partition Skew
If you partition by a column with uneven distribution (e.g., country where 80% of data is USA), one partition gets overwhelmed. Use hash partitioning or salt the key.
Interview Questions
Q: What is a Data Flow in ADF and when would you use it? A: A Data Flow is a visual data transformation designer powered by Spark. Use it when you need to clean, join, aggregate, or reshape data — transformations beyond what Copy Activity can do. It is code-free and runs on managed Spark clusters.
Q: What is the difference between Data Flow and Copy Activity? A: Copy Activity moves data as-is from source to sink (fast, cheap). Data Flow transforms data with joins, aggregations, filters, and derived columns (slower, more expensive, runs on Spark). Use Copy for ingestion, Data Flow for transformation.
Q: Why does the Debug cluster cost money even when idle? A: Debug mode provisions a Spark cluster that stays running for instant data previews. The cluster consumes compute resources whether you are actively using it or not. Set TTL to auto-shutdown or manually turn off Debug when not needed.
Q: What is the difference between a Join and a Lookup in Data Flows? A: Join combines two full datasets on matching keys (like SQL JOIN). Lookup enriches the primary stream by fetching matching values from a reference dataset without duplicating rows. Use Join for combining equal datasets, Lookup for adding reference columns.
Q: How do you optimize Data Flow performance? A: Right-size the cluster, broadcast small tables in joins, drop unnecessary columns early with Select, use partitioned source data, and set TTL on the Integration Runtime to avoid cold starts.
Wrapping Up
Data Flows bring the power of Apache Spark to your fingertips without writing code. They are the transformation engine that turns raw Bronze layer data into clean, aggregated, business-ready Gold layer data.
The key is knowing WHEN to use them: not for simple data movement (that is Copy Activity’s job), but for the heavy lifting — joins, aggregations, cleanup, and reshaping that turn raw data into insights.
And always remember to turn off the Debug cluster when you leave. Your Azure bill will thank you.
Related posts: – What is Azure Data Factory? – Metadata-Driven Pipeline in ADF – Parquet vs CSV vs JSON – Schema-on-Write vs Schema-on-Read – ADF Expressions Guide
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.