Joins in Azure Synapse Data Flows: Every Join Type Explained with Real-World Examples
In SQL, you write JOIN in a query. In Synapse Data Flows, you drag a Join box onto the canvas. Same concept, visual interface, Apache Spark engine underneath.
But here is where most tutorials fail — they show you HOW to configure a join but not WHEN to use each type, what happens to your data, and why your row count suddenly tripled.
This post covers every join type available in Synapse (and ADF) Data Flows with real datasets, step-by-step UI configuration, visual before/after tables, and the real-world scenarios where each join is the right choice.
Think of joins like matchmaking at a conference. You have two guest lists (datasets), and you need to combine them based on some common attribute (the join key). The join type determines what happens to guests who are on one list but not the other.
Table of Contents
- What Is a Join in Data Flows?
- The Sample Data
- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- Cross Join
- Exists (Anti-Join Alternative)
- Lookup vs Join: When to Use Which
- Join on Multiple Keys
- Join with Expressions (Non-Equality Joins)
- Handling Duplicate Column Names
- Broadcast Optimization
- Real-World Scenarios
- Performance Tips
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is a Join in Data Flows?
A Join transformation in Data Flows combines two data streams based on matching keys. It takes a left stream (primary) and a right stream (secondary) and produces a single output stream with columns from both.
Left Stream (Customers) Right Stream (Orders)
+----+--------+ +--------+------+--------+
| id | name | | ord_id | c_id | amount |
+----+--------+ JOIN +--------+------+--------+
| 1 | Alice | ON | 101 | 1 | 500 |
| 2 | Bob | c.id = | 102 | 1 | 300 |
| 3 | Carol | o.c_id | 103 | 3 | 700 |
| 4 | Dave | +--------+------+--------+
+----+--------+
Real-life analogy: A join is like matching RSVPs with a seating chart. The RSVP list has guest names and dietary preferences. The seating chart has guest names and table assignments. You combine them to get: guest name, dietary preference, AND table number — all in one row.
The Sample Data
I will use these two datasets throughout:
Customers (Left Stream):
| customer_id | name | city | status |
|---|---|---|---|
| 1 | Alice | Toronto | Active |
| 2 | Bob | Mumbai | Active |
| 3 | Carol | London | Inactive |
| 4 | Dave | Toronto | Active |
| 5 | Eve | Paris | Active |
Orders (Right Stream):
| order_id | customer_id | amount | order_date |
|---|---|---|---|
| 101 | 1 | 500 | 2026-01-15 |
| 102 | 1 | 300 | 2026-02-10 |
| 103 | 3 | 700 | 2026-03-05 |
| 104 | 6 | 150 | 2026-03-20 |
Notice: – Alice (id=1) has 2 orders – Carol (id=3) has 1 order – Dave (id=4) and Eve (id=5) have NO orders – Order 104 references customer_id=6 who does NOT exist in Customers
This is realistic data — mismatches and gaps are normal.
Inner Join
What It Does
Returns only rows where BOTH sides have a match. Like a married couples dance — only people who came WITH a partner get on the dance floor. Singles wait at the bar.
How to Configure
- Add a Join transformation after your left stream
- Name:
JoinOrders - Right stream: select
OrdersSource - Join type: Inner
- Join condition:
Customers@customer_id == Orders@customer_id
Result
| customer_id | name | city | order_id | amount | order_date |
|---|---|---|---|---|---|
| 1 | Alice | Toronto | 101 | 500 | 2026-01-15 |
| 1 | Alice | Toronto | 102 | 300 | 2026-02-10 |
| 3 | Carol | London | 103 | 700 | 2026-03-05 |
3 rows. Alice appears twice because she has 2 orders (one-to-many relationship).
Who is missing: – Bob, Dave, Eve — they have no orders (no match on right side) – Order 104 — customer_id=6 does not exist in Customers (no match on left side)
When to Use Inner Join
- You only want customers WHO HAVE ordered
- Building fact tables where every row needs valid dimension references
- Matching two datasets where incomplete records should be excluded
Real-life analogy: An Inner Join is like a job interview match. Only candidates (left) who match an open position (right) get an interview. Candidates with no matching positions and positions with no matching candidates are both excluded.
Left Outer Join
What It Does
Returns ALL rows from the left stream and matching rows from the right. If a left row has no match on the right, the right columns are filled with NULL.
Real-life analogy: It is like a class attendance sheet. Every student (left) appears on the sheet. If they submitted homework (right), you note the grade. If they did not submit, you write “N/A”. No student is left off the sheet.
How to Configure
- Join type: Left outer
- Same join condition:
Customers@customer_id == Orders@customer_id
Result
| customer_id | name | city | order_id | amount | order_date |
|---|---|---|---|---|---|
| 1 | Alice | Toronto | 101 | 500 | 2026-01-15 |
| 1 | Alice | Toronto | 102 | 300 | 2026-02-10 |
| 2 | Bob | Mumbai | NULL | NULL | NULL |
| 3 | Carol | London | 103 | 700 | 2026-03-05 |
| 4 | Dave | Toronto | NULL | NULL | NULL |
| 5 | Eve | Paris | NULL | NULL | NULL |
5 rows. All customers appear. Bob, Dave, and Eve have NULLs for order columns because they have no orders. Order 104 (customer_id=6) is still excluded because it has no matching customer on the left.
When to Use Left Outer Join
- Customer 360 view — show all customers with their orders (including customers who never ordered)
- Data quality checks — find customers without orders (filter WHERE order_id IS NULL after the join)
- Preserving the complete primary dataset while enriching with optional data
Finding Records Without Matches
Add a Filter transformation after the Left Outer Join:
isNull(order_id)
Result: Bob, Dave, Eve — customers who have never placed an order. This is one of the most common data engineering patterns.
Right Outer Join
What It Does
The mirror of Left Outer. Returns ALL rows from the right stream and matching rows from the left. Unmatched right rows have NULLs for left columns.
Real-life analogy: Like a package delivery report. Every package (right) appears in the report. If the recipient (left) is in the system, their details are included. If not, the recipient columns show “Unknown.”
Result
| customer_id | name | city | order_id | amount | order_date |
|---|---|---|---|---|---|
| 1 | Alice | Toronto | 101 | 500 | 2026-01-15 |
| 1 | Alice | Toronto | 102 | 300 | 2026-02-10 |
| 3 | Carol | London | 103 | 700 | 2026-03-05 |
| NULL | NULL | NULL | 104 | 150 | 2026-03-20 |
4 rows. All orders appear. Order 104 has NULLs for customer columns because customer_id=6 does not exist. Bob, Dave, and Eve are excluded because they have no orders.
When to Use Right Outer Join
In practice, rarely used. You can always swap the streams and use Left Outer instead:
Right Outer Join (Customers LEFT, Orders RIGHT)
= Left Outer Join (Orders LEFT, Customers RIGHT)
Same result, better readability. Most data engineers always use Left Outer and swap the stream order when needed.
Full Outer Join
What It Does
Returns ALL rows from BOTH sides. Matching rows are combined. Unmatched rows from either side have NULLs.
Real-life analogy: A company merger. You combine the employee lists from both companies. Some employees are in both lists (they transfer). Some are only in Company A. Some are only in Company B. Everyone appears in the merged list, with NULLs where the other company’s data is missing.
Result
| customer_id | name | city | order_id | amount | order_date |
|---|---|---|---|---|---|
| 1 | Alice | Toronto | 101 | 500 | 2026-01-15 |
| 1 | Alice | Toronto | 102 | 300 | 2026-02-10 |
| 2 | Bob | Mumbai | NULL | NULL | NULL |
| 3 | Carol | London | 103 | 700 | 2026-03-05 |
| 4 | Dave | Toronto | NULL | NULL | NULL |
| 5 | Eve | Paris | NULL | NULL | NULL |
| NULL | NULL | NULL | 104 | 150 | 2026-03-20 |
7 rows. Everyone and everything is included. No data is lost.
When to Use Full Outer Join
- Data reconciliation — comparing source vs target to find mismatches
- Merge two systems — combining data from two sources where either side might have records the other does not
- Data quality audit — finding orphaned records on both sides
Finding Orphans on Both Sides
After a Full Outer Join, filter:
isNull(Customers@customer_id) || isNull(Orders@order_id)
This gives you: customers without orders AND orders without valid customers. Perfect for data quality reports.
Cross Join
What It Does
Produces every possible combination of rows from both sides. No join condition needed.
Real-life analogy: A restaurant menu where you choose one appetizer AND one main course. If there are 5 appetizers and 4 mains, you have 20 possible meal combinations. That is a Cross Join.
How to Configure
- Join type: Cross
- No join condition (it combines everything with everything)
Result
5 customers x 4 orders = 20 rows. Every customer paired with every order.
When to Use Cross Join
- Generate all combinations — all products x all stores for inventory planning
- Create date dimensions — years x months x days
- Test data generation
Warning: Cross Join on large datasets explodes the row count. 10,000 x 10,000 = 100 million rows. Only use on small reference datasets.
Exists (Anti-Join Alternative)
Data Flows have a separate Exists transformation that works differently from Join:
What It Does
Filters the left stream to keep only rows that DO or DO NOT exist in the right stream. It does NOT add columns from the right side.
Real-life analogy: A bouncer with a guest list. They check if your name is on the list (exists) or not (not exists). They do not ask about your table reservation details — they just let you in or turn you away.
How to Configure
- Add an Exists transformation (not Join)
- Right stream:
OrdersSource - Exist type: Exists (or Not exists)
- Condition:
Customers@customer_id == Orders@customer_id
Exists Result (Customers Who Have Orders)
| customer_id | name | city | status |
|---|---|---|---|
| 1 | Alice | Toronto | Active |
| 3 | Carol | London | Inactive |
Only customers who have at least one order. No order columns added. Alice appears ONCE (not twice like Inner Join), because Exists is a filter, not a combiner.
Not Exists Result (Customers Without Orders)
| customer_id | name | city | status |
|---|---|---|---|
| 2 | Bob | Mumbai | Active |
| 4 | Dave | Toronto | Active |
| 5 | Eve | Paris | Active |
Customers who have zero orders. This is the anti-join pattern.
Exists vs Left Outer Join + Filter
Both achieve “customers without orders” but differently:
| Approach | How | Result Columns | Performance |
|---|---|---|---|
| Exists (Not exists) | Filter only, no column merge | Left stream columns only | Faster (no data shuffling) |
| Left Outer Join + Filter(isNull) | Full join then filter | Both stream columns | Slower (joins all data first) |
Use Exists when you only need to know IF a match exists, not the matched data. Use Join when you need columns from both sides.
Lookup vs Join: When to Use Which
Data Flows have both Join and Lookup transformations. They seem similar but behave differently:
| Feature | Join | Lookup |
|---|---|---|
| Output | Combined rows from both streams | Left stream enriched with right stream values |
| Row multiplication | Yes (one-to-many creates multiple rows) | No (returns first/last/any match) |
| Multiple matches | Creates duplicate rows | Returns one value per row |
| Use case | Combining equal datasets | Adding reference data to primary stream |
Example: Alice Has 2 Orders
Join result: 2 rows (one per order)
| customer_id | name | order_id | amount |
|---|---|---|---|
| 1 | Alice | 101 | 500 |
| 1 | Alice | 102 | 300 |
Lookup result: 1 row (picks first match)
| customer_id | name | first_order_id | first_amount |
|---|---|---|---|
| 1 | Alice | 101 | 500 |
Real-life analogy: Join is like matching students with ALL their exam scores (one row per exam). Lookup is like matching students with their LATEST exam score only (one row per student).
Use Lookup when: You want to enrich data without duplicating rows — adding a dimension value to a fact record.
Use Join when: You genuinely need all matching combinations.
Join on Multiple Keys
Sometimes a single column is not enough. You need to match on multiple columns:
How to Configure
In the Join transformation, click + to add multiple conditions:
Condition 1: Left@year == Right@year
Condition 2: Left@month == Right@month
Condition 3: Left@region == Right@region
All conditions must be TRUE for rows to match (AND logic).
Real-life analogy: Multi-key joins are like identifying a person. Name alone is not enough (there are many “Johns”). Name + city is better. Name + city + birthday is almost unique. More keys = more precise matching.
In the JSON
"joinCondition": {
"left": ["year", "month", "region"],
"right": ["yr", "mo", "reg"]
}
Column names can differ between left and right as long as they contain comparable data.
Join with Expressions (Non-Equality Joins)
Standard joins use equality (==). But sometimes you need range-based or expression-based joins:
Date Range Join (SCD Type 2 Lookup)
Match orders to the customer dimension version that was active on the order date:
Customers@customer_id == Orders@customer_id
AND Orders@order_date >= Customers@effective_from
AND Orders@order_date < Customers@effective_to
Fuzzy Matching (Contains)
contains(Left@product_name, Right@search_term)
Note: Non-equality joins cannot use broadcast optimization and may be slower than equality joins.
Handling Duplicate Column Names
When both streams have a column with the same name (e.g., both have customer_id), the output has duplicates. This causes errors downstream.
Solution 1: Use Select After Join
Add a Select transformation to drop or rename duplicate columns:
Keep: Customers@customer_id (rename to customer_id)
Drop: Orders@customer_id (duplicate)
Keep: all other columns
Solution 2: Stream Prefix
In the Join configuration, Data Flows automatically prefix ambiguous columns:
Customers@customer_id
Orders@customer_id
You can reference them with the stream prefix in downstream transformations.
Best Practice
Always add a Select transformation after every Join to clean up column names. Remove duplicates, rename ambiguous columns, and drop unnecessary fields. Think of it as tidying up after combining two piles of documents — remove the duplicates before filing.
Broadcast Optimization
The Problem: Shuffle
When joining two large datasets, Spark must shuffle data across the cluster — sending rows with matching keys to the same node. This is like reorganizing a library by moving books between shelves. Slow and expensive.
The Solution: Broadcast
If one dataset is small (under 50 MB), you can broadcast it — send the entire small dataset to every node. Each node then joins locally without shuffling. This is like giving every librarian a photocopy of the small catalog — they can match books locally without walking to other shelves.
How to Configure
- In the Join transformation, click Optimize tab
- Select Fixed broadcast
- Choose which side to broadcast: Left or Right (pick the smaller one)
Or use Auto: Data Flows automatically broadcast when one side is small enough.
When to Broadcast
| Left Stream | Right Stream | Strategy |
|---|---|---|
| 10 million rows | 500 rows (dim table) | Broadcast Right |
| 200 rows (config table) | 5 million rows | Broadcast Left |
| 10 million rows | 10 million rows | No broadcast (shuffle) |
Rule of thumb: Broadcast the side that is under 50 MB. Never broadcast large datasets — it overwhelms node memory.
Real-World Scenarios
Scenario 1: Customer 360 with Orders and Support Tickets
Customers (Source)
|
LEFT OUTER JOIN Orders ON customer_id
|
LEFT OUTER JOIN Support_Tickets ON customer_id
|
Aggregate: total_orders, total_spend, open_tickets per customer
|
Derived Column: health_score = iif(open_tickets > 3, 'At Risk', 'Healthy')
|
Sink: dim_customer_360
Left Outer because you want ALL customers, even those with no orders or no tickets.
Scenario 2: Data Reconciliation (Source vs Target)
Source_Data (Left)
|
FULL OUTER JOIN Target_Data ON primary_key
|
Conditional Split:
Missing in Target: isNull(Target@primary_key) --> Sink: missing_in_target
Missing in Source: isNull(Source@primary_key) --> Sink: missing_in_source
Mismatched: Source@amount != Target@amount --> Sink: mismatched_records
Matched: Default --> (discard or log)
Full Outer catches mismatches on both sides.
Scenario 3: Slowly Changing Dimension Type 2 Lookup
Fact_Orders (Left)
|
INNER JOIN Dim_Customer ON customer_id
AND order_date >= effective_from
AND order_date < effective_to
|
Result: Each order matched to the correct customer version
This ensures an order from 2025 matches the customer’s 2025 address, not their 2026 address.
Scenario 4: Product Enrichment from Multiple Catalogs
Internal_Products (Left)
|
LEFT OUTER JOIN Vendor_Catalog ON product_sku
|
LEFT OUTER JOIN Pricing_Service ON product_id
|
Select: product_name, vendor_description, current_price
|
Sink: enriched_products
Left Outer because some internal products may not be in the vendor catalog or pricing service yet.
Performance Tips
-
Broadcast small tables — always broadcast dimension tables (under 50 MB) when joining with large fact tables.
-
Filter before joining — reduce row count BEFORE the join. If you only need Active customers, filter first, then join. Less data to shuffle = faster join.
-
Drop unnecessary columns before joining — add a Select before the Join to keep only the columns you need. Less data per row = faster processing.
-
Use Exists instead of Join for filtering — if you only need to know IF a match exists (not the matched columns), Exists is faster because it does not combine data.
-
Avoid Cross Joins on large datasets — 10K x 10K = 100M rows. If you must, filter one side first.
-
Partition your data — Spark joins are fastest when both sides are partitioned on the join key.
Common Mistakes
1. Unexpected Row Multiplication
You join customers (5 rows) with orders (4 rows) and expect 5 rows. But you get 7 because Alice has 2 orders. Inner Join creates a row for EACH match.
Fix: If you need one row per customer with aggregated order data, use Aggregate after the Join (SUM amount, COUNT orders).
2. Joining on Wrong Column
You accidentally join on customer_name instead of customer_id. Two different people named “Alex” get merged into one.
Fix: Always join on the unique identifier (primary key / foreign key), not on descriptive columns.
3. Not Handling NULLs in Join Keys
If customer_id is NULL in some rows, NULL != NULL in a join. Those rows will never match.
Fix: Filter out NULLs before joining, or use coalesce(customer_id, -1) to replace NULLs with a sentinel value.
4. Duplicate Columns Breaking Downstream
After joining, both sides have customer_id. A downstream Aggregate on customer_id fails because it is ambiguous.
Fix: Always add a Select after Join to resolve duplicate column names.
5. Broadcasting a Large Table
You broadcast a 2 GB table to every node. Each node runs out of memory.
Fix: Only broadcast tables under 50 MB. Let Spark shuffle for larger tables.
Interview Questions
Q: What join types are available in Synapse Data Flows? A: Inner, Left Outer, Right Outer, Full Outer, and Cross. Additionally, the Exists transformation provides an anti-join pattern, and Lookup provides single-match enrichment.
Q: What is the difference between Join and Lookup in Data Flows? A: Join creates a row for every match (one-to-many produces duplicate rows). Lookup returns only one match per input row (no duplication). Use Join for combining datasets. Use Lookup for enriching with reference data.
Q: What is broadcast optimization and when should you use it? A: Broadcasting sends a small dataset to every Spark node, eliminating the need for data shuffling during the join. Use it when one side is under 50 MB (dimension tables, config tables). Never broadcast large datasets.
Q: How do you find records in one dataset that do not exist in another? A: Use the Exists transformation with “Not exists” type. It filters the left stream to keep only rows with no match on the right. Alternatively, use a Left Outer Join followed by a Filter where the right key is NULL.
Q: How do you handle duplicate column names after a join? A: Add a Select transformation after the Join to rename or drop duplicate columns. Data Flows use stream prefixes (Left@column, Right@column) to disambiguate, but downstream transformations need clean column names.
Wrapping Up
Joins in Data Flows work exactly like SQL joins — same logic, visual interface, Spark engine. The key decisions are:
- Inner when you only want matched records
- Left Outer when you want all records from the primary stream
- Full Outer for reconciliation and finding mismatches on both sides
- Exists for filtering without adding columns
- Lookup for enriching without duplicating rows
And always remember: filter early, broadcast small tables, and clean up duplicate columns after every join.
Related posts: – SQL Joins Explained – Data Flows in ADF/Synapse – Schema-on-Write vs Schema-on-Read – SQL Window Functions
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.