Joins in Azure Synapse Data Flows: Every Join Type Explained with Real-World Examples

In SQL, you write JOIN in a query. In Synapse Data Flows, you drag a Join box onto the canvas. Same concept, visual interface, Apache Spark engine underneath.

But here is where most tutorials fail — they show you HOW to configure a join but not WHEN to use each type, what happens to your data, and why your row count suddenly tripled.

This post covers every join type available in Synapse (and ADF) Data Flows with real datasets, step-by-step UI configuration, visual before/after tables, and the real-world scenarios where each join is the right choice.

Think of joins like matchmaking at a conference. You have two guest lists (datasets), and you need to combine them based on some common attribute (the join key). The join type determines what happens to guests who are on one list but not the other.

What Is a Join in Data Flows?
The Sample Data
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join
Cross Join
Exists (Anti-Join Alternative)
Lookup vs Join: When to Use Which
Join on Multiple Keys
Join with Expressions (Non-Equality Joins)
Handling Duplicate Column Names
Broadcast Optimization
Real-World Scenarios
Performance Tips
Common Mistakes
Interview Questions
Wrapping Up

What Is a Join in Data Flows?

A Join transformation in Data Flows combines two data streams based on matching keys. It takes a left stream (primary) and a right stream (secondary) and produces a single output stream with columns from both.

Left Stream (Customers)     Right Stream (Orders)
+----+--------+            +--------+------+--------+
| id | name   |            | ord_id | c_id | amount |
+----+--------+     JOIN   +--------+------+--------+
| 1  | Alice  |     ON     | 101    | 1    | 500    |
| 2  | Bob    |   c.id =   | 102    | 1    | 300    |
| 3  | Carol  |   o.c_id   | 103    | 3    | 700    |
| 4  | Dave   |            +--------+------+--------+
+----+--------+

Real-life analogy: A join is like matching RSVPs with a seating chart. The RSVP list has guest names and dietary preferences. The seating chart has guest names and table assignments. You combine them to get: guest name, dietary preference, AND table number — all in one row.

The Sample Data

I will use these two datasets throughout:

Customers (Left Stream):

customer_id	name	city	status
1	Alice	Toronto	Active
2	Bob	Mumbai	Active
3	Carol	London	Inactive
4	Dave	Toronto	Active
5	Eve	Paris	Active

Orders (Right Stream):

order_id	customer_id	amount	order_date
101	1	500	2026-01-15
102	1	300	2026-02-10
103	3	700	2026-03-05
104	6	150	2026-03-20

Notice: – Alice (id=1) has 2 orders – Carol (id=3) has 1 order – Dave (id=4) and Eve (id=5) have NO orders – Order 104 references customer_id=6 who does NOT exist in Customers

This is realistic data — mismatches and gaps are normal.

Inner Join

What It Does

Returns only rows where BOTH sides have a match. Like a married couples dance — only people who came WITH a partner get on the dance floor. Singles wait at the bar.

How to Configure

Add a Join transformation after your left stream
Name: JoinOrders
Right stream: select OrdersSource
Join type: Inner
Join condition: Customers@customer_id == Orders@customer_id

Result

customer_id	name	city	order_id	amount	order_date
1	Alice	Toronto	101	500	2026-01-15
1	Alice	Toronto	102	300	2026-02-10
3	Carol	London	103	700	2026-03-05

3 rows. Alice appears twice because she has 2 orders (one-to-many relationship).

Who is missing: – Bob, Dave, Eve — they have no orders (no match on right side) – Order 104 — customer_id=6 does not exist in Customers (no match on left side)

When to Use Inner Join

You only want customers WHO HAVE ordered
Building fact tables where every row needs valid dimension references
Matching two datasets where incomplete records should be excluded

Real-life analogy: An Inner Join is like a job interview match. Only candidates (left) who match an open position (right) get an interview. Candidates with no matching positions and positions with no matching candidates are both excluded.

Left Outer Join

What It Does

Returns ALL rows from the left stream and matching rows from the right. If a left row has no match on the right, the right columns are filled with NULL.

Real-life analogy: It is like a class attendance sheet. Every student (left) appears on the sheet. If they submitted homework (right), you note the grade. If they did not submit, you write “N/A”. No student is left off the sheet.

How to Configure

Join type: Left outer
Same join condition: Customers@customer_id == Orders@customer_id

Result

customer_id	name	city	order_id	amount	order_date
1	Alice	Toronto	101	500	2026-01-15
1	Alice	Toronto	102	300	2026-02-10
2	Bob	Mumbai	NULL	NULL	NULL
3	Carol	London	103	700	2026-03-05
4	Dave	Toronto	NULL	NULL	NULL
5	Eve	Paris	NULL	NULL	NULL

5 rows. All customers appear. Bob, Dave, and Eve have NULLs for order columns because they have no orders. Order 104 (customer_id=6) is still excluded because it has no matching customer on the left.

When to Use Left Outer Join

Customer 360 view — show all customers with their orders (including customers who never ordered)
Data quality checks — find customers without orders (filter WHERE order_id IS NULL after the join)
Preserving the complete primary dataset while enriching with optional data

Finding Records Without Matches

Add a Filter transformation after the Left Outer Join:

isNull(order_id)

Result: Bob, Dave, Eve — customers who have never placed an order. This is one of the most common data engineering patterns.

Right Outer Join

What It Does

The mirror of Left Outer. Returns ALL rows from the right stream and matching rows from the left. Unmatched right rows have NULLs for left columns.

Real-life analogy: Like a package delivery report. Every package (right) appears in the report. If the recipient (left) is in the system, their details are included. If not, the recipient columns show “Unknown.”

Result

customer_id	name	city	order_id	amount	order_date
1	Alice	Toronto	101	500	2026-01-15
1	Alice	Toronto	102	300	2026-02-10
3	Carol	London	103	700	2026-03-05
NULL	NULL	NULL	104	150	2026-03-20

4 rows. All orders appear. Order 104 has NULLs for customer columns because customer_id=6 does not exist. Bob, Dave, and Eve are excluded because they have no orders.

When to Use Right Outer Join

In practice, rarely used. You can always swap the streams and use Left Outer instead:

Right Outer Join (Customers LEFT, Orders RIGHT)
  = Left Outer Join (Orders LEFT, Customers RIGHT)

Same result, better readability. Most data engineers always use Left Outer and swap the stream order when needed.

Full Outer Join

What It Does

Returns ALL rows from BOTH sides. Matching rows are combined. Unmatched rows from either side have NULLs.

Real-life analogy: A company merger. You combine the employee lists from both companies. Some employees are in both lists (they transfer). Some are only in Company A. Some are only in Company B. Everyone appears in the merged list, with NULLs where the other company’s data is missing.

Result

customer_id	name	city	order_id	amount	order_date
1	Alice	Toronto	101	500	2026-01-15
1	Alice	Toronto	102	300	2026-02-10
2	Bob	Mumbai	NULL	NULL	NULL
3	Carol	London	103	700	2026-03-05
4	Dave	Toronto	NULL	NULL	NULL
5	Eve	Paris	NULL	NULL	NULL
NULL	NULL	NULL	104	150	2026-03-20

7 rows. Everyone and everything is included. No data is lost.

When to Use Full Outer Join

Data reconciliation — comparing source vs target to find mismatches
Merge two systems — combining data from two sources where either side might have records the other does not
Data quality audit — finding orphaned records on both sides

Finding Orphans on Both Sides

After a Full Outer Join, filter:

isNull(Customers@customer_id) || isNull(Orders@order_id)

This gives you: customers without orders AND orders without valid customers. Perfect for data quality reports.

Cross Join

What It Does

Produces every possible combination of rows from both sides. No join condition needed.

Real-life analogy: A restaurant menu where you choose one appetizer AND one main course. If there are 5 appetizers and 4 mains, you have 20 possible meal combinations. That is a Cross Join.

How to Configure

Join type: Cross
No join condition (it combines everything with everything)

Result

5 customers x 4 orders = 20 rows. Every customer paired with every order.

When to Use Cross Join

Generate all combinations — all products x all stores for inventory planning
Create date dimensions — years x months x days
Test data generation

Warning: Cross Join on large datasets explodes the row count. 10,000 x 10,000 = 100 million rows. Only use on small reference datasets.

Exists (Anti-Join Alternative)

Data Flows have a separate Exists transformation that works differently from Join:

What It Does

Filters the left stream to keep only rows that DO or DO NOT exist in the right stream. It does NOT add columns from the right side.

Real-life analogy: A bouncer with a guest list. They check if your name is on the list (exists) or not (not exists). They do not ask about your table reservation details — they just let you in or turn you away.

How to Configure

Add an Exists transformation (not Join)
Right stream: OrdersSource
Exist type: Exists (or Not exists)
Condition: Customers@customer_id == Orders@customer_id

Exists Result (Customers Who Have Orders)

customer_id	name	city	status
1	Alice	Toronto	Active
3	Carol	London	Inactive

Only customers who have at least one order. No order columns added. Alice appears ONCE (not twice like Inner Join), because Exists is a filter, not a combiner.

Not Exists Result (Customers Without Orders)

customer_id	name	city	status
2	Bob	Mumbai	Active
4	Dave	Toronto	Active
5	Eve	Paris	Active

Customers who have zero orders. This is the anti-join pattern.

Exists vs Left Outer Join + Filter

Both achieve “customers without orders” but differently:

Approach	How	Result Columns	Performance
Exists (Not exists)	Filter only, no column merge	Left stream columns only	Faster (no data shuffling)
Left Outer Join + Filter(isNull)	Full join then filter	Both stream columns	Slower (joins all data first)

Use Exists when you only need to know IF a match exists, not the matched data. Use Join when you need columns from both sides.

Lookup vs Join: When to Use Which

Data Flows have both Join and Lookup transformations. They seem similar but behave differently:

Feature	Join	Lookup
Output	Combined rows from both streams	Left stream enriched with right stream values
Row multiplication	Yes (one-to-many creates multiple rows)	No (returns first/last/any match)
Multiple matches	Creates duplicate rows	Returns one value per row
Use case	Combining equal datasets	Adding reference data to primary stream

Example: Alice Has 2 Orders

Join result: 2 rows (one per order)

customer_id	name	order_id	amount
1	Alice	101	500
1	Alice	102	300

Lookup result: 1 row (picks first match)

customer_id	name	first_order_id	first_amount
1	Alice	101	500

Real-life analogy: Join is like matching students with ALL their exam scores (one row per exam). Lookup is like matching students with their LATEST exam score only (one row per student).

Use Lookup when: You want to enrich data without duplicating rows — adding a dimension value to a fact record.

Use Join when: You genuinely need all matching combinations.

Join on Multiple Keys

Sometimes a single column is not enough. You need to match on multiple columns:

How to Configure

In the Join transformation, click + to add multiple conditions:

Condition 1: Left@year == Right@year
Condition 2: Left@month == Right@month
Condition 3: Left@region == Right@region

All conditions must be TRUE for rows to match (AND logic).

Real-life analogy: Multi-key joins are like identifying a person. Name alone is not enough (there are many “Johns”). Name + city is better. Name + city + birthday is almost unique. More keys = more precise matching.

In the JSON

"joinCondition": {
    "left": ["year", "month", "region"],
    "right": ["yr", "mo", "reg"]
}

Column names can differ between left and right as long as they contain comparable data.

Join with Expressions (Non-Equality Joins)

Standard joins use equality (==). But sometimes you need range-based or expression-based joins:

Date Range Join (SCD Type 2 Lookup)

Match orders to the customer dimension version that was active on the order date:

Customers@customer_id == Orders@customer_id
  AND Orders@order_date >= Customers@effective_from
  AND Orders@order_date < Customers@effective_to

Fuzzy Matching (Contains)

contains(Left@product_name, Right@search_term)

Note: Non-equality joins cannot use broadcast optimization and may be slower than equality joins.

Handling Duplicate Column Names

When both streams have a column with the same name (e.g., both have customer_id), the output has duplicates. This causes errors downstream.

Solution 1: Use Select After Join

Add a Select transformation to drop or rename duplicate columns:

Keep: Customers@customer_id (rename to customer_id)
Drop: Orders@customer_id (duplicate)
Keep: all other columns

Solution 2: Stream Prefix

In the Join configuration, Data Flows automatically prefix ambiguous columns:

Customers@customer_id
Orders@customer_id

You can reference them with the stream prefix in downstream transformations.

Best Practice

Always add a Select transformation after every Join to clean up column names. Remove duplicates, rename ambiguous columns, and drop unnecessary fields. Think of it as tidying up after combining two piles of documents — remove the duplicates before filing.

Broadcast Optimization

The Problem: Shuffle

When joining two large datasets, Spark must shuffle data across the cluster — sending rows with matching keys to the same node. This is like reorganizing a library by moving books between shelves. Slow and expensive.

The Solution: Broadcast

If one dataset is small (under 50 MB), you can broadcast it — send the entire small dataset to every node. Each node then joins locally without shuffling. This is like giving every librarian a photocopy of the small catalog — they can match books locally without walking to other shelves.

How to Configure

In the Join transformation, click Optimize tab
Select Fixed broadcast
Choose which side to broadcast: Left or Right (pick the smaller one)

Or use Auto: Data Flows automatically broadcast when one side is small enough.

When to Broadcast

Left Stream	Right Stream	Strategy
10 million rows	500 rows (dim table)	Broadcast Right
200 rows (config table)	5 million rows	Broadcast Left
10 million rows	10 million rows	No broadcast (shuffle)

Rule of thumb: Broadcast the side that is under 50 MB. Never broadcast large datasets — it overwhelms node memory.

Real-World Scenarios

Scenario 1: Customer 360 with Orders and Support Tickets

Customers (Source)
  |
  LEFT OUTER JOIN Orders ON customer_id
  |
  LEFT OUTER JOIN Support_Tickets ON customer_id
  |
  Aggregate: total_orders, total_spend, open_tickets per customer
  |
  Derived Column: health_score = iif(open_tickets > 3, 'At Risk', 'Healthy')
  |
  Sink: dim_customer_360

Left Outer because you want ALL customers, even those with no orders or no tickets.

Scenario 2: Data Reconciliation (Source vs Target)

Source_Data (Left)
  |
  FULL OUTER JOIN Target_Data ON primary_key
  |
  Conditional Split:
    Missing in Target: isNull(Target@primary_key) --> Sink: missing_in_target
    Missing in Source: isNull(Source@primary_key) --> Sink: missing_in_source
    Mismatched:        Source@amount != Target@amount --> Sink: mismatched_records
    Matched:           Default --> (discard or log)

Full Outer catches mismatches on both sides.

Scenario 3: Slowly Changing Dimension Type 2 Lookup

Fact_Orders (Left)
  |
  INNER JOIN Dim_Customer ON customer_id
    AND order_date >= effective_from
    AND order_date < effective_to
  |
  Result: Each order matched to the correct customer version

This ensures an order from 2025 matches the customer’s 2025 address, not their 2026 address.

Scenario 4: Product Enrichment from Multiple Catalogs

Internal_Products (Left)
  |
  LEFT OUTER JOIN Vendor_Catalog ON product_sku
  |
  LEFT OUTER JOIN Pricing_Service ON product_id
  |
  Select: product_name, vendor_description, current_price
  |
  Sink: enriched_products

Left Outer because some internal products may not be in the vendor catalog or pricing service yet.

Performance Tips

Broadcast small tables — always broadcast dimension tables (under 50 MB) when joining with large fact tables.
Filter before joining — reduce row count BEFORE the join. If you only need Active customers, filter first, then join. Less data to shuffle = faster join.
Drop unnecessary columns before joining — add a Select before the Join to keep only the columns you need. Less data per row = faster processing.
Use Exists instead of Join for filtering — if you only need to know IF a match exists (not the matched columns), Exists is faster because it does not combine data.
Avoid Cross Joins on large datasets — 10K x 10K = 100M rows. If you must, filter one side first.
Partition your data — Spark joins are fastest when both sides are partitioned on the join key.

Common Mistakes

1. Unexpected Row Multiplication

You join customers (5 rows) with orders (4 rows) and expect 5 rows. But you get 7 because Alice has 2 orders. Inner Join creates a row for EACH match.

Fix: If you need one row per customer with aggregated order data, use Aggregate after the Join (SUM amount, COUNT orders).

2. Joining on Wrong Column

You accidentally join on customer_name instead of customer_id. Two different people named “Alex” get merged into one.

Fix: Always join on the unique identifier (primary key / foreign key), not on descriptive columns.

3. Not Handling NULLs in Join Keys

If customer_id is NULL in some rows, NULL != NULL in a join. Those rows will never match.

Fix: Filter out NULLs before joining, or use coalesce(customer_id, -1) to replace NULLs with a sentinel value.

4. Duplicate Columns Breaking Downstream

After joining, both sides have customer_id. A downstream Aggregate on customer_id fails because it is ambiguous.

Fix: Always add a Select after Join to resolve duplicate column names.

5. Broadcasting a Large Table

You broadcast a 2 GB table to every node. Each node runs out of memory.

Fix: Only broadcast tables under 50 MB. Let Spark shuffle for larger tables.

Interview Questions

Q: What join types are available in Synapse Data Flows? A: Inner, Left Outer, Right Outer, Full Outer, and Cross. Additionally, the Exists transformation provides an anti-join pattern, and Lookup provides single-match enrichment.

Q: What is the difference between Join and Lookup in Data Flows? A: Join creates a row for every match (one-to-many produces duplicate rows). Lookup returns only one match per input row (no duplication). Use Join for combining datasets. Use Lookup for enriching with reference data.

Q: What is broadcast optimization and when should you use it? A: Broadcasting sends a small dataset to every Spark node, eliminating the need for data shuffling during the join. Use it when one side is under 50 MB (dimension tables, config tables). Never broadcast large datasets.

Q: How do you find records in one dataset that do not exist in another? A: Use the Exists transformation with “Not exists” type. It filters the left stream to keep only rows with no match on the right. Alternatively, use a Left Outer Join followed by a Filter where the right key is NULL.

Q: How do you handle duplicate column names after a join? A: Add a Select transformation after the Join to rename or drop duplicate columns. Data Flows use stream prefixes (Left@column, Right@column) to disambiguate, but downstream transformations need clean column names.

Wrapping Up

Joins in Data Flows work exactly like SQL joins — same logic, visual interface, Spark engine. The key decisions are:

Inner when you only want matched records
Left Outer when you want all records from the primary stream
Full Outer for reconciliation and finding mismatches on both sides
Exists for filtering without adding columns
Lookup for enriching without duplicating rows

And always remember: filter early, broadcast small tables, and clean up duplicate columns after every join.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Joins in Azure Synapse Data Flows: Every Join Type Explained with Real-World Examples

Table of Contents

What Is a Join in Data Flows?

The Sample Data

Inner Join

What It Does

How to Configure

Result

When to Use Inner Join

Left Outer Join

What It Does

How to Configure

Result

When to Use Left Outer Join

Finding Records Without Matches

Right Outer Join

What It Does

Result

When to Use Right Outer Join

Full Outer Join

What It Does

Result

When to Use Full Outer Join

Finding Orphans on Both Sides

Cross Join

What It Does

How to Configure

Result

When to Use Cross Join

Exists (Anti-Join Alternative)

What It Does

How to Configure

Exists Result (Customers Who Have Orders)

Not Exists Result (Customers Without Orders)

Exists vs Left Outer Join + Filter

Lookup vs Join: When to Use Which

Example: Alice Has 2 Orders

Join on Multiple Keys

How to Configure

In the JSON

Join with Expressions (Non-Equality Joins)

Date Range Join (SCD Type 2 Lookup)

Fuzzy Matching (Contains)

Handling Duplicate Column Names

Solution 1: Use Select After Join

Solution 2: Stream Prefix

Best Practice

Broadcast Optimization

The Problem: Shuffle

The Solution: Broadcast

How to Configure

When to Broadcast

Real-World Scenarios

Scenario 1: Customer 360 with Orders and Support Tickets

Scenario 2: Data Reconciliation (Source vs Target)

Scenario 3: Slowly Changing Dimension Type 2 Lookup

Scenario 4: Product Enrichment from Multiple Catalogs

Performance Tips

Common Mistakes

1. Unexpected Row Multiplication

2. Joining on Wrong Column

3. Not Handling NULLs in Join Keys

4. Duplicate Columns Breaking Downstream

5. Broadcasting a Large Table

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply