Joins in Azure Synapse Data Flows: Every Join Type Explained with Real-World Examples

Joins in Azure Synapse Data Flows: Every Join Type Explained with Real-World Examples

In SQL, you write JOIN in a query. In Synapse Data Flows, you drag a Join box onto the canvas. Same concept, visual interface, Apache Spark engine underneath.

But here is where most tutorials fail — they show you HOW to configure a join but not WHEN to use each type, what happens to your data, and why your row count suddenly tripled.

This post covers every join type available in Synapse (and ADF) Data Flows with real datasets, step-by-step UI configuration, visual before/after tables, and the real-world scenarios where each join is the right choice.

Think of joins like matchmaking at a conference. You have two guest lists (datasets), and you need to combine them based on some common attribute (the join key). The join type determines what happens to guests who are on one list but not the other.

Table of Contents

  • What Is a Join in Data Flows?
  • The Sample Data
  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join
  • Cross Join
  • Exists (Anti-Join Alternative)
  • Lookup vs Join: When to Use Which
  • Join on Multiple Keys
  • Join with Expressions (Non-Equality Joins)
  • Handling Duplicate Column Names
  • Broadcast Optimization
  • Real-World Scenarios
  • Performance Tips
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is a Join in Data Flows?

A Join transformation in Data Flows combines two data streams based on matching keys. It takes a left stream (primary) and a right stream (secondary) and produces a single output stream with columns from both.

Left Stream (Customers)     Right Stream (Orders)
+----+--------+            +--------+------+--------+
| id | name   |            | ord_id | c_id | amount |
+----+--------+     JOIN   +--------+------+--------+
| 1  | Alice  |     ON     | 101    | 1    | 500    |
| 2  | Bob    |   c.id =   | 102    | 1    | 300    |
| 3  | Carol  |   o.c_id   | 103    | 3    | 700    |
| 4  | Dave   |            +--------+------+--------+
+----+--------+

Real-life analogy: A join is like matching RSVPs with a seating chart. The RSVP list has guest names and dietary preferences. The seating chart has guest names and table assignments. You combine them to get: guest name, dietary preference, AND table number — all in one row.

The Sample Data

I will use these two datasets throughout:

Customers (Left Stream):

customer_id name city status
1 Alice Toronto Active
2 Bob Mumbai Active
3 Carol London Inactive
4 Dave Toronto Active
5 Eve Paris Active

Orders (Right Stream):

order_id customer_id amount order_date
101 1 500 2026-01-15
102 1 300 2026-02-10
103 3 700 2026-03-05
104 6 150 2026-03-20

Notice: – Alice (id=1) has 2 orders – Carol (id=3) has 1 order – Dave (id=4) and Eve (id=5) have NO orders – Order 104 references customer_id=6 who does NOT exist in Customers

This is realistic data — mismatches and gaps are normal.

Inner Join

What It Does

Returns only rows where BOTH sides have a match. Like a married couples dance — only people who came WITH a partner get on the dance floor. Singles wait at the bar.

How to Configure

  1. Add a Join transformation after your left stream
  2. Name: JoinOrders
  3. Right stream: select OrdersSource
  4. Join type: Inner
  5. Join condition: Customers@customer_id == Orders@customer_id

Result

customer_id name city order_id amount order_date
1 Alice Toronto 101 500 2026-01-15
1 Alice Toronto 102 300 2026-02-10
3 Carol London 103 700 2026-03-05

3 rows. Alice appears twice because she has 2 orders (one-to-many relationship).

Who is missing: – Bob, Dave, Eve — they have no orders (no match on right side) – Order 104 — customer_id=6 does not exist in Customers (no match on left side)

When to Use Inner Join

  • You only want customers WHO HAVE ordered
  • Building fact tables where every row needs valid dimension references
  • Matching two datasets where incomplete records should be excluded

Real-life analogy: An Inner Join is like a job interview match. Only candidates (left) who match an open position (right) get an interview. Candidates with no matching positions and positions with no matching candidates are both excluded.

Left Outer Join

What It Does

Returns ALL rows from the left stream and matching rows from the right. If a left row has no match on the right, the right columns are filled with NULL.

Real-life analogy: It is like a class attendance sheet. Every student (left) appears on the sheet. If they submitted homework (right), you note the grade. If they did not submit, you write “N/A”. No student is left off the sheet.

How to Configure

  1. Join type: Left outer
  2. Same join condition: Customers@customer_id == Orders@customer_id

Result

customer_id name city order_id amount order_date
1 Alice Toronto 101 500 2026-01-15
1 Alice Toronto 102 300 2026-02-10
2 Bob Mumbai NULL NULL NULL
3 Carol London 103 700 2026-03-05
4 Dave Toronto NULL NULL NULL
5 Eve Paris NULL NULL NULL

5 rows. All customers appear. Bob, Dave, and Eve have NULLs for order columns because they have no orders. Order 104 (customer_id=6) is still excluded because it has no matching customer on the left.

When to Use Left Outer Join

  • Customer 360 view — show all customers with their orders (including customers who never ordered)
  • Data quality checks — find customers without orders (filter WHERE order_id IS NULL after the join)
  • Preserving the complete primary dataset while enriching with optional data

Finding Records Without Matches

Add a Filter transformation after the Left Outer Join:

isNull(order_id)

Result: Bob, Dave, Eve — customers who have never placed an order. This is one of the most common data engineering patterns.

Right Outer Join

What It Does

The mirror of Left Outer. Returns ALL rows from the right stream and matching rows from the left. Unmatched right rows have NULLs for left columns.

Real-life analogy: Like a package delivery report. Every package (right) appears in the report. If the recipient (left) is in the system, their details are included. If not, the recipient columns show “Unknown.”

Result

customer_id name city order_id amount order_date
1 Alice Toronto 101 500 2026-01-15
1 Alice Toronto 102 300 2026-02-10
3 Carol London 103 700 2026-03-05
NULL NULL NULL 104 150 2026-03-20

4 rows. All orders appear. Order 104 has NULLs for customer columns because customer_id=6 does not exist. Bob, Dave, and Eve are excluded because they have no orders.

When to Use Right Outer Join

In practice, rarely used. You can always swap the streams and use Left Outer instead:

Right Outer Join (Customers LEFT, Orders RIGHT)
  = Left Outer Join (Orders LEFT, Customers RIGHT)

Same result, better readability. Most data engineers always use Left Outer and swap the stream order when needed.

Full Outer Join

What It Does

Returns ALL rows from BOTH sides. Matching rows are combined. Unmatched rows from either side have NULLs.

Real-life analogy: A company merger. You combine the employee lists from both companies. Some employees are in both lists (they transfer). Some are only in Company A. Some are only in Company B. Everyone appears in the merged list, with NULLs where the other company’s data is missing.

Result

customer_id name city order_id amount order_date
1 Alice Toronto 101 500 2026-01-15
1 Alice Toronto 102 300 2026-02-10
2 Bob Mumbai NULL NULL NULL
3 Carol London 103 700 2026-03-05
4 Dave Toronto NULL NULL NULL
5 Eve Paris NULL NULL NULL
NULL NULL NULL 104 150 2026-03-20

7 rows. Everyone and everything is included. No data is lost.

When to Use Full Outer Join

  • Data reconciliation — comparing source vs target to find mismatches
  • Merge two systems — combining data from two sources where either side might have records the other does not
  • Data quality audit — finding orphaned records on both sides

Finding Orphans on Both Sides

After a Full Outer Join, filter:

isNull(Customers@customer_id) || isNull(Orders@order_id)

This gives you: customers without orders AND orders without valid customers. Perfect for data quality reports.

Cross Join

What It Does

Produces every possible combination of rows from both sides. No join condition needed.

Real-life analogy: A restaurant menu where you choose one appetizer AND one main course. If there are 5 appetizers and 4 mains, you have 20 possible meal combinations. That is a Cross Join.

How to Configure

  1. Join type: Cross
  2. No join condition (it combines everything with everything)

Result

5 customers x 4 orders = 20 rows. Every customer paired with every order.

When to Use Cross Join

  • Generate all combinations — all products x all stores for inventory planning
  • Create date dimensions — years x months x days
  • Test data generation

Warning: Cross Join on large datasets explodes the row count. 10,000 x 10,000 = 100 million rows. Only use on small reference datasets.

Exists (Anti-Join Alternative)

Data Flows have a separate Exists transformation that works differently from Join:

What It Does

Filters the left stream to keep only rows that DO or DO NOT exist in the right stream. It does NOT add columns from the right side.

Real-life analogy: A bouncer with a guest list. They check if your name is on the list (exists) or not (not exists). They do not ask about your table reservation details — they just let you in or turn you away.

How to Configure

  1. Add an Exists transformation (not Join)
  2. Right stream: OrdersSource
  3. Exist type: Exists (or Not exists)
  4. Condition: Customers@customer_id == Orders@customer_id

Exists Result (Customers Who Have Orders)

customer_id name city status
1 Alice Toronto Active
3 Carol London Inactive

Only customers who have at least one order. No order columns added. Alice appears ONCE (not twice like Inner Join), because Exists is a filter, not a combiner.

Not Exists Result (Customers Without Orders)

customer_id name city status
2 Bob Mumbai Active
4 Dave Toronto Active
5 Eve Paris Active

Customers who have zero orders. This is the anti-join pattern.

Exists vs Left Outer Join + Filter

Both achieve “customers without orders” but differently:

Approach How Result Columns Performance
Exists (Not exists) Filter only, no column merge Left stream columns only Faster (no data shuffling)
Left Outer Join + Filter(isNull) Full join then filter Both stream columns Slower (joins all data first)

Use Exists when you only need to know IF a match exists, not the matched data. Use Join when you need columns from both sides.

Lookup vs Join: When to Use Which

Data Flows have both Join and Lookup transformations. They seem similar but behave differently:

Feature Join Lookup
Output Combined rows from both streams Left stream enriched with right stream values
Row multiplication Yes (one-to-many creates multiple rows) No (returns first/last/any match)
Multiple matches Creates duplicate rows Returns one value per row
Use case Combining equal datasets Adding reference data to primary stream

Example: Alice Has 2 Orders

Join result: 2 rows (one per order)

customer_id name order_id amount
1 Alice 101 500
1 Alice 102 300

Lookup result: 1 row (picks first match)

customer_id name first_order_id first_amount
1 Alice 101 500

Real-life analogy: Join is like matching students with ALL their exam scores (one row per exam). Lookup is like matching students with their LATEST exam score only (one row per student).

Use Lookup when: You want to enrich data without duplicating rows — adding a dimension value to a fact record.

Use Join when: You genuinely need all matching combinations.

Join on Multiple Keys

Sometimes a single column is not enough. You need to match on multiple columns:

How to Configure

In the Join transformation, click + to add multiple conditions:

Condition 1: Left@year == Right@year
Condition 2: Left@month == Right@month
Condition 3: Left@region == Right@region

All conditions must be TRUE for rows to match (AND logic).

Real-life analogy: Multi-key joins are like identifying a person. Name alone is not enough (there are many “Johns”). Name + city is better. Name + city + birthday is almost unique. More keys = more precise matching.

In the JSON

"joinCondition": {
    "left": ["year", "month", "region"],
    "right": ["yr", "mo", "reg"]
}

Column names can differ between left and right as long as they contain comparable data.

Join with Expressions (Non-Equality Joins)

Standard joins use equality (==). But sometimes you need range-based or expression-based joins:

Date Range Join (SCD Type 2 Lookup)

Match orders to the customer dimension version that was active on the order date:

Customers@customer_id == Orders@customer_id
  AND Orders@order_date >= Customers@effective_from
  AND Orders@order_date < Customers@effective_to

Fuzzy Matching (Contains)

contains(Left@product_name, Right@search_term)

Note: Non-equality joins cannot use broadcast optimization and may be slower than equality joins.

Handling Duplicate Column Names

When both streams have a column with the same name (e.g., both have customer_id), the output has duplicates. This causes errors downstream.

Solution 1: Use Select After Join

Add a Select transformation to drop or rename duplicate columns:

Keep: Customers@customer_id (rename to customer_id)
Drop: Orders@customer_id (duplicate)
Keep: all other columns

Solution 2: Stream Prefix

In the Join configuration, Data Flows automatically prefix ambiguous columns:

Customers@customer_id
Orders@customer_id

You can reference them with the stream prefix in downstream transformations.

Best Practice

Always add a Select transformation after every Join to clean up column names. Remove duplicates, rename ambiguous columns, and drop unnecessary fields. Think of it as tidying up after combining two piles of documents — remove the duplicates before filing.

Broadcast Optimization

The Problem: Shuffle

When joining two large datasets, Spark must shuffle data across the cluster — sending rows with matching keys to the same node. This is like reorganizing a library by moving books between shelves. Slow and expensive.

The Solution: Broadcast

If one dataset is small (under 50 MB), you can broadcast it — send the entire small dataset to every node. Each node then joins locally without shuffling. This is like giving every librarian a photocopy of the small catalog — they can match books locally without walking to other shelves.

How to Configure

  1. In the Join transformation, click Optimize tab
  2. Select Fixed broadcast
  3. Choose which side to broadcast: Left or Right (pick the smaller one)

Or use Auto: Data Flows automatically broadcast when one side is small enough.

When to Broadcast

Left Stream Right Stream Strategy
10 million rows 500 rows (dim table) Broadcast Right
200 rows (config table) 5 million rows Broadcast Left
10 million rows 10 million rows No broadcast (shuffle)

Rule of thumb: Broadcast the side that is under 50 MB. Never broadcast large datasets — it overwhelms node memory.

Real-World Scenarios

Scenario 1: Customer 360 with Orders and Support Tickets

Customers (Source)
  |
  LEFT OUTER JOIN Orders ON customer_id
  |
  LEFT OUTER JOIN Support_Tickets ON customer_id
  |
  Aggregate: total_orders, total_spend, open_tickets per customer
  |
  Derived Column: health_score = iif(open_tickets > 3, 'At Risk', 'Healthy')
  |
  Sink: dim_customer_360

Left Outer because you want ALL customers, even those with no orders or no tickets.

Scenario 2: Data Reconciliation (Source vs Target)

Source_Data (Left)
  |
  FULL OUTER JOIN Target_Data ON primary_key
  |
  Conditional Split:
    Missing in Target: isNull(Target@primary_key) --> Sink: missing_in_target
    Missing in Source: isNull(Source@primary_key) --> Sink: missing_in_source
    Mismatched:        Source@amount != Target@amount --> Sink: mismatched_records
    Matched:           Default --> (discard or log)

Full Outer catches mismatches on both sides.

Scenario 3: Slowly Changing Dimension Type 2 Lookup

Fact_Orders (Left)
  |
  INNER JOIN Dim_Customer ON customer_id
    AND order_date >= effective_from
    AND order_date < effective_to
  |
  Result: Each order matched to the correct customer version

This ensures an order from 2025 matches the customer’s 2025 address, not their 2026 address.

Scenario 4: Product Enrichment from Multiple Catalogs

Internal_Products (Left)
  |
  LEFT OUTER JOIN Vendor_Catalog ON product_sku
  |
  LEFT OUTER JOIN Pricing_Service ON product_id
  |
  Select: product_name, vendor_description, current_price
  |
  Sink: enriched_products

Left Outer because some internal products may not be in the vendor catalog or pricing service yet.

Performance Tips

  1. Broadcast small tables — always broadcast dimension tables (under 50 MB) when joining with large fact tables.

  2. Filter before joining — reduce row count BEFORE the join. If you only need Active customers, filter first, then join. Less data to shuffle = faster join.

  3. Drop unnecessary columns before joining — add a Select before the Join to keep only the columns you need. Less data per row = faster processing.

  4. Use Exists instead of Join for filtering — if you only need to know IF a match exists (not the matched columns), Exists is faster because it does not combine data.

  5. Avoid Cross Joins on large datasets — 10K x 10K = 100M rows. If you must, filter one side first.

  6. Partition your data — Spark joins are fastest when both sides are partitioned on the join key.

Common Mistakes

1. Unexpected Row Multiplication

You join customers (5 rows) with orders (4 rows) and expect 5 rows. But you get 7 because Alice has 2 orders. Inner Join creates a row for EACH match.

Fix: If you need one row per customer with aggregated order data, use Aggregate after the Join (SUM amount, COUNT orders).

2. Joining on Wrong Column

You accidentally join on customer_name instead of customer_id. Two different people named “Alex” get merged into one.

Fix: Always join on the unique identifier (primary key / foreign key), not on descriptive columns.

3. Not Handling NULLs in Join Keys

If customer_id is NULL in some rows, NULL != NULL in a join. Those rows will never match.

Fix: Filter out NULLs before joining, or use coalesce(customer_id, -1) to replace NULLs with a sentinel value.

4. Duplicate Columns Breaking Downstream

After joining, both sides have customer_id. A downstream Aggregate on customer_id fails because it is ambiguous.

Fix: Always add a Select after Join to resolve duplicate column names.

5. Broadcasting a Large Table

You broadcast a 2 GB table to every node. Each node runs out of memory.

Fix: Only broadcast tables under 50 MB. Let Spark shuffle for larger tables.

Interview Questions

Q: What join types are available in Synapse Data Flows? A: Inner, Left Outer, Right Outer, Full Outer, and Cross. Additionally, the Exists transformation provides an anti-join pattern, and Lookup provides single-match enrichment.

Q: What is the difference between Join and Lookup in Data Flows? A: Join creates a row for every match (one-to-many produces duplicate rows). Lookup returns only one match per input row (no duplication). Use Join for combining datasets. Use Lookup for enriching with reference data.

Q: What is broadcast optimization and when should you use it? A: Broadcasting sends a small dataset to every Spark node, eliminating the need for data shuffling during the join. Use it when one side is under 50 MB (dimension tables, config tables). Never broadcast large datasets.

Q: How do you find records in one dataset that do not exist in another? A: Use the Exists transformation with “Not exists” type. It filters the left stream to keep only rows with no match on the right. Alternatively, use a Left Outer Join followed by a Filter where the right key is NULL.

Q: How do you handle duplicate column names after a join? A: Add a Select transformation after the Join to rename or drop duplicate columns. Data Flows use stream prefixes (Left@column, Right@column) to disambiguate, but downstream transformations need clean column names.

Wrapping Up

Joins in Data Flows work exactly like SQL joins — same logic, visual interface, Spark engine. The key decisions are:

  • Inner when you only want matched records
  • Left Outer when you want all records from the primary stream
  • Full Outer for reconciliation and finding mismatches on both sides
  • Exists for filtering without adding columns
  • Lookup for enriching without duplicating rows

And always remember: filter early, broadcast small tables, and clean up duplicate columns after every join.

Related posts:SQL Joins ExplainedData Flows in ADF/SynapseSchema-on-Write vs Schema-on-ReadSQL Window Functions


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link