PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies

Joins are the bread and butter of data engineering. Every pipeline you build, every report you generate, every dimension table you enrich — somewhere in the process, two DataFrames need to be combined.

But PySpark has EIGHT join types, not just the usual three you know from SQL textbooks. Inner, left, right, and full outer are familiar. But what about left_semi and left_anti? What about cross and self joins? These lesser-known joins solve real problems that would otherwise require clunky workarounds.

This post walks through all eight join types using the SAME two DataFrames, so you can see exactly how each join behaves differently on identical data. Same input, different outputs — that is what makes the differences click.

Think of joins like matching guests at a wedding. The left table is the bride’s guest list. The right table is the groom’s guest list. Different join types answer different questions: “Who is on BOTH lists?” (inner). “Who is on the bride’s list regardless of the groom’s?” (left). “Who is on the bride’s list but NOT the groom’s?” (anti). Same guest lists, different questions, different results.

The Setup: Two DataFrames
Why the Setup Data Is Designed This Way
Join 1: INNER JOIN — Only Matches
Join 2: LEFT JOIN — All Left, Matches from Right
Join 3: RIGHT JOIN — All Right, Matches from Left
Join 4: FULL OUTER JOIN — Everything from Both
Join 5: LEFT SEMI JOIN — Left Rows WITH a Match
Join 6: LEFT ANTI JOIN — Left Rows WITHOUT a Match
Join 7: CROSS JOIN — Every Row Times Every Row
Join 8: SELF JOIN — Table Joined with Itself
Side-by-Side Comparison
Broadcast Joins (Performance Optimization)
Handling Duplicate Column Names
Joining on Multiple Conditions
Joining on Inequality Conditions
SQL Equivalents for Every Join
Common Mistakes
When to Use Which Join
Interview Questions
Wrapping Up

The Setup: Two DataFrames

We will use these two DataFrames for EVERY join example:

from pyspark.sql.functions import *

# Employees table — 5 employees, one has NULL department
emp_data = [
    (1, "Naveen", "DE"),
    (2, "Shrey", "DS"),
    (3, "Vrushab", "DE"),
    (4, "Vishnu", "QA"),
    (5, "Ravi", None),       # No department assigned
]

df_emp = spark.createDataFrame(emp_data, ["emp_id", "name", "dept_code"])

# Departments table — 4 departments, one has no employees
dept_data = [
    ("DE", "Data Engineering", 500000),
    ("DS", "Data Science", 450000),
    ("QA", "Quality Assurance", 300000),
    ("PM", "Project Management", 400000),  # No employees in PM
]

df_dept = spark.createDataFrame(dept_data, ["dept_code", "dept_name", "budget"])

Employees:

emp_id	name	dept_code
1	Naveen	DE
2	Shrey	DS
3	Vrushab	DE
4	Vishnu	QA
5	Ravi	NULL

Departments:

dept_code	dept_name	budget
DE	Data Engineering	500000
DS	Data Science	450000
QA	Quality Assurance	300000
PM	Project Management	400000

Why the Setup Data Is Designed This Way

The data is intentionally designed so every join produces a DIFFERENT result:

Ravi has NULL dept_code — he exists in employees but has NO matching department
PM (Project Management) exists in departments but NO employee belongs to it
Naveen and Vrushab both belong to DE — useful for showing duplicate matches and self-join

These three facts create every possible join scenario: matches, left-only rows, right-only rows, NULLs, and duplicates.

Join 1: INNER JOIN — Only Matches

What It Returns

Only rows where the join condition matches in BOTH tables. No NULLs, no unmatched rows.

df_inner = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "inner")
df_inner.show()

emp_id	name	dept_code	dept_code	dept_name	budget
1	Naveen	DE	DE	Data Engineering	500000
2	Shrey	DS	DS	Data Science	450000
3	Vrushab	DE	DE	Data Engineering	500000
4	Vishnu	QA	QA	Quality Assurance	300000

4 rows. Ravi is excluded (NULL dept has no match). PM is excluded (no employee has dept_code = PM).

Real-Life Analogy

Inner join is like a guest list checkpoint at an exclusive event. Only people whose name appears on BOTH the invite list AND the RSVP list get in. If you were invited but did not RSVP (Ravi — in employees but no matching dept), you are turned away. If a seat was reserved but nobody claimed it (PM — in departments but no matching employee), the seat stays empty.

When to Use

Standard enrichment: “Give me employees WITH their department details”
When you only want records that have complete data on both sides
Most common join type — the default when you just say “join”

Join 2: LEFT JOIN (LEFT OUTER) — All Left, Matches from Right

What It Returns

ALL rows from the left table (employees), plus matching columns from the right table (departments). Where there is no match, right-side columns are NULL.

df_left = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left")
df_left.show()

emp_id	name	dept_code	dept_code	dept_name	budget
1	Naveen	DE	DE	Data Engineering	500000
2	Shrey	DS	DS	Data Science	450000
3	Vrushab	DE	DE	Data Engineering	500000
4	Vishnu	QA	QA	Quality Assurance	300000
5	Ravi	NULL	NULL	NULL	NULL

5 rows. All employees are kept. Ravi shows up with NULLs for department details. PM is excluded (it is on the right side, not the left).

Real-Life Analogy

Left join is like a class attendance sheet. Every student (left table) is listed, whether they submitted homework (right table) or not. Students who submitted get a grade next to their name. Students who did not submit get a blank — but they are still on the sheet. The left side is always complete.

When to Use

“Give me ALL employees, whether they have a department or not”
When the left table is your primary dataset and you want to enrich it with optional data
Finding employees WITHOUT departments: df_left.filter(col("dept_name").isNull())

Join 3: RIGHT JOIN (RIGHT OUTER) — All Right, Matches from Left

What It Returns

ALL rows from the right table (departments), plus matching columns from the left table (employees). Where there is no match, left-side columns are NULL.

df_right = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "right")
df_right.show()

emp_id	name	dept_code	dept_code	dept_name	budget
1	Naveen	DE	DE	Data Engineering	500000
2	Shrey	DS	DS	Data Science	450000
3	Vrushab	DE	DE	Data Engineering	500000
4	Vishnu	QA	QA	Quality Assurance	300000
NULL	NULL	NULL	PM	Project Management	400000

5 rows. All departments are kept. PM shows up with NULLs for employee details. Ravi is excluded (he is on the left side with no match on the right).

Real-Life Analogy

Right join is like a department budget report. Every department (right table) is listed, whether it has employees or not. Departments with employees show the headcount. Empty departments show up with zeroes — but they are still on the report. The right side is always complete.

When to Use

“Give me ALL departments, whether they have employees or not”
Finding empty departments: df_right.filter(col("name").isNull())
In practice, most engineers rewrite right joins as left joins by swapping the table order — same result, easier to read

Join 4: FULL OUTER JOIN — Everything from Both

What It Returns

ALL rows from BOTH tables. Matched rows are combined. Unmatched rows from either side get NULLs for the other side’s columns.

df_full = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "full")
df_full.show()

emp_id	name	dept_code	dept_code	dept_name	budget
1	Naveen	DE	DE	Data Engineering	500000
2	Shrey	DS	DS	Data Science	450000
3	Vrushab	DE	DE	Data Engineering	500000
4	Vishnu	QA	QA	Quality Assurance	300000
5	Ravi	NULL	NULL	NULL	NULL
NULL	NULL	NULL	PM	Project Management	400000

6 rows. Everyone is included — matched employees, Ravi (no dept), AND PM (no employees). No data is lost from either side.

Real-Life Analogy

Full outer join is like merging two contact lists from different phones. Contacts in both phones get combined into one entry. Contacts only in Phone A (Ravi) appear with blanks for Phone B fields. Contacts only in Phone B (PM) appear with blanks for Phone A fields. Nothing is lost from either phone.

When to Use

Data reconciliation: “Show me everything from both systems, highlighting mismatches”
Finding orphans on BOTH sides simultaneously
Rarely used in production pipelines — usually you know which side is primary (left or right)

Join 5: LEFT SEMI JOIN — Left Rows WITH a Match

What It Returns

Rows from the left table ONLY where a match EXISTS in the right table. No columns from the right table are included. This is the PySpark equivalent of SQL’s WHERE EXISTS.

df_semi = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left_semi")
df_semi.show()

emp_id	name	dept_code
1	Naveen	DE
2	Shrey	DS
3	Vrushab	DE
4	Vishnu	QA

4 rows. Only employee columns — no dept_name or budget. Ravi is excluded (no matching department). Only employees who HAVE a department are returned.

Real-Life Analogy

Semi join is like a nightclub bouncer checking a VIP list. The bouncer (left_semi) checks if your name (employee) is on the VIP list (departments). If it is, you get in — but the bouncer does not hand you the VIP list itself. You only keep YOUR information (left columns). You do not get the list’s information (right columns).

SQL Equivalent

SELECT * FROM employees e
WHERE EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code)

When to Use

Filtering: “Give me employees who belong to a VALID department” without adding department columns
Cleaner than inner join when you do not need right-side columns
Data validation: check which records have matching references

Join 6: LEFT ANTI JOIN — Left Rows WITHOUT a Match

What It Returns

Rows from the left table ONLY where NO match exists in the right table. The OPPOSITE of semi join. This is PySpark’s WHERE NOT EXISTS.

df_anti = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left_anti")
df_anti.show()

emp_id	name	dept_code
5	Ravi	NULL

1 row. Only Ravi — the only employee WITHOUT a matching department. Everyone else was filtered out because they DO have matches.

Real-Life Analogy

Anti join is like a teacher checking who did NOT submit homework. The teacher compares the student list (employees) against the submissions pile (departments). Students whose name is NOT in the pile are the result. It answers: “Who is missing?”

SQL Equivalent

SELECT * FROM employees e
WHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code)

When to Use

Finding orphan records: “Which employees have no department?”
Data quality checks: “Which orders reference a product that does not exist?”
Identifying missing references before loading into a dimension table
Incremental loads: “Which source records are NOT yet in the target?”

Join 7: CROSS JOIN — Every Row Times Every Row

What It Returns

The cartesian product — every row from the left paired with every row from the right. No join condition needed.

df_cross = df_emp.crossJoin(df_dept)
print(f"Rows: {df_emp.count()} x {df_dept.count()} = {df_cross.count()}")
df_cross.show(20)

20 rows (5 employees x 4 departments). Every employee is paired with every department, regardless of whether they belong to it. Naveen appears 4 times — once paired with each department.

Real-Life Analogy

Cross join is like a speed dating event. Every person on the left side meets every person on the right side. 5 people on the left, 4 on the right = 20 conversations. Most conversations will not be a match, but every combination is tried.

When to Use

Generating all possible combinations (products x colors, employees x training courses)
Creating date spine tables (every date x every metric)
Very rare in data engineering — usually a sign you need a different join type
Dangerous on large tables: 1M x 1M = 1 TRILLION rows — crashes the cluster

Join 8: SELF JOIN — Table Joined with Itself

What It Returns

A table joined with itself. Requires aliases to distinguish the two copies.

# Find all pairs of employees in the SAME department
df_self = df_emp.alias("a").join(
    df_emp.alias("b"),
    (col("a.dept_code") == col("b.dept_code")) & (col("a.emp_id") < col("b.emp_id")),
    "inner"
).select(
    col("a.name").alias("employee_1"),
    col("b.name").alias("employee_2"),
    col("a.dept_code").alias("department")
)

df_self.show()

employee_1	employee_2	department
Naveen	Vrushab	DE

1 row. Naveen and Vrushab are both in DE. The condition a.emp_id < b.emp_id prevents duplicate pairs (Naveen-Vrushab and Vrushab-Naveen) and self-pairs (Naveen-Naveen).

Real-Life Analogy

Self join is like a doubles tennis tournament within a single team. You pair every player with every other player on the SAME team. The emp_id < emp_id condition ensures you do not pair someone with themselves and you do not count Alice-Bob AND Bob-Alice as separate pairs.

When to Use

Manager-employee relationships (same table, different roles)
Finding duplicates or near-duplicates within a table
Comparing records: “Which employees have the same salary?”
Hierarchical data: organizational trees, parent-child relationships

Side-by-Side Comparison

print(f"Employees: {df_emp.count()} rows | Departments: {df_dept.count()} rows")
print(f"Matching: Naveen(DE), Shrey(DS), Vrushab(DE), Vishnu(QA)")
print(f"Left only: Ravi(NULL) | Right only: PM")
print()
print(f"{'Join Type':<20} {'Rows':>5}  What It Returns")
print("-" * 75)
print(f"{'inner':<20} {'4':>5}  Only rows matching in BOTH tables")
print(f"{'left':<20} {'5':>5}  ALL left + matching right (NULLs for Ravi)")
print(f"{'right':<20} {'5':>5}  ALL right + matching left (NULLs for PM)")
print(f"{'full':<20} {'6':>5}  EVERYTHING from both (Ravi + PM both included)")
print(f"{'left_semi':<20} {'4':>5}  Left rows WITH match (left columns only)")
print(f"{'left_anti':<20} {'1':>5}  Left rows WITHOUT match (only Ravi)")
print(f"{'cross':<20} {'20':>5}  Every row x every row (5 x 4 = 20)")
print(f"{'self':<20} {'1':>5}  Same-table pairs (Naveen & Vrushab in DE)")

Broadcast Joins (Performance Optimization)

When one DataFrame is small (under 10 MB), broadcast it to every worker node to avoid shuffling the large DataFrame:

from pyspark.sql.functions import broadcast

# Without broadcast: both DataFrames are shuffled across the cluster
df_slow = df_large.join(df_small, "key_column")

# With broadcast: small DataFrame is sent to every node, large stays in place
df_fast = df_large.join(broadcast(df_small), "key_column")

When to use broadcast: – Lookup/dimension tables (countries, departments, product categories) — usually small – Any DataFrame under 10 MB – Spark auto-broadcasts tables under 10 MB by default (configurable)

Real-life analogy: Without broadcast, both teams travel to a central meeting room (shuffle). With broadcast, the small team (3 people) visits every office of the large team (100 people) — much faster because you do not move 100 people.

Handling Duplicate Column Names

The join condition creates duplicate dept_code columns. Fix it:

# Method 1: Drop the duplicate column
df_joined = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "inner")     .drop(df_dept.dept_code)

# Method 2: Join on a list (auto-deduplicates)
df_joined = df_emp.join(df_dept, ["dept_code"], "inner")
# Only ONE dept_code column in the result

# Method 3: Rename before joining
df_dept_renamed = df_dept.withColumnRenamed("dept_code", "d_dept_code")
df_joined = df_emp.join(df_dept_renamed, df_emp.dept_code == df_dept_renamed.d_dept_code, "inner")

Recommendation: Method 2 (join on a list) is cleanest when the column name is the same in both DataFrames.

Joining on Multiple Conditions

# Join on TWO columns
df_result = df_orders.join(
    df_products,
    (df_orders.product_id == df_products.product_id) &
    (df_orders.region == df_products.region),
    "inner"
)

# Shorthand when column names match
df_result = df_orders.join(df_products, ["product_id", "region"], "inner")

Joining on Inequality Conditions

# Range join: find which salary band an employee belongs to
salary_bands = spark.createDataFrame([
    ("Junior", 0, 70000),
    ("Mid", 70001, 90000),
    ("Senior", 90001, 120000),
], ["band", "min_salary", "max_salary"])

df_banded = df_emp_with_salary.join(
    salary_bands,
    (df_emp_with_salary.salary >= salary_bands.min_salary) &
    (df_emp_with_salary.salary <= salary_bands.max_salary),
    "left"
)

SQL Equivalents for Every Join

df_emp.createOrReplaceTempView("employees")
df_dept.createOrReplaceTempView("departments")

-- INNER
SELECT * FROM employees e INNER JOIN departments d ON e.dept_code = d.dept_code;

-- LEFT
SELECT * FROM employees e LEFT JOIN departments d ON e.dept_code = d.dept_code;

-- RIGHT
SELECT * FROM employees e RIGHT JOIN departments d ON e.dept_code = d.dept_code;

-- FULL OUTER
SELECT * FROM employees e FULL OUTER JOIN departments d ON e.dept_code = d.dept_code;

-- SEMI (SQL uses EXISTS)
SELECT * FROM employees e WHERE EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code);

-- ANTI (SQL uses NOT EXISTS)
SELECT * FROM employees e WHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code);

-- CROSS
SELECT * FROM employees CROSS JOIN departments;

-- SELF
SELECT a.name AS emp1, b.name AS emp2, a.dept_code
FROM employees a INNER JOIN employees b ON a.dept_code = b.dept_code AND a.emp_id < b.emp_id;

Common Mistakes

Using inner when you mean left — inner drops unmatched rows silently. If your source has 1,000 rows and the result has 950, you lost 50 rows. Use left join to preserve all source rows and investigate NULLs.
Forgetting to handle duplicate columns — joining on df_a.id == df_b.id creates TWO id columns. Use join(df_b, ["id"]) or .drop() the duplicate.
Cross join by accident — joining without a condition creates a cartesian product. 10K x 10K = 100M rows. Always specify a join condition unless you intentionally want a cross join.
Not using broadcast for small tables — joining a 500M-row fact table with a 1,000-row lookup without broadcast causes an expensive shuffle. Broadcast the small table.
Confusing semi and anti — semi keeps rows WITH a match (filter IN). Anti keeps rows WITHOUT a match (filter NOT IN). Think: semi = “exists,” anti = “not exists.”
NULL join keys — NULLs never match anything, even other NULLs. Ravi (NULL dept) does not match the NULL department. If you need NULL-safe joins, use eqNullSafe: df_emp.dept_code.eqNullSafe(df_dept.dept_code).
Self-join without alias — df.join(df, ...) fails with ambiguous column references. Always use .alias("a") and .alias("b").

When to Use Which Join

Question You Are Asking	Join Type
“Give me combined data where both sides match”	inner
“Give me ALL employees with department info if available”	left
“Give me ALL departments with employee info if available”	right
“Give me EVERYTHING from both sides”	full
“Which employees HAVE a valid department?” (no dept columns needed)	left_semi
“Which employees are MISSING a department?”	left_anti
“Generate every possible employee-department combination”	cross
“Find employee pairs in the same department”	self

Interview Questions

Q: What is the difference between inner join and left join? A: Inner join returns only rows where the join condition matches in BOTH tables — unmatched rows from either side are dropped. Left join returns ALL rows from the left table plus matching rows from the right — unmatched left rows get NULLs for right-side columns. Left join preserves the left table completely.

Q: What is a left_semi join and when would you use it? A: Left_semi returns rows from the left table where a match EXISTS in the right table, but does NOT include columns from the right table. It is equivalent to SQL’s WHERE EXISTS. Use it to filter a table by the existence of related records without adding extra columns.

Q: What is a left_anti join? A: Left_anti returns rows from the left table where NO match exists in the right table — the opposite of semi join. Equivalent to SQL’s WHERE NOT EXISTS. Use it to find orphan records, missing references, or records not yet loaded into a target table.

Q: Why should you broadcast small tables in joins? A: Without broadcast, both DataFrames are shuffled across the cluster (expensive network transfer). With broadcast, the small DataFrame is copied to every worker node, and the large DataFrame stays in place — no shuffle needed. Use broadcast for any lookup/dimension table under 10 MB.

Q: How do you handle duplicate column names after a join? A: Three methods: join on a column list (join(df_b, ["id"])) which auto-deduplicates, drop the duplicate column (.drop(df_b.id)), or rename before joining (.withColumnRenamed()). Joining on a list is the cleanest approach when column names match.

Q: Do NULLs match in joins? A: No. NULL never equals NULL in standard joins. A row with NULL in the join key will not match any row, including other NULLs. For NULL-safe joins, use .eqNullSafe() instead of ==.

Wrapping Up

Eight join types, one setup, eight different results. That is the complete picture of PySpark joins. In practice, you will use inner and left 80% of the time, anti for data quality checks, broadcast for performance, and the rest for specific scenarios.

The key insight: every join answers a different question about the relationship between two datasets. Choose the join based on the QUESTION you are asking, not the data.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies

Table of Contents

The Setup: Two DataFrames

Why the Setup Data Is Designed This Way

Join 1: INNER JOIN — Only Matches

What It Returns

Real-Life Analogy

When to Use

Join 2: LEFT JOIN (LEFT OUTER) — All Left, Matches from Right

What It Returns

Real-Life Analogy

When to Use

Join 3: RIGHT JOIN (RIGHT OUTER) — All Right, Matches from Left

What It Returns

Real-Life Analogy

When to Use

Join 4: FULL OUTER JOIN — Everything from Both

What It Returns

Real-Life Analogy

When to Use

Join 5: LEFT SEMI JOIN — Left Rows WITH a Match

What It Returns

Real-Life Analogy

SQL Equivalent

When to Use

Join 6: LEFT ANTI JOIN — Left Rows WITHOUT a Match

What It Returns

Real-Life Analogy

SQL Equivalent

When to Use

Join 7: CROSS JOIN — Every Row Times Every Row

What It Returns

Real-Life Analogy

When to Use

Join 8: SELF JOIN — Table Joined with Itself

What It Returns

Real-Life Analogy

When to Use

Side-by-Side Comparison

Broadcast Joins (Performance Optimization)

Handling Duplicate Column Names

Joining on Multiple Conditions

Joining on Inequality Conditions

SQL Equivalents for Every Join

Common Mistakes

When to Use Which Join

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply