PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies
Joins are the bread and butter of data engineering. Every pipeline you build, every report you generate, every dimension table you enrich — somewhere in the process, two DataFrames need to be combined.
But PySpark has EIGHT join types, not just the usual three you know from SQL textbooks. Inner, left, right, and full outer are familiar. But what about left_semi and left_anti? What about cross and self joins? These lesser-known joins solve real problems that would otherwise require clunky workarounds.
This post walks through all eight join types using the SAME two DataFrames, so you can see exactly how each join behaves differently on identical data. Same input, different outputs — that is what makes the differences click.
Think of joins like matching guests at a wedding. The left table is the bride’s guest list. The right table is the groom’s guest list. Different join types answer different questions: “Who is on BOTH lists?” (inner). “Who is on the bride’s list regardless of the groom’s?” (left). “Who is on the bride’s list but NOT the groom’s?” (anti). Same guest lists, different questions, different results.
Table of Contents
- The Setup: Two DataFrames
- Why the Setup Data Is Designed This Way
- Join 1: INNER JOIN — Only Matches
- Join 2: LEFT JOIN — All Left, Matches from Right
- Join 3: RIGHT JOIN — All Right, Matches from Left
- Join 4: FULL OUTER JOIN — Everything from Both
- Join 5: LEFT SEMI JOIN — Left Rows WITH a Match
- Join 6: LEFT ANTI JOIN — Left Rows WITHOUT a Match
- Join 7: CROSS JOIN — Every Row Times Every Row
- Join 8: SELF JOIN — Table Joined with Itself
- Side-by-Side Comparison
- Broadcast Joins (Performance Optimization)
- Handling Duplicate Column Names
- Joining on Multiple Conditions
- Joining on Inequality Conditions
- SQL Equivalents for Every Join
- Common Mistakes
- When to Use Which Join
- Interview Questions
- Wrapping Up
The Setup: Two DataFrames
We will use these two DataFrames for EVERY join example:
from pyspark.sql.functions import *
# Employees table — 5 employees, one has NULL department
emp_data = [
(1, "Naveen", "DE"),
(2, "Shrey", "DS"),
(3, "Vrushab", "DE"),
(4, "Vishnu", "QA"),
(5, "Ravi", None), # No department assigned
]
df_emp = spark.createDataFrame(emp_data, ["emp_id", "name", "dept_code"])
# Departments table — 4 departments, one has no employees
dept_data = [
("DE", "Data Engineering", 500000),
("DS", "Data Science", 450000),
("QA", "Quality Assurance", 300000),
("PM", "Project Management", 400000), # No employees in PM
]
df_dept = spark.createDataFrame(dept_data, ["dept_code", "dept_name", "budget"])
Employees:
| emp_id | name | dept_code |
|---|---|---|
| 1 | Naveen | DE |
| 2 | Shrey | DS |
| 3 | Vrushab | DE |
| 4 | Vishnu | QA |
| 5 | Ravi | NULL |
Departments:
| dept_code | dept_name | budget |
|---|---|---|
| DE | Data Engineering | 500000 |
| DS | Data Science | 450000 |
| QA | Quality Assurance | 300000 |
| PM | Project Management | 400000 |
Why the Setup Data Is Designed This Way
The data is intentionally designed so every join produces a DIFFERENT result:
- Ravi has
NULLdept_code — he exists in employees but has NO matching department - PM (Project Management) exists in departments but NO employee belongs to it
- Naveen and Vrushab both belong to DE — useful for showing duplicate matches and self-join
These three facts create every possible join scenario: matches, left-only rows, right-only rows, NULLs, and duplicates.
Join 1: INNER JOIN — Only Matches
What It Returns
Only rows where the join condition matches in BOTH tables. No NULLs, no unmatched rows.
df_inner = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "inner")
df_inner.show()
| emp_id | name | dept_code | dept_code | dept_name | budget |
|---|---|---|---|---|---|
| 1 | Naveen | DE | DE | Data Engineering | 500000 |
| 2 | Shrey | DS | DS | Data Science | 450000 |
| 3 | Vrushab | DE | DE | Data Engineering | 500000 |
| 4 | Vishnu | QA | QA | Quality Assurance | 300000 |
4 rows. Ravi is excluded (NULL dept has no match). PM is excluded (no employee has dept_code = PM).
Real-Life Analogy
Inner join is like a guest list checkpoint at an exclusive event. Only people whose name appears on BOTH the invite list AND the RSVP list get in. If you were invited but did not RSVP (Ravi — in employees but no matching dept), you are turned away. If a seat was reserved but nobody claimed it (PM — in departments but no matching employee), the seat stays empty.
When to Use
- Standard enrichment: “Give me employees WITH their department details”
- When you only want records that have complete data on both sides
- Most common join type — the default when you just say “join”
Join 2: LEFT JOIN (LEFT OUTER) — All Left, Matches from Right
What It Returns
ALL rows from the left table (employees), plus matching columns from the right table (departments). Where there is no match, right-side columns are NULL.
df_left = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left")
df_left.show()
| emp_id | name | dept_code | dept_code | dept_name | budget |
|---|---|---|---|---|---|
| 1 | Naveen | DE | DE | Data Engineering | 500000 |
| 2 | Shrey | DS | DS | Data Science | 450000 |
| 3 | Vrushab | DE | DE | Data Engineering | 500000 |
| 4 | Vishnu | QA | QA | Quality Assurance | 300000 |
| 5 | Ravi | NULL | NULL | NULL | NULL |
5 rows. All employees are kept. Ravi shows up with NULLs for department details. PM is excluded (it is on the right side, not the left).
Real-Life Analogy
Left join is like a class attendance sheet. Every student (left table) is listed, whether they submitted homework (right table) or not. Students who submitted get a grade next to their name. Students who did not submit get a blank — but they are still on the sheet. The left side is always complete.
When to Use
- “Give me ALL employees, whether they have a department or not”
- When the left table is your primary dataset and you want to enrich it with optional data
- Finding employees WITHOUT departments:
df_left.filter(col("dept_name").isNull())
Join 3: RIGHT JOIN (RIGHT OUTER) — All Right, Matches from Left
What It Returns
ALL rows from the right table (departments), plus matching columns from the left table (employees). Where there is no match, left-side columns are NULL.
df_right = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "right")
df_right.show()
| emp_id | name | dept_code | dept_code | dept_name | budget |
|---|---|---|---|---|---|
| 1 | Naveen | DE | DE | Data Engineering | 500000 |
| 2 | Shrey | DS | DS | Data Science | 450000 |
| 3 | Vrushab | DE | DE | Data Engineering | 500000 |
| 4 | Vishnu | QA | QA | Quality Assurance | 300000 |
| NULL | NULL | NULL | PM | Project Management | 400000 |
5 rows. All departments are kept. PM shows up with NULLs for employee details. Ravi is excluded (he is on the left side with no match on the right).
Real-Life Analogy
Right join is like a department budget report. Every department (right table) is listed, whether it has employees or not. Departments with employees show the headcount. Empty departments show up with zeroes — but they are still on the report. The right side is always complete.
When to Use
- “Give me ALL departments, whether they have employees or not”
- Finding empty departments:
df_right.filter(col("name").isNull()) - In practice, most engineers rewrite right joins as left joins by swapping the table order — same result, easier to read
Join 4: FULL OUTER JOIN — Everything from Both
What It Returns
ALL rows from BOTH tables. Matched rows are combined. Unmatched rows from either side get NULLs for the other side’s columns.
df_full = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "full")
df_full.show()
| emp_id | name | dept_code | dept_code | dept_name | budget |
|---|---|---|---|---|---|
| 1 | Naveen | DE | DE | Data Engineering | 500000 |
| 2 | Shrey | DS | DS | Data Science | 450000 |
| 3 | Vrushab | DE | DE | Data Engineering | 500000 |
| 4 | Vishnu | QA | QA | Quality Assurance | 300000 |
| 5 | Ravi | NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | PM | Project Management | 400000 |
6 rows. Everyone is included — matched employees, Ravi (no dept), AND PM (no employees). No data is lost from either side.
Real-Life Analogy
Full outer join is like merging two contact lists from different phones. Contacts in both phones get combined into one entry. Contacts only in Phone A (Ravi) appear with blanks for Phone B fields. Contacts only in Phone B (PM) appear with blanks for Phone A fields. Nothing is lost from either phone.
When to Use
- Data reconciliation: “Show me everything from both systems, highlighting mismatches”
- Finding orphans on BOTH sides simultaneously
- Rarely used in production pipelines — usually you know which side is primary (left or right)
Join 5: LEFT SEMI JOIN — Left Rows WITH a Match
What It Returns
Rows from the left table ONLY where a match EXISTS in the right table. No columns from the right table are included. This is the PySpark equivalent of SQL’s WHERE EXISTS.
df_semi = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left_semi")
df_semi.show()
| emp_id | name | dept_code |
|---|---|---|
| 1 | Naveen | DE |
| 2 | Shrey | DS |
| 3 | Vrushab | DE |
| 4 | Vishnu | QA |
4 rows. Only employee columns — no dept_name or budget. Ravi is excluded (no matching department). Only employees who HAVE a department are returned.
Real-Life Analogy
Semi join is like a nightclub bouncer checking a VIP list. The bouncer (left_semi) checks if your name (employee) is on the VIP list (departments). If it is, you get in — but the bouncer does not hand you the VIP list itself. You only keep YOUR information (left columns). You do not get the list’s information (right columns).
SQL Equivalent
SELECT * FROM employees e
WHERE EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code)
When to Use
- Filtering: “Give me employees who belong to a VALID department” without adding department columns
- Cleaner than inner join when you do not need right-side columns
- Data validation: check which records have matching references
Join 6: LEFT ANTI JOIN — Left Rows WITHOUT a Match
What It Returns
Rows from the left table ONLY where NO match exists in the right table. The OPPOSITE of semi join. This is PySpark’s WHERE NOT EXISTS.
df_anti = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "left_anti")
df_anti.show()
| emp_id | name | dept_code |
|---|---|---|
| 5 | Ravi | NULL |
1 row. Only Ravi — the only employee WITHOUT a matching department. Everyone else was filtered out because they DO have matches.
Real-Life Analogy
Anti join is like a teacher checking who did NOT submit homework. The teacher compares the student list (employees) against the submissions pile (departments). Students whose name is NOT in the pile are the result. It answers: “Who is missing?”
SQL Equivalent
SELECT * FROM employees e
WHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code)
When to Use
- Finding orphan records: “Which employees have no department?”
- Data quality checks: “Which orders reference a product that does not exist?”
- Identifying missing references before loading into a dimension table
- Incremental loads: “Which source records are NOT yet in the target?”
Join 7: CROSS JOIN — Every Row Times Every Row
What It Returns
The cartesian product — every row from the left paired with every row from the right. No join condition needed.
df_cross = df_emp.crossJoin(df_dept)
print(f"Rows: {df_emp.count()} x {df_dept.count()} = {df_cross.count()}")
df_cross.show(20)
20 rows (5 employees x 4 departments). Every employee is paired with every department, regardless of whether they belong to it. Naveen appears 4 times — once paired with each department.
Real-Life Analogy
Cross join is like a speed dating event. Every person on the left side meets every person on the right side. 5 people on the left, 4 on the right = 20 conversations. Most conversations will not be a match, but every combination is tried.
When to Use
- Generating all possible combinations (products x colors, employees x training courses)
- Creating date spine tables (every date x every metric)
- Very rare in data engineering — usually a sign you need a different join type
- Dangerous on large tables: 1M x 1M = 1 TRILLION rows — crashes the cluster
Join 8: SELF JOIN — Table Joined with Itself
What It Returns
A table joined with itself. Requires aliases to distinguish the two copies.
# Find all pairs of employees in the SAME department
df_self = df_emp.alias("a").join(
df_emp.alias("b"),
(col("a.dept_code") == col("b.dept_code")) & (col("a.emp_id") < col("b.emp_id")),
"inner"
).select(
col("a.name").alias("employee_1"),
col("b.name").alias("employee_2"),
col("a.dept_code").alias("department")
)
df_self.show()
| employee_1 | employee_2 | department |
|---|---|---|
| Naveen | Vrushab | DE |
1 row. Naveen and Vrushab are both in DE. The condition a.emp_id < b.emp_id prevents duplicate pairs (Naveen-Vrushab and Vrushab-Naveen) and self-pairs (Naveen-Naveen).
Real-Life Analogy
Self join is like a doubles tennis tournament within a single team. You pair every player with every other player on the SAME team. The emp_id < emp_id condition ensures you do not pair someone with themselves and you do not count Alice-Bob AND Bob-Alice as separate pairs.
When to Use
- Manager-employee relationships (same table, different roles)
- Finding duplicates or near-duplicates within a table
- Comparing records: “Which employees have the same salary?”
- Hierarchical data: organizational trees, parent-child relationships
Side-by-Side Comparison
print(f"Employees: {df_emp.count()} rows | Departments: {df_dept.count()} rows")
print(f"Matching: Naveen(DE), Shrey(DS), Vrushab(DE), Vishnu(QA)")
print(f"Left only: Ravi(NULL) | Right only: PM")
print()
print(f"{'Join Type':<20} {'Rows':>5} What It Returns")
print("-" * 75)
print(f"{'inner':<20} {'4':>5} Only rows matching in BOTH tables")
print(f"{'left':<20} {'5':>5} ALL left + matching right (NULLs for Ravi)")
print(f"{'right':<20} {'5':>5} ALL right + matching left (NULLs for PM)")
print(f"{'full':<20} {'6':>5} EVERYTHING from both (Ravi + PM both included)")
print(f"{'left_semi':<20} {'4':>5} Left rows WITH match (left columns only)")
print(f"{'left_anti':<20} {'1':>5} Left rows WITHOUT match (only Ravi)")
print(f"{'cross':<20} {'20':>5} Every row x every row (5 x 4 = 20)")
print(f"{'self':<20} {'1':>5} Same-table pairs (Naveen & Vrushab in DE)")
Broadcast Joins (Performance Optimization)
When one DataFrame is small (under 10 MB), broadcast it to every worker node to avoid shuffling the large DataFrame:
from pyspark.sql.functions import broadcast
# Without broadcast: both DataFrames are shuffled across the cluster
df_slow = df_large.join(df_small, "key_column")
# With broadcast: small DataFrame is sent to every node, large stays in place
df_fast = df_large.join(broadcast(df_small), "key_column")
When to use broadcast: – Lookup/dimension tables (countries, departments, product categories) — usually small – Any DataFrame under 10 MB – Spark auto-broadcasts tables under 10 MB by default (configurable)
Real-life analogy: Without broadcast, both teams travel to a central meeting room (shuffle). With broadcast, the small team (3 people) visits every office of the large team (100 people) — much faster because you do not move 100 people.
Handling Duplicate Column Names
The join condition creates duplicate dept_code columns. Fix it:
# Method 1: Drop the duplicate column
df_joined = df_emp.join(df_dept, df_emp.dept_code == df_dept.dept_code, "inner") .drop(df_dept.dept_code)
# Method 2: Join on a list (auto-deduplicates)
df_joined = df_emp.join(df_dept, ["dept_code"], "inner")
# Only ONE dept_code column in the result
# Method 3: Rename before joining
df_dept_renamed = df_dept.withColumnRenamed("dept_code", "d_dept_code")
df_joined = df_emp.join(df_dept_renamed, df_emp.dept_code == df_dept_renamed.d_dept_code, "inner")
Recommendation: Method 2 (join on a list) is cleanest when the column name is the same in both DataFrames.
Joining on Multiple Conditions
# Join on TWO columns
df_result = df_orders.join(
df_products,
(df_orders.product_id == df_products.product_id) &
(df_orders.region == df_products.region),
"inner"
)
# Shorthand when column names match
df_result = df_orders.join(df_products, ["product_id", "region"], "inner")
Joining on Inequality Conditions
# Range join: find which salary band an employee belongs to
salary_bands = spark.createDataFrame([
("Junior", 0, 70000),
("Mid", 70001, 90000),
("Senior", 90001, 120000),
], ["band", "min_salary", "max_salary"])
df_banded = df_emp_with_salary.join(
salary_bands,
(df_emp_with_salary.salary >= salary_bands.min_salary) &
(df_emp_with_salary.salary <= salary_bands.max_salary),
"left"
)
SQL Equivalents for Every Join
df_emp.createOrReplaceTempView("employees")
df_dept.createOrReplaceTempView("departments")
-- INNER
SELECT * FROM employees e INNER JOIN departments d ON e.dept_code = d.dept_code;
-- LEFT
SELECT * FROM employees e LEFT JOIN departments d ON e.dept_code = d.dept_code;
-- RIGHT
SELECT * FROM employees e RIGHT JOIN departments d ON e.dept_code = d.dept_code;
-- FULL OUTER
SELECT * FROM employees e FULL OUTER JOIN departments d ON e.dept_code = d.dept_code;
-- SEMI (SQL uses EXISTS)
SELECT * FROM employees e WHERE EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code);
-- ANTI (SQL uses NOT EXISTS)
SELECT * FROM employees e WHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_code = e.dept_code);
-- CROSS
SELECT * FROM employees CROSS JOIN departments;
-- SELF
SELECT a.name AS emp1, b.name AS emp2, a.dept_code
FROM employees a INNER JOIN employees b ON a.dept_code = b.dept_code AND a.emp_id < b.emp_id;
Common Mistakes
-
Using inner when you mean left — inner drops unmatched rows silently. If your source has 1,000 rows and the result has 950, you lost 50 rows. Use left join to preserve all source rows and investigate NULLs.
-
Forgetting to handle duplicate columns — joining on
df_a.id == df_b.idcreates TWOidcolumns. Usejoin(df_b, ["id"])or.drop()the duplicate. -
Cross join by accident — joining without a condition creates a cartesian product. 10K x 10K = 100M rows. Always specify a join condition unless you intentionally want a cross join.
-
Not using broadcast for small tables — joining a 500M-row fact table with a 1,000-row lookup without broadcast causes an expensive shuffle. Broadcast the small table.
-
Confusing semi and anti — semi keeps rows WITH a match (filter IN). Anti keeps rows WITHOUT a match (filter NOT IN). Think: semi = “exists,” anti = “not exists.”
-
NULL join keys — NULLs never match anything, even other NULLs. Ravi (NULL dept) does not match the NULL department. If you need NULL-safe joins, use
eqNullSafe:df_emp.dept_code.eqNullSafe(df_dept.dept_code). -
Self-join without alias —
df.join(df, ...)fails with ambiguous column references. Always use.alias("a")and.alias("b").
When to Use Which Join
| Question You Are Asking | Join Type |
|---|---|
| “Give me combined data where both sides match” | inner |
| “Give me ALL employees with department info if available” | left |
| “Give me ALL departments with employee info if available” | right |
| “Give me EVERYTHING from both sides” | full |
| “Which employees HAVE a valid department?” (no dept columns needed) | left_semi |
| “Which employees are MISSING a department?” | left_anti |
| “Generate every possible employee-department combination” | cross |
| “Find employee pairs in the same department” | self |
Interview Questions
Q: What is the difference between inner join and left join? A: Inner join returns only rows where the join condition matches in BOTH tables — unmatched rows from either side are dropped. Left join returns ALL rows from the left table plus matching rows from the right — unmatched left rows get NULLs for right-side columns. Left join preserves the left table completely.
Q: What is a left_semi join and when would you use it? A: Left_semi returns rows from the left table where a match EXISTS in the right table, but does NOT include columns from the right table. It is equivalent to SQL’s WHERE EXISTS. Use it to filter a table by the existence of related records without adding extra columns.
Q: What is a left_anti join? A: Left_anti returns rows from the left table where NO match exists in the right table — the opposite of semi join. Equivalent to SQL’s WHERE NOT EXISTS. Use it to find orphan records, missing references, or records not yet loaded into a target table.
Q: Why should you broadcast small tables in joins? A: Without broadcast, both DataFrames are shuffled across the cluster (expensive network transfer). With broadcast, the small DataFrame is copied to every worker node, and the large DataFrame stays in place — no shuffle needed. Use broadcast for any lookup/dimension table under 10 MB.
Q: How do you handle duplicate column names after a join?
A: Three methods: join on a column list (join(df_b, ["id"])) which auto-deduplicates, drop the duplicate column (.drop(df_b.id)), or rename before joining (.withColumnRenamed()). Joining on a list is the cleanest approach when column names match.
Q: Do NULLs match in joins?
A: No. NULL never equals NULL in standard joins. A row with NULL in the join key will not match any row, including other NULLs. For NULL-safe joins, use .eqNullSafe() instead of ==.
Wrapping Up
Eight join types, one setup, eight different results. That is the complete picture of PySpark joins. In practice, you will use inner and left 80% of the time, anti for data quality checks, broadcast for performance, and the rest for specific scenarios.
The key insight: every join answers a different question about the relationship between two datasets. Choose the join based on the QUESTION you are asking, not the data.
Related posts: – SQL Joins (INNER, LEFT, FULL, CROSS, SELF) – PySpark Transformations Cookbook – Apache Spark and PySpark Architecture – Joins in Synapse Data Flows – PySpark Foundations
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.