Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs
You write five lines of PySpark code. You run the cell. It finishes in 0.2 seconds. You think: “That was fast — Spark just processed 10 million rows in 0.2 seconds!”
It did not. Spark did NOTHING. It just wrote down your instructions on a piece of paper and said “noted.” Your 10 million rows are still sitting untouched in the data lake.
This is lazy evaluation — the single most important concept in PySpark that most tutorials mention once and never explain properly. If you do not understand lazy evaluation, you will write code that looks correct but runs 10x slower than it should, or you will stare at your screen wondering why Spark is “not working” when it is actually working exactly as designed.
Table of Contents
- What Is Lazy Evaluation?
- The Restaurant Analogy
- Transformations vs Actions: The Two Types of Operations
- Transformations (Lazy — Just Instructions)
- Actions (Eager — Triggers Everything)
- Why Spark Is Lazy (The Optimization Story)
- The Catalyst Optimizer: Spark’s Secret Brain
- Optimization 1: Predicate Pushdown
- Optimization 2: Column Pruning
- Optimization 3: Join Reordering
- Optimization 4: Filter Pushdown Before Join
- Seeing the Execution Plan (explain())
- The DAG: Spark’s Recipe Book
- Narrow vs Wide Transformations
- When Does Spark Actually Execute?
- The Cache Trap: When Laziness Bites You
- Lazy Evaluation in Your SCD Type 2 Pipeline
- Common Misconceptions
- Proving Laziness: A Hands-On Experiment
- How to Think in Lazy Evaluation
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is Lazy Evaluation?
Lazy evaluation means Spark records your transformations but does not execute them until you request a result (an action). Every filter(), select(), groupBy(), join(), and withColumn() is recorded as a plan — not executed. Only when you call show(), count(), collect(), or write() does Spark look at the complete plan, optimize it, and execute everything at once.
# Line 1: Spark records "read parquet" → NO execution
df = spark.read.parquet("/data/sales/")
# Line 2: Spark records "filter year=2026" → NO execution
df_filtered = df.filter(col("year") == 2026)
# Line 3: Spark records "group by region" → NO execution
df_grouped = df_filtered.groupBy("region")
# Line 4: Spark records "sum amount" → NO execution
df_result = df_grouped.agg(sum("amount").alias("total"))
# Line 5: THIS triggers execution of ALL lines above
df_result.show() # NOW Spark reads, filters, groups, and sums
Lines 1-4 complete in milliseconds because nothing happens. Line 5 takes 30 seconds because Spark executes the entire pipeline.
Real-life analogy: Lazy evaluation is like writing a shopping list. Writing “buy milk” on the list does not put milk in your fridge. Writing “buy eggs” does not cook breakfast. The list just grows. Only when you drive to the store (action) do you actually buy everything on the list — and you buy it all in one efficient trip, not one item at a time.
The Restaurant Analogy
This analogy makes lazy evaluation click for everyone:
Eager evaluation (how Pandas works):
You: "I'd like the menu please."
Waiter: *runs to kitchen, prints menu, brings it back* (30 seconds)
You: "I'll have the soup."
Waiter: *runs to kitchen, makes soup, brings it back* (5 minutes)
You: "Actually, can I add a salad?"
Waiter: *runs to kitchen, makes salad, brings it back* (3 minutes)
You: "And a steak, medium rare."
Waiter: *runs to kitchen, cooks steak, brings it back* (15 minutes)
Total: 23 minutes, 30 seconds. 4 separate kitchen trips.
Lazy evaluation (how PySpark works):
You: "I'd like the menu please."
Waiter: *writes it down* (instant)
You: "I'll have the soup."
Waiter: *writes it down* (instant)
You: "Actually, can I add a salad?"
Waiter: *writes it down* (instant)
You: "And a steak, medium rare."
Waiter: *writes it down* (instant)
You: "That's everything. Please bring it all." ← ACTION
Waiter: *goes to kitchen ONCE, the chef sees the full order,
starts the steak first (takes longest), soup and salad in parallel,
everything arrives together*
Total: 15 minutes. 1 optimized kitchen trip.
The waiter (Spark) does not run to the kitchen after every item. They collect the ENTIRE order (all transformations), then hand it to the chef (executor) who optimizes the cooking order. The steak starts first because it takes longest. The soup and salad cook in parallel. Everything arrives together.
That is lazy evaluation. Record everything, optimize the plan, execute once.
Transformations vs Actions: The Two Types of Operations
Every PySpark operation is either a transformation (lazy) or an action (eager).
Transformations (Lazy — Just Instructions)
These are RECORDED but NOT executed. They build up the execution plan.
| Transformation | What It Records |
|---|---|
df.filter(...) |
“Filter rows matching this condition” |
df.select(...) |
“Keep only these columns” |
df.withColumn(...) |
“Add or modify this column” |
df.groupBy(...) |
“Group rows by these columns” |
df.agg(...) |
“Compute these aggregations” |
df.join(...) |
“Join with another DataFrame” |
df.orderBy(...) |
“Sort by these columns” |
df.distinct() |
“Remove duplicate rows” |
df.dropDuplicates(...) |
“Remove duplicates by specific columns” |
df.union(...) |
“Combine two DataFrames” |
df.repartition(...) |
“Redistribute data across partitions” |
df.coalesce(...) |
“Reduce number of partitions” |
df.withColumnRenamed(...) |
“Rename this column” |
df.drop(...) |
“Remove these columns” |
df.cache() |
“Remember to cache this when computed” |
spark.read.parquet(...) |
“Plan to read from this path” |
None of these touch any data. They are instructions on a piece of paper.
Actions (Eager — Triggers Everything)
These TRIGGER execution of all recorded transformations.
| Action | What It Does |
|---|---|
df.show() |
Execute plan, display results |
df.show(5) |
Execute plan, display first 5 rows |
df.count() |
Execute plan, return row count |
df.collect() |
Execute plan, return ALL rows to driver |
df.take(n) |
Execute plan, return first n rows |
df.first() |
Execute plan, return first row |
df.head(n) |
Execute plan, return first n rows |
df.write.parquet(...) |
Execute plan, write results to storage |
df.write.format("delta").save(...) |
Execute plan, write Delta table |
df.toPandas() |
Execute plan, convert to Pandas DataFrame |
df.foreach(...) |
Execute plan, apply function to each row |
df.describe() |
Execute plan, compute statistics |
display(df) |
Execute plan, show in Databricks UI |
Real-life analogy: Transformations are like adding items to your Amazon cart. Nothing ships. Nothing is charged. Actions are like clicking “Place Order” — NOW everything ships, your card is charged, and delivery begins. You can add and remove items from the cart all day (transformations) without spending a penny. Only “Place Order” (action) makes things real.
Why Spark Is Lazy (The Optimization Story)
Lazy evaluation is not just a design choice — it is the KEY to Spark’s performance. By waiting until an action to execute, Spark can see the ENTIRE plan and optimize it before running anything.
Without Lazy Evaluation (Eager — Like Pandas)
# Step 1: Read 10M rows from 50 columns (reads ALL data)
df = pd.read_parquet("sales.parquet") # Reads 10M rows, 50 columns
# Step 2: Filter to 2026 (scans ALL 10M rows)
df = df[df["year"] == 2026] # Scans 10M, keeps 1M
# Step 3: Select 3 columns (already read 50, now throws away 47)
df = df[["region", "amount", "year"]] # Wasted work reading 47 columns
# Step 4: Group and sum
result = df.groupby("region")["amount"].sum() # Works on 1M rows
# Total: Read 10M rows x 50 columns, then threw away 90% of the data
With Lazy Evaluation (PySpark)
# Step 1: Records "read parquet" — NO execution
df = spark.read.parquet("sales.parquet")
# Step 2: Records "filter year=2026" — NO execution
df = df.filter(col("year") == 2026)
# Step 3: Records "select 3 columns" — NO execution
df = df.select("region", "amount", "year")
# Step 4: Records "group by region, sum amount" — NO execution
result = df.groupBy("region").agg(sum("amount"))
# Step 5: ACTION — Spark optimizes and executes
result.show()
# What Spark ACTUALLY does (optimized plan):
# 1. Read ONLY 3 columns from Parquet (column pruning — skips 47 columns)
# 2. Read ONLY year=2026 partitions (predicate pushdown — skips 9M rows)
# 3. Group and sum on 1M rows x 3 columns
# Total: Read 1M rows x 3 columns instead of 10M rows x 50 columns
The result: PySpark reads 97% LESS data than Pandas for the same query. Not because PySpark is magic — but because lazy evaluation lets Spark see all the steps BEFORE executing and optimize the plan.
The Catalyst Optimizer: Spark’s Secret Brain
When you trigger an action, Spark does not execute your transformations in the order you wrote them. Instead, the Catalyst Optimizer rewrites your plan into a more efficient version:
Your Code (Logical Plan):
1. Read ALL columns from sales.parquet
2. Filter year = 2026
3. Select region, amount, year
4. GroupBy region, sum amount
Catalyst Optimizer (Physical Plan):
1. Read ONLY region, amount, year FROM sales.parquet ← Column pruning (moved step 3 up)
WHERE year = 2026 ← Predicate pushdown (moved step 2 into read)
2. GroupBy region, sum amount ← Only processes needed data
The optimizer reordered steps 2 and 3 to happen DURING the read, not after. This is only possible because Spark saw the entire plan before executing.
Real-life analogy: The Catalyst Optimizer is like a smart GPS. You type three stops: grocery store, post office, gas station. A naive GPS routes you in that order — even if the post office is next door and the grocery store is 20 minutes away. A smart GPS reorders: post office first (closest), then gas station (on the way), then grocery store. Same stops, faster route. Catalyst does this for your data operations.
Optimization 1: Predicate Pushdown
Spark pushes filters DOWN to the data source, so only matching rows are read:
# You write:
df = spark.read.parquet("/data/sales/")
df = df.filter(col("year") == 2026)
# Spark executes:
# Read only Parquet row groups where year=2026 (skips 90% of data)
With Parquet files, each row group stores min/max statistics. If a row group’s max year is 2025, Spark skips it entirely without reading any data from it.
Optimization 2: Column Pruning
Spark reads ONLY the columns you actually use:
# You write:
df = spark.read.parquet("/data/sales/") # File has 50 columns
df = df.select("region", "amount") # You only use 2
# Spark executes:
# Read ONLY region and amount columns from Parquet (skips 48 columns)
Parquet stores each column separately, so Spark can physically skip unneeded columns.
Optimization 3: Join Reordering
Spark reorders joins for optimal performance:
# You write:
result = big_table.join(medium_table, "key1").join(small_table, "key2")
# Spark might execute:
# Join big_table with small_table first (broadcast small), then join with medium
# Because broadcasting small_table is cheaper than shuffling medium_table
Optimization 4: Filter Pushdown Before Join
# You write:
result = orders.join(products, "product_id").filter(col("year") == 2026)
# Spark executes:
# Filter orders to year=2026 FIRST (1M rows instead of 10M)
# THEN join the filtered 1M rows with products
# You wrote the filter after the join, but Spark moved it before
This is only possible because Spark sees the filter AND the join in the plan before executing either.
Seeing the Execution Plan (explain())
You can see what Spark plans to do with explain():
df = spark.read.parquet("/data/sales/")
df_result = df.filter(col("year") == 2026) .select("region", "amount") .groupBy("region") .agg(sum("amount").alias("total"))
# See the plan WITHOUT executing
df_result.explain()
Output shows:
== Physical Plan ==
*(2) HashAggregate(keys=[region], functions=[sum(amount)])
+- Exchange hashpartitioning(region, 200)
+- *(1) HashAggregate(keys=[region], functions=[partial_sum(amount)])
+- *(1) FileScan parquet [region,amount] ← Only 2 columns read!
PushedFilters: [IsNotNull(year), EqualTo(year,2026)] ← Filter pushed to read!
Notice: Spark reads only 2 columns and pushes the year filter into the file scan. This is the optimized plan that runs when you call .show().
# Extended explain shows all plan stages
df_result.explain(True)
# Shows: Parsed → Analyzed → Optimized → Physical
The DAG: Spark’s Recipe Book
The DAG (Directed Acyclic Graph) is the complete execution plan Spark creates before running:
Read Parquet (Stage 1)
|
Filter year=2026
|
Select region, amount
|
Partial Aggregate (local sum per partition)
|
─── SHUFFLE (Exchange) ─── Stage boundary
|
Final Aggregate (global sum per region) (Stage 2)
|
Show (collect to driver)
Each stage runs in parallel across the cluster. The shuffle (data exchange) is the boundary between stages.
Real-life analogy: The DAG is like a recipe with steps. “Chop vegetables (Stage 1)” can be done by 4 chefs in parallel (4 partitions). “Combine and cook (Stage 2)” requires all vegetables to be in one pot (shuffle). The DAG ensures chefs know what to do and in what order.
Narrow vs Wide Transformations
Narrow Transformations (No Shuffle — Fast)
Data stays on the same partition. No network transfer:
df.filter(...) # Each partition filters independently
df.select(...) # Each partition selects independently
df.withColumn(...) # Each partition adds column independently
df.map(...) # Each partition maps independently
Real-life analogy: Narrow transformations are like each chef peeling their own pile of potatoes. No chef needs anything from another chef’s pile.
Wide Transformations (Shuffle Required — Expensive)
Data must move between partitions across the network:
df.groupBy(...) # All rows with same key must be on same node
df.join(...) # Matching keys must be co-located
df.orderBy(...) # Global sort requires data exchange
df.distinct() # Must compare across all partitions
df.repartition(...) # Explicitly redistributes data
Real-life analogy: Wide transformations are like sorting all the potatoes by size ACROSS all chefs. Every chef must pass their large potatoes to one chef and their small potatoes to another. This passing (shuffle) is the most expensive part.
The performance rule: Minimize wide transformations. Every shuffle moves data across the network and is orders of magnitude slower than narrow transformations.
When Does Spark Actually Execute?
Here is a timeline of what happens when you run a notebook:
# Cell 1 — Runs in 0.1 seconds (lazy — just recording)
df = spark.read.parquet("/data/10GB_file/")
df_clean = df.filter(col("status") == "Active") .withColumn("name", upper(trim(col("name")))) .select("id", "name", "city", "salary")
# Cell 2 — Runs in 0.1 seconds (lazy — still recording)
df_agg = df_clean.groupBy("city").agg(
count("*").alias("emp_count"),
avg("salary").alias("avg_salary")
)
# Cell 3 — Runs in 45 SECONDS (action — executes EVERYTHING above)
df_agg.show()
# NOW Spark: reads 10GB, filters, cleans, selects, groups, aggregates, displays
# Cell 4 — Runs in 45 SECONDS AGAIN (another action — re-executes EVERYTHING)
df_agg.count()
# Spark re-reads 10GB, re-filters, re-cleans, re-selects, re-groups, re-aggregates
Notice Cell 4: Spark re-executes the entire plan because it does not cache results by default. This is where cache() helps.
The Cache Trap: When Laziness Bites You
Without caching, every action re-executes the entire transformation chain:
# WITHOUT cache — 3 actions = 3 full executions
df_clean = df.filter(...).withColumn(...).groupBy(...).agg(...)
df_clean.show() # Executes full chain (45 sec)
df_clean.count() # Executes full chain AGAIN (45 sec)
df_clean.write() # Executes full chain AGAIN (45 sec)
# Total: 135 seconds
# WITH cache — 1 full execution + 2 cache reads
df_clean = df.filter(...).withColumn(...).groupBy(...).agg(...).cache()
df_clean.show() # First action: executes and caches result (45 sec)
df_clean.count() # Reads from cache (0.5 sec)
df_clean.write() # Reads from cache (2 sec)
# Total: 47.5 seconds — 3x faster!
# Free memory when done
df_clean.unpersist()
Important: cache() itself is a transformation (lazy). The actual caching happens on the FIRST action. After that, subsequent actions read from cache.
Lazy Evaluation in Your SCD Type 2 Pipeline
Here is how lazy evaluation works in the SCD Type 2 pipeline you are building:
# All of these are LAZY — no execution yet
df_source = spark.read.format("delta").load("/silver/customers/") # Lazy
df_source = df_source.withColumn("hash", sha2(concat_ws("|", ...), 256)) # Lazy
df_target = spark.read.format("delta").load("/gold/dim_customer/") # Lazy
df_target = df_target.filter(col("is_active") == True) # Lazy
df_joined = df_source.join(df_target, "customer_id", "left") # Lazy
df_new = df_joined.filter(col("target_id").isNull()) # Lazy
df_changed = df_joined.filter(col("source_hash") != col("target_hash")) # Lazy
# THIS triggers execution of ALL the above
df_new.count() # NOW Spark reads both tables, joins, filters, counts
Spark sees the ENTIRE SCD pipeline before executing. It can:
– Read only needed columns from both Delta tables (column pruning)
– Push the is_active = True filter into the Delta table scan
– Choose broadcast join if the target table is small enough
– Optimize the join order
If you were using eager execution (Pandas), each step would execute immediately — reading ALL columns, ALL rows, without any optimization.
Common Misconceptions
Misconception 1: “My cell ran instantly, so the data is processed”
Wrong. The cell ran instantly because Spark recorded the transformation. Data is NOT processed until an action.
Misconception 2: “spark.read.parquet() reads the file”
Wrong. It records the INTENT to read. The file is not actually opened until an action triggers execution. Spark does read the file schema (column names and types) immediately, but the data rows are not loaded.
Misconception 3: “cache() stores data in memory immediately”
Wrong. cache() is a lazy transformation. It marks the DataFrame for caching. The actual caching happens on the FIRST action that triggers the computation.
Misconception 4: “Each cell in a Databricks notebook is a separate execution”
Wrong. Cells that contain only transformations do nothing. Only cells with actions trigger execution. Three cells of transformations followed by one cell with .show() = one execution.
Misconception 5: “More transformations = slower execution”
Not necessarily. More transformations give Spark more opportunities to optimize. A chain of filter → select → join might execute faster than join → filter → select because Spark reorders the filter before the join. The optimizer works better with more context.
Proving Laziness: A Hands-On Experiment
Try this in your Databricks notebook:
# Cell 1: Read a file that DOES NOT EXIST
df = spark.read.parquet("/this/path/does/not/exist/")
print("Cell 1 completed!")
# Output: "Cell 1 completed!" — NO ERROR! Spark did not try to read the file.
# Cell 2: Add transformations on the non-existent data
df_clean = df.filter(col("year") == 2026).select("id", "name")
print("Cell 2 completed!")
# Output: "Cell 2 completed!" — STILL NO ERROR! Spark just recorded instructions.
# Cell 3: Trigger an action
df_clean.show()
# ERROR: "Path does not exist: /this/path/does/not/exist/"
# NOW Spark tried to actually read the file and discovered it does not exist.
The error only appears in Cell 3 because that is when Spark first attempts to execute. Cells 1 and 2 succeed despite referencing a non-existent path — proof that Spark does nothing until an action forces it.
How to Think in Lazy Evaluation
When writing PySpark, think of your code in two phases:
Phase 1: Build the recipe (transformations) – What data do I need to read? – What filters apply? – What columns do I need? – What joins and aggregations? – Build the entire transformation chain WITHOUT worrying about performance
Phase 2: Execute the recipe (action)
– Call .show(), .count(), .write(), or display()
– Spark optimizes and executes the entire chain
– If you need the result multiple times, .cache() before the first action
# Phase 1: Build (all lazy — milliseconds)
df = spark.read.parquet("/data/sales/")
df_clean = df.filter(col("year") == 2026) .select("region", "product", "amount") .withColumn("amount_usd", col("amount") * 1.35)
df_agg = df_clean.groupBy("region").agg(sum("amount_usd").alias("total"))
# Cache if you need it more than once
df_agg.cache()
# Phase 2: Execute (triggers everything)
df_agg.show() # Action 1 — computes and caches
df_agg.write.format("delta") # Action 2 — reads from cache
.save("/gold/regional_sales/")
df_agg.unpersist() # Free cache memory
Common Mistakes
-
Calling
.count()after every transformation to “check” — every.count()triggers a FULL execution. If you have 5.count()calls, you execute the entire chain 5 times. Remove unnecessary counts. -
Not caching reused DataFrames — if you use
df_cleanin 3 actions, Spark recomputes it 3 times. Cache it after the first action. -
Thinking
.cache()is an action — it is lazy. The data is only cached on the first action that triggers computation. -
Debugging with
.show()everywhere — each.show()triggers execution. In development, keep one.show()at the end. In production, remove ALL.show()calls. -
Not understanding why errors appear late — a typo in a column name inside
.filter()only errors when an action runs, not when.filter()is called. This confuses beginners who think the filter cell worked fine. -
Writing eager-style code — collecting data to the driver with
.collect()or.toPandas()breaks laziness and can crash the driver on large data. Let Spark keep data distributed.
Interview Questions
Q: What is lazy evaluation in PySpark? A: Spark records transformations (filter, select, join, groupBy) without executing them. Execution only happens when an action (show, count, write, collect) is called. This allows Spark’s Catalyst Optimizer to see the entire execution plan and optimize it before running — reordering operations, pushing filters down, pruning columns, and choosing optimal join strategies.
Q: What is the difference between a transformation and an action? A: Transformations (filter, select, join, withColumn) are lazy — they build the execution plan without processing data. Actions (show, count, write, collect) are eager — they trigger execution of all recorded transformations. Only actions cause Spark to actually read and process data.
Q: Why is lazy evaluation faster than eager evaluation? A: Because Spark sees the entire plan before executing. It can push filters before joins (reducing data volume), read only needed columns (column pruning), reorder joins for optimal performance, and choose broadcast vs shuffle joins. Eager execution processes each step immediately without this global optimization.
Q: What happens if you call two actions on the same DataFrame without caching?
A: Spark re-executes the entire transformation chain for EACH action. Two .show() calls on the same DataFrame = two full executions. Use .cache() to store the result after the first action so subsequent actions read from memory.
Q: How do you see what Spark plans to do before executing?
A: Call .explain() on the DataFrame. It shows the physical execution plan including filter pushdown, column pruning, join strategy, and shuffle operations — without actually executing the query. Use .explain(True) for the full plan from parsed to physical.
Q: Give an example of how lazy evaluation optimizes a query.
A: If you write df.read.parquet(50 columns).join(other_df).filter(year==2026).select(3 columns), Spark reorders this to: read ONLY 3 columns from Parquet, push the year filter INTO the file scan, THEN join the smaller filtered result. Instead of reading 50 columns and 10M rows, it reads 3 columns and 1M rows. 97% less data, same result.
Wrapping Up
Lazy evaluation is not a quirk of PySpark — it is the foundation that makes distributed data processing possible at scale. Without it, every transformation would execute immediately, with no opportunity to optimize. With it, Spark sees your entire recipe before turning on the stove, and cooks everything in the most efficient order.
The mental model is simple: – Transformations = writing instructions on paper (free, instant) – Actions = handing the paper to Spark and saying “go” (expensive, one-time) – Cache = telling Spark “keep the result, I will need it again”
Write your transformations freely. Let the Catalyst Optimizer do its magic. Trigger one action at the end. Cache if you need the result twice. That is lazy evaluation in practice.
Related posts: – Apache Spark and PySpark Architecture – PySpark Foundations – PySpark Transformations Cookbook – Delta Lake and PySpark Optimization – PySpark Joins
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.