PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns
Window functions are the most asked topic in data engineering interviews and the most used feature in production pipelines. They let you compare rows against other rows, compute running totals, find the previous or next value, rank records within groups, and deduplicate data — all WITHOUT collapsing your DataFrame like groupBy does.
Think of a window function like looking out of a window on a train. At any moment, you can see a few rows before you (LAG), a few rows after you (LEAD), and your position relative to others in your car (RANK). But unlike groupBy (which squashes everyone in the car into one summary), window functions keep EVERY row intact and add a new column with the calculation.
This post covers every window function with realistic business scenarios, not toy examples.
Table of Contents
- What Makes Window Functions Different from GroupBy
- The Window Specification: partitionBy + orderBy
- The Sample Data
- ROW_NUMBER — Unique Sequential Ranking
- RANK — Ranking with Gaps
- DENSE_RANK — Ranking Without Gaps
- ROW_NUMBER vs RANK vs DENSE_RANK Comparison
- LAG — Look at the Previous Row
- LEAD — Look at the Next Row
- FIRST_VALUE and LAST_VALUE
- NTILE — Divide into Buckets
- SUM with Window — Running Totals
- AVG with Window — Moving Average
- COUNT with Window — Running Count
- MAX/MIN with Window — Running Max/Min
- Percent of Group Total
- Cumulative Distribution (CUME_DIST and PERCENT_RANK)
- Frame Specification: ROWS BETWEEN
- Real-World Pattern 1: Deduplication (Keep Latest)
- Real-World Pattern 2: Gap Detection
- Real-World Pattern 3: Sessionization
- Real-World Pattern 4: Year-over-Year Comparison
- Real-World Pattern 5: Top N per Group
- Performance Considerations
- Common Mistakes
- Interview Questions
- Wrapping Up
What Makes Window Functions Different from GroupBy
# GroupBy: 10 rows → 3 rows (one per department)
df.groupBy("department").agg(avg("salary"))
# Result: 3 rows — original row details LOST
# Window: 10 rows → 10 rows (one per employee, with department avg added)
window_spec = Window.partitionBy("department")
df.withColumn("dept_avg_salary", avg("salary").over(window_spec))
# Result: 10 rows — every employee kept, new column added
Real-life analogy: GroupBy is like a class photo where each row of students merges into one row in the yearbook — you lose individual faces. Window functions are like giving each student a badge that says “Row 3 of 5 in the Science class” — every student keeps their identity, plus gets context about their position in the group.
The Window Specification: partitionBy + orderBy
Every window function needs a Window specification — the “window” through which Spark looks at the data:
from pyspark.sql.window import Window
# The window is defined by:
# 1. partitionBy — which groups to compute within (like GROUP BY)
# 2. orderBy — how to sort rows within each group
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
partitionBy = “Process each department separately.” Without it, the entire DataFrame is one group.
orderBy = “Within each department, sort by salary descending.” Required for ranking and LAG/LEAD. Optional for aggregate windows.
Real-life analogy: partitionBy divides a marathon into age groups (18-25, 26-35, 36-45). orderBy sorts runners within each age group by finish time. Your rank is within YOUR age group, not the entire marathon.
The Sample Data
from pyspark.sql.functions import *
from pyspark.sql.window import Window
data = [
(1, "Naveen", "Engineering", "Toronto", 105000, "2022-01-15"),
(2, "Shrey", "Engineering", "Toronto", 95000, "2023-03-20"),
(3, "Vrushab", "Engineering", "Vancouver", 92000, "2023-06-10"),
(4, "Vishnu", "Analytics", "Ottawa", 91000, "2023-09-01"),
(5, "Ravi", "Analytics", "Calgary", 85000, "2024-01-05"),
(6, "Priya", "Analytics", "Toronto", 85000, "2023-07-15"),
(7, "Anita", "Sales", "Mumbai", 78000, "2024-03-15"),
(8, "Deepak", "Sales", "Delhi", 82000, "2024-05-01"),
(9, "Kavya", "Sales", "Pune", 78000, "2024-02-10"),
(10, "Manoj", "Sales", "Mumbai", 75000, "2023-11-20"),
]
df = spark.createDataFrame(data,
["emp_id", "name", "department", "city", "salary", "hire_date"]
).withColumn("hire_date", to_date(col("hire_date")))
df.show()
10 employees across 3 departments: Engineering (3), Analytics (3), Sales (4).
ROW_NUMBER — Unique Sequential Ranking
What It Does
Assigns a unique sequential number (1, 2, 3…) to each row within the partition. No ties — even if two employees have the same salary, they get different numbers.
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("row_num", row_number().over(window_spec)) .select("department", "name", "salary", "row_num") .show()
| department | name | salary | row_num |
|---|---|---|---|
| Analytics | Vishnu | 91000 | 1 |
| Analytics | Priya | 85000 | 2 |
| Analytics | Ravi | 85000 | 3 |
| Engineering | Naveen | 105000 | 1 |
| Engineering | Shrey | 95000 | 2 |
| Engineering | Vrushab | 92000 | 3 |
| Sales | Deepak | 82000 | 1 |
| Sales | Anita | 78000 | 2 |
| Sales | Kavya | 78000 | 3 |
| Sales | Manoj | 75000 | 4 |
Notice: Priya and Ravi both earn 85000 but get different row numbers (2 and 3). The order between ties is non-deterministic unless you add a tiebreaker column to orderBy.
Real-life analogy: ROW_NUMBER is like numbering seats in a theater row — 1, 2, 3, 4. Even if two people are the same height (salary), they sit in different seats. No sharing.
When to Use
- Deduplication — keep only
row_num = 1per group - Pagination — show rows 11-20 for page 2
- Assigning sequence IDs — deterministic ordering within groups
RANK — Ranking with Gaps
What It Does
Assigns ranks with gaps after ties. If two employees tie at rank 2, the next person gets rank 4 (skipping 3).
df.withColumn("rank", rank().over(window_spec)) .select("department", "name", "salary", "rank") .show()
| department | name | salary | rank |
|---|---|---|---|
| Analytics | Vishnu | 91000 | 1 |
| Analytics | Priya | 85000 | 2 |
| Analytics | Ravi | 85000 | 2 |
| Sales | Deepak | 82000 | 1 |
| Sales | Anita | 78000 | 2 |
| Sales | Kavya | 78000 | 2 |
| Sales | Manoj | 75000 | 4 |
Priya and Ravi tie at rank 2. Next person in Analytics would be rank 4 (if there was one). Anita and Kavya tie at rank 2 in Sales. Manoj is rank 4, not 3.
Real-life analogy: Olympic medals. Two athletes tie for silver (rank 2). No bronze is awarded — next person is rank 4. The gap reflects that two people already share a higher position.
DENSE_RANK — Ranking Without Gaps
What It Does
Assigns ranks without gaps after ties. If two tie at rank 2, the next person gets rank 3.
df.withColumn("dense_rank", dense_rank().over(window_spec)) .select("department", "name", "salary", "dense_rank") .show()
| department | name | salary | dense_rank |
|---|---|---|---|
| Analytics | Vishnu | 91000 | 1 |
| Analytics | Priya | 85000 | 2 |
| Analytics | Ravi | 85000 | 2 |
| Sales | Deepak | 82000 | 1 |
| Sales | Anita | 78000 | 2 |
| Sales | Kavya | 78000 | 2 |
| Sales | Manoj | 75000 | 3 |
Manoj is now rank 3 (not 4 like with RANK). No gaps.
Real-life analogy: Class grades. Two students get an A (rank 1). The next grade is B (rank 2), not C. The consecutive ranking makes sense when you care about “how many distinct levels” rather than “how many people are above you.”
ROW_NUMBER vs RANK vs DENSE_RANK Comparison
Using Sales department (Deepak=82K, Anita=78K, Kavya=78K, Manoj=75K):
| name | salary | row_number | rank | dense_rank |
|---|---|---|---|---|
| Deepak | 82000 | 1 | 1 | 1 |
| Anita | 78000 | 2 | 2 | 2 |
| Kavya | 78000 | 3 | 2 | 2 |
| Manoj | 75000 | 4 | 4 | 3 |
| Function | Ties | Next after tie | Use when |
|---|---|---|---|
| ROW_NUMBER | Different numbers | Continues sequence | Dedup, pagination |
| RANK | Same number | Skips (gap) | Competition ranking |
| DENSE_RANK | Same number | No gap | “How many distinct levels above me” |
LAG — Look at the Previous Row
What It Does
Returns a value from the previous row in the window (or N rows back).
window_hire = Window.partitionBy("department").orderBy("hire_date")
df.withColumn("prev_hire_name", lag("name", 1).over(window_hire)) .withColumn("prev_hire_date", lag("hire_date", 1).over(window_hire)) .select("department", "name", "hire_date", "prev_hire_name", "prev_hire_date") .show()
| department | name | hire_date | prev_hire_name | prev_hire_date |
|---|---|---|---|---|
| Analytics | Priya | 2023-07-15 | NULL | NULL |
| Analytics | Vishnu | 2023-09-01 | Priya | 2023-07-15 |
| Analytics | Ravi | 2024-01-05 | Vishnu | 2023-09-01 |
| Engineering | Naveen | 2022-01-15 | NULL | NULL |
| Engineering | Shrey | 2023-03-20 | Naveen | 2022-01-15 |
| Engineering | Vrushab | 2023-06-10 | Shrey | 2023-03-20 |
The first person in each department has NULL (no previous row).
Real-life analogy: LAG is like looking in the rearview mirror while driving. You see the car (row) that was behind you in the lane (partition). The first car in the lane has no rearview — NULL.
Practical Use: Days Since Last Hire
df.withColumn("prev_hire", lag("hire_date", 1).over(window_hire)) .withColumn("days_since_last_hire", datediff(col("hire_date"), col("prev_hire"))) .select("department", "name", "hire_date", "days_since_last_hire") .show()
LEAD — Look at the Next Row
What It Does
Returns a value from the next row in the window.
df.withColumn("next_hire_name", lead("name", 1).over(window_hire)) .withColumn("next_hire_date", lead("hire_date", 1).over(window_hire)) .select("department", "name", "hire_date", "next_hire_name", "next_hire_date") .show()
The last person in each department has NULL (no next row).
Real-life analogy: LEAD is like looking through the windshield. You see the car ahead of you. The last car in the lane sees nothing — NULL.
FIRST_VALUE and LAST_VALUE
# Who was the first hire in each department?
df.withColumn("first_hired", first("name").over(window_hire)) .withColumn("last_hired", last("name").over(window_hire)) .select("department", "name", "hire_date", "first_hired", "last_hired") .show()
NTILE — Divide into Buckets
# Divide all employees into 4 salary quartiles
window_all = Window.orderBy(col("salary").desc())
df.withColumn("quartile", ntile(4).over(window_all)) .select("name", "salary", "quartile") .show()
| name | salary | quartile |
|---|---|---|
| Naveen | 105000 | 1 |
| Shrey | 95000 | 1 |
| Vrushab | 92000 | 1 |
| Vishnu | 91000 | 2 |
| Ravi | 85000 | 2 |
| Priya | 85000 | 2 |
| Deepak | 82000 | 3 |
| Anita | 78000 | 3 |
| Kavya | 78000 | 4 |
| Manoj | 75000 | 4 |
Top 25% earners are quartile 1. Bottom 25% are quartile 4.
Real-life analogy: NTILE is like dividing a deck of cards into 4 equal piles. Each pile (quartile) has roughly the same number of cards, sorted by value.
SUM with Window — Running Totals
window_running = Window.partitionBy("department").orderBy("hire_date") .rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_salary_total", sum("salary").over(window_running)) .select("department", "name", "hire_date", "salary", "running_salary_total") .show()
| department | name | hire_date | salary | running_salary_total |
|---|---|---|---|---|
| Analytics | Priya | 2023-07-15 | 85000 | 85000 |
| Analytics | Vishnu | 2023-09-01 | 91000 | 176000 |
| Analytics | Ravi | 2024-01-05 | 85000 | 261000 |
| Engineering | Naveen | 2022-01-15 | 105000 | 105000 |
| Engineering | Shrey | 2023-03-20 | 95000 | 200000 |
| Engineering | Vrushab | 2023-06-10 | 92000 | 292000 |
Each row shows the cumulative salary total up to that point within the department.
Real-life analogy: Running total is like your bank balance. Each deposit (hire) adds to the previous balance. At any point, you see the total accumulated so far.
AVG with Window — Moving Average
# 3-row moving average of salary
window_moving = Window.partitionBy("department").orderBy("hire_date") .rowsBetween(-1, 1) # Previous row, current row, next row
df.withColumn("moving_avg_salary", round(avg("salary").over(window_moving), 0)) .select("department", "name", "salary", "moving_avg_salary") .show()
Percent of Group Total
window_dept = Window.partitionBy("department")
df.withColumn("dept_total", sum("salary").over(window_dept)) .withColumn("pct_of_dept", round(col("salary") / col("dept_total") * 100, 1)) .select("department", "name", "salary", "dept_total", "pct_of_dept") .show()
| department | name | salary | dept_total | pct_of_dept |
|---|---|---|---|---|
| Analytics | Vishnu | 91000 | 261000 | 34.9 |
| Analytics | Priya | 85000 | 261000 | 32.6 |
| Analytics | Ravi | 85000 | 261000 | 32.6 |
| Engineering | Naveen | 105000 | 292000 | 36.0 |
Naveen accounts for 36% of Engineering’s total salary budget.
Frame Specification: ROWS BETWEEN
The frame defines WHICH rows the window function considers:
# From start of partition to current row (running total)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Previous 2 rows + current row (3-row moving average)
.rowsBetween(-2, 0)
# Previous 1 + current + next 1 (centered moving average)
.rowsBetween(-1, 1)
# Entire partition (total — no running aspect)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Real-World Pattern 1: Deduplication (Keep Latest)
# Keep only the latest record per customer_id
window_dedup = Window.partitionBy("customer_id").orderBy(col("updated_date").desc())
df_deduped = df.withColumn("rn", row_number().over(window_dedup)) .filter(col("rn") == 1) .drop("rn")
This is the most common window function pattern in production — used in every SCD pipeline we built.
Real-World Pattern 2: Gap Detection
# Find gaps in order sequence
window_seq = Window.orderBy("order_id")
df_gaps = df.withColumn("prev_id", lag("order_id", 1).over(window_seq)) .withColumn("gap", col("order_id") - col("prev_id")) .filter(col("gap") > 1)
# Shows where order IDs skip numbers — data quality issue
Real-World Pattern 3: Sessionization
# Group web clicks into sessions (new session if >30 min gap)
window_user = Window.partitionBy("user_id").orderBy("click_time")
df_sessions = df .withColumn("prev_click", lag("click_time", 1).over(window_user)) .withColumn("gap_minutes", (col("click_time").cast("long") - col("prev_click").cast("long")) / 60) .withColumn("new_session", when(col("gap_minutes") > 30, 1).otherwise(0)) .withColumn("session_id", sum("new_session").over(window_user))
Real-World Pattern 4: Year-over-Year Comparison
# Compare each month's revenue with the same month last year
window_yoy = Window.partitionBy("month").orderBy("year")
df_yoy = df .withColumn("last_year_revenue", lag("revenue", 1).over(window_yoy)) .withColumn("yoy_growth_pct", round(
(col("revenue") - col("last_year_revenue")) / col("last_year_revenue") * 100, 1
))
Real-World Pattern 5: Top N per Group
# Top 2 earners per department
window_top = Window.partitionBy("department").orderBy(col("salary").desc())
df_top2 = df.withColumn("rank", row_number().over(window_top)) .filter(col("rank") <= 2) .drop("rank")
df_top2.show()
Performance Considerations
-
Minimize the number of distinct windows — each unique Window spec causes a separate sort. Reuse the same window object.
-
partitionBy reduces shuffle scope — without partitionBy, the entire DataFrame is one partition (global sort = expensive).
-
orderBy triggers a sort — avoid unnecessary orderBy when using aggregate window functions (sum, avg without running totals).
-
Use row_number for dedup, not distinct — row_number lets you CHOOSE which duplicate to keep (latest, highest). distinct() is random.
Common Mistakes
-
Forgetting orderBy for ranking functions —
row_number()withoutorderBygives non-deterministic results. Always specify the sort. -
Using row_number when you need rank — row_number gives different numbers to ties. If you want ties to have the same rank, use rank() or dense_rank().
-
Wrong frame for running totals — default frame with orderBy is
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For exact row-based running totals, use.rowsBetween(Window.unboundedPreceding, Window.currentRow). -
Applying window function then filtering in wrong order —
df.filter(row_number > 1)does not work directly. You must use withColumn first, then filter. -
Missing partitionBy — without it, the window covers the ENTIRE DataFrame. rank() without partitionBy ranks ALL employees globally, not within departments.
Interview Questions
Q: What is the difference between ROW_NUMBER, RANK, and DENSE_RANK? A: ROW_NUMBER assigns unique sequential numbers (no ties). RANK assigns same rank to ties but leaves gaps (1,2,2,4). DENSE_RANK assigns same rank to ties without gaps (1,2,2,3). Use ROW_NUMBER for dedup, RANK for competition ranking, DENSE_RANK for counting distinct levels.
Q: How do LAG and LEAD work? A: LAG returns a value from a previous row in the window (looking backward). LEAD returns a value from the next row (looking forward). Both take an offset parameter (default 1). The first row’s LAG is NULL, the last row’s LEAD is NULL. Used for gap detection, YoY comparison, and sessionization.
Q: How do you calculate a running total in PySpark?
A: Use sum(column).over(Window.partitionBy(group).orderBy(sort).rowsBetween(Window.unboundedPreceding, Window.currentRow)). The frame specification ensures the sum includes all rows from the start of the partition up to the current row.
Q: How do you get the top N records per group?
A: Use row_number().over(Window.partitionBy(group).orderBy(sort_desc)) to assign ranks within each group, then filter where rank <= N. This is more flexible than groupBy().agg() because you keep all columns.
Q: What is the difference between groupBy and window functions? A: groupBy collapses rows — 10 rows become 3 (one per group). Window functions keep ALL rows and add a computed column. Use groupBy for summary reports. Use window functions when you need row-level detail WITH group-level context.
Wrapping Up
Window functions are the most powerful feature in PySpark for row-level analytics. They let you rank, compare, accumulate, and partition WITHOUT losing row details. Master the five patterns — dedup, gap detection, sessionization, YoY comparison, top N per group — and you can solve 90% of window function problems in production and interviews.
Related posts: – SQL Window Functions (ROW_NUMBER, RANK, LAG, LEAD) – PySpark Transformations Cookbook – PySpark Joins – Apache Spark and PySpark Architecture
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.