PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Window functions are the most asked topic in data engineering interviews and the most used feature in production pipelines. They let you compare rows against other rows, compute running totals, find the previous or next value, rank records within groups, and deduplicate data — all WITHOUT collapsing your DataFrame like groupBy does.

Think of a window function like looking out of a window on a train. At any moment, you can see a few rows before you (LAG), a few rows after you (LEAD), and your position relative to others in your car (RANK). But unlike groupBy (which squashes everyone in the car into one summary), window functions keep EVERY row intact and add a new column with the calculation.

This post covers every window function with realistic business scenarios, not toy examples.

What Makes Window Functions Different from GroupBy
The Window Specification: partitionBy + orderBy
The Sample Data
ROW_NUMBER — Unique Sequential Ranking
RANK — Ranking with Gaps
DENSE_RANK — Ranking Without Gaps
ROW_NUMBER vs RANK vs DENSE_RANK Comparison
LAG — Look at the Previous Row
LEAD — Look at the Next Row
FIRST_VALUE and LAST_VALUE
NTILE — Divide into Buckets
SUM with Window — Running Totals
AVG with Window — Moving Average
COUNT with Window — Running Count
MAX/MIN with Window — Running Max/Min
Percent of Group Total
Cumulative Distribution (CUME_DIST and PERCENT_RANK)
Frame Specification: ROWS BETWEEN
Real-World Pattern 1: Deduplication (Keep Latest)
Real-World Pattern 2: Gap Detection
Real-World Pattern 3: Sessionization
Real-World Pattern 4: Year-over-Year Comparison
Real-World Pattern 5: Top N per Group
Performance Considerations
Common Mistakes
Interview Questions
Wrapping Up

What Makes Window Functions Different from GroupBy

# GroupBy: 10 rows → 3 rows (one per department)
df.groupBy("department").agg(avg("salary"))
# Result: 3 rows — original row details LOST

# Window: 10 rows → 10 rows (one per employee, with department avg added)
window_spec = Window.partitionBy("department")
df.withColumn("dept_avg_salary", avg("salary").over(window_spec))
# Result: 10 rows — every employee kept, new column added

Real-life analogy: GroupBy is like a class photo where each row of students merges into one row in the yearbook — you lose individual faces. Window functions are like giving each student a badge that says “Row 3 of 5 in the Science class” — every student keeps their identity, plus gets context about their position in the group.

The Window Specification: partitionBy + orderBy

Every window function needs a Window specification — the “window” through which Spark looks at the data:

from pyspark.sql.window import Window

# The window is defined by:
# 1. partitionBy — which groups to compute within (like GROUP BY)
# 2. orderBy — how to sort rows within each group

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

partitionBy = “Process each department separately.” Without it, the entire DataFrame is one group.

orderBy = “Within each department, sort by salary descending.” Required for ranking and LAG/LEAD. Optional for aggregate windows.

Real-life analogy: partitionBy divides a marathon into age groups (18-25, 26-35, 36-45). orderBy sorts runners within each age group by finish time. Your rank is within YOUR age group, not the entire marathon.

The Sample Data

from pyspark.sql.functions import *
from pyspark.sql.window import Window

data = [
    (1, "Naveen", "Engineering", "Toronto", 105000, "2022-01-15"),
    (2, "Shrey", "Engineering", "Toronto", 95000, "2023-03-20"),
    (3, "Vrushab", "Engineering", "Vancouver", 92000, "2023-06-10"),
    (4, "Vishnu", "Analytics", "Ottawa", 91000, "2023-09-01"),
    (5, "Ravi", "Analytics", "Calgary", 85000, "2024-01-05"),
    (6, "Priya", "Analytics", "Toronto", 85000, "2023-07-15"),
    (7, "Anita", "Sales", "Mumbai", 78000, "2024-03-15"),
    (8, "Deepak", "Sales", "Delhi", 82000, "2024-05-01"),
    (9, "Kavya", "Sales", "Pune", 78000, "2024-02-10"),
    (10, "Manoj", "Sales", "Mumbai", 75000, "2023-11-20"),
]

df = spark.createDataFrame(data,
    ["emp_id", "name", "department", "city", "salary", "hire_date"]
).withColumn("hire_date", to_date(col("hire_date")))

df.show()

10 employees across 3 departments: Engineering (3), Analytics (3), Sales (4).

ROW_NUMBER — Unique Sequential Ranking

What It Does

Assigns a unique sequential number (1, 2, 3…) to each row within the partition. No ties — even if two employees have the same salary, they get different numbers.

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

df.withColumn("row_num", row_number().over(window_spec))   .select("department", "name", "salary", "row_num")   .show()

department	name	salary	row_num
Analytics	Vishnu	91000	1
Analytics	Priya	85000	2
Analytics	Ravi	85000	3
Engineering	Naveen	105000	1
Engineering	Shrey	95000	2
Engineering	Vrushab	92000	3
Sales	Deepak	82000	1
Sales	Anita	78000	2
Sales	Kavya	78000	3
Sales	Manoj	75000	4

Notice: Priya and Ravi both earn 85000 but get different row numbers (2 and 3). The order between ties is non-deterministic unless you add a tiebreaker column to orderBy.

Real-life analogy: ROW_NUMBER is like numbering seats in a theater row — 1, 2, 3, 4. Even if two people are the same height (salary), they sit in different seats. No sharing.

When to Use

Deduplication — keep only row_num = 1 per group
Pagination — show rows 11-20 for page 2
Assigning sequence IDs — deterministic ordering within groups

RANK — Ranking with Gaps

What It Does

Assigns ranks with gaps after ties. If two employees tie at rank 2, the next person gets rank 4 (skipping 3).

df.withColumn("rank", rank().over(window_spec))   .select("department", "name", "salary", "rank")   .show()

department	name	salary	rank
Analytics	Vishnu	91000	1
Analytics	Priya	85000	2
Analytics	Ravi	85000	2
Sales	Deepak	82000	1
Sales	Anita	78000	2
Sales	Kavya	78000	2
Sales	Manoj	75000	4

Priya and Ravi tie at rank 2. Next person in Analytics would be rank 4 (if there was one). Anita and Kavya tie at rank 2 in Sales. Manoj is rank 4, not 3.

Real-life analogy: Olympic medals. Two athletes tie for silver (rank 2). No bronze is awarded — next person is rank 4. The gap reflects that two people already share a higher position.

DENSE_RANK — Ranking Without Gaps

What It Does

Assigns ranks without gaps after ties. If two tie at rank 2, the next person gets rank 3.

df.withColumn("dense_rank", dense_rank().over(window_spec))   .select("department", "name", "salary", "dense_rank")   .show()

department	name	salary	dense_rank
Analytics	Vishnu	91000	1
Analytics	Priya	85000	2
Analytics	Ravi	85000	2
Sales	Deepak	82000	1
Sales	Anita	78000	2
Sales	Kavya	78000	2
Sales	Manoj	75000	3

Manoj is now rank 3 (not 4 like with RANK). No gaps.

Real-life analogy: Class grades. Two students get an A (rank 1). The next grade is B (rank 2), not C. The consecutive ranking makes sense when you care about “how many distinct levels” rather than “how many people are above you.”

ROW_NUMBER vs RANK vs DENSE_RANK Comparison

Using Sales department (Deepak=82K, Anita=78K, Kavya=78K, Manoj=75K):

name	salary	row_number	rank	dense_rank
Deepak	82000	1	1	1
Anita	78000	2	2	2
Kavya	78000	3	2	2
Manoj	75000	4	4	3

Function	Ties	Next after tie	Use when
ROW_NUMBER	Different numbers	Continues sequence	Dedup, pagination
RANK	Same number	Skips (gap)	Competition ranking
DENSE_RANK	Same number	No gap	“How many distinct levels above me”

LAG — Look at the Previous Row

What It Does

Returns a value from the previous row in the window (or N rows back).

window_hire = Window.partitionBy("department").orderBy("hire_date")

df.withColumn("prev_hire_name", lag("name", 1).over(window_hire))   .withColumn("prev_hire_date", lag("hire_date", 1).over(window_hire))   .select("department", "name", "hire_date", "prev_hire_name", "prev_hire_date")   .show()

department	name	hire_date	prev_hire_name	prev_hire_date
Analytics	Priya	2023-07-15	NULL	NULL
Analytics	Vishnu	2023-09-01	Priya	2023-07-15
Analytics	Ravi	2024-01-05	Vishnu	2023-09-01
Engineering	Naveen	2022-01-15	NULL	NULL
Engineering	Shrey	2023-03-20	Naveen	2022-01-15
Engineering	Vrushab	2023-06-10	Shrey	2023-03-20

The first person in each department has NULL (no previous row).

Real-life analogy: LAG is like looking in the rearview mirror while driving. You see the car (row) that was behind you in the lane (partition). The first car in the lane has no rearview — NULL.

Practical Use: Days Since Last Hire

df.withColumn("prev_hire", lag("hire_date", 1).over(window_hire))   .withColumn("days_since_last_hire", datediff(col("hire_date"), col("prev_hire")))   .select("department", "name", "hire_date", "days_since_last_hire")   .show()

LEAD — Look at the Next Row

What It Does

Returns a value from the next row in the window.

df.withColumn("next_hire_name", lead("name", 1).over(window_hire))   .withColumn("next_hire_date", lead("hire_date", 1).over(window_hire))   .select("department", "name", "hire_date", "next_hire_name", "next_hire_date")   .show()

The last person in each department has NULL (no next row).

Real-life analogy: LEAD is like looking through the windshield. You see the car ahead of you. The last car in the lane sees nothing — NULL.

FIRST_VALUE and LAST_VALUE

# Who was the first hire in each department?
df.withColumn("first_hired", first("name").over(window_hire))   .withColumn("last_hired", last("name").over(window_hire))   .select("department", "name", "hire_date", "first_hired", "last_hired")   .show()

NTILE — Divide into Buckets

# Divide all employees into 4 salary quartiles
window_all = Window.orderBy(col("salary").desc())
df.withColumn("quartile", ntile(4).over(window_all))   .select("name", "salary", "quartile")   .show()

name	salary	quartile
Naveen	105000	1
Shrey	95000	1
Vrushab	92000	1
Vishnu	91000	2
Ravi	85000	2
Priya	85000	2
Deepak	82000	3
Anita	78000	3
Kavya	78000	4
Manoj	75000	4

Top 25% earners are quartile 1. Bottom 25% are quartile 4.

Real-life analogy: NTILE is like dividing a deck of cards into 4 equal piles. Each pile (quartile) has roughly the same number of cards, sorted by value.

SUM with Window — Running Totals

window_running = Window.partitionBy("department").orderBy("hire_date")     .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df.withColumn("running_salary_total", sum("salary").over(window_running))   .select("department", "name", "hire_date", "salary", "running_salary_total")   .show()

department	name	hire_date	salary	running_salary_total
Analytics	Priya	2023-07-15	85000	85000
Analytics	Vishnu	2023-09-01	91000	176000
Analytics	Ravi	2024-01-05	85000	261000
Engineering	Naveen	2022-01-15	105000	105000
Engineering	Shrey	2023-03-20	95000	200000
Engineering	Vrushab	2023-06-10	92000	292000

Each row shows the cumulative salary total up to that point within the department.

Real-life analogy: Running total is like your bank balance. Each deposit (hire) adds to the previous balance. At any point, you see the total accumulated so far.

AVG with Window — Moving Average

# 3-row moving average of salary
window_moving = Window.partitionBy("department").orderBy("hire_date")     .rowsBetween(-1, 1)  # Previous row, current row, next row

df.withColumn("moving_avg_salary", round(avg("salary").over(window_moving), 0))   .select("department", "name", "salary", "moving_avg_salary")   .show()

COUNT with Window — Running Count

# Running count of hires per department over time
window_running = Window.partitionBy("department").orderBy("hire_date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df.withColumn("hire_number", count("*").over(window_running)) \
  .select("department", "name", "hire_date", "hire_number") \
  .show()

# Result for Engineering:
# Naveen  2022-01-15  1  (first hire)
# Shrey   2023-03-20  2  (second hire)
# Vrushab 2023-06-10  3  (third hire)

# Total headcount per department (no orderBy = entire partition)
window_dept = Window.partitionBy("department")
df.withColumn("dept_headcount", count("*").over(window_dept)) \
  .select("department", "name", "dept_headcount") \
  .show()

# Every Engineering row shows 3, every Analytics row shows 3, every Sales row shows 4

Real-life analogy: Running count is like a ticket counter at a deli. Each customer who arrives gets the next number. At any point, the counter shows how many customers have been served so far — not the total for the day, just up to now.

MAX/MIN with Window — Running Max/Min

# Running max salary seen so far (within department, ordered by hire date)
window_running = Window.partitionBy("department").orderBy("hire_date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df.withColumn("max_salary_so_far", max("salary").over(window_running)) \
  .withColumn("min_salary_so_far", min("salary").over(window_running)) \
  .select("department", "name", "hire_date", "salary", "max_salary_so_far", "min_salary_so_far") \
  .show()

# Result for Engineering:
# Naveen  2022-01-15  105000  105000  105000  (only hire so far)
# Shrey   2023-03-20   95000  105000   95000  (max stays, min drops)
# Vrushab 2023-06-10   92000  105000   92000  (max stays, min drops again)

# Department-wide max/min (no running — entire partition)
window_dept = Window.partitionBy("department")
df.withColumn("dept_max", max("salary").over(window_dept)) \
  .withColumn("dept_min", min("salary").over(window_dept)) \
  .withColumn("salary_range", col("dept_max") - col("dept_min")) \
  .select("department", "name", "salary", "dept_max", "dept_min", "salary_range") \
  .show()

# Practical use: flag employees below department minimum threshold
df.withColumn("dept_avg", avg("salary").over(window_dept)) \
  .withColumn("below_avg", when(col("salary") < col("dept_avg"), "Yes").otherwise("No")) \
  .show()

Real-life analogy: Running max is like tracking your personal best in a sport. After each game (row), you check: “Is this my highest score ever?” The running max only goes up or stays flat — it never decreases. Running min is the opposite: your worst performance so far.

Percent of Group Total

window_dept = Window.partitionBy("department")

df.withColumn("dept_total", sum("salary").over(window_dept))   .withColumn("pct_of_dept", round(col("salary") / col("dept_total") * 100, 1))   .select("department", "name", "salary", "dept_total", "pct_of_dept")   .show()

department	name	salary	dept_total	pct_of_dept
Analytics	Vishnu	91000	261000	34.9
Analytics	Priya	85000	261000	32.6
Analytics	Ravi	85000	261000	32.6
Engineering	Naveen	105000	292000	36.0

Naveen accounts for 36% of Engineering’s total salary budget.

Cumulative Distribution (CUME_DIST and PERCENT_RANK)

These functions tell you where a row sits relative to the entire partition as a percentage:

window_salary = Window.partitionBy("department").orderBy(col("salary").asc())

df.withColumn("percent_rank", round(percent_rank().over(window_salary), 2)) \
  .withColumn("cume_dist", round(cume_dist().over(window_salary), 2)) \
  .select("department", "name", "salary", "percent_rank", "cume_dist") \
  .show()

department	name	salary	percent_rank	cume_dist
Sales	Manoj	75000	0.0	0.25
Sales	Anita	78000	0.33	0.75
Sales	Kavya	78000	0.33	0.75
Sales	Deepak	82000	1.0	1.0

# PERCENT_RANK = (rank - 1) / (total_rows - 1)
# Manoj: (1-1)/(4-1) = 0.0   → bottom of the group
# Anita: (2-1)/(4-1) = 0.33  → 33rd percentile
# Deepak: (4-1)/(4-1) = 1.0  → top of the group

# CUME_DIST = count of rows with value <= current / total rows
# Manoj: 1/4 = 0.25  → 25% of Sales earns <= 75K
# Anita: 3/4 = 0.75  → 75% of Sales earns <= 78K (includes Kavya tie)
# Deepak: 4/4 = 1.0  → 100% earns <= 82K

# Practical use: find employees in the top 10% of their department
window_desc = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("pct_rank", percent_rank().over(window_desc)) \
  .filter(col("pct_rank") <= 0.10) \
  .show()  # Top 10% earners per department

Key difference: PERCENT_RANK ranges 0.0 to 1.0 and gives 0.0 to the first row (relative position). CUME_DIST ranges above 0.0 to 1.0 and tells you what fraction of the group is at or below this value. Use PERCENT_RANK for “what percentile am I?” Use CUME_DIST for “what percentage of the group is at or below me?”

Frame Specification: ROWS BETWEEN

The frame defines WHICH rows the window function considers:

# From start of partition to current row (running total)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Previous 2 rows + current row (3-row moving average)
.rowsBetween(-2, 0)

# Previous 1 + current + next 1 (centered moving average)
.rowsBetween(-1, 1)

# Entire partition (total — no running aspect)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

Real-World Pattern 1: Deduplication (Keep Latest)

# Keep only the latest record per customer_id
window_dedup = Window.partitionBy("customer_id").orderBy(col("updated_date").desc())

df_deduped = df.withColumn("rn", row_number().over(window_dedup))                .filter(col("rn") == 1)                .drop("rn")

This is the most common window function pattern in production — used in every SCD pipeline we built.

Real-World Pattern 2: Gap Detection

# Find gaps in order sequence
window_seq = Window.orderBy("order_id")
df_gaps = df.withColumn("prev_id", lag("order_id", 1).over(window_seq))             .withColumn("gap", col("order_id") - col("prev_id"))             .filter(col("gap") > 1)
# Shows where order IDs skip numbers — data quality issue

Real-World Pattern 3: Sessionization

# Group web clicks into sessions (new session if >30 min gap)
window_user = Window.partitionBy("user_id").orderBy("click_time")
df_sessions = df     .withColumn("prev_click", lag("click_time", 1).over(window_user))     .withColumn("gap_minutes", (col("click_time").cast("long") - col("prev_click").cast("long")) / 60)     .withColumn("new_session", when(col("gap_minutes") > 30, 1).otherwise(0))     .withColumn("session_id", sum("new_session").over(window_user))

Real-World Pattern 4: Year-over-Year Comparison

# Compare each month's revenue with the same month last year
window_yoy = Window.partitionBy("month").orderBy("year")
df_yoy = df     .withColumn("last_year_revenue", lag("revenue", 1).over(window_yoy))     .withColumn("yoy_growth_pct", round(
        (col("revenue") - col("last_year_revenue")) / col("last_year_revenue") * 100, 1
    ))

Real-World Pattern 5: Top N per Group

# Top 2 earners per department
window_top = Window.partitionBy("department").orderBy(col("salary").desc())
df_top2 = df.withColumn("rank", row_number().over(window_top))             .filter(col("rank") <= 2)             .drop("rank")
df_top2.show()

Performance Considerations

Minimize the number of distinct windows — each unique Window spec causes a separate sort. Reuse the same window object.
partitionBy reduces shuffle scope — without partitionBy, the entire DataFrame is one partition (global sort = expensive).
orderBy triggers a sort — avoid unnecessary orderBy when using aggregate window functions (sum, avg without running totals).
Use row_number for dedup, not distinct — row_number lets you CHOOSE which duplicate to keep (latest, highest). distinct() is random.

Common Mistakes

Forgetting orderBy for ranking functions — row_number() without orderBy gives non-deterministic results. Always specify the sort.
Using row_number when you need rank — row_number gives different numbers to ties. If you want ties to have the same rank, use rank() or dense_rank().
Wrong frame for running totals — default frame with orderBy is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For exact row-based running totals, use .rowsBetween(Window.unboundedPreceding, Window.currentRow).
Applying window function then filtering in wrong order — df.filter(row_number > 1) does not work directly. You must use withColumn first, then filter.
Missing partitionBy — without it, the window covers the ENTIRE DataFrame. rank() without partitionBy ranks ALL employees globally, not within departments.

Interview Questions

Q: What is the difference between ROW_NUMBER, RANK, and DENSE_RANK? A: ROW_NUMBER assigns unique sequential numbers (no ties). RANK assigns same rank to ties but leaves gaps (1,2,2,4). DENSE_RANK assigns same rank to ties without gaps (1,2,2,3). Use ROW_NUMBER for dedup, RANK for competition ranking, DENSE_RANK for counting distinct levels.

Q: How do LAG and LEAD work? A: LAG returns a value from a previous row in the window (looking backward). LEAD returns a value from the next row (looking forward). Both take an offset parameter (default 1). The first row’s LAG is NULL, the last row’s LEAD is NULL. Used for gap detection, YoY comparison, and sessionization.

Q: How do you calculate a running total in PySpark? A: Use sum(column).over(Window.partitionBy(group).orderBy(sort).rowsBetween(Window.unboundedPreceding, Window.currentRow)). The frame specification ensures the sum includes all rows from the start of the partition up to the current row.

Q: How do you get the top N records per group? A: Use row_number().over(Window.partitionBy(group).orderBy(sort_desc)) to assign ranks within each group, then filter where rank <= N. This is more flexible than groupBy().agg() because you keep all columns.

Q: What is the difference between groupBy and window functions? A: groupBy collapses rows — 10 rows become 3 (one per group). Window functions keep ALL rows and add a computed column. Use groupBy for summary reports. Use window functions when you need row-level detail WITH group-level context.

Wrapping Up

Window functions are the most powerful feature in PySpark for row-level analytics. They let you rank, compare, accumulate, and partition WITHOUT losing row details. Master the five patterns — dedup, gap detection, sessionization, YoY comparison, top N per group — and you can solve 90% of window function problems in production and interviews.

← Previous: Lazy Evaluation in PySpark

Python & PySpark (8/11)

Next: PySpark Data Cleaning & Validation →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Table of Contents

What Makes Window Functions Different from GroupBy

The Window Specification: partitionBy + orderBy

The Sample Data

ROW_NUMBER — Unique Sequential Ranking

What It Does

When to Use

RANK — Ranking with Gaps

What It Does

DENSE_RANK — Ranking Without Gaps

What It Does

ROW_NUMBER vs RANK vs DENSE_RANK Comparison

LAG — Look at the Previous Row

What It Does

Practical Use: Days Since Last Hire

LEAD — Look at the Next Row

What It Does

FIRST_VALUE and LAST_VALUE

NTILE — Divide into Buckets

SUM with Window — Running Totals

AVG with Window — Moving Average

COUNT with Window — Running Count

MAX/MIN with Window — Running Max/Min

Percent of Group Total

Cumulative Distribution (CUME_DIST and PERCENT_RANK)

Frame Specification: ROWS BETWEEN

Real-World Pattern 1: Deduplication (Keep Latest)

Real-World Pattern 2: Gap Detection

Real-World Pattern 3: Sessionization

Real-World Pattern 4: Year-over-Year Comparison

Real-World Pattern 5: Top N per Group

Performance Considerations

Common Mistakes

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Table of Contents

What Makes Window Functions Different from GroupBy

The Window Specification: partitionBy + orderBy

The Sample Data

ROW_NUMBER — Unique Sequential Ranking

What It Does

When to Use

RANK — Ranking with Gaps

What It Does

DENSE_RANK — Ranking Without Gaps

What It Does

ROW_NUMBER vs RANK vs DENSE_RANK Comparison

LAG — Look at the Previous Row

What It Does

Practical Use: Days Since Last Hire

LEAD — Look at the Next Row

What It Does

FIRST_VALUE and LAST_VALUE

NTILE — Divide into Buckets

SUM with Window — Running Totals

AVG with Window — Moving Average

COUNT with Window — Running Count

MAX/MIN with Window — Running Max/Min

Percent of Group Total

Cumulative Distribution (CUME_DIST and PERCENT_RANK)

Frame Specification: ROWS BETWEEN

Real-World Pattern 1: Deduplication (Keep Latest)

Real-World Pattern 2: Gap Detection

Real-World Pattern 3: Sessionization

Real-World Pattern 4: Year-over-Year Comparison

Real-World Pattern 5: Top N per Group

Performance Considerations

Common Mistakes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply