SQL CTEs and Subqueries Explained: When to Use Each and Why CTEs Win

SQL CTEs and Subqueries Explained: When to Use Each and Why CTEs Win

If SQL joins let you combine tables and window functions let you add calculations, then CTEs and subqueries let you build queries in layers — breaking complex problems into manageable steps.

Every senior SQL developer uses CTEs daily. Every data engineering pipeline has at least one. And almost every SQL interview includes a question that requires one.

Yet many developers still write deeply nested subqueries that are impossible to debug. This post will show you why CTEs are almost always the better choice and how to use both effectively.

Table of Contents

  • What Is a Subquery?
  • Types of Subqueries
  • What Is a CTE (Common Table Expression)?
  • CTE Syntax and Examples
  • CTEs vs Subqueries: The Comparison
  • Multiple CTEs (Chaining)
  • CTEs with Window Functions
  • CTEs for Deduplication
  • CTEs for Running Totals and Rankings
  • Recursive CTEs
  • CTEs in INSERT, UPDATE, and DELETE
  • Performance: CTEs vs Subqueries vs Temp Tables
  • When to Use Subqueries Instead of CTEs
  • Common Mistakes
  • Real-World Data Engineering Patterns
  • Interview Questions
  • Wrapping Up

What Is a Subquery?

A subquery is a query nested inside another query. It runs first, and its result is used by the outer query.

-- Find employees who earn more than the average salary
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

The inner query (SELECT AVG(salary) FROM employees) runs first, returns a number (e.g., 72000), and then the outer query uses that number in the WHERE clause.

Types of Subqueries

Scalar Subquery (Returns a Single Value)

-- Used in SELECT or WHERE
SELECT name, salary,
       salary - (SELECT AVG(salary) FROM employees) as diff_from_avg
FROM employees;

Row Subquery (Returns a Single Row)

SELECT * FROM employees
WHERE (department_id, salary) = (SELECT department_id, MAX(salary)
                                  FROM employees
                                  GROUP BY department_id
                                  LIMIT 1);

Table Subquery (Returns Multiple Rows and Columns)

-- Used in FROM clause (derived table)
SELECT dept_name, avg_sal
FROM (
    SELECT d.department_name as dept_name,
           AVG(e.salary) as avg_sal
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.department_name
) as dept_averages
WHERE avg_sal > 70000;

Correlated Subquery (References Outer Query)

-- Find employees who earn more than their department average
SELECT name, salary, department_id
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id  -- references outer query
);

Warning: Correlated subqueries run once PER ROW of the outer query. On a million-row table, that is a million executions of the inner query. They are slow.

EXISTS Subquery

-- Find departments that have at least one employee
SELECT department_name
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e
    WHERE e.department_id = d.id
);

EXISTS returns TRUE/FALSE — it does not return data. It stops as soon as it finds one matching row, making it efficient for existence checks.

What Is a CTE (Common Table Expression)?

A CTE is a named temporary result set that you define at the beginning of a query using the WITH keyword. It exists only for the duration of that single query.

WITH high_earners AS (
    SELECT name, salary, department_id
    FROM employees
    WHERE salary > 80000
)
SELECT h.name, h.salary, d.department_name
FROM high_earners h
JOIN departments d ON h.department_id = d.id;

The CTE high_earners is defined first, then used in the main query like a regular table. After the query completes, high_earners disappears.

CTE Syntax and Examples

Basic Syntax

WITH cte_name AS (
    -- Your SELECT query here
    SELECT column1, column2
    FROM some_table
    WHERE condition
)
SELECT *
FROM cte_name
WHERE additional_condition;

Example: Average Salary per Department

WITH dept_stats AS (
    SELECT department_id,
           AVG(salary) as avg_salary,
           COUNT(*) as emp_count,
           MAX(salary) as max_salary
    FROM employees
    GROUP BY department_id
)
SELECT d.department_name,
       ds.avg_salary,
       ds.emp_count,
       ds.max_salary
FROM dept_stats ds
JOIN departments d ON ds.department_id = d.id
ORDER BY ds.avg_salary DESC;

Example: Employees Above Department Average

WITH dept_avg AS (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT e.name, e.salary, e.department_id,
       da.avg_salary,
       e.salary - da.avg_salary as above_avg
FROM employees e
JOIN dept_avg da ON e.department_id = da.department_id
WHERE e.salary > da.avg_salary;

This replaces the correlated subquery approach and is much faster because the average is calculated once per department, not once per employee.

CTEs vs Subqueries: The Comparison

Aspect CTE Subquery
Readability Excellent — named, top-to-bottom flow Poor when deeply nested
Reusability Can reference the same CTE multiple times Must repeat the subquery each time
Debugging Easy — run each CTE independently Hard — must extract and test inner queries
Recursion Supported (recursive CTE) Not supported
Performance Usually same as subquery (optimizer rewrites) Usually same as CTE
Naming Must have a name (self-documenting) Anonymous (no name)
Scope Only within the single query Only within the single query

The Readability Difference

Subquery approach (hard to read):

SELECT name, salary, department_name
FROM (
    SELECT e.name, e.salary, d.department_name,
           ROW_NUMBER() OVER(PARTITION BY e.department_id ORDER BY e.salary DESC) as rn
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    WHERE e.salary > (
        SELECT AVG(salary) FROM employees
    )
) ranked
WHERE rn = 1;

CTE approach (easy to read):

WITH company_avg AS (
    SELECT AVG(salary) as avg_salary
    FROM employees
),
above_average AS (
    SELECT e.name, e.salary, d.department_name, e.department_id
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    CROSS JOIN company_avg ca
    WHERE e.salary > ca.avg_salary
),
ranked AS (
    SELECT name, salary, department_name,
           ROW_NUMBER() OVER(PARTITION BY department_id ORDER BY salary DESC) as rn
    FROM above_average
)
SELECT name, salary, department_name
FROM ranked
WHERE rn = 1;

Same result, but the CTE version tells a story: first calculate the average, then filter, then rank, then pick the top.

Multiple CTEs (Chaining)

You can define multiple CTEs separated by commas, and later CTEs can reference earlier ones:

WITH
-- Step 1: Get monthly revenue
monthly_revenue AS (
    SELECT DATE_TRUNC('month', order_date) as month,
           SUM(amount) as revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
-- Step 2: Add previous month for comparison (references Step 1)
with_previous AS (
    SELECT month, revenue,
           LAG(revenue) OVER(ORDER BY month) as prev_revenue
    FROM monthly_revenue
),
-- Step 3: Calculate growth percentage (references Step 2)
with_growth AS (
    SELECT month, revenue, prev_revenue,
           CASE WHEN prev_revenue > 0
                THEN ROUND((revenue - prev_revenue) * 100.0 / prev_revenue, 2)
                ELSE NULL
           END as growth_pct
    FROM with_previous
)
-- Final: Select results
SELECT * FROM with_growth
ORDER BY month;

Each CTE builds on the previous one. This is the layered query pattern that data engineers use constantly.

CTEs with Window Functions

CTEs and window functions are a powerful combination because you cannot filter by a window function in the same SELECT — you need a wrapper:

-- Top 3 earners per department
WITH ranked AS (
    SELECT name, department_id, salary,
           ROW_NUMBER() OVER(
               PARTITION BY department_id
               ORDER BY salary DESC
           ) as rn
    FROM employees
)
SELECT name, department_id, salary
FROM ranked
WHERE rn <= 3;

Without the CTE, you cannot write WHERE ROW_NUMBER() <= 3 because window functions execute after WHERE.

CTEs for Deduplication

-- Keep only the latest record per customer
WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER(
               PARTITION BY customer_id
               ORDER BY updated_at DESC
           ) as rn
    FROM customer_records
)
SELECT customer_id, name, email, updated_at
FROM ranked
WHERE rn = 1;

This is the single most common CTE pattern in data engineering.

CTEs for Running Totals and Rankings

WITH daily_sales AS (
    SELECT order_date, SUM(amount) as daily_total
    FROM orders
    GROUP BY order_date
),
with_running AS (
    SELECT order_date, daily_total,
           SUM(daily_total) OVER(ORDER BY order_date) as running_total,
           RANK() OVER(ORDER BY daily_total DESC) as best_day_rank
    FROM daily_sales
)
SELECT * FROM with_running
ORDER BY order_date;

Recursive CTEs

A recursive CTE references itself to traverse hierarchical data:

-- Employee hierarchy (who reports to whom)
WITH RECURSIVE org_chart AS (
    -- Base case: top-level managers (no manager)
    SELECT id, name, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees who report to someone in the previous level
    SELECT e.id, e.name, e.manager_id, oc.level + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT name, level
FROM org_chart
ORDER BY level, name;

Note: SQL Server uses WITH cte AS (...) without the RECURSIVE keyword. PostgreSQL and MySQL require WITH RECURSIVE.

Other Recursive CTE Use Cases

  • Bill of materials: Parts that contain sub-parts
  • File system navigation: Folders containing subfolders
  • Date generation: Generate a sequence of dates
-- Generate dates for a date dimension
WITH RECURSIVE dates AS (
    SELECT CAST('2024-01-01' AS DATE) as dt
    UNION ALL
    SELECT DATEADD(day, 1, dt)
    FROM dates
    WHERE dt < '2026-12-31'
)
SELECT dt FROM dates;

CTEs in INSERT, UPDATE, and DELETE

CTEs are not just for SELECT. You can use them with data modification statements:

CTE with INSERT

WITH new_records AS (
    SELECT s.id, s.name, s.email
    FROM staging s
    LEFT JOIN production p ON s.id = p.id
    WHERE p.id IS NULL  -- only records not in production
)
INSERT INTO production (id, name, email)
SELECT id, name, email FROM new_records;

CTE with DELETE

-- Delete duplicate records (keep first occurrence)
WITH duplicates AS (
    SELECT id,
           ROW_NUMBER() OVER(PARTITION BY email ORDER BY created_at) as rn
    FROM users
)
DELETE FROM duplicates WHERE rn > 1;

CTE with UPDATE

WITH dept_avg AS (
    SELECT department_id, AVG(salary) as avg_sal
    FROM employees
    GROUP BY department_id
)
UPDATE employees
SET bonus = salary * 0.15
FROM dept_avg
WHERE employees.department_id = dept_avg.department_id
  AND employees.salary > dept_avg.avg_sal;

Performance: CTEs vs Subqueries vs Temp Tables

Approach Materialized? Reusable? Best For
CTE No (usually inlined by optimizer) Within the query Readability, single-query logic
Subquery No (inlined) No (must repeat) Simple one-time filters
Temp Table Yes (stored on disk) Across multiple queries Large intermediate results, complex multi-step ETL
View No (re-executed each time) Across queries and sessions Reusable abstractions
Materialized View Yes (pre-computed) Across queries Expensive aggregations

Key insight: In most databases (SQL Server, PostgreSQL), CTEs and subqueries have identical performance because the optimizer rewrites them the same way. Choose based on readability.

Exception: In PostgreSQL (before version 12), CTEs were always materialized (computed and stored in memory). This made them slower for simple cases. PostgreSQL 12+ inlines CTEs by default.

When to use temp tables instead of CTEs: – The intermediate result is used by multiple separate queries – The intermediate result is very large and benefits from indexing – You need to debug step-by-step in a stored procedure

When to Use Subqueries Instead of CTEs

CTEs are not always the answer. Use subqueries when:

  • Simple scalar values: WHERE salary > (SELECT AVG(salary) FROM employees) is cleaner than a CTE for one value
  • EXISTS checks: WHERE EXISTS (SELECT 1 FROM ...) is idiomatic and clear
  • IN lists: WHERE department_id IN (SELECT id FROM departments WHERE region = 'NA')
  • Single-use, simple filters where a CTE would add unnecessary verbosity

Common Mistakes

1. Referencing a CTE Outside Its Query

WITH my_cte AS (SELECT * FROM employees)
SELECT * FROM my_cte;  -- This works

SELECT * FROM my_cte;  -- ERROR: my_cte does not exist here

A CTE only lives for the duration of the immediately following statement.

2. Missing Comma Between Multiple CTEs

-- WRONG: missing comma
WITH cte1 AS (SELECT 1)
     cte2 AS (SELECT 2)  -- Error!
SELECT * FROM cte1, cte2;

-- RIGHT: comma between CTEs
WITH cte1 AS (SELECT 1),
     cte2 AS (SELECT 2)
SELECT * FROM cte1, cte2;

3. Using CTE When a Simple Subquery Is Cleaner

-- Overkill CTE for a simple value
WITH avg_sal AS (SELECT AVG(salary) as val FROM employees)
SELECT name FROM employees
CROSS JOIN avg_sal WHERE salary > avg_sal.val;

-- Simpler with subquery
SELECT name FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

4. Infinite Recursion

Recursive CTEs without a proper termination condition run forever (until the database kills them):

-- DANGEROUS: no WHERE clause to stop recursion
WITH RECURSIVE nums AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1 FROM nums  -- never stops!
)
SELECT * FROM nums;

Always include a termination condition: WHERE n < 1000.

Real-World Data Engineering Patterns

Pattern 1: Staging to Production with Dedup

WITH deduped AS (
    SELECT *,
           ROW_NUMBER() OVER(PARTITION BY id ORDER BY load_date DESC) as rn
    FROM staging_customers
),
clean AS (
    SELECT id, name, email, phone, load_date
    FROM deduped
    WHERE rn = 1
)
INSERT INTO production_customers
SELECT id, name, email, phone, load_date
FROM clean;

Pattern 2: Incremental Load Detection

WITH source_max AS (
    SELECT MAX(modified_date) as max_date
    FROM source_table
),
watermark AS (
    SELECT last_loaded_value
    FROM pipeline_config
    WHERE table_name = 'customers'
),
delta AS (
    SELECT s.*
    FROM source_table s
    CROSS JOIN watermark w
    WHERE s.modified_date > w.last_loaded_value
      AND s.modified_date <= (SELECT max_date FROM source_max)
)
SELECT * FROM delta;

Pattern 3: Data Quality Report

WITH total_rows AS (
    SELECT COUNT(*) as cnt FROM customers
),
null_checks AS (
    SELECT
        SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails,
        SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) as null_names,
        SUM(CASE WHEN phone IS NULL THEN 1 ELSE 0 END) as null_phones
    FROM customers
),
dup_check AS (
    SELECT COUNT(*) - COUNT(DISTINCT email) as duplicate_emails
    FROM customers
)
SELECT
    t.cnt as total_rows,
    n.null_emails, n.null_names, n.null_phones,
    d.duplicate_emails
FROM total_rows t
CROSS JOIN null_checks n
CROSS JOIN dup_check d;

Interview Questions

Q: What is a CTE and how is it different from a subquery? A: A CTE is a named temporary result set defined with the WITH keyword. Unlike subqueries which are anonymous and nested, CTEs are named, defined at the top of the query, and can be referenced multiple times. They improve readability without typically affecting performance.

Q: Can you reference a CTE multiple times in the same query? A: Yes. A CTE can be referenced multiple times in the main query, unlike a subquery which must be repeated. This avoids code duplication and makes self-joins on derived data cleaner.

Q: What is a recursive CTE? A: A CTE that references itself to traverse hierarchical data. It has a base case (anchor) and a recursive case joined with UNION ALL. Common uses: org charts, bill of materials, date generation.

Q: When would you use a temp table instead of a CTE? A: When the intermediate result is used across multiple separate queries, when it is very large and needs indexing, or when debugging step-by-step in a stored procedure. CTEs only exist for one statement.

Q: Do CTEs improve query performance? A: Usually not — most optimizers inline CTEs like subqueries, producing identical execution plans. CTEs improve readability and maintainability, which are equally important in production code.

Q: How do you deduplicate data using a CTE? A: Use ROW_NUMBER in a CTE partitioned by the dedup key, ordered by the tiebreaker (e.g., latest date). Then filter WHERE rn = 1 in the outer query.

Wrapping Up

CTEs are the data engineer’s best friend for writing readable, maintainable SQL. They let you break complex queries into named, logical steps that anyone can follow.

The pattern is always the same: define your steps as CTEs at the top, then combine them in the final SELECT. Each CTE is testable independently, named descriptively, and builds on the previous one.

Combined with window functions and joins, CTEs complete your SQL toolkit for any data engineering challenge.

Related posts:SQL Window FunctionsSQL Joins ExplainedTop 20 Data Engineering Interview QuestionsPython for Data Engineers

If this guide leveled up your SQL, share it with your team. Questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link