Apache Spark and PySpark for Data Engineers: Architecture, Python vs PySpark, and Big Data Processing

Apache Spark and PySpark for Data Engineers: Architecture, Python vs PySpark, and Big Data Processing

Imagine you need to move a mountain of sand from one side of a construction site to the other. You could use one wheelbarrow and one worker — it works, but it takes weeks. Or you could hire 100 workers with 100 wheelbarrows, divide the sand pile into 100 sections, and move it all in a single day.

That is the difference between Python (one worker) and PySpark (100 workers). Python processes data on a single machine. PySpark distributes the work across a CLUSTER of machines, processing data in parallel. Same destination, dramatically different speed.

This post covers everything a data engineer needs to know about Apache Spark and PySpark — the architecture that makes it fast, the differences from regular Python/Pandas, and the code patterns you will use in production.

Table of Contents

  • What Is Apache Spark?
  • What Is PySpark?
  • When Do You Need Spark (And When You Do Not)
  • The Architecture: How Spark Actually Works
  • Driver and Executors: The Manager and Workers
  • SparkSession: The Entry Point
  • RDDs, DataFrames, and Datasets
  • Lazy Evaluation: The Recipe Book Analogy
  • Transformations vs Actions
  • Python vs PySpark: The Key Differences
  • PySpark DataFrame Operations (With Pandas Comparison)
  • Reading and Writing Data
  • Filtering, Selecting, and Sorting
  • GroupBy and Aggregations
  • Joins
  • Window Functions
  • UDFs (User-Defined Functions)
  • Spark SQL: Writing SQL on DataFrames
  • Partitioning and Shuffling
  • Broadcast Variables and Accumulators
  • PySpark in Azure (Synapse, Databricks, HDInsight)
  • Performance Tuning
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is Apache Spark?

Apache Spark is a distributed data processing engine that processes large datasets across a cluster of machines in parallel. It was created at UC Berkeley in 2009 and is now the de facto standard for big data processing.

Key characteristics:Distributed — spreads work across multiple machines (nodes) – In-memory — processes data in RAM, not on disk (100x faster than Hadoop MapReduce) – Unified — handles batch processing, streaming, machine learning, and graph processing – Multi-language — supports Python (PySpark), Scala, Java, R, and SQL

Real-life analogy: Spark is like a restaurant kitchen with 20 chefs (worker nodes) and 1 head chef (driver). The head chef takes the order (your code), divides it into tasks (chop vegetables, grill meat, prepare sauce), and assigns each task to a different chef. All chefs work simultaneously. The result comes together on the plate much faster than if one chef did everything.

What Is PySpark?

PySpark is the Python API for Apache Spark. It lets you write Spark programs using Python syntax. Under the hood, your Python code is translated into Spark’s execution engine (written in Scala/JVM).

# This is PySpark — looks like Python, runs on a distributed cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

df = spark.read.parquet("s3://datalake/sales/")
result = df.filter(col("year") == 2026)            .groupBy("region")            .agg(sum("amount").alias("total_sales"))
result.show()

This code looks simple, but it might be processing 500 GB of data across 50 machines simultaneously.

When Do You Need Spark (And When You Do Not)

Use Spark When:

Scenario Why Spark
Data exceeds a single machine’s RAM (50+ GB) Distributes across cluster
Processing takes hours with Pandas Parallelism cuts time to minutes
Joining two billion-row tables MPP handles this efficiently
Streaming data (Kafka, Event Hubs) Spark Structured Streaming
Data lake transformations (Bronze → Silver → Gold) Native Parquet/Delta support
Machine learning on large datasets MLlib library

Do NOT Use Spark When:

Scenario Use Instead
Data fits in memory (under 10 GB) Pandas — simpler, faster startup
Simple file copy (no transformation) ADF Copy Activity — cheaper
SQL-only transformations SQL Pool or Athena — simpler
Quick one-off analysis Pandas or SQL — no cluster overhead
Real-time single-record processing Azure Functions or Lambda

The rule of thumb: If Pandas can handle it, use Pandas. If Pandas crashes with “out of memory” or takes hours, use PySpark.

Real-life analogy: You do not hire 100 construction workers to hang a picture frame. That is a one-person job (Pandas). But when you need to build a house (process 500 GB), one person takes a year. 100 workers finish in a month (PySpark).

The Architecture: How Spark Actually Works

The Cluster

┌─────────────────────────────────────────────────────────┐
│                    SPARK CLUSTER                         │
│                                                          │
│  ┌──────────────┐                                        │
│  │   DRIVER     │  ← Your code runs here                │
│  │   (Manager)  │  ← Creates the execution plan          │
│  │              │  ← Distributes work to executors        │
│  └──────┬───────┘                                        │
│         │                                                │
│    ┌────┴────┬──────────┬──────────┐                     │
│    │         │          │          │                      │
│  ┌─┴──┐  ┌──┴──┐  ┌───┴──┐  ┌───┴──┐                   │
│  │ E1 │  │ E2  │  │ E3   │  │ E4   │  ← Executors       │
│  │    │  │     │  │      │  │      │  ← Do the work      │
│  │Core│  │Core │  │Core  │  │Core  │                     │
│  │Core│  │Core │  │Core  │  │Core  │                     │
│  │RAM │  │RAM  │  │RAM   │  │RAM   │                     │
│  └────┘  └─────┘  └──────┘  └──────┘                    │
│                                                          │
│  Cluster Manager: YARN / Kubernetes / Standalone          │
└─────────────────────────────────────────────────────────┘

The Components

Component What It Does Real-Life Equivalent
Driver Runs your code, creates execution plan, coordinates executors Head chef — reads the recipe, assigns tasks
Executors Execute the actual data processing tasks Line cooks — do the chopping, cooking, plating
Cluster Manager Allocates resources (CPU, memory) to drivers and executors Restaurant manager — assigns chefs to stations
Cores CPU threads within each executor Hands of each cook — more hands = more tasks
Partitions Chunks of data distributed across executors Portions of ingredients divided among cooks

Driver and Executors: The Manager and Workers

The Driver (Brain)

The Driver is where YOUR code runs. It:

  1. Parses your PySpark code
  2. Creates a Directed Acyclic Graph (DAG) of operations
  3. Optimizes the plan (Catalyst optimizer)
  4. Splits the work into tasks
  5. Sends tasks to executors
  6. Collects results back

The Driver does NOT process data. It manages the process.

Real-life analogy: The Driver is like a movie director. They do not act in the movie (process data). They plan every scene (execution plan), tell actors where to stand (distribute tasks), and assemble the final cut (collect results).

The Executors (Muscle)

Executors are the worker processes that actually process data:

  1. Receive tasks from the Driver
  2. Read data from storage (ADLS, S3, HDFS)
  3. Process the data (filter, join, aggregate)
  4. Store intermediate results in memory
  5. Return results to the Driver

Each executor has its own memory and CPU cores. More executors = more parallelism.

How Data Flows

1. You write: df.filter(col("year") == 2026).groupBy("region").sum("amount")

2. Driver creates plan:
   Stage 1: Read Parquet files (parallel across executors)
   Stage 2: Filter year=2026 (each executor filters its partition)
   Stage 3: Shuffle data by region (redistribute data so same regions are together)
   Stage 4: Aggregate sum(amount) per region (each executor sums its partition)

3. Executors execute:
   Executor 1: Reads files 1-100, filters, partial sum for "North America"
   Executor 2: Reads files 101-200, filters, partial sum for "Europe"
   Executor 3: Reads files 201-300, filters, partial sum for "Asia"
   Executor 4: Reads files 301-400, filters, partial sum for "North America"

4. Shuffle: Regroup by region
   Executor 1 gets ALL "North America" data → final sum
   Executor 2 gets ALL "Europe" data → final sum
   Executor 3 gets ALL "Asia" data → final sum

5. Driver collects: [{North America, 5M}, {Europe, 3M}, {Asia, 2M}]

SparkSession: The Entry Point

Every PySpark program starts with a SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("MyDataPipeline")     .config("spark.sql.shuffle.partitions", 200)     .config("spark.executor.memory", "4g")     .getOrCreate()

In Synapse and Databricks, the SparkSession is pre-created. You just use spark directly:

# In Synapse/Databricks notebooks — spark is already available
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/data/")

Real-life analogy: SparkSession is like turning on the ignition of a car. You must start the engine before you can drive. All Spark operations go through this session.

RDDs, DataFrames, and Datasets

Spark has evolved through three APIs:

API Era Description When to Use
RDD Spark 1.0 (2014) Low-level, unstructured, type-safe Almost never (legacy code only)
DataFrame Spark 1.3 (2015) Table-like, optimized, SQL-compatible Always in PySpark
Dataset Spark 1.6 (2016) Type-safe DataFrame (Scala/Java only) Not available in Python

Use DataFrames. They are optimized by the Catalyst query optimizer, support SQL syntax, and integrate with Parquet, Delta, JSON, and CSV natively.

# DataFrame — this is what you use
df = spark.read.parquet("data.parquet")
df.filter(col("salary") > 70000).show()

# RDD — old way (avoid unless required)
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x > 3).collect()

Lazy Evaluation: The Recipe Book Analogy

This is the most important concept in Spark.

When you write PySpark transformations, nothing happens immediately. Spark records your instructions but does not execute them until you request a result.

# These lines do NOT execute yet — Spark just records them
df = spark.read.parquet("sales.parquet")        # Lazy — recorded
filtered = df.filter(col("year") == 2026)       # Lazy — recorded
grouped = filtered.groupBy("region")            # Lazy — recorded
result = grouped.agg(sum("amount"))             # Lazy — recorded

# THIS triggers execution — Spark runs everything above
result.show()  # ACTION — now Spark reads, filters, groups, and aggregates

Why lazy evaluation? Because Spark can optimize the ENTIRE plan before executing. If you filter after a join, Spark might push the filter BEFORE the join (reducing data before the expensive operation). This is impossible if each step executes immediately.

Real-life analogy: Imagine you are planning a road trip. Lazy evaluation is like writing down ALL the stops on the map FIRST (transformations), then finding the optimal route that visits them all (optimization), and THEN driving (execution). If you drove to each stop immediately without planning, you might backtrack and waste time.

Transformations (Lazy — Recorded)

df.filter(...)       # Lazy
df.select(...)       # Lazy
df.groupBy(...)      # Lazy
df.join(...)         # Lazy
df.withColumn(...)   # Lazy
df.orderBy(...)      # Lazy

Actions (Eager — Triggers Execution)

df.show()            # Action — display results
df.count()           # Action — count rows
df.collect()         # Action — return all rows to driver (dangerous on large data!)
df.write.parquet()   # Action — write to storage
df.first()           # Action — return first row
df.take(10)          # Action — return first 10 rows

Python vs PySpark: The Key Differences

Feature Python (Pandas) PySpark
Processing Single machine Distributed cluster (many machines)
Data size Fits in RAM (up to ~10 GB) Unlimited (TBs to PBs)
Execution Eager (runs immediately) Lazy (runs on action)
Memory Limited to machine RAM Cluster aggregate RAM (100s of GB)
API df['column'] or df.column col('column') or df['column']
Mutability DataFrames are mutable DataFrames are immutable
Index Row index exists No row index
Null handling NaN None / null
String operations df['name'].str.upper() upper(col('name'))
Apply function df.apply(func) udf(func) or transform
Startup time Instant 30 sec – 5 min (cluster startup)
Best for Small data, exploration Big data, production pipelines

Code Comparison

Read a file:

# Pandas
import pandas as pd
df = pd.read_parquet("sales.parquet")

# PySpark
df = spark.read.parquet("abfss://container@storage/sales/")

Filter rows:

# Pandas
result = df[df["salary"] > 70000]

# PySpark
from pyspark.sql.functions import col
result = df.filter(col("salary") > 70000)

Add a column:

# Pandas
df["bonus"] = df["salary"] * 0.10

# PySpark
from pyspark.sql.functions import col
df = df.withColumn("bonus", col("salary") * 0.10)

Group and aggregate:

# Pandas
result = df.groupby("department")["salary"].mean()

# PySpark
from pyspark.sql.functions import avg
result = df.groupBy("department").agg(avg("salary").alias("avg_salary"))

Join:

# Pandas
merged = pd.merge(employees, departments, on="dept_id", how="inner")

# PySpark
merged = employees.join(departments, employees.dept_id == departments.id, "inner")

Real-life analogy: Pandas is like cooking in your home kitchen — quick, simple, limited by your kitchen size. PySpark is like cooking in a commercial kitchen with 20 stations — more setup time, but can prepare food for 1,000 guests that your home kitchen never could.

Reading and Writing Data

# Read Parquet (most common for data lakes)
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/silver/customers/")

# Read CSV
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Read JSON
df = spark.read.json("path/to/data.json")

# Read Delta Lake
df = spark.read.format("delta").load("path/to/delta/table/")

# Read from SQL Database
df = spark.read.format("jdbc")     .option("url", "jdbc:sqlserver://server.database.windows.net:1433;database=mydb")     .option("dbtable", "SalesLT.Customer")     .option("user", "admin")     .option("password", "password")     .load()

# Write Parquet
df.write.mode("overwrite").parquet("path/to/output/")

# Write Delta
df.write.format("delta").mode("overwrite").save("path/to/delta/output/")

# Write with partitioning
df.write.partitionBy("year", "month").parquet("path/to/partitioned/")

Filtering, Selecting, and Sorting

from pyspark.sql.functions import col, upper, trim, when, lit

# Select columns
df.select("name", "city", "salary")
df.select(col("name"), col("salary") * 12)  # calculated column

# Filter rows
df.filter(col("salary") > 70000)
df.filter((col("city") == "Toronto") & (col("status") == "Active"))
df.filter(col("name").like("%Alice%"))
df.filter(col("department").isin("Engineering", "Sales"))
df.filter(col("email").isNotNull())

# Add/modify columns
df.withColumn("full_name", concat(col("first_name"), lit(" "), col("last_name")))
df.withColumn("salary_band", when(col("salary") > 80000, "Senior")
                             .when(col("salary") > 60000, "Mid")
                             .otherwise("Junior"))

# Rename columns
df.withColumnRenamed("old_name", "new_name")

# Drop columns
df.drop("temp_column", "debug_column")

# Sort
df.orderBy(col("salary").desc())
df.orderBy("department", col("salary").desc())

# Distinct / Deduplicate
df.distinct()
df.dropDuplicates(["email"])

GroupBy and Aggregations

from pyspark.sql.functions import sum, avg, count, min, max, countDistinct

# Basic aggregation
df.groupBy("department").agg(
    count("*").alias("emp_count"),
    avg("salary").alias("avg_salary"),
    sum("salary").alias("total_salary"),
    max("salary").alias("max_salary"),
    min("salary").alias("min_salary")
)

# Multiple group by columns
df.groupBy("department", "city").agg(
    count("*").alias("emp_count"),
    avg("salary").alias("avg_salary")
)

# Count distinct
df.groupBy("department").agg(
    countDistinct("city").alias("unique_cities")
)

# Pivot (cross-tab)
df.groupBy("department").pivot("year").agg(sum("salary"))

Joins

# Inner join
result = employees.join(departments, employees.dept_id == departments.id, "inner")

# Left outer join
result = employees.join(departments, employees.dept_id == departments.id, "left")

# Full outer join
result = employees.join(departments, employees.dept_id == departments.id, "full")

# Cross join
result = products.crossJoin(colors)

# Join with same column name (avoid ambiguity)
result = employees.join(departments, employees["dept_id"] == departments["id"], "inner")     .drop(departments["id"])  # drop duplicate column

# Broadcast join (small table optimization)
from pyspark.sql.functions import broadcast
result = fact_table.join(broadcast(dim_table), "key_column")

Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead, sum

# Define window
window_dept = Window.partitionBy("department").orderBy(col("salary").desc())

# Row number (unique rank)
df.withColumn("row_num", row_number().over(window_dept))

# Rank (with gaps)
df.withColumn("rank", rank().over(window_dept))

# Running total
window_date = Window.partitionBy("department").orderBy("hire_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_total", sum("salary").over(window_date))

# Previous value (LAG)
df.withColumn("prev_salary", lag("salary", 1).over(window_dept))

# Top 3 per department
df.withColumn("rn", row_number().over(window_dept)).filter(col("rn") <= 3)

Spark SQL: Writing SQL on DataFrames

You can register DataFrames as temporary views and write standard SQL:

# Register as temp view
df.createOrReplaceTempView("employees")

# Write SQL
result = spark.sql(
    "SELECT department, "
    "COUNT(*) as emp_count, "
    "AVG(salary) as avg_salary "
    "FROM employees "
    "WHERE status = 'Active' "
    "GROUP BY department "
    "HAVING COUNT(*) > 5 "
    "ORDER BY avg_salary DESC"
)

result.show()

This is extremely useful for data engineers who are more comfortable with SQL than DataFrame API.

Partitioning and Shuffling

Partitions

Data in Spark is divided into partitions — chunks distributed across executors. More partitions = more parallelism (up to a point).

# Check number of partitions
df.rdd.getNumPartitions()

# Repartition (increase partitions — causes shuffle)
df = df.repartition(200)

# Repartition by column (all rows with same key go to same partition)
df = df.repartition("department")

# Coalesce (decrease partitions — no shuffle, more efficient)
df = df.coalesce(10)  # reduce to 10 partitions before writing

Shuffling (The Expensive Operation)

A shuffle redistributes data across the cluster. It is the most expensive operation in Spark because data moves over the network between nodes.

Operations that cause shuffles: – groupBy() + aggregation – join() (unless broadcast) – repartition()distinct()orderBy() (global sort)

Real-life analogy: Shuffling is like reorganizing a library by author instead of by genre. Every book might need to move to a different shelf. If the library has 10 floors (10 nodes), books need to travel between floors (network transfer). Slow and expensive.

How to minimize shuffles: 1. Filter early — reduce data before joins and aggregations 2. Broadcast small tables — avoids shuffle for joins 3. Partition by join key — pre-partition so data is already co-located 4. Use coalesce instead of repartition when reducing partitions

PySpark in Azure

Synapse Spark Pool

# Pre-configured — spark session already available
df = spark.read.parquet("abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/silver/")

# Write Delta
df.write.format("delta").mode("overwrite").save("abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/gold/customers/")

Databricks

# Also pre-configured — spark session available
df = spark.read.format("delta").load("/mnt/datalake/silver/customers/")

# Databricks-specific: display() instead of show()
display(df)

When to Use Which

Service Best For Cost Model
Synapse Spark Pool Integrated with Synapse pipelines and SQL pools Per node-hour
Databricks Best Spark experience, Delta Lake native, ML Per DBU
HDInsight Legacy Hadoop workloads, Kafka, HBase Per VM-hour

Performance Tuning

1. Right-Size Your Cluster

Workload Recommended
Development 3 nodes, Small (4 cores, 32 GB each)
Medium ETL (1-10 GB) 5 nodes, Medium (8 cores, 64 GB each)
Large ETL (10-100 GB) 10-20 nodes, Large (16 cores, 128 GB each)
Very large (100+ GB) 20+ nodes, auto-scale enabled

2. Use Parquet and Delta (Not CSV)

CSV: full scan required, no column pruning, no predicate pushdown. Parquet: column pruning, predicate pushdown, compression. 10x faster. Delta: everything Parquet offers, plus ACID and Z-ORDER.

3. Cache Wisely

# Cache a DataFrame that is used multiple times
df_filtered = df.filter(col("year") == 2026).cache()

# Use it multiple times (reads from memory, not disk)
df_filtered.groupBy("region").count().show()
df_filtered.groupBy("product").sum("amount").show()

# Unpersist when done
df_filtered.unpersist()

4. Avoid collect() on Large Data

# DANGEROUS — brings ALL data to the driver. If data > driver memory → crash
all_data = df.collect()

# SAFE — brings only 10 rows
sample = df.take(10)
# or
df.show(10)

5. Use Adaptive Query Execution (AQE)

spark.conf.set("spark.sql.adaptive.enabled", "true")  # Default in Spark 3.0+

AQE automatically optimizes joins, handles skew, and adjusts partition sizes at runtime.

Common Mistakes

  1. Using Pandas when data is too large — Pandas loads everything into RAM. 50 GB file on a 16 GB machine = crash. Use PySpark.

  2. Calling collect() on large data — brings all data to the driver, crashing it. Use show(), take(), or write() instead.

  3. Not caching reused DataFrames — if you use the same filtered DataFrame 5 times, Spark recomputes it 5 times without caching.

  4. Too many small files — writing 10,000 small Parquet files. Use coalesce(10) before writing.

  5. Ignoring partition skew — one partition has 90% of the data. Use salting or repartition by a more even key.

  6. Using UDFs when built-in functions exist — UDFs are slow (row-by-row Python). Built-in functions run in JVM (much faster). Always check if pyspark.sql.functions has what you need before writing a UDF.

  7. Not filtering early — filter BEFORE joins and aggregations to reduce data volume.

Interview Questions

Q: What is Apache Spark and why is it faster than Hadoop MapReduce? A: Spark is a distributed data processing engine that processes data in-memory across a cluster. It is faster than MapReduce because MapReduce writes intermediate results to disk between stages, while Spark keeps data in RAM. This makes Spark up to 100x faster for iterative workloads.

Q: What is the difference between a transformation and an action in Spark? A: Transformations (filter, select, groupBy, join) are lazy — they record the operation but do not execute it. Actions (show, count, collect, write) trigger execution of all recorded transformations. Lazy evaluation allows Spark to optimize the entire execution plan before running.

Q: What is the difference between Python/Pandas and PySpark? A: Pandas processes data on a single machine using RAM — limited to data that fits in memory. PySpark distributes data across a cluster of machines, processing in parallel. Pandas is eager (executes immediately), PySpark is lazy (executes on action). Use Pandas for small data, PySpark for big data.

Q: What is shuffling and why is it expensive? A: Shuffling is the redistribution of data across cluster nodes during operations like groupBy, join, and repartition. It is expensive because data moves over the network between machines. Minimize shuffles by filtering early, broadcasting small tables, and pre-partitioning by join keys.

Q: What is the difference between repartition() and coalesce()? A: repartition() creates a new set of partitions by fully shuffling data — can increase or decrease partitions. coalesce() reduces partitions WITHOUT a full shuffle — only merges existing partitions. Use coalesce() when reducing partitions (e.g., before writing output files).

Q: How would you optimize a slow PySpark job? A: Filter early to reduce data volume, broadcast small tables in joins, use Parquet/Delta instead of CSV, cache reused DataFrames, avoid collect() on large data, use coalesce() before writing to reduce small files, enable AQE, and right-size the cluster.

Wrapping Up

PySpark is the bridge between Python’s simplicity and Spark’s power. You write Python code. Spark distributes it across a cluster. The result: data processing that scales from gigabytes to petabytes without changing your code — just add more nodes.

For data engineers, PySpark is essential for Silver and Gold layer transformations in data lakes, especially when Data Flows are not flexible enough or when data volumes exceed what SQL pools handle efficiently. Combined with Delta Lake, PySpark becomes the backbone of modern lakehouse architectures.

Start with Pandas for small data. Graduate to PySpark when you outgrow a single machine. The syntax is similar enough that the transition is smooth — the hardest part is understanding lazy evaluation and distributed thinking. Once those click, PySpark becomes second nature.

Related posts:Python for Data EngineersData Flows in ADF/SynapseData File Formats (Parquet, Delta)SQL Window FunctionsFine-Tuning LLMs


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link