Apache Spark and PySpark for Data Engineers: Architecture, Python vs PySpark, and Big Data Processing
Imagine you need to move a mountain of sand from one side of a construction site to the other. You could use one wheelbarrow and one worker — it works, but it takes weeks. Or you could hire 100 workers with 100 wheelbarrows, divide the sand pile into 100 sections, and move it all in a single day.
That is the difference between Python (one worker) and PySpark (100 workers). Python processes data on a single machine. PySpark distributes the work across a CLUSTER of machines, processing data in parallel. Same destination, dramatically different speed.
This post covers everything a data engineer needs to know about Apache Spark and PySpark — the architecture that makes it fast, the differences from regular Python/Pandas, and the code patterns you will use in production.
Table of Contents
- What Is Apache Spark?
- What Is PySpark?
- When Do You Need Spark (And When You Do Not)
- The Architecture: How Spark Actually Works
- Driver and Executors: The Manager and Workers
- SparkSession: The Entry Point
- RDDs, DataFrames, and Datasets
- Lazy Evaluation: The Recipe Book Analogy
- Transformations vs Actions
- Python vs PySpark: The Key Differences
- PySpark DataFrame Operations (With Pandas Comparison)
- Reading and Writing Data
- Filtering, Selecting, and Sorting
- GroupBy and Aggregations
- Joins
- Window Functions
- UDFs (User-Defined Functions)
- Spark SQL: Writing SQL on DataFrames
- Partitioning and Shuffling
- Broadcast Variables and Accumulators
- PySpark in Azure (Synapse, Databricks, HDInsight)
- Performance Tuning
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is Apache Spark?
Apache Spark is a distributed data processing engine that processes large datasets across a cluster of machines in parallel. It was created at UC Berkeley in 2009 and is now the de facto standard for big data processing.
Key characteristics: – Distributed — spreads work across multiple machines (nodes) – In-memory — processes data in RAM, not on disk (100x faster than Hadoop MapReduce) – Unified — handles batch processing, streaming, machine learning, and graph processing – Multi-language — supports Python (PySpark), Scala, Java, R, and SQL
Real-life analogy: Spark is like a restaurant kitchen with 20 chefs (worker nodes) and 1 head chef (driver). The head chef takes the order (your code), divides it into tasks (chop vegetables, grill meat, prepare sauce), and assigns each task to a different chef. All chefs work simultaneously. The result comes together on the plate much faster than if one chef did everything.
What Is PySpark?
PySpark is the Python API for Apache Spark. It lets you write Spark programs using Python syntax. Under the hood, your Python code is translated into Spark’s execution engine (written in Scala/JVM).
# This is PySpark — looks like Python, runs on a distributed cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()
df = spark.read.parquet("s3://datalake/sales/")
result = df.filter(col("year") == 2026) .groupBy("region") .agg(sum("amount").alias("total_sales"))
result.show()
This code looks simple, but it might be processing 500 GB of data across 50 machines simultaneously.
When Do You Need Spark (And When You Do Not)
Use Spark When:
| Scenario | Why Spark |
|---|---|
| Data exceeds a single machine’s RAM (50+ GB) | Distributes across cluster |
| Processing takes hours with Pandas | Parallelism cuts time to minutes |
| Joining two billion-row tables | MPP handles this efficiently |
| Streaming data (Kafka, Event Hubs) | Spark Structured Streaming |
| Data lake transformations (Bronze → Silver → Gold) | Native Parquet/Delta support |
| Machine learning on large datasets | MLlib library |
Do NOT Use Spark When:
| Scenario | Use Instead |
|---|---|
| Data fits in memory (under 10 GB) | Pandas — simpler, faster startup |
| Simple file copy (no transformation) | ADF Copy Activity — cheaper |
| SQL-only transformations | SQL Pool or Athena — simpler |
| Quick one-off analysis | Pandas or SQL — no cluster overhead |
| Real-time single-record processing | Azure Functions or Lambda |
The rule of thumb: If Pandas can handle it, use Pandas. If Pandas crashes with “out of memory” or takes hours, use PySpark.
Real-life analogy: You do not hire 100 construction workers to hang a picture frame. That is a one-person job (Pandas). But when you need to build a house (process 500 GB), one person takes a year. 100 workers finish in a month (PySpark).
The Architecture: How Spark Actually Works
The Cluster
┌─────────────────────────────────────────────────────────┐
│ SPARK CLUSTER │
│ │
│ ┌──────────────┐ │
│ │ DRIVER │ ← Your code runs here │
│ │ (Manager) │ ← Creates the execution plan │
│ │ │ ← Distributes work to executors │
│ └──────┬───────┘ │
│ │ │
│ ┌────┴────┬──────────┬──────────┐ │
│ │ │ │ │ │
│ ┌─┴──┐ ┌──┴──┐ ┌───┴──┐ ┌───┴──┐ │
│ │ E1 │ │ E2 │ │ E3 │ │ E4 │ ← Executors │
│ │ │ │ │ │ │ │ │ ← Do the work │
│ │Core│ │Core │ │Core │ │Core │ │
│ │Core│ │Core │ │Core │ │Core │ │
│ │RAM │ │RAM │ │RAM │ │RAM │ │
│ └────┘ └─────┘ └──────┘ └──────┘ │
│ │
│ Cluster Manager: YARN / Kubernetes / Standalone │
└─────────────────────────────────────────────────────────┘
The Components
| Component | What It Does | Real-Life Equivalent |
|---|---|---|
| Driver | Runs your code, creates execution plan, coordinates executors | Head chef — reads the recipe, assigns tasks |
| Executors | Execute the actual data processing tasks | Line cooks — do the chopping, cooking, plating |
| Cluster Manager | Allocates resources (CPU, memory) to drivers and executors | Restaurant manager — assigns chefs to stations |
| Cores | CPU threads within each executor | Hands of each cook — more hands = more tasks |
| Partitions | Chunks of data distributed across executors | Portions of ingredients divided among cooks |
Driver and Executors: The Manager and Workers
The Driver (Brain)
The Driver is where YOUR code runs. It:
- Parses your PySpark code
- Creates a Directed Acyclic Graph (DAG) of operations
- Optimizes the plan (Catalyst optimizer)
- Splits the work into tasks
- Sends tasks to executors
- Collects results back
The Driver does NOT process data. It manages the process.
Real-life analogy: The Driver is like a movie director. They do not act in the movie (process data). They plan every scene (execution plan), tell actors where to stand (distribute tasks), and assemble the final cut (collect results).
The Executors (Muscle)
Executors are the worker processes that actually process data:
- Receive tasks from the Driver
- Read data from storage (ADLS, S3, HDFS)
- Process the data (filter, join, aggregate)
- Store intermediate results in memory
- Return results to the Driver
Each executor has its own memory and CPU cores. More executors = more parallelism.
How Data Flows
1. You write: df.filter(col("year") == 2026).groupBy("region").sum("amount")
2. Driver creates plan:
Stage 1: Read Parquet files (parallel across executors)
Stage 2: Filter year=2026 (each executor filters its partition)
Stage 3: Shuffle data by region (redistribute data so same regions are together)
Stage 4: Aggregate sum(amount) per region (each executor sums its partition)
3. Executors execute:
Executor 1: Reads files 1-100, filters, partial sum for "North America"
Executor 2: Reads files 101-200, filters, partial sum for "Europe"
Executor 3: Reads files 201-300, filters, partial sum for "Asia"
Executor 4: Reads files 301-400, filters, partial sum for "North America"
4. Shuffle: Regroup by region
Executor 1 gets ALL "North America" data → final sum
Executor 2 gets ALL "Europe" data → final sum
Executor 3 gets ALL "Asia" data → final sum
5. Driver collects: [{North America, 5M}, {Europe, 3M}, {Asia, 2M}]
SparkSession: The Entry Point
Every PySpark program starts with a SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder .appName("MyDataPipeline") .config("spark.sql.shuffle.partitions", 200) .config("spark.executor.memory", "4g") .getOrCreate()
In Synapse and Databricks, the SparkSession is pre-created. You just use spark directly:
# In Synapse/Databricks notebooks — spark is already available
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/data/")
Real-life analogy: SparkSession is like turning on the ignition of a car. You must start the engine before you can drive. All Spark operations go through this session.
RDDs, DataFrames, and Datasets
Spark has evolved through three APIs:
| API | Era | Description | When to Use |
|---|---|---|---|
| RDD | Spark 1.0 (2014) | Low-level, unstructured, type-safe | Almost never (legacy code only) |
| DataFrame | Spark 1.3 (2015) | Table-like, optimized, SQL-compatible | Always in PySpark |
| Dataset | Spark 1.6 (2016) | Type-safe DataFrame (Scala/Java only) | Not available in Python |
Use DataFrames. They are optimized by the Catalyst query optimizer, support SQL syntax, and integrate with Parquet, Delta, JSON, and CSV natively.
# DataFrame — this is what you use
df = spark.read.parquet("data.parquet")
df.filter(col("salary") > 70000).show()
# RDD — old way (avoid unless required)
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x > 3).collect()
Lazy Evaluation: The Recipe Book Analogy
This is the most important concept in Spark.
When you write PySpark transformations, nothing happens immediately. Spark records your instructions but does not execute them until you request a result.
# These lines do NOT execute yet — Spark just records them
df = spark.read.parquet("sales.parquet") # Lazy — recorded
filtered = df.filter(col("year") == 2026) # Lazy — recorded
grouped = filtered.groupBy("region") # Lazy — recorded
result = grouped.agg(sum("amount")) # Lazy — recorded
# THIS triggers execution — Spark runs everything above
result.show() # ACTION — now Spark reads, filters, groups, and aggregates
Why lazy evaluation? Because Spark can optimize the ENTIRE plan before executing. If you filter after a join, Spark might push the filter BEFORE the join (reducing data before the expensive operation). This is impossible if each step executes immediately.
Real-life analogy: Imagine you are planning a road trip. Lazy evaluation is like writing down ALL the stops on the map FIRST (transformations), then finding the optimal route that visits them all (optimization), and THEN driving (execution). If you drove to each stop immediately without planning, you might backtrack and waste time.
Transformations (Lazy — Recorded)
df.filter(...) # Lazy
df.select(...) # Lazy
df.groupBy(...) # Lazy
df.join(...) # Lazy
df.withColumn(...) # Lazy
df.orderBy(...) # Lazy
Actions (Eager — Triggers Execution)
df.show() # Action — display results
df.count() # Action — count rows
df.collect() # Action — return all rows to driver (dangerous on large data!)
df.write.parquet() # Action — write to storage
df.first() # Action — return first row
df.take(10) # Action — return first 10 rows
Python vs PySpark: The Key Differences
| Feature | Python (Pandas) | PySpark |
|---|---|---|
| Processing | Single machine | Distributed cluster (many machines) |
| Data size | Fits in RAM (up to ~10 GB) | Unlimited (TBs to PBs) |
| Execution | Eager (runs immediately) | Lazy (runs on action) |
| Memory | Limited to machine RAM | Cluster aggregate RAM (100s of GB) |
| API | df['column'] or df.column |
col('column') or df['column'] |
| Mutability | DataFrames are mutable | DataFrames are immutable |
| Index | Row index exists | No row index |
| Null handling | NaN | None / null |
| String operations | df['name'].str.upper() |
upper(col('name')) |
| Apply function | df.apply(func) |
udf(func) or transform |
| Startup time | Instant | 30 sec – 5 min (cluster startup) |
| Best for | Small data, exploration | Big data, production pipelines |
Code Comparison
Read a file:
# Pandas
import pandas as pd
df = pd.read_parquet("sales.parquet")
# PySpark
df = spark.read.parquet("abfss://container@storage/sales/")
Filter rows:
# Pandas
result = df[df["salary"] > 70000]
# PySpark
from pyspark.sql.functions import col
result = df.filter(col("salary") > 70000)
Add a column:
# Pandas
df["bonus"] = df["salary"] * 0.10
# PySpark
from pyspark.sql.functions import col
df = df.withColumn("bonus", col("salary") * 0.10)
Group and aggregate:
# Pandas
result = df.groupby("department")["salary"].mean()
# PySpark
from pyspark.sql.functions import avg
result = df.groupBy("department").agg(avg("salary").alias("avg_salary"))
Join:
# Pandas
merged = pd.merge(employees, departments, on="dept_id", how="inner")
# PySpark
merged = employees.join(departments, employees.dept_id == departments.id, "inner")
Real-life analogy: Pandas is like cooking in your home kitchen — quick, simple, limited by your kitchen size. PySpark is like cooking in a commercial kitchen with 20 stations — more setup time, but can prepare food for 1,000 guests that your home kitchen never could.
Reading and Writing Data
# Read Parquet (most common for data lakes)
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/silver/customers/")
# Read CSV
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Read JSON
df = spark.read.json("path/to/data.json")
# Read Delta Lake
df = spark.read.format("delta").load("path/to/delta/table/")
# Read from SQL Database
df = spark.read.format("jdbc") .option("url", "jdbc:sqlserver://server.database.windows.net:1433;database=mydb") .option("dbtable", "SalesLT.Customer") .option("user", "admin") .option("password", "password") .load()
# Write Parquet
df.write.mode("overwrite").parquet("path/to/output/")
# Write Delta
df.write.format("delta").mode("overwrite").save("path/to/delta/output/")
# Write with partitioning
df.write.partitionBy("year", "month").parquet("path/to/partitioned/")
Filtering, Selecting, and Sorting
from pyspark.sql.functions import col, upper, trim, when, lit
# Select columns
df.select("name", "city", "salary")
df.select(col("name"), col("salary") * 12) # calculated column
# Filter rows
df.filter(col("salary") > 70000)
df.filter((col("city") == "Toronto") & (col("status") == "Active"))
df.filter(col("name").like("%Alice%"))
df.filter(col("department").isin("Engineering", "Sales"))
df.filter(col("email").isNotNull())
# Add/modify columns
df.withColumn("full_name", concat(col("first_name"), lit(" "), col("last_name")))
df.withColumn("salary_band", when(col("salary") > 80000, "Senior")
.when(col("salary") > 60000, "Mid")
.otherwise("Junior"))
# Rename columns
df.withColumnRenamed("old_name", "new_name")
# Drop columns
df.drop("temp_column", "debug_column")
# Sort
df.orderBy(col("salary").desc())
df.orderBy("department", col("salary").desc())
# Distinct / Deduplicate
df.distinct()
df.dropDuplicates(["email"])
GroupBy and Aggregations
from pyspark.sql.functions import sum, avg, count, min, max, countDistinct
# Basic aggregation
df.groupBy("department").agg(
count("*").alias("emp_count"),
avg("salary").alias("avg_salary"),
sum("salary").alias("total_salary"),
max("salary").alias("max_salary"),
min("salary").alias("min_salary")
)
# Multiple group by columns
df.groupBy("department", "city").agg(
count("*").alias("emp_count"),
avg("salary").alias("avg_salary")
)
# Count distinct
df.groupBy("department").agg(
countDistinct("city").alias("unique_cities")
)
# Pivot (cross-tab)
df.groupBy("department").pivot("year").agg(sum("salary"))
Joins
# Inner join
result = employees.join(departments, employees.dept_id == departments.id, "inner")
# Left outer join
result = employees.join(departments, employees.dept_id == departments.id, "left")
# Full outer join
result = employees.join(departments, employees.dept_id == departments.id, "full")
# Cross join
result = products.crossJoin(colors)
# Join with same column name (avoid ambiguity)
result = employees.join(departments, employees["dept_id"] == departments["id"], "inner") .drop(departments["id"]) # drop duplicate column
# Broadcast join (small table optimization)
from pyspark.sql.functions import broadcast
result = fact_table.join(broadcast(dim_table), "key_column")
Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead, sum
# Define window
window_dept = Window.partitionBy("department").orderBy(col("salary").desc())
# Row number (unique rank)
df.withColumn("row_num", row_number().over(window_dept))
# Rank (with gaps)
df.withColumn("rank", rank().over(window_dept))
# Running total
window_date = Window.partitionBy("department").orderBy("hire_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_total", sum("salary").over(window_date))
# Previous value (LAG)
df.withColumn("prev_salary", lag("salary", 1).over(window_dept))
# Top 3 per department
df.withColumn("rn", row_number().over(window_dept)).filter(col("rn") <= 3)
Spark SQL: Writing SQL on DataFrames
You can register DataFrames as temporary views and write standard SQL:
# Register as temp view
df.createOrReplaceTempView("employees")
# Write SQL
result = spark.sql(
"SELECT department, "
"COUNT(*) as emp_count, "
"AVG(salary) as avg_salary "
"FROM employees "
"WHERE status = 'Active' "
"GROUP BY department "
"HAVING COUNT(*) > 5 "
"ORDER BY avg_salary DESC"
)
result.show()
This is extremely useful for data engineers who are more comfortable with SQL than DataFrame API.
Partitioning and Shuffling
Partitions
Data in Spark is divided into partitions — chunks distributed across executors. More partitions = more parallelism (up to a point).
# Check number of partitions
df.rdd.getNumPartitions()
# Repartition (increase partitions — causes shuffle)
df = df.repartition(200)
# Repartition by column (all rows with same key go to same partition)
df = df.repartition("department")
# Coalesce (decrease partitions — no shuffle, more efficient)
df = df.coalesce(10) # reduce to 10 partitions before writing
Shuffling (The Expensive Operation)
A shuffle redistributes data across the cluster. It is the most expensive operation in Spark because data moves over the network between nodes.
Operations that cause shuffles:
– groupBy() + aggregation
– join() (unless broadcast)
– repartition()
– distinct()
– orderBy() (global sort)
Real-life analogy: Shuffling is like reorganizing a library by author instead of by genre. Every book might need to move to a different shelf. If the library has 10 floors (10 nodes), books need to travel between floors (network transfer). Slow and expensive.
How to minimize shuffles: 1. Filter early — reduce data before joins and aggregations 2. Broadcast small tables — avoids shuffle for joins 3. Partition by join key — pre-partition so data is already co-located 4. Use coalesce instead of repartition when reducing partitions
PySpark in Azure
Synapse Spark Pool
# Pre-configured — spark session already available
df = spark.read.parquet("abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/silver/")
# Write Delta
df.write.format("delta").mode("overwrite").save("abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/gold/customers/")
Databricks
# Also pre-configured — spark session available
df = spark.read.format("delta").load("/mnt/datalake/silver/customers/")
# Databricks-specific: display() instead of show()
display(df)
When to Use Which
| Service | Best For | Cost Model |
|---|---|---|
| Synapse Spark Pool | Integrated with Synapse pipelines and SQL pools | Per node-hour |
| Databricks | Best Spark experience, Delta Lake native, ML | Per DBU |
| HDInsight | Legacy Hadoop workloads, Kafka, HBase | Per VM-hour |
Performance Tuning
1. Right-Size Your Cluster
| Workload | Recommended |
|---|---|
| Development | 3 nodes, Small (4 cores, 32 GB each) |
| Medium ETL (1-10 GB) | 5 nodes, Medium (8 cores, 64 GB each) |
| Large ETL (10-100 GB) | 10-20 nodes, Large (16 cores, 128 GB each) |
| Very large (100+ GB) | 20+ nodes, auto-scale enabled |
2. Use Parquet and Delta (Not CSV)
CSV: full scan required, no column pruning, no predicate pushdown. Parquet: column pruning, predicate pushdown, compression. 10x faster. Delta: everything Parquet offers, plus ACID and Z-ORDER.
3. Cache Wisely
# Cache a DataFrame that is used multiple times
df_filtered = df.filter(col("year") == 2026).cache()
# Use it multiple times (reads from memory, not disk)
df_filtered.groupBy("region").count().show()
df_filtered.groupBy("product").sum("amount").show()
# Unpersist when done
df_filtered.unpersist()
4. Avoid collect() on Large Data
# DANGEROUS — brings ALL data to the driver. If data > driver memory → crash
all_data = df.collect()
# SAFE — brings only 10 rows
sample = df.take(10)
# or
df.show(10)
5. Use Adaptive Query Execution (AQE)
spark.conf.set("spark.sql.adaptive.enabled", "true") # Default in Spark 3.0+
AQE automatically optimizes joins, handles skew, and adjusts partition sizes at runtime.
Common Mistakes
-
Using Pandas when data is too large — Pandas loads everything into RAM. 50 GB file on a 16 GB machine = crash. Use PySpark.
-
Calling collect() on large data — brings all data to the driver, crashing it. Use
show(),take(), orwrite()instead. -
Not caching reused DataFrames — if you use the same filtered DataFrame 5 times, Spark recomputes it 5 times without caching.
-
Too many small files — writing 10,000 small Parquet files. Use
coalesce(10)before writing. -
Ignoring partition skew — one partition has 90% of the data. Use salting or repartition by a more even key.
-
Using UDFs when built-in functions exist — UDFs are slow (row-by-row Python). Built-in functions run in JVM (much faster). Always check if
pyspark.sql.functionshas what you need before writing a UDF. -
Not filtering early — filter BEFORE joins and aggregations to reduce data volume.
Interview Questions
Q: What is Apache Spark and why is it faster than Hadoop MapReduce? A: Spark is a distributed data processing engine that processes data in-memory across a cluster. It is faster than MapReduce because MapReduce writes intermediate results to disk between stages, while Spark keeps data in RAM. This makes Spark up to 100x faster for iterative workloads.
Q: What is the difference between a transformation and an action in Spark? A: Transformations (filter, select, groupBy, join) are lazy — they record the operation but do not execute it. Actions (show, count, collect, write) trigger execution of all recorded transformations. Lazy evaluation allows Spark to optimize the entire execution plan before running.
Q: What is the difference between Python/Pandas and PySpark? A: Pandas processes data on a single machine using RAM — limited to data that fits in memory. PySpark distributes data across a cluster of machines, processing in parallel. Pandas is eager (executes immediately), PySpark is lazy (executes on action). Use Pandas for small data, PySpark for big data.
Q: What is shuffling and why is it expensive? A: Shuffling is the redistribution of data across cluster nodes during operations like groupBy, join, and repartition. It is expensive because data moves over the network between machines. Minimize shuffles by filtering early, broadcasting small tables, and pre-partitioning by join keys.
Q: What is the difference between repartition() and coalesce()? A: repartition() creates a new set of partitions by fully shuffling data — can increase or decrease partitions. coalesce() reduces partitions WITHOUT a full shuffle — only merges existing partitions. Use coalesce() when reducing partitions (e.g., before writing output files).
Q: How would you optimize a slow PySpark job? A: Filter early to reduce data volume, broadcast small tables in joins, use Parquet/Delta instead of CSV, cache reused DataFrames, avoid collect() on large data, use coalesce() before writing to reduce small files, enable AQE, and right-size the cluster.
Wrapping Up
PySpark is the bridge between Python’s simplicity and Spark’s power. You write Python code. Spark distributes it across a cluster. The result: data processing that scales from gigabytes to petabytes without changing your code — just add more nodes.
For data engineers, PySpark is essential for Silver and Gold layer transformations in data lakes, especially when Data Flows are not flexible enough or when data volumes exceed what SQL pools handle efficiently. Combined with Delta Lake, PySpark becomes the backbone of modern lakehouse architectures.
Start with Pandas for small data. Graduate to PySpark when you outgrow a single machine. The syntax is similar enough that the transition is smooth — the hardest part is understanding lazy evaluation and distributed thinking. Once those click, PySpark becomes second nature.
Related posts: – Python for Data Engineers – Data Flows in ADF/Synapse – Data File Formats (Parquet, Delta) – SQL Window Functions – Fine-Tuning LLMs
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.