PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches
Every PySpark tutorial starts with df = spark.read.parquet("path") and assumes you know what spark is, where it came from, and why it exists. But if you are new to PySpark, you are left wondering: “Where did this spark object come from? Do I need to create it? What happens if I do not? And when do I shut it down?”
In Databricks, the SparkSession is pre-created — it just magically exists. But in job interviews, local development, and non-Databricks environments, you NEED to understand how to create, configure, and terminate a Spark session. Without this foundation, everything else feels like magic.
Think of the SparkSession like the ignition key of a car. In Databricks, someone already started the car for you — you just get in and drive. But to truly understand driving (PySpark), you need to know how the ignition works, what happens when you turn the key, and how to turn the engine off.
Table of Contents
- What Is a SparkSession?
- SparkSession vs SparkContext: The History
- The Import Statements You Need
- Creating a SparkSession (Non-Databricks)
- SparkSession in Databricks (Pre-Created)
- SparkSession Builder: Every Configuration Option
- Checking Your Spark Version and Configuration
- spark.conf.set vs Builder Config
- The SparkContext (Under the Hood)
- Stopping the Session
- Running PySpark Locally (Without Databricks)
- Installing PySpark Locally
- Local Mode vs Cluster Mode
- The Complete Import Reference
- Common Import Patterns by Task
- Environment Comparison: Local vs Databricks vs Synapse
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is a SparkSession?
A SparkSession is the single entry point to all PySpark functionality. Every operation — reading files, creating DataFrames, running SQL, configuring Spark — goes through the SparkSession.
# Everything starts from spark (the SparkSession)
spark.read.parquet(...) # Read files
spark.sql(...) # Run SQL queries
spark.createDataFrame(...) # Create DataFrames manually
spark.conf.set(...) # Configure settings
spark.catalog.listDatabases() # Browse the catalog
spark.stop() # Shut down Spark
Without a SparkSession, you cannot do anything in PySpark. It is like trying to browse the internet without opening a browser — you need the browser (SparkSession) first, then you navigate (read, transform, write).
Real-life analogy: The SparkSession is like logging into your computer. Before you log in, the computer exists but you cannot use it. After login (creating the session), you can open files, run programs, and do work. Logging out (stopping the session) frees up resources.
SparkSession vs SparkContext: The History
The Old Way (Spark 1.x — Before 2016)
In Spark 1.x, you had to create multiple entry points:
# Old way — multiple objects needed
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("MyApp").setMaster("local[*]")
sc = SparkContext(conf=conf) # For RDD operations
sqlContext = SQLContext(sc) # For SQL operations
hiveContext = HiveContext(sc) # For Hive operations
Three separate objects for different tasks. Confusing and error-prone.
The New Way (Spark 2.0+ — Current)
SparkSession unified everything into ONE object:
# New way — one object for everything
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# spark does EVERYTHING: RDDs, DataFrames, SQL, Hive, streaming, catalog
The SparkContext still exists — it is embedded inside the SparkSession. You can access it if needed:
sc = spark.sparkContext # Access the underlying SparkContext
sc.setLogLevel("ERROR") # Set log level (reduce noise)
But 99% of the time, you only use spark (the SparkSession).
Real-life analogy: Before Spark 2.0, you needed different apps for different tasks — one for email, one for calendar, one for tasks. SparkSession is like a unified app (Outlook) that handles email, calendar, and tasks in one place.
The Import Statements You Need
The Essential Imports
# 1. SparkSession — always needed (except in Databricks where it is pre-created)
from pyspark.sql import SparkSession
# 2. Functions — needed for EVERY transformation
from pyspark.sql.functions import *
# Or import specific functions (cleaner but more typing):
from pyspark.sql.functions import col, lit, when, sum, avg, count, upper, lower, trim
# 3. Types — needed when defining schemas manually
from pyspark.sql.types import *
# Or import specific types:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
# 4. Window — needed for window functions
from pyspark.sql.window import Window
What Each Import Provides
| Import | What You Get | When You Need It |
|---|---|---|
pyspark.sql.SparkSession |
The SparkSession class | Creating the session (not needed in Databricks) |
pyspark.sql.functions |
All transformation functions (col, lit, when, sum, upper, trim, etc.) | Every notebook that transforms data |
pyspark.sql.types |
Data type classes (StringType, IntegerType, StructType, etc.) | Defining manual schemas |
pyspark.sql.window.Window |
Window specification for window functions | Using row_number, rank, lag, lead, etc. |
In Databricks — What Is Already Available
# These are pre-imported in Databricks notebooks:
# - spark (SparkSession instance)
# - sc (SparkContext instance)
# - sqlContext (SQL context)
# - dbutils (Databricks utilities)
# - display() (Databricks display function)
# You still need to import:
from pyspark.sql.functions import * # Always import this
from pyspark.sql.types import * # When defining schemas
from pyspark.sql.window import Window # When using window functions
Real-life analogy: Imports are like loading tools into your workshop. SparkSession is the power generator (runs everything). functions are your power tools (drill, saw, sander). types are your measurement tools (ruler, level, tape). Window is a specialty tool (used for specific operations). In Databricks, the generator is already running — you just need to grab the tools.
Creating a SparkSession (Non-Databricks)
The Basic Session
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder .appName("MyDataPipeline") .getOrCreate()
print(f"Spark version: {spark.version}")
print("SparkSession created successfully!")
What .builder Does
The builder pattern configures the session step by step:
spark = SparkSession.builder .appName("MyDataPipeline") # Name shown in Spark UI
.master("local[*]") # Where to run (local or cluster)
.config("spark.sql.shuffle.partitions", "200") # Config settings
.config("spark.driver.memory", "4g") # Driver memory
.config("spark.executor.memory", "8g") # Executor memory
.enableHiveSupport() # Enable Hive metastore
.getOrCreate() # Create new or get existing session
What Each Builder Method Does
| Method | What It Does | Example |
|---|---|---|
.appName("name") |
Sets the application name (visible in Spark UI and logs) | "ETL_Daily_Sales" |
.master("mode") |
Where Spark runs: local or cluster | "local[*]", "yarn", "k8s://" |
.config("key", "value") |
Sets Spark configuration properties | ("spark.driver.memory", "4g") |
.enableHiveSupport() |
Enables Hive metastore for SQL table access | Used with Hive/catalog tables |
.getOrCreate() |
Returns existing session or creates a new one | Always the last call |
Master Mode Options
| Master | What It Does | When to Use |
|---|---|---|
local |
1 thread on your machine | Quick testing |
local[4] |
4 threads on your machine | Local development with parallelism |
local[*] |
All available cores on your machine | Full local power |
yarn |
Run on a Hadoop YARN cluster | Production on Hadoop |
k8s://host:port |
Run on Kubernetes | Production on K8s |
| (not set) | Databricks/Synapse manages this | Cloud-managed environments |
In Databricks and Synapse: Do NOT set .master() — the platform manages the cluster. Setting it manually can cause errors.
getOrCreate() vs new Session
# getOrCreate() — reuses existing session if one exists
spark = SparkSession.builder.appName("App1").getOrCreate() # Creates new
spark2 = SparkSession.builder.appName("App2").getOrCreate() # Returns SAME session (App1)
# In a JVM, there can only be ONE active SparkSession
# getOrCreate() prevents accidental duplicates
SparkSession in Databricks (Pre-Created)
In Databricks notebooks, the SparkSession is already created and available as the variable spark:
# In Databricks — NO creation needed
# Just use spark directly:
df = spark.read.parquet("/mnt/datalake/customers/")
spark.sql("SELECT * FROM silver.customers").show()
print(f"Spark version: {spark.version}")
Why Databricks Pre-Creates It
When you attach a notebook to a cluster, Databricks:
1. Starts the Spark driver on the cluster
2. Creates a SparkSession configured for that cluster
3. Injects it as the variable spark into your notebook
4. Also injects sc (SparkContext) and dbutils
You NEVER need to call SparkSession.builder...getOrCreate() in Databricks. If you do, it returns the existing pre-created session — no harm, but unnecessary.
# This works in Databricks but is redundant:
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Returns the same pre-created session — does not create a new one
What Happens If You Call spark.stop() in Databricks?
spark.stop() # DON'T DO THIS IN DATABRICKS!
# The session is terminated
# All subsequent cells fail with "SparkSession has been stopped"
# You must restart the cluster to get a new session
Never call spark.stop() in Databricks. The platform manages the session lifecycle. When you detach the notebook or the cluster terminates, the session is cleaned up automatically.
Real-life analogy: In Databricks, calling spark.stop() is like turning off the power generator while everyone is still working. All machines (cells) stop immediately. In a standalone app, spark.stop() is like turning off the generator when you are done for the day — proper shutdown.
SparkSession Builder: Every Configuration Option
Common Configurations
spark = SparkSession.builder .appName("Production_ETL") .config("spark.sql.shuffle.partitions", "200") .config("spark.driver.memory", "4g") .config("spark.executor.memory", "8g") .config("spark.executor.cores", "4") .config("spark.sql.adaptive.enabled", "true") .config("spark.sql.adaptive.coalescePartitions.enabled", "true") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.parquet.compression.codec", "snappy") .config("spark.sql.sources.partitionOverwriteMode", "dynamic") .config("spark.databricks.delta.optimizeWrite.enabled", "true") .getOrCreate()
Configuration Reference
| Config | What It Controls | Default | Recommended |
|---|---|---|---|
spark.sql.shuffle.partitions |
Number of partitions after shuffle operations | 200 | 200 for medium, 2-4x cores for large |
spark.driver.memory |
Memory for the driver process | 1g | 4g-8g for development |
spark.executor.memory |
Memory per executor | 1g | 4g-16g based on data size |
spark.executor.cores |
CPU cores per executor | 1 | 2-4 per executor |
spark.sql.adaptive.enabled |
Adaptive Query Execution | true (Spark 3+) | Always true |
spark.sql.parquet.compression.codec |
Parquet compression | snappy | snappy (fast) or zstd (smaller) |
spark.serializer |
Object serialization | Java | Kryo (faster) |
spark.sql.sources.partitionOverwriteMode |
How partition overwrite works | static | dynamic (safer) |
Checking Your Spark Version and Configuration
# Spark version
print(f"Spark version: {spark.version}")
# All configurations
for item in spark.sparkContext.getConf().getAll():
print(f"{item[0]} = {item[1]}")
# Specific configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))
print(spark.conf.get("spark.driver.memory"))
# Application name
print(spark.sparkContext.appName)
# Master (where Spark is running)
print(spark.sparkContext.master)
# Default parallelism
print(spark.sparkContext.defaultParallelism)
# Active SparkSession
print(SparkSession.getActiveSession())
spark.conf.set vs Builder Config
Both set configurations, but at different times:
Builder Config (At Creation Time)
# Set BEFORE the session is created — some configs ONLY work here
spark = SparkSession.builder .config("spark.driver.memory", "4g") .config("spark.executor.memory", "8g") .getOrCreate()
spark.conf.set (Runtime)
# Set AFTER the session is created — for runtime-adjustable configs
spark.conf.set("spark.sql.shuffle.partitions", "100")
spark.conf.set("spark.sql.adaptive.enabled", "true")
Which Configs Work Where?
| Config | Builder Only | Runtime (conf.set) |
|---|---|---|
spark.driver.memory |
✅ Yes | ❌ No (must be set before session starts) |
spark.executor.memory |
✅ Yes | ❌ No |
spark.sql.shuffle.partitions |
✅ Yes | ✅ Yes |
spark.sql.adaptive.enabled |
✅ Yes | ✅ Yes |
fs.azure.account.key.* |
❌ No | ✅ Yes (storage auth) |
Rule: Memory and resource configs must be in the builder. SQL and runtime behavior configs can be changed anytime with spark.conf.set().
In Databricks: Memory configs are set on the cluster configuration page, not in code. Use spark.conf.set() for SQL and storage configs.
The SparkContext (Under the Hood)
The SparkContext is the original Spark entry point (pre-Spark 2.0). It manages the connection to the cluster and is responsible for:
- Communicating with the cluster manager
- Distributing tasks to executors
- Managing RDDs (the old data structure)
# Access SparkContext from SparkSession
sc = spark.sparkContext
# Useful SparkContext operations
sc.setLogLevel("ERROR") # Reduce log noise (INFO → ERROR)
print(sc.version) # Spark version
print(sc.master) # Cluster mode
print(sc.defaultParallelism) # Default partition count
print(sc.applicationId) # Unique app ID
# Create RDD (old way — rarely needed)
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.collect()) # [1, 2, 3, 4, 5]
# Read text file as RDD
rdd = sc.textFile("/mnt/datalake/data.txt")
print(rdd.take(5))
When you need SparkContext directly: Almost never. Use the SparkSession for everything. The only common use is sc.setLogLevel("ERROR") to reduce log noise during development.
Stopping the Session
When to Stop
| Environment | Stop the Session? | Why |
|---|---|---|
| Local PySpark script | ✅ Yes — spark.stop() |
Frees resources, clean shutdown |
| Databricks notebook | ❌ Never | Platform manages lifecycle |
| Synapse notebook | ❌ Never | Platform manages lifecycle |
| Unit tests | ✅ Yes — in teardown | Prevent resource leaks between tests |
How to Stop
# Stop the session (only in standalone apps)
spark.stop()
# After stopping:
# - All executors are terminated
# - All cached data is released
# - All temporary views are dropped
# - The SparkContext is stopped
# - You cannot use spark anymore (must create a new session)
Graceful Shutdown Pattern (Standalone Apps)
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName("DailyETL").getOrCreate()
try:
# Your ETL logic here
df = spark.read.parquet("/data/input/")
df_clean = df.filter(df.status == "Active")
df_clean.write.parquet("/data/output/")
print("ETL completed successfully!")
except Exception as e:
print(f"ETL failed: {e}")
raise
finally:
spark.stop() # Always stop, even if there's an error
print("Spark session stopped.")
if __name__ == "__main__":
main()
Running PySpark Locally (Without Databricks)
Installing PySpark
# Install PySpark
pip install pyspark
# Install with specific version
pip install pyspark==3.5.0
# Install with extras (Spark SQL, MLlib)
pip install pyspark[sql]
Prerequisites
- Java 8 or 11 (Spark runs on the JVM)
- Python 3.8+
# Check Java version
java -version
# If not installed (Mac):
brew install openjdk@11
# If not installed (Ubuntu):
sudo apt install openjdk-11-jdk
Your First Local PySpark Script
# save as etl_local.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, trim, current_date
# Create session (local mode — runs on your machine)
spark = SparkSession.builder .appName("LocalETL") .master("local[*]") .getOrCreate()
# Create sample data
data = [
(1, " alice ", "Toronto", 95000),
(2, "BOB", "Mumbai", 72000),
(3, "Carol", None, 68000),
]
df = spark.createDataFrame(data, ["id", "name", "city", "salary"])
# Transform
df_clean = df .withColumn("name", upper(trim(col("name")))) .withColumn("city", col("city").isNull()) .withColumn("processed_date", current_date())
df_clean.show()
df_clean.printSchema()
# Stop session
spark.stop()
print("Done!")
Run it:
python etl_local.py
# Or with spark-submit (preferred for production):
spark-submit etl_local.py
python vs spark-submit
| Method | When to Use | What It Does |
|---|---|---|
python script.py |
Quick local testing | Runs as a regular Python script |
spark-submit script.py |
Production, cluster deployment | Sets up Spark environment properly, supports cluster configs |
# spark-submit with configuration
spark-submit --master local[4] --driver-memory 4g --executor-memory 8g --conf spark.sql.shuffle.partitions=100 etl_local.py
Local Mode vs Cluster Mode
| Feature | Local Mode | Cluster Mode (Databricks/YARN) |
|---|---|---|
| Where it runs | Your laptop/desktop | Cluster of machines |
| Master | local[*] |
Managed by platform |
| Max data | Limited by your RAM | Unlimited (add nodes) |
| Speed | One machine | Parallel across many machines |
| Cost | Free | Cloud compute costs |
| Use for | Learning, unit tests, small data | Production, big data |
The Complete Import Reference
# === ESSENTIAL (every notebook) ===
from pyspark.sql import SparkSession # Session creation
from pyspark.sql.functions import * # All transformation functions
# === SCHEMA DEFINITION ===
from pyspark.sql.types import (
StructType, StructField, # Schema structure
StringType, IntegerType, LongType, DoubleType, # Numeric/string types
FloatType, BooleanType, DateType, TimestampType, # More types
ArrayType, MapType, DecimalType, BinaryType # Complex types
)
# === WINDOW FUNCTIONS ===
from pyspark.sql.window import Window # Window specifications
# === DELTA LAKE (Databricks) ===
from delta.tables import DeltaTable # Delta operations (MERGE, etc.)
# === PANDAS INTEGRATION ===
import pandas as pd # Pandas interop
from pyspark.sql.functions import pandas_udf # Vectorized UDFs
# === STREAMING ===
from pyspark.sql.streaming import StreamingQuery # Structured Streaming
# === SPARK CONTEXT (rarely needed) ===
from pyspark import SparkContext, SparkConf # Low-level access
Common Import Patterns by Task
# Reading/writing files: just SparkSession + functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
# Data cleaning: functions for string/date/null handling
from pyspark.sql.functions import (
col, lit, when, coalesce, trim, upper, lower, initcap,
to_date, to_timestamp, current_date, year, month,
regexp_replace, concat, concat_ws, split, length
)
# Aggregations and analytics: agg functions + window
from pyspark.sql.functions import (
sum, avg, count, min, max, countDistinct,
row_number, rank, dense_rank, lag, lead, first, last
)
from pyspark.sql.window import Window
# Schema definition: types
from pyspark.sql.types import (
StructType, StructField, StringType, IntegerType,
DoubleType, DateType, TimestampType, BooleanType
)
# Delta Lake operations: DeltaTable
from delta.tables import DeltaTable
Environment Comparison: Local vs Databricks vs Synapse
| Feature | Local PySpark | Databricks | Synapse Spark |
|---|---|---|---|
| SparkSession | You create it | Pre-created as spark |
Pre-created as spark |
| spark.stop() | Required at end | Never call | Never call |
| Master | local[*] or spark-submit |
Managed by cluster | Managed by pool |
| Imports | All manual | spark, sc, dbutils pre-imported |
spark pre-imported |
| Java required | Yes (install manually) | No (pre-installed) | No (pre-installed) |
| Delta Lake | Install separately | Native | Supported |
| display() | Not available | Available | Available |
| dbutils | Not available | Available | Not available (use mssparkutils) |
| Cost | Free | Per DBU + VM | Per node-hour |
Common Mistakes
-
Calling
spark.stop()in Databricks — terminates the session, all cells fail. Restart the cluster to recover. -
Setting
spark.driver.memoryviaspark.conf.set()— this does not work at runtime. Memory configs must be set in the builder or cluster config. -
Forgetting to import functions —
col("salary")fails withNameError: name 'col' is not defined. Addfrom pyspark.sql.functions import *. -
Creating SparkSession in Databricks — not harmful but unnecessary.
SparkSession.builder.getOrCreate()returns the existing pre-created session. -
Not installing Java for local PySpark — PySpark requires Java 8 or 11. Without it,
spark-submitfails with “Java not found.” -
Using
import pyspark.sql.functionsinstead offrom pyspark.sql.functions import *— the former requirespyspark.sql.functions.col()every time. The latter gives youcol()directly. -
Confusing
sparkwithsc—sparkis the SparkSession (use this).scis the SparkContext (rarely needed). They are different objects.
Interview Questions
Q: What is a SparkSession and why is it needed? A: SparkSession is the unified entry point to all PySpark functionality — reading data, creating DataFrames, running SQL, configuring settings, and accessing the catalog. It was introduced in Spark 2.0 to replace the separate SparkContext, SQLContext, and HiveContext. Every PySpark operation goes through the SparkSession.
Q: What is the difference between SparkSession and SparkContext?
A: SparkContext is the original Spark 1.x entry point that manages the cluster connection and RDD operations. SparkSession (Spark 2.0+) wraps SparkContext and adds DataFrame, SQL, and catalog support in a unified API. Use SparkSession for everything — SparkContext is accessible via spark.sparkContext if needed.
Q: What does getOrCreate() do?
A: It returns the existing active SparkSession if one exists, or creates a new one if none exists. This prevents accidentally creating multiple sessions in the same JVM. In a single JVM, there can only be one active SparkSession.
Q: Why should you never call spark.stop() in Databricks?
A: Because Databricks manages the SparkSession lifecycle. Calling spark.stop() terminates the session, and all subsequent notebook cells fail. The platform automatically stops the session when the cluster terminates or the notebook is detached.
Q: What imports do you need for a typical PySpark notebook?
A: At minimum: from pyspark.sql.functions import * for transformation functions. Add from pyspark.sql.types import * for schema definitions. Add from pyspark.sql.window import Window for window functions. In non-Databricks environments, also add from pyspark.sql import SparkSession.
Q: What is the difference between spark.conf.set() and builder .config()?
A: Builder .config() sets properties before the session is created — required for resource configs like spark.driver.memory. spark.conf.set() changes properties at runtime — works for SQL behavior and storage authentication. Memory configs ONLY work in the builder; they are ignored at runtime.
Wrapping Up
The SparkSession is the foundation of every PySpark program. In Databricks, it is pre-created and ready to use. In standalone environments, you create it with SparkSession.builder...getOrCreate() and stop it with spark.stop() when done.
The key things to remember:
– Databricks: spark is already available. Import functions, types, and Window as needed. Never call spark.stop().
– Local/standalone: Create the session with builder. Set master to local[*]. Stop the session in finally block.
– Always import: from pyspark.sql.functions import * — you need this in every notebook.
This is the foundation. Everything else — reading files, transforming DataFrames, writing Delta tables — builds on top of this session.
Related posts: – Apache Spark and PySpark Architecture – PySpark Transformations Cookbook – Azure Databricks Introduction – Reading/Writing File Formats in Databricks – Python for Data Engineers
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.