PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches

Every PySpark tutorial starts with df = spark.read.parquet("path") and assumes you know what spark is, where it came from, and why it exists. But if you are new to PySpark, you are left wondering: “Where did this spark object come from? Do I need to create it? What happens if I do not? And when do I shut it down?”

In Databricks, the SparkSession is pre-created — it just magically exists. But in job interviews, local development, and non-Databricks environments, you NEED to understand how to create, configure, and terminate a Spark session. Without this foundation, everything else feels like magic.

Think of the SparkSession like the ignition key of a car. In Databricks, someone already started the car for you — you just get in and drive. But to truly understand driving (PySpark), you need to know how the ignition works, what happens when you turn the key, and how to turn the engine off.

What Is a SparkSession?
SparkSession vs SparkContext: The History
The Import Statements You Need
Creating a SparkSession (Non-Databricks)
SparkSession in Databricks (Pre-Created)
SparkSession Builder: Every Configuration Option
Checking Your Spark Version and Configuration
spark.conf.set vs Builder Config
The SparkContext (Under the Hood)
Stopping the Session
Running PySpark Locally (Without Databricks)
Installing PySpark Locally
Local Mode vs Cluster Mode
The Complete Import Reference
Common Import Patterns by Task
Environment Comparison: Local vs Databricks vs Synapse
Common Mistakes
Interview Questions
Wrapping Up

What Is a SparkSession?

A SparkSession is the single entry point to all PySpark functionality. Every operation — reading files, creating DataFrames, running SQL, configuring Spark — goes through the SparkSession.

# Everything starts from spark (the SparkSession)
spark.read.parquet(...)        # Read files
spark.sql(...)                 # Run SQL queries
spark.createDataFrame(...)     # Create DataFrames manually
spark.conf.set(...)            # Configure settings
spark.catalog.listDatabases()  # Browse the catalog
spark.stop()                   # Shut down Spark

Without a SparkSession, you cannot do anything in PySpark. It is like trying to browse the internet without opening a browser — you need the browser (SparkSession) first, then you navigate (read, transform, write).

Real-life analogy: The SparkSession is like logging into your computer. Before you log in, the computer exists but you cannot use it. After login (creating the session), you can open files, run programs, and do work. Logging out (stopping the session) frees up resources.

SparkSession vs SparkContext: The History

The Old Way (Spark 1.x — Before 2016)

In Spark 1.x, you had to create multiple entry points:

# Old way — multiple objects needed
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("MyApp").setMaster("local[*]")
sc = SparkContext(conf=conf)          # For RDD operations
sqlContext = SQLContext(sc)            # For SQL operations
hiveContext = HiveContext(sc)          # For Hive operations

Three separate objects for different tasks. Confusing and error-prone.

The New Way (Spark 2.0+ — Current)

SparkSession unified everything into ONE object:

# New way — one object for everything
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
# spark does EVERYTHING: RDDs, DataFrames, SQL, Hive, streaming, catalog

The SparkContext still exists — it is embedded inside the SparkSession. You can access it if needed:

sc = spark.sparkContext  # Access the underlying SparkContext
sc.setLogLevel("ERROR")  # Set log level (reduce noise)

But 99% of the time, you only use spark (the SparkSession).

Real-life analogy: Before Spark 2.0, you needed different apps for different tasks — one for email, one for calendar, one for tasks. SparkSession is like a unified app (Outlook) that handles email, calendar, and tasks in one place.

The Import Statements You Need

The Essential Imports

# 1. SparkSession — always needed (except in Databricks where it is pre-created)
from pyspark.sql import SparkSession

# 2. Functions — needed for EVERY transformation
from pyspark.sql.functions import *
# Or import specific functions (cleaner but more typing):
from pyspark.sql.functions import col, lit, when, sum, avg, count, upper, lower, trim

# 3. Types — needed when defining schemas manually
from pyspark.sql.types import *
# Or import specific types:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType

# 4. Window — needed for window functions
from pyspark.sql.window import Window

What Each Import Provides

Import	What You Get	When You Need It
`pyspark.sql.SparkSession`	The SparkSession class	Creating the session (not needed in Databricks)
`pyspark.sql.functions`	All transformation functions (col, lit, when, sum, upper, trim, etc.)	Every notebook that transforms data
`pyspark.sql.types`	Data type classes (StringType, IntegerType, StructType, etc.)	Defining manual schemas
`pyspark.sql.window.Window`	Window specification for window functions	Using row_number, rank, lag, lead, etc.

In Databricks — What Is Already Available

# These are pre-imported in Databricks notebooks:
# - spark (SparkSession instance)
# - sc (SparkContext instance)
# - sqlContext (SQL context)
# - dbutils (Databricks utilities)
# - display() (Databricks display function)

# You still need to import:
from pyspark.sql.functions import *    # Always import this
from pyspark.sql.types import *         # When defining schemas
from pyspark.sql.window import Window   # When using window functions

Real-life analogy: Imports are like loading tools into your workshop. SparkSession is the power generator (runs everything). functions are your power tools (drill, saw, sander). types are your measurement tools (ruler, level, tape). Window is a specialty tool (used for specific operations). In Databricks, the generator is already running — you just need to grab the tools.

Creating a SparkSession (Non-Databricks)

The Basic Session

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder     .appName("MyDataPipeline")     .getOrCreate()

print(f"Spark version: {spark.version}")
print("SparkSession created successfully!")

What `.builder` Does

The builder pattern configures the session step by step:

spark = SparkSession.builder     .appName("MyDataPipeline")          # Name shown in Spark UI
    .master("local[*]")                  # Where to run (local or cluster)
    .config("spark.sql.shuffle.partitions", "200")   # Config settings
    .config("spark.driver.memory", "4g")              # Driver memory
    .config("spark.executor.memory", "8g")            # Executor memory
    .enableHiveSupport()                 # Enable Hive metastore
    .getOrCreate()                       # Create new or get existing session

What Each Builder Method Does

Method	What It Does	Example
`.appName("name")`	Sets the application name (visible in Spark UI and logs)	`"ETL_Daily_Sales"`
`.master("mode")`	Where Spark runs: local or cluster	`"local[*]"`, `"yarn"`, `"k8s://"`
`.config("key", "value")`	Sets Spark configuration properties	`("spark.driver.memory", "4g")`
`.enableHiveSupport()`	Enables Hive metastore for SQL table access	Used with Hive/catalog tables
`.getOrCreate()`	Returns existing session or creates a new one	Always the last call

Master Mode Options

Master	What It Does	When to Use
`local`	1 thread on your machine	Quick testing
`local[4]`	4 threads on your machine	Local development with parallelism
`local[*]`	All available cores on your machine	Full local power
`yarn`	Run on a Hadoop YARN cluster	Production on Hadoop
`k8s://host:port`	Run on Kubernetes	Production on K8s
(not set)	Databricks/Synapse manages this	Cloud-managed environments

In Databricks and Synapse: Do NOT set .master() — the platform manages the cluster. Setting it manually can cause errors.

getOrCreate() vs new Session

# getOrCreate() — reuses existing session if one exists
spark = SparkSession.builder.appName("App1").getOrCreate()  # Creates new
spark2 = SparkSession.builder.appName("App2").getOrCreate() # Returns SAME session (App1)

# In a JVM, there can only be ONE active SparkSession
# getOrCreate() prevents accidental duplicates

SparkSession in Databricks (Pre-Created)

In Databricks notebooks, the SparkSession is already created and available as the variable spark:

# In Databricks — NO creation needed
# Just use spark directly:
df = spark.read.parquet("/mnt/datalake/customers/")
spark.sql("SELECT * FROM silver.customers").show()
print(f"Spark version: {spark.version}")

Why Databricks Pre-Creates It

When you attach a notebook to a cluster, Databricks: 1. Starts the Spark driver on the cluster 2. Creates a SparkSession configured for that cluster 3. Injects it as the variable spark into your notebook 4. Also injects sc (SparkContext) and dbutils

You NEVER need to call SparkSession.builder...getOrCreate() in Databricks. If you do, it returns the existing pre-created session — no harm, but unnecessary.

# This works in Databricks but is redundant:
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Returns the same pre-created session — does not create a new one

What Happens If You Call spark.stop() in Databricks?

spark.stop()  # DON'T DO THIS IN DATABRICKS!
# The session is terminated
# All subsequent cells fail with "SparkSession has been stopped"
# You must restart the cluster to get a new session

Never call spark.stop() in Databricks. The platform manages the session lifecycle. When you detach the notebook or the cluster terminates, the session is cleaned up automatically.

Real-life analogy: In Databricks, calling spark.stop() is like turning off the power generator while everyone is still working. All machines (cells) stop immediately. In a standalone app, spark.stop() is like turning off the generator when you are done for the day — proper shutdown.

SparkSession Builder: Every Configuration Option

Common Configurations

spark = SparkSession.builder     .appName("Production_ETL")     .config("spark.sql.shuffle.partitions", "200")     .config("spark.driver.memory", "4g")     .config("spark.executor.memory", "8g")     .config("spark.executor.cores", "4")     .config("spark.sql.adaptive.enabled", "true")     .config("spark.sql.adaptive.coalescePartitions.enabled", "true")     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")     .config("spark.sql.parquet.compression.codec", "snappy")     .config("spark.sql.sources.partitionOverwriteMode", "dynamic")     .config("spark.databricks.delta.optimizeWrite.enabled", "true")     .getOrCreate()

Configuration Reference

Config	What It Controls	Default	Recommended
`spark.sql.shuffle.partitions`	Number of partitions after shuffle operations	200	200 for medium, 2-4x cores for large
`spark.driver.memory`	Memory for the driver process	1g	4g-8g for development
`spark.executor.memory`	Memory per executor	1g	4g-16g based on data size
`spark.executor.cores`	CPU cores per executor	1	2-4 per executor
`spark.sql.adaptive.enabled`	Adaptive Query Execution	true (Spark 3+)	Always true
`spark.sql.parquet.compression.codec`	Parquet compression	snappy	snappy (fast) or zstd (smaller)
`spark.serializer`	Object serialization	Java	Kryo (faster)
`spark.sql.sources.partitionOverwriteMode`	How partition overwrite works	static	dynamic (safer)

Checking Your Spark Version and Configuration

# Spark version
print(f"Spark version: {spark.version}")

# All configurations
for item in spark.sparkContext.getConf().getAll():
    print(f"{item[0]} = {item[1]}")

# Specific configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))
print(spark.conf.get("spark.driver.memory"))

# Application name
print(spark.sparkContext.appName)

# Master (where Spark is running)
print(spark.sparkContext.master)

# Default parallelism
print(spark.sparkContext.defaultParallelism)

# Active SparkSession
print(SparkSession.getActiveSession())

spark.conf.set vs Builder Config

Both set configurations, but at different times:

Builder Config (At Creation Time)

# Set BEFORE the session is created — some configs ONLY work here
spark = SparkSession.builder     .config("spark.driver.memory", "4g")     .config("spark.executor.memory", "8g")     .getOrCreate()

spark.conf.set (Runtime)

# Set AFTER the session is created — for runtime-adjustable configs
spark.conf.set("spark.sql.shuffle.partitions", "100")
spark.conf.set("spark.sql.adaptive.enabled", "true")

Which Configs Work Where?

Config	Builder Only	Runtime (conf.set)
`spark.driver.memory`	✅ Yes	❌ No (must be set before session starts)
`spark.executor.memory`	✅ Yes	❌ No
`spark.sql.shuffle.partitions`	✅ Yes	✅ Yes
`spark.sql.adaptive.enabled`	✅ Yes	✅ Yes
`fs.azure.account.key.*`	❌ No	✅ Yes (storage auth)

Rule: Memory and resource configs must be in the builder. SQL and runtime behavior configs can be changed anytime with spark.conf.set().

In Databricks: Memory configs are set on the cluster configuration page, not in code. Use spark.conf.set() for SQL and storage configs.

The SparkContext (Under the Hood)

The SparkContext is the original Spark entry point (pre-Spark 2.0). It manages the connection to the cluster and is responsible for:

Communicating with the cluster manager
Distributing tasks to executors
Managing RDDs (the old data structure)

# Access SparkContext from SparkSession
sc = spark.sparkContext

# Useful SparkContext operations
sc.setLogLevel("ERROR")                    # Reduce log noise (INFO → ERROR)
print(sc.version)                           # Spark version
print(sc.master)                            # Cluster mode
print(sc.defaultParallelism)                # Default partition count
print(sc.applicationId)                     # Unique app ID

# Create RDD (old way — rarely needed)
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.collect())  # [1, 2, 3, 4, 5]

# Read text file as RDD
rdd = sc.textFile("/mnt/datalake/data.txt")
print(rdd.take(5))

When you need SparkContext directly: Almost never. Use the SparkSession for everything. The only common use is sc.setLogLevel("ERROR") to reduce log noise during development.

Stopping the Session

When to Stop

Environment	Stop the Session?	Why
Local PySpark script	✅ Yes — `spark.stop()`	Frees resources, clean shutdown
Databricks notebook	❌ Never	Platform manages lifecycle
Synapse notebook	❌ Never	Platform manages lifecycle
Unit tests	✅ Yes — in teardown	Prevent resource leaks between tests

How to Stop

# Stop the session (only in standalone apps)
spark.stop()

# After stopping:
# - All executors are terminated
# - All cached data is released
# - All temporary views are dropped
# - The SparkContext is stopped
# - You cannot use spark anymore (must create a new session)

Graceful Shutdown Pattern (Standalone Apps)

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName("DailyETL").getOrCreate()

    try:
        # Your ETL logic here
        df = spark.read.parquet("/data/input/")
        df_clean = df.filter(df.status == "Active")
        df_clean.write.parquet("/data/output/")
        print("ETL completed successfully!")
    except Exception as e:
        print(f"ETL failed: {e}")
        raise
    finally:
        spark.stop()  # Always stop, even if there's an error
        print("Spark session stopped.")

if __name__ == "__main__":
    main()

Running PySpark Locally (Without Databricks)

Installing PySpark

# Install PySpark
pip install pyspark

# Install with specific version
pip install pyspark==3.5.0

# Install with extras (Spark SQL, MLlib)
pip install pyspark[sql]

Prerequisites

Java 8 or 11 (Spark runs on the JVM)
Python 3.8+

# Check Java version
java -version

# If not installed (Mac):
brew install openjdk@11

# If not installed (Ubuntu):
sudo apt install openjdk-11-jdk

Your First Local PySpark Script

# save as etl_local.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, trim, current_date

# Create session (local mode — runs on your machine)
spark = SparkSession.builder     .appName("LocalETL")     .master("local[*]")     .getOrCreate()

# Create sample data
data = [
    (1, "  alice ", "Toronto", 95000),
    (2, "BOB", "Mumbai", 72000),
    (3, "Carol", None, 68000),
]

df = spark.createDataFrame(data, ["id", "name", "city", "salary"])

# Transform
df_clean = df     .withColumn("name", upper(trim(col("name"))))     .withColumn("city", col("city").isNull())     .withColumn("processed_date", current_date())

df_clean.show()
df_clean.printSchema()

# Stop session
spark.stop()
print("Done!")

Run it:

python etl_local.py
# Or with spark-submit (preferred for production):
spark-submit etl_local.py

python vs spark-submit

Method	When to Use	What It Does
`python script.py`	Quick local testing	Runs as a regular Python script
`spark-submit script.py`	Production, cluster deployment	Sets up Spark environment properly, supports cluster configs

# spark-submit with configuration
spark-submit     --master local[4]     --driver-memory 4g     --executor-memory 8g     --conf spark.sql.shuffle.partitions=100     etl_local.py

Local Mode vs Cluster Mode

Feature	Local Mode	Cluster Mode (Databricks/YARN)
Where it runs	Your laptop/desktop	Cluster of machines
Master	`local[*]`	Managed by platform
Max data	Limited by your RAM	Unlimited (add nodes)
Speed	One machine	Parallel across many machines
Cost	Free	Cloud compute costs
Use for	Learning, unit tests, small data	Production, big data

The Complete Import Reference

# === ESSENTIAL (every notebook) ===
from pyspark.sql import SparkSession                     # Session creation
from pyspark.sql.functions import *                       # All transformation functions

# === SCHEMA DEFINITION ===
from pyspark.sql.types import (
    StructType, StructField,                              # Schema structure
    StringType, IntegerType, LongType, DoubleType,        # Numeric/string types
    FloatType, BooleanType, DateType, TimestampType,      # More types
    ArrayType, MapType, DecimalType, BinaryType            # Complex types
)

# === WINDOW FUNCTIONS ===
from pyspark.sql.window import Window                     # Window specifications

# === DELTA LAKE (Databricks) ===
from delta.tables import DeltaTable                       # Delta operations (MERGE, etc.)

# === PANDAS INTEGRATION ===
import pandas as pd                                       # Pandas interop
from pyspark.sql.functions import pandas_udf              # Vectorized UDFs

# === STREAMING ===
from pyspark.sql.streaming import StreamingQuery          # Structured Streaming

# === SPARK CONTEXT (rarely needed) ===
from pyspark import SparkContext, SparkConf                # Low-level access

Common Import Patterns by Task

# Reading/writing files: just SparkSession + functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Data cleaning: functions for string/date/null handling
from pyspark.sql.functions import (
    col, lit, when, coalesce, trim, upper, lower, initcap,
    to_date, to_timestamp, current_date, year, month,
    regexp_replace, concat, concat_ws, split, length
)

# Aggregations and analytics: agg functions + window
from pyspark.sql.functions import (
    sum, avg, count, min, max, countDistinct,
    row_number, rank, dense_rank, lag, lead, first, last
)
from pyspark.sql.window import Window

# Schema definition: types
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType,
    DoubleType, DateType, TimestampType, BooleanType
)

# Delta Lake operations: DeltaTable
from delta.tables import DeltaTable

Environment Comparison: Local vs Databricks vs Synapse

Feature	Local PySpark	Databricks	Synapse Spark
SparkSession	You create it	Pre-created as `spark`	Pre-created as `spark`
spark.stop()	Required at end	Never call	Never call
Master	`local[*]` or `spark-submit`	Managed by cluster	Managed by pool
Imports	All manual	`spark`, `sc`, `dbutils` pre-imported	`spark` pre-imported
Java required	Yes (install manually)	No (pre-installed)	No (pre-installed)
Delta Lake	Install separately	Native	Supported
display()	Not available	Available	Available
dbutils	Not available	Available	Not available (use mssparkutils)
Cost	Free	Per DBU + VM	Per node-hour

Common Mistakes

Calling spark.stop() in Databricks — terminates the session, all cells fail. Restart the cluster to recover.
Setting spark.driver.memory via spark.conf.set() — this does not work at runtime. Memory configs must be set in the builder or cluster config.
Forgetting to import functions — col("salary") fails with NameError: name 'col' is not defined. Add from pyspark.sql.functions import *.
Creating SparkSession in Databricks — not harmful but unnecessary. SparkSession.builder.getOrCreate() returns the existing pre-created session.
Not installing Java for local PySpark — PySpark requires Java 8 or 11. Without it, spark-submit fails with “Java not found.”
Using import pyspark.sql.functions instead of from pyspark.sql.functions import * — the former requires pyspark.sql.functions.col() every time. The latter gives you col() directly.
Confusing spark with sc — spark is the SparkSession (use this). sc is the SparkContext (rarely needed). They are different objects.

Interview Questions

Q: What is a SparkSession and why is it needed? A: SparkSession is the unified entry point to all PySpark functionality — reading data, creating DataFrames, running SQL, configuring settings, and accessing the catalog. It was introduced in Spark 2.0 to replace the separate SparkContext, SQLContext, and HiveContext. Every PySpark operation goes through the SparkSession.

Q: What is the difference between SparkSession and SparkContext? A: SparkContext is the original Spark 1.x entry point that manages the cluster connection and RDD operations. SparkSession (Spark 2.0+) wraps SparkContext and adds DataFrame, SQL, and catalog support in a unified API. Use SparkSession for everything — SparkContext is accessible via spark.sparkContext if needed.

Q: What does getOrCreate() do? A: It returns the existing active SparkSession if one exists, or creates a new one if none exists. This prevents accidentally creating multiple sessions in the same JVM. In a single JVM, there can only be one active SparkSession.

Q: Why should you never call spark.stop() in Databricks? A: Because Databricks manages the SparkSession lifecycle. Calling spark.stop() terminates the session, and all subsequent notebook cells fail. The platform automatically stops the session when the cluster terminates or the notebook is detached.

Q: What imports do you need for a typical PySpark notebook? A: At minimum: from pyspark.sql.functions import * for transformation functions. Add from pyspark.sql.types import * for schema definitions. Add from pyspark.sql.window import Window for window functions. In non-Databricks environments, also add from pyspark.sql import SparkSession.

Q: What is the difference between spark.conf.set() and builder .config()? A: Builder .config() sets properties before the session is created — required for resource configs like spark.driver.memory. spark.conf.set() changes properties at runtime — works for SQL behavior and storage authentication. Memory configs ONLY work in the builder; they are ignored at runtime.

Wrapping Up

The SparkSession is the foundation of every PySpark program. In Databricks, it is pre-created and ready to use. In standalone environments, you create it with SparkSession.builder...getOrCreate() and stop it with spark.stop() when done.

The key things to remember: – Databricks: spark is already available. Import functions, types, and Window as needed. Never call spark.stop(). – Local/standalone: Create the session with builder. Set master to local[*]. Stop the session in finally block. – Always import: from pyspark.sql.functions import * — you need this in every notebook.

This is the foundation. Everything else — reading files, transforming DataFrames, writing Delta tables — builds on top of this session.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.