PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches

PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches

Every PySpark tutorial starts with df = spark.read.parquet("path") and assumes you know what spark is, where it came from, and why it exists. But if you are new to PySpark, you are left wondering: “Where did this spark object come from? Do I need to create it? What happens if I do not? And when do I shut it down?”

In Databricks, the SparkSession is pre-created — it just magically exists. But in job interviews, local development, and non-Databricks environments, you NEED to understand how to create, configure, and terminate a Spark session. Without this foundation, everything else feels like magic.

Think of the SparkSession like the ignition key of a car. In Databricks, someone already started the car for you — you just get in and drive. But to truly understand driving (PySpark), you need to know how the ignition works, what happens when you turn the key, and how to turn the engine off.

Table of Contents

  • What Is a SparkSession?
  • SparkSession vs SparkContext: The History
  • The Import Statements You Need
  • Creating a SparkSession (Non-Databricks)
  • SparkSession in Databricks (Pre-Created)
  • SparkSession Builder: Every Configuration Option
  • Checking Your Spark Version and Configuration
  • spark.conf.set vs Builder Config
  • The SparkContext (Under the Hood)
  • Stopping the Session
  • Running PySpark Locally (Without Databricks)
  • Installing PySpark Locally
  • Local Mode vs Cluster Mode
  • The Complete Import Reference
  • Common Import Patterns by Task
  • Environment Comparison: Local vs Databricks vs Synapse
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is a SparkSession?

A SparkSession is the single entry point to all PySpark functionality. Every operation — reading files, creating DataFrames, running SQL, configuring Spark — goes through the SparkSession.

# Everything starts from spark (the SparkSession)
spark.read.parquet(...)        # Read files
spark.sql(...)                 # Run SQL queries
spark.createDataFrame(...)     # Create DataFrames manually
spark.conf.set(...)            # Configure settings
spark.catalog.listDatabases()  # Browse the catalog
spark.stop()                   # Shut down Spark

Without a SparkSession, you cannot do anything in PySpark. It is like trying to browse the internet without opening a browser — you need the browser (SparkSession) first, then you navigate (read, transform, write).

Real-life analogy: The SparkSession is like logging into your computer. Before you log in, the computer exists but you cannot use it. After login (creating the session), you can open files, run programs, and do work. Logging out (stopping the session) frees up resources.

SparkSession vs SparkContext: The History

The Old Way (Spark 1.x — Before 2016)

In Spark 1.x, you had to create multiple entry points:

# Old way — multiple objects needed
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("MyApp").setMaster("local[*]")
sc = SparkContext(conf=conf)          # For RDD operations
sqlContext = SQLContext(sc)            # For SQL operations
hiveContext = HiveContext(sc)          # For Hive operations

Three separate objects for different tasks. Confusing and error-prone.

The New Way (Spark 2.0+ — Current)

SparkSession unified everything into ONE object:

# New way — one object for everything
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
# spark does EVERYTHING: RDDs, DataFrames, SQL, Hive, streaming, catalog

The SparkContext still exists — it is embedded inside the SparkSession. You can access it if needed:

sc = spark.sparkContext  # Access the underlying SparkContext
sc.setLogLevel("ERROR")  # Set log level (reduce noise)

But 99% of the time, you only use spark (the SparkSession).

Real-life analogy: Before Spark 2.0, you needed different apps for different tasks — one for email, one for calendar, one for tasks. SparkSession is like a unified app (Outlook) that handles email, calendar, and tasks in one place.

The Import Statements You Need

The Essential Imports

# 1. SparkSession — always needed (except in Databricks where it is pre-created)
from pyspark.sql import SparkSession

# 2. Functions — needed for EVERY transformation
from pyspark.sql.functions import *
# Or import specific functions (cleaner but more typing):
from pyspark.sql.functions import col, lit, when, sum, avg, count, upper, lower, trim

# 3. Types — needed when defining schemas manually
from pyspark.sql.types import *
# Or import specific types:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType

# 4. Window — needed for window functions
from pyspark.sql.window import Window

What Each Import Provides

Import What You Get When You Need It
pyspark.sql.SparkSession The SparkSession class Creating the session (not needed in Databricks)
pyspark.sql.functions All transformation functions (col, lit, when, sum, upper, trim, etc.) Every notebook that transforms data
pyspark.sql.types Data type classes (StringType, IntegerType, StructType, etc.) Defining manual schemas
pyspark.sql.window.Window Window specification for window functions Using row_number, rank, lag, lead, etc.

In Databricks — What Is Already Available

# These are pre-imported in Databricks notebooks:
# - spark (SparkSession instance)
# - sc (SparkContext instance)
# - sqlContext (SQL context)
# - dbutils (Databricks utilities)
# - display() (Databricks display function)

# You still need to import:
from pyspark.sql.functions import *    # Always import this
from pyspark.sql.types import *         # When defining schemas
from pyspark.sql.window import Window   # When using window functions

Real-life analogy: Imports are like loading tools into your workshop. SparkSession is the power generator (runs everything). functions are your power tools (drill, saw, sander). types are your measurement tools (ruler, level, tape). Window is a specialty tool (used for specific operations). In Databricks, the generator is already running — you just need to grab the tools.

Creating a SparkSession (Non-Databricks)

The Basic Session

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder     .appName("MyDataPipeline")     .getOrCreate()

print(f"Spark version: {spark.version}")
print("SparkSession created successfully!")

What .builder Does

The builder pattern configures the session step by step:

spark = SparkSession.builder     .appName("MyDataPipeline")          # Name shown in Spark UI
    .master("local[*]")                  # Where to run (local or cluster)
    .config("spark.sql.shuffle.partitions", "200")   # Config settings
    .config("spark.driver.memory", "4g")              # Driver memory
    .config("spark.executor.memory", "8g")            # Executor memory
    .enableHiveSupport()                 # Enable Hive metastore
    .getOrCreate()                       # Create new or get existing session

What Each Builder Method Does

Method What It Does Example
.appName("name") Sets the application name (visible in Spark UI and logs) "ETL_Daily_Sales"
.master("mode") Where Spark runs: local or cluster "local[*]", "yarn", "k8s://"
.config("key", "value") Sets Spark configuration properties ("spark.driver.memory", "4g")
.enableHiveSupport() Enables Hive metastore for SQL table access Used with Hive/catalog tables
.getOrCreate() Returns existing session or creates a new one Always the last call

Master Mode Options

Master What It Does When to Use
local 1 thread on your machine Quick testing
local[4] 4 threads on your machine Local development with parallelism
local[*] All available cores on your machine Full local power
yarn Run on a Hadoop YARN cluster Production on Hadoop
k8s://host:port Run on Kubernetes Production on K8s
(not set) Databricks/Synapse manages this Cloud-managed environments

In Databricks and Synapse: Do NOT set .master() — the platform manages the cluster. Setting it manually can cause errors.

getOrCreate() vs new Session

# getOrCreate() — reuses existing session if one exists
spark = SparkSession.builder.appName("App1").getOrCreate()  # Creates new
spark2 = SparkSession.builder.appName("App2").getOrCreate() # Returns SAME session (App1)

# In a JVM, there can only be ONE active SparkSession
# getOrCreate() prevents accidental duplicates

SparkSession in Databricks (Pre-Created)

In Databricks notebooks, the SparkSession is already created and available as the variable spark:

# In Databricks — NO creation needed
# Just use spark directly:
df = spark.read.parquet("/mnt/datalake/customers/")
spark.sql("SELECT * FROM silver.customers").show()
print(f"Spark version: {spark.version}")

Why Databricks Pre-Creates It

When you attach a notebook to a cluster, Databricks: 1. Starts the Spark driver on the cluster 2. Creates a SparkSession configured for that cluster 3. Injects it as the variable spark into your notebook 4. Also injects sc (SparkContext) and dbutils

You NEVER need to call SparkSession.builder...getOrCreate() in Databricks. If you do, it returns the existing pre-created session — no harm, but unnecessary.

# This works in Databricks but is redundant:
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Returns the same pre-created session — does not create a new one

What Happens If You Call spark.stop() in Databricks?

spark.stop()  # DON'T DO THIS IN DATABRICKS!
# The session is terminated
# All subsequent cells fail with "SparkSession has been stopped"
# You must restart the cluster to get a new session

Never call spark.stop() in Databricks. The platform manages the session lifecycle. When you detach the notebook or the cluster terminates, the session is cleaned up automatically.

Real-life analogy: In Databricks, calling spark.stop() is like turning off the power generator while everyone is still working. All machines (cells) stop immediately. In a standalone app, spark.stop() is like turning off the generator when you are done for the day — proper shutdown.

SparkSession Builder: Every Configuration Option

Common Configurations

spark = SparkSession.builder     .appName("Production_ETL")     .config("spark.sql.shuffle.partitions", "200")     .config("spark.driver.memory", "4g")     .config("spark.executor.memory", "8g")     .config("spark.executor.cores", "4")     .config("spark.sql.adaptive.enabled", "true")     .config("spark.sql.adaptive.coalescePartitions.enabled", "true")     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")     .config("spark.sql.parquet.compression.codec", "snappy")     .config("spark.sql.sources.partitionOverwriteMode", "dynamic")     .config("spark.databricks.delta.optimizeWrite.enabled", "true")     .getOrCreate()

Configuration Reference

Config What It Controls Default Recommended
spark.sql.shuffle.partitions Number of partitions after shuffle operations 200 200 for medium, 2-4x cores for large
spark.driver.memory Memory for the driver process 1g 4g-8g for development
spark.executor.memory Memory per executor 1g 4g-16g based on data size
spark.executor.cores CPU cores per executor 1 2-4 per executor
spark.sql.adaptive.enabled Adaptive Query Execution true (Spark 3+) Always true
spark.sql.parquet.compression.codec Parquet compression snappy snappy (fast) or zstd (smaller)
spark.serializer Object serialization Java Kryo (faster)
spark.sql.sources.partitionOverwriteMode How partition overwrite works static dynamic (safer)

Checking Your Spark Version and Configuration

# Spark version
print(f"Spark version: {spark.version}")

# All configurations
for item in spark.sparkContext.getConf().getAll():
    print(f"{item[0]} = {item[1]}")

# Specific configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))
print(spark.conf.get("spark.driver.memory"))

# Application name
print(spark.sparkContext.appName)

# Master (where Spark is running)
print(spark.sparkContext.master)

# Default parallelism
print(spark.sparkContext.defaultParallelism)

# Active SparkSession
print(SparkSession.getActiveSession())

spark.conf.set vs Builder Config

Both set configurations, but at different times:

Builder Config (At Creation Time)

# Set BEFORE the session is created — some configs ONLY work here
spark = SparkSession.builder     .config("spark.driver.memory", "4g")     .config("spark.executor.memory", "8g")     .getOrCreate()

spark.conf.set (Runtime)

# Set AFTER the session is created — for runtime-adjustable configs
spark.conf.set("spark.sql.shuffle.partitions", "100")
spark.conf.set("spark.sql.adaptive.enabled", "true")

Which Configs Work Where?

Config Builder Only Runtime (conf.set)
spark.driver.memory ✅ Yes ❌ No (must be set before session starts)
spark.executor.memory ✅ Yes ❌ No
spark.sql.shuffle.partitions ✅ Yes ✅ Yes
spark.sql.adaptive.enabled ✅ Yes ✅ Yes
fs.azure.account.key.* ❌ No ✅ Yes (storage auth)

Rule: Memory and resource configs must be in the builder. SQL and runtime behavior configs can be changed anytime with spark.conf.set().

In Databricks: Memory configs are set on the cluster configuration page, not in code. Use spark.conf.set() for SQL and storage configs.

The SparkContext (Under the Hood)

The SparkContext is the original Spark entry point (pre-Spark 2.0). It manages the connection to the cluster and is responsible for:

  • Communicating with the cluster manager
  • Distributing tasks to executors
  • Managing RDDs (the old data structure)
# Access SparkContext from SparkSession
sc = spark.sparkContext

# Useful SparkContext operations
sc.setLogLevel("ERROR")                    # Reduce log noise (INFO → ERROR)
print(sc.version)                           # Spark version
print(sc.master)                            # Cluster mode
print(sc.defaultParallelism)                # Default partition count
print(sc.applicationId)                     # Unique app ID

# Create RDD (old way — rarely needed)
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.collect())  # [1, 2, 3, 4, 5]

# Read text file as RDD
rdd = sc.textFile("/mnt/datalake/data.txt")
print(rdd.take(5))

When you need SparkContext directly: Almost never. Use the SparkSession for everything. The only common use is sc.setLogLevel("ERROR") to reduce log noise during development.

Stopping the Session

When to Stop

Environment Stop the Session? Why
Local PySpark script ✅ Yes — spark.stop() Frees resources, clean shutdown
Databricks notebook ❌ Never Platform manages lifecycle
Synapse notebook ❌ Never Platform manages lifecycle
Unit tests ✅ Yes — in teardown Prevent resource leaks between tests

How to Stop

# Stop the session (only in standalone apps)
spark.stop()

# After stopping:
# - All executors are terminated
# - All cached data is released
# - All temporary views are dropped
# - The SparkContext is stopped
# - You cannot use spark anymore (must create a new session)

Graceful Shutdown Pattern (Standalone Apps)

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName("DailyETL").getOrCreate()

    try:
        # Your ETL logic here
        df = spark.read.parquet("/data/input/")
        df_clean = df.filter(df.status == "Active")
        df_clean.write.parquet("/data/output/")
        print("ETL completed successfully!")
    except Exception as e:
        print(f"ETL failed: {e}")
        raise
    finally:
        spark.stop()  # Always stop, even if there's an error
        print("Spark session stopped.")

if __name__ == "__main__":
    main()

Running PySpark Locally (Without Databricks)

Installing PySpark

# Install PySpark
pip install pyspark

# Install with specific version
pip install pyspark==3.5.0

# Install with extras (Spark SQL, MLlib)
pip install pyspark[sql]

Prerequisites

  • Java 8 or 11 (Spark runs on the JVM)
  • Python 3.8+
# Check Java version
java -version

# If not installed (Mac):
brew install openjdk@11

# If not installed (Ubuntu):
sudo apt install openjdk-11-jdk

Your First Local PySpark Script

# save as etl_local.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, trim, current_date

# Create session (local mode — runs on your machine)
spark = SparkSession.builder     .appName("LocalETL")     .master("local[*]")     .getOrCreate()

# Create sample data
data = [
    (1, "  alice ", "Toronto", 95000),
    (2, "BOB", "Mumbai", 72000),
    (3, "Carol", None, 68000),
]

df = spark.createDataFrame(data, ["id", "name", "city", "salary"])

# Transform
df_clean = df     .withColumn("name", upper(trim(col("name"))))     .withColumn("city", col("city").isNull())     .withColumn("processed_date", current_date())

df_clean.show()
df_clean.printSchema()

# Stop session
spark.stop()
print("Done!")

Run it:

python etl_local.py
# Or with spark-submit (preferred for production):
spark-submit etl_local.py

python vs spark-submit

Method When to Use What It Does
python script.py Quick local testing Runs as a regular Python script
spark-submit script.py Production, cluster deployment Sets up Spark environment properly, supports cluster configs

# spark-submit with configuration
spark-submit     --master local[4]     --driver-memory 4g     --executor-memory 8g     --conf spark.sql.shuffle.partitions=100     etl_local.py

Local Mode vs Cluster Mode

Feature Local Mode Cluster Mode (Databricks/YARN)
Where it runs Your laptop/desktop Cluster of machines
Master local[*] Managed by platform
Max data Limited by your RAM Unlimited (add nodes)
Speed One machine Parallel across many machines
Cost Free Cloud compute costs
Use for Learning, unit tests, small data Production, big data

The Complete Import Reference

# === ESSENTIAL (every notebook) ===
from pyspark.sql import SparkSession                     # Session creation
from pyspark.sql.functions import *                       # All transformation functions

# === SCHEMA DEFINITION ===
from pyspark.sql.types import (
    StructType, StructField,                              # Schema structure
    StringType, IntegerType, LongType, DoubleType,        # Numeric/string types
    FloatType, BooleanType, DateType, TimestampType,      # More types
    ArrayType, MapType, DecimalType, BinaryType            # Complex types
)

# === WINDOW FUNCTIONS ===
from pyspark.sql.window import Window                     # Window specifications

# === DELTA LAKE (Databricks) ===
from delta.tables import DeltaTable                       # Delta operations (MERGE, etc.)

# === PANDAS INTEGRATION ===
import pandas as pd                                       # Pandas interop
from pyspark.sql.functions import pandas_udf              # Vectorized UDFs

# === STREAMING ===
from pyspark.sql.streaming import StreamingQuery          # Structured Streaming

# === SPARK CONTEXT (rarely needed) ===
from pyspark import SparkContext, SparkConf                # Low-level access

Common Import Patterns by Task

# Reading/writing files: just SparkSession + functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Data cleaning: functions for string/date/null handling
from pyspark.sql.functions import (
    col, lit, when, coalesce, trim, upper, lower, initcap,
    to_date, to_timestamp, current_date, year, month,
    regexp_replace, concat, concat_ws, split, length
)

# Aggregations and analytics: agg functions + window
from pyspark.sql.functions import (
    sum, avg, count, min, max, countDistinct,
    row_number, rank, dense_rank, lag, lead, first, last
)
from pyspark.sql.window import Window

# Schema definition: types
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType,
    DoubleType, DateType, TimestampType, BooleanType
)

# Delta Lake operations: DeltaTable
from delta.tables import DeltaTable

Environment Comparison: Local vs Databricks vs Synapse

Feature Local PySpark Databricks Synapse Spark
SparkSession You create it Pre-created as spark Pre-created as spark
spark.stop() Required at end Never call Never call
Master local[*] or spark-submit Managed by cluster Managed by pool
Imports All manual spark, sc, dbutils pre-imported spark pre-imported
Java required Yes (install manually) No (pre-installed) No (pre-installed)
Delta Lake Install separately Native Supported
display() Not available Available Available
dbutils Not available Available Not available (use mssparkutils)
Cost Free Per DBU + VM Per node-hour

Common Mistakes

  1. Calling spark.stop() in Databricks — terminates the session, all cells fail. Restart the cluster to recover.

  2. Setting spark.driver.memory via spark.conf.set() — this does not work at runtime. Memory configs must be set in the builder or cluster config.

  3. Forgetting to import functionscol("salary") fails with NameError: name 'col' is not defined. Add from pyspark.sql.functions import *.

  4. Creating SparkSession in Databricks — not harmful but unnecessary. SparkSession.builder.getOrCreate() returns the existing pre-created session.

  5. Not installing Java for local PySpark — PySpark requires Java 8 or 11. Without it, spark-submit fails with “Java not found.”

  6. Using import pyspark.sql.functions instead of from pyspark.sql.functions import * — the former requires pyspark.sql.functions.col() every time. The latter gives you col() directly.

  7. Confusing spark with scspark is the SparkSession (use this). sc is the SparkContext (rarely needed). They are different objects.

Interview Questions

Q: What is a SparkSession and why is it needed? A: SparkSession is the unified entry point to all PySpark functionality — reading data, creating DataFrames, running SQL, configuring settings, and accessing the catalog. It was introduced in Spark 2.0 to replace the separate SparkContext, SQLContext, and HiveContext. Every PySpark operation goes through the SparkSession.

Q: What is the difference between SparkSession and SparkContext? A: SparkContext is the original Spark 1.x entry point that manages the cluster connection and RDD operations. SparkSession (Spark 2.0+) wraps SparkContext and adds DataFrame, SQL, and catalog support in a unified API. Use SparkSession for everything — SparkContext is accessible via spark.sparkContext if needed.

Q: What does getOrCreate() do? A: It returns the existing active SparkSession if one exists, or creates a new one if none exists. This prevents accidentally creating multiple sessions in the same JVM. In a single JVM, there can only be one active SparkSession.

Q: Why should you never call spark.stop() in Databricks? A: Because Databricks manages the SparkSession lifecycle. Calling spark.stop() terminates the session, and all subsequent notebook cells fail. The platform automatically stops the session when the cluster terminates or the notebook is detached.

Q: What imports do you need for a typical PySpark notebook? A: At minimum: from pyspark.sql.functions import * for transformation functions. Add from pyspark.sql.types import * for schema definitions. Add from pyspark.sql.window import Window for window functions. In non-Databricks environments, also add from pyspark.sql import SparkSession.

Q: What is the difference between spark.conf.set() and builder .config()? A: Builder .config() sets properties before the session is created — required for resource configs like spark.driver.memory. spark.conf.set() changes properties at runtime — works for SQL behavior and storage authentication. Memory configs ONLY work in the builder; they are ignored at runtime.

Wrapping Up

The SparkSession is the foundation of every PySpark program. In Databricks, it is pre-created and ready to use. In standalone environments, you create it with SparkSession.builder...getOrCreate() and stop it with spark.stop() when done.

The key things to remember: – Databricks: spark is already available. Import functions, types, and Window as needed. Never call spark.stop(). – Local/standalone: Create the session with builder. Set master to local[*]. Stop the session in finally block. – Always import: from pyspark.sql.functions import * — you need this in every notebook.

This is the foundation. Everything else — reading files, transforming DataFrames, writing Delta tables — builds on top of this session.

Related posts:Apache Spark and PySpark ArchitecturePySpark Transformations CookbookAzure Databricks IntroductionReading/Writing File Formats in DatabricksPython for Data Engineers


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link