Apache Spark in Fabric: Runtime Configurations, Starter Pools, Custom Environments, V-Order, Adaptive Query Execution, and Performance Tuning

Apache Spark in Fabric: Runtime Configurations, Starter Pools, Custom Environments, V-Order, Adaptive Query Execution, and Performance Tuning

Every Fabric Notebook runs on Apache Spark. But most data engineers treat Spark as a black box — write PySpark, click Run, wait. When a notebook takes 45 minutes instead of 5, they do not know why or how to fix it.

This post opens the black box. How Spark clusters work in Fabric, how to configure them for your workload, what starter pools vs custom environments mean, how V-Order optimizes Delta writes, how Adaptive Query Execution (AQE) auto-tunes at runtime, and the specific Spark settings that turn a 45-minute notebook into a 5-minute one.

Think of Spark like a team of workers (executors) in a warehouse. The warehouse manager (driver) assigns tasks. If you assign 200 workers to sort 100 boxes (200 shuffle partitions for small data), most workers stand idle — waste. If you assign 2 workers to sort 10 million boxes, they work for hours — bottleneck. Configuration is about matching the team size to the job.

Table of Contents

  • How Spark Works in Fabric
  • Driver and Executors
  • Spark Sessions in Fabric
  • Starter Pools (Quick Start)
  • Custom Spark Pools (via Environments)
  • Spark Runtime Versions
  • Key Spark Configurations
  • Shuffle Partitions
  • Adaptive Query Execution (AQE)
  • Auto-Optimize and Auto-Compact
  • V-Order Optimization
  • Broadcast Joins
  • Memory Configuration
  • Spark UI: Understanding Your Job
  • Stages, Tasks, and Shuffles
  • Reading the Spark UI
  • Identifying Bottlenecks
  • Performance Tuning Patterns
  • Pattern 1: Small Data Optimization
  • Pattern 2: Large Data Optimization
  • Pattern 3: Join Optimization
  • Pattern 4: Write Optimization
  • Spark Job Definitions (Scheduled Spark Jobs)
  • High Concurrency Mode
  • Lakehouse Attached vs Multi-Lakehouse
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

How Spark Works in Fabric

┌─────────────────────────────────────────────────┐
│  SPARK SESSION (your notebook)                    │
│                                                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ DRIVER   │  │ EXECUTOR │  │ EXECUTOR │       │
│  │ (brain)  │  │ (worker) │  │ (worker) │ ...   │
│  │          │  │          │  │          │       │
│  │ Plans    │  │ Processes│  │ Processes│       │
│  │ queries  │  │ data     │  │ data     │       │
│  │ Coords   │  │ partitions│ │ partitions│      │
│  └──────────┘  └──────────┘  └──────────┘       │
│                                                   │
│  Reading from: OneLake (Delta/Parquet files)      │
│  Writing to:   OneLake (Delta tables)             │
└─────────────────────────────────────────────────┘

Driver and Executors

  • Driver: The brain — plans the query, distributes work, collects results
  • Executors: The workers — process data partitions in parallel
  • More executors = more parallelism = faster processing (up to a point)

Spark Sessions in Fabric

When you open a notebook and run a cell, Fabric: 1. Allocates a Spark session (driver + executors) 2. Starts the session (10 seconds with starter pool, 1-3 minutes with custom) 3. Runs your code 4. Keeps the session alive for reuse (idle timeout: 20 minutes default)

Starter Pools (Quick Start)

Starter pools are pre-warmed Spark clusters maintained by Fabric:

You run a cell → Fabric assigns a pre-started session → Code runs in ~10 seconds
(No cluster startup wait!)
Feature Starter Pool Custom Environment
Startup time ~10 seconds 1-3 minutes
Configuration Default (Fabric-managed) Fully customizable
Libraries Default only Custom PyPI/wheel/jar
Best for Ad-hoc queries, exploration Production notebooks
Pool management Fabric manages You configure via Environment

Use starter pools for: Quick data exploration, ad-hoc queries, testing. Use custom environments for: Production notebooks with specific libraries and tuned configurations.

Key Spark Configurations

Shuffle Partitions (Most Impactful Setting)

# DEFAULT: 200 shuffle partitions
# This means EVERY join, group by, or window function creates 200 output files

# For SMALL data (< 1 million rows):
spark.conf.set("spark.sql.shuffle.partitions", "10")
# 200 partitions for 10K rows = 200 files with 50 rows each (terrible!)
# 10 partitions = 10 files with 1,000 rows each (efficient)

# For MEDIUM data (1-100 million rows):
spark.conf.set("spark.sql.shuffle.partitions", "50")

# For LARGE data (100M+ rows):
spark.conf.set("spark.sql.shuffle.partitions", "200")  # Default is fine

# BEST: Let AQE handle it automatically
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# AQE automatically reduces 200 partitions to the optimal number at runtime

Real-life analogy: Shuffle partitions are like splitting a pizza. 200 slices for 2 people (small data) = ridiculously thin, impossible to eat. 4 slices for 2 people = perfect. 200 slices for 100 people (big data) = reasonable.

Adaptive Query Execution (AQE)

AQE automatically optimizes Spark queries AT RUNTIME based on actual data statistics:

# Enable AQE (default in Fabric Runtime 1.2+)
spark.conf.set("spark.sql.adaptive.enabled", "true")

# What AQE auto-optimizes:
# 1. Coalesce shuffle partitions (reduces 200 → optimal number)
# 2. Convert sort-merge join to broadcast join (if one side is small)
# 3. Optimize skewed joins (rebalances uneven partitions)
# 4. Dynamic partition pruning (skips irrelevant partitions)

AQE is like an autopilot. It adjusts in real-time based on what it discovers about the data. You set it and forget it.

Auto-Optimize and Auto-Compact

# Auto-optimize: automatically optimizes Delta writes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Optimizes file sizes during writes (fewer small files)

# Auto-compact: automatically compacts small files after writes
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
# Runs mini-OPTIMIZE after each write operation

V-Order Optimization

V-Order is a Fabric-specific optimization that sorts data within Parquet files for faster reads:

# V-Order is ENABLED by default in Fabric
spark.conf.get("spark.sql.parquet.vorder.enabled")  # "true"

# What V-Order does:
# 1. Sorts data within each Parquet file for optimal columnar access
# 2. Improves Power BI Direct Lake read performance by ~50%
# 3. Improves SQL endpoint query performance
# 4. Zero additional cost — applied during write

Broadcast Joins

When joining a large table with a small table, broadcast the small one:

from pyspark.sql.functions import broadcast

# ❌ SLOW: Both tables shuffled across executors
result = large_df.join(small_df, "key")

# ✅ FAST: Small table broadcasted to all executors (no shuffle)
result = large_df.join(broadcast(small_df), "key")

# Auto-broadcast threshold (default 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")  # Increase to 50MB
# Tables under 50MB are automatically broadcasted

Spark UI: Understanding Your Job

Access the Spark UI from the notebook → Monitor tab (or click the job link after running a cell):

Spark UI tabs:
  Jobs → Shows all Spark jobs triggered by your code
  Stages → Shows stages within each job (read, shuffle, write)
  Tasks → Shows individual task execution on each executor
  Storage → Shows cached DataFrames
  Environment → Shows Spark configuration
  SQL → Shows query plans (most useful for optimization)

Identifying Bottlenecks

Symptom Likely Cause Fix
One task takes 10x longer than others Data skew (one partition has much more data) Repartition, salt the key, use AQE skew join
Many stages with tiny tasks Too many shuffle partitions Reduce shuffle partitions or enable AQE coalesce
Long wait between stages Shuffle I/O bottleneck Use broadcast join, optimize partition count
Out of memory errors Data too large for driver/executor Increase memory, reduce partitions, filter earlier

Performance Tuning Patterns

Pattern 1: Small Data (< 1M rows)

spark.conf.set("spark.sql.shuffle.partitions", "8")
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Use .coalesce(1) before writing if you want a single output file
df.coalesce(1).write.format("delta").mode("overwrite").saveAsTable("small_table")

Pattern 2: Large Data (100M+ rows)

spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Partition output by a frequently filtered column
df.write.format("delta").partitionBy("year", "month").saveAsTable("large_table")

Pattern 3: Join Optimization

# Small lookup table: broadcast it
result = fact_df.join(broadcast(dim_df), "customer_key")

# Skewed join: salt the key
from pyspark.sql.functions import rand, concat, lit
salted_df = skewed_df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() * 10).cast("int")))

Pattern 4: Write Optimization

# Optimize file sizes on write
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# After large loads, run OPTIMIZE
spark.sql("OPTIMIZE gold.fact_sales ZORDER BY (customer_key, date_key)")

Spark Job Definitions (Scheduled Spark Jobs)

For production PySpark scripts that do not need an interactive notebook:

  1. + New itemSpark Job Definition
  2. Upload your .py file or reference from OneLake
  3. Configure: runtime, environment, arguments
  4. Schedule or trigger from a Pipeline
# my_etl_job.py (standalone script, not a notebook)
import sys
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DailyETL").getOrCreate()
table_name = sys.argv[1]  # Passed as argument

df = spark.table(f"bronze.{table_name}")
# ... transformations ...
df.write.saveAsTable(f"silver.{table_name}")
spark.stop()

Use Spark Job Definitions when: You have a stable, production PySpark script that does not need interactive cell-by-cell execution.

Common Mistakes

  1. Not setting shuffle partitions for small data — 200 partitions for 10,000 rows creates 200 tiny files. Set to 8-20 for small datasets.

  2. Disabling AQE — AQE is free optimization. Never disable it. It automatically coalesces partitions, optimizes joins, and handles skew.

  3. Not using broadcast for small dimension tables — shuffling a 10,000-row dimension table across the cluster wastes network I/O. Broadcast it.

  4. Writing without optimizeWrite — creates many small files. Enable optimizeWrite and autoCompact for all production writes.

  5. Ignoring V-Order — V-Order is enabled by default in Fabric. Verify it is on. Disabling it degrades Direct Lake and SQL endpoint performance significantly.

  6. Not checking the Spark UI — when a notebook is slow, the Spark UI tells you exactly why. One skewed partition, too many shuffles, or OOM errors — all visible in the UI.

Interview Questions

Q: What is the most impactful Spark configuration for performance? A: Shuffle partitions (spark.sql.shuffle.partitions). The default 200 is correct for large data but creates overhead for small-medium data. Combined with AQE (Adaptive Query Execution), which auto-coalesces partitions at runtime, these two settings handle 80% of performance issues.

Q: What is V-Order in Fabric? A: A Fabric-specific write optimization that sorts data within Parquet files for optimal columnar access. Enabled by default. Improves Power BI Direct Lake performance by ~50% and SQL endpoint query speed. No additional cost — applied automatically during Delta writes.

Q: How do you handle data skew in Spark? A: Three approaches: enable AQE skew join optimization (automatic), broadcast the smaller table in a join, or salt the skewed key (add a random suffix to distribute data evenly). Check the Spark UI for tasks that take 10x longer than others — that indicates skew.

Wrapping Up

Spark in Fabric is not just “write PySpark and click run.” Understanding shuffle partitions, AQE, V-Order, broadcast joins, and the Spark UI transforms you from someone who writes Spark to someone who masters it. The configurations in this post can turn a 45-minute notebook into a 5-minute one.

Related posts:Fabric Notebooks Deep DivePySpark Transformations CookbookDelta Lake OptimizationFabric Lakehouse Guide


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link