Apache Spark in Fabric: Runtime Configurations, Starter Pools, Custom Environments, V-Order, Adaptive Query Execution, and Performance Tuning

Every Fabric Notebook runs on Apache Spark. But most data engineers treat Spark as a black box — write PySpark, click Run, wait. When a notebook takes 45 minutes instead of 5, they do not know why or how to fix it.

This post opens the black box. How Spark clusters work in Fabric, how to configure them for your workload, what starter pools vs custom environments mean, how V-Order optimizes Delta writes, how Adaptive Query Execution (AQE) auto-tunes at runtime, and the specific Spark settings that turn a 45-minute notebook into a 5-minute one.

Think of Spark like a team of workers (executors) in a warehouse. The warehouse manager (driver) assigns tasks. If you assign 200 workers to sort 100 boxes (200 shuffle partitions for small data), most workers stand idle — waste. If you assign 2 workers to sort 10 million boxes, they work for hours — bottleneck. Configuration is about matching the team size to the job.

How Spark Works in Fabric
Driver and Executors
Spark Sessions in Fabric
Starter Pools (Quick Start)
Custom Spark Pools (via Environments)
Spark Runtime Versions
Key Spark Configurations
Shuffle Partitions
Adaptive Query Execution (AQE)
Auto-Optimize and Auto-Compact
V-Order Optimization
Resource Profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)
Broadcast Joins
Memory Configuration
Spark UI: Understanding Your Job
Stages, Tasks, and Shuffles
Reading the Spark UI
Identifying Bottlenecks
Performance Tuning Patterns
Pattern 1: Small Data Optimization
Pattern 2: Large Data Optimization
Pattern 3: Join Optimization
Pattern 4: Write Optimization
Spark Job Definitions (Scheduled Spark Jobs)
High Concurrency Mode
Lakehouse Attached vs Multi-Lakehouse
Common Mistakes
Interview Questions
Wrapping Up

How Spark Works in Fabric

┌─────────────────────────────────────────────────┐
│  SPARK SESSION (your notebook)                    │
│                                                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ DRIVER   │  │ EXECUTOR │  │ EXECUTOR │       │
│  │ (brain)  │  │ (worker) │  │ (worker) │ ...   │
│  │          │  │          │  │          │       │
│  │ Plans    │  │ Processes│  │ Processes│       │
│  │ queries  │  │ data     │  │ data     │       │
│  │ Coords   │  │ partitions│ │ partitions│      │
│  └──────────┘  └──────────┘  └──────────┘       │
│                                                   │
│  Reading from: OneLake (Delta/Parquet files)      │
│  Writing to:   OneLake (Delta tables)             │
└─────────────────────────────────────────────────┘

Driver and Executors

Driver: The brain — plans the query, distributes work, collects results
Executors: The workers — process data partitions in parallel
More executors = more parallelism = faster processing (up to a point)

Spark Sessions in Fabric

When you open a notebook and run a cell, Fabric: 1. Allocates a Spark session (driver + executors) 2. Starts the session (10 seconds with starter pool, 1-3 minutes with custom) 3. Runs your code 4. Keeps the session alive for reuse (idle timeout: 20 minutes default)

Starter Pools (Quick Start)

Starter pools are pre-warmed Spark clusters maintained by Fabric:

You run a cell → Fabric assigns a pre-started session → Code runs in ~10 seconds
(No cluster startup wait!)

Feature	Starter Pool	Custom Environment
Startup time	~10 seconds	1-3 minutes
Configuration	Default (Fabric-managed)	Fully customizable
Libraries	Default only	Custom PyPI/wheel/jar
Best for	Ad-hoc queries, exploration	Production notebooks
Pool management	Fabric manages	You configure via Environment

Use starter pools for: Quick data exploration, ad-hoc queries, testing. Use custom environments for: Production notebooks with specific libraries and tuned configurations.

Custom Spark Pools (via Environments)

Custom pools are configured through Spark Environments. Unlike starter pools (instant but default config), custom pools give you full control over cluster size, libraries, and Spark settings:

Create a custom pool:
  1. + New item → Environment → name: "prod_spark_env"
  2. Compute tab:
     Node Family: Memory Optimized
     Node Size:   Medium (8 vCores, 56 GB)
     Nodes:       3-10 (min/max autoscale range)
  3. Public Libraries tab:
     Add: openpyxl, great-expectations, azure-identity
  4. Spark Properties tab:
     spark.sql.shuffle.partitions = 50
     spark.databricks.delta.optimizeWrite.enabled = true
  5. Click Publish → Environment builds (2-5 minutes one-time)

Attach to notebook:
  Notebook → Environment dropdown → select "prod_spark_env"
  First cell run: 1-3 minutes startup (cluster provisions)
  Subsequent cells: instant (session is alive)

Key difference from starter pools: Custom pools take 1-3 minutes to start but run with your exact libraries, your Spark settings, and your chosen node size. Starter pools start in 10 seconds but with Fabric’s defaults only. For production workloads that run on a schedule (pipeline-triggered), the 1-3 minute startup cost is negligible.

Spark Runtime Versions

Fabric provides multiple Spark runtime versions, each bundling a specific Spark, Python, and Delta Lake version:

Runtime	Spark Version	Python	Delta Lake	Status
Runtime 1.2	Spark 3.4	Python 3.10	Delta 2.4	GA
Runtime 1.3	Spark 3.5	Python 3.11	Delta 3.0	GA

Choose the runtime in your Environment settings. Use Runtime 1.3 for new projects — it includes Spark 3.5 performance improvements, updated Python libraries, and Delta Lake 3.0 features (liquid clustering support, improved MERGE performance). Only use Runtime 1.2 if you have compatibility requirements with older Delta tables or libraries.

Key Spark Configurations

Shuffle Partitions

# DEFAULT: 200 shuffle partitions
# This means EVERY join, group by, or window function creates 200 output files

# For SMALL data (< 1 million rows):
spark.conf.set("spark.sql.shuffle.partitions", "10")
# 200 partitions for 10K rows = 200 files with 50 rows each (terrible!)
# 10 partitions = 10 files with 1,000 rows each (efficient)

# For MEDIUM data (1-100 million rows):
spark.conf.set("spark.sql.shuffle.partitions", "50")

# For LARGE data (100M+ rows):
spark.conf.set("spark.sql.shuffle.partitions", "200")  # Default is fine

# BEST: Let AQE handle it automatically
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# AQE automatically reduces 200 partitions to the optimal number at runtime

Real-life analogy: Shuffle partitions are like splitting a pizza. 200 slices for 2 people (small data) = ridiculously thin, impossible to eat. 4 slices for 2 people = perfect. 200 slices for 100 people (big data) = reasonable.

Adaptive Query Execution (AQE)

AQE automatically optimizes Spark queries AT RUNTIME based on actual data statistics:

# Enable AQE (default in Fabric Runtime 1.2+)
spark.conf.set("spark.sql.adaptive.enabled", "true")

# What AQE auto-optimizes:
# 1. Coalesce shuffle partitions (reduces 200 → optimal number)
# 2. Convert sort-merge join to broadcast join (if one side is small)
# 3. Optimize skewed joins (rebalances uneven partitions)
# 4. Dynamic partition pruning (skips irrelevant partitions)

AQE is like an autopilot. It adjusts in real-time based on what it discovers about the data. You set it and forget it.

Auto-Optimize and Auto-Compact

# Auto-optimize: automatically optimizes Delta writes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Optimizes file sizes during writes (fewer small files)

# Auto-compact: automatically compacts small files after writes
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
# Runs mini-OPTIMIZE after each write operation

V-Order Optimization

V-Order is a Fabric-specific optimization that sorts data within Parquet files for faster reads:

# V-Order status depends on your resource profile:
# writeHeavy (default for new workspaces): V-Order DISABLED
# readHeavyForPBI: V-Order ENABLED
# readHeavyForSpark: V-Order DISABLED

# Check current setting:
spark.conf.get("spark.sql.parquet.vorder.default")

# Enable V-Order manually (if using writeHeavy but need read optimization):
spark.conf.set("spark.sql.parquet.vorder.default", "true")

# What V-Order does:
# 1. Sorts data within each Parquet file for optimal columnar access
# 2. Improves Power BI Direct Lake read performance by ~50%
# 3. Improves SQL endpoint query performance
# 4. Small write overhead (~5-10% slower writes) — worth it for read-heavy tables
# 5. Automatically applied to readHeavyForPBI profile

Resource Profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)

Fabric provides predefined Spark resource profiles — preconfigured sets of Spark settings optimized for different workload patterns. Instead of tuning individual settings, you apply a profile and Fabric sets the optimal values for you.

Profile	V-Order	optimizeWrite	binSize	Best For
`writeHeavy` (default)	Disabled	null	128 MB	ETL, ingestion, streaming — write speed prioritized
`readHeavyForSpark`	Disabled	Enabled	128 MB	Spark interactive queries, frequent joins and reads
`readHeavyForPBI`	Enabled	Enabled	1 GB	Power BI Direct Lake — read performance prioritized
`custom`	User-defined	User-defined	User-defined	Custom workloads with specific requirements

What Each Profile Configures Under the Hood

# writeHeavy (default for all new workspaces):
{
    "spark.sql.parquet.vorder.default": "false",          # V-Order OFF → faster writes
    "spark.databricks.delta.optimizeWrite.enabled": "null",
    "spark.databricks.delta.optimizeWrite.binSize": "128",
    "spark.databricks.delta.optimizeWrite.partitioned.enabled": "true"
}

# readHeavyForPBI (Gold layer for Power BI):
{
    "spark.sql.parquet.vorder.default": "true",           # V-Order ON → faster reads
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.optimizeWrite.binSize": "1g"  # Larger files → fewer reads
}

# readHeavyForSpark (Silver/dimension tables):
{
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.optimizeWrite.partitioned.enabled": "true",
    "spark.databricks.delta.optimizeWrite.binSize": "128"
}

How to Set Resource Profiles

# Option 1: Set at RUNTIME in a notebook (overrides environment setting)
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

# Option 2: Set at ENVIRONMENT level (applies to all notebooks using this environment)
# Workspace → Environment → Spark Configurations:
#   spark.fabric.resourceProfile = writeHeavy
# This becomes the default for all Spark jobs in the environment.

# Runtime takes precedence over environment.
# Useful when different stages need different profiles in the same notebook:

# Stage 1: Ingest raw data (write-heavy)
spark.conf.set("spark.fabric.resourceProfile", "writeHeavy")
df = spark.read.format("csv").load("Files/raw/sales.csv")
df.write.format("delta").saveAsTable("bronze_sales")

# Stage 2: Build Gold table for Power BI (read-heavy)
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForPBI")
gold_df = spark.sql("SELECT region, SUM(amount) FROM silver_sales GROUP BY region")
gold_df.write.format("delta").mode("overwrite").saveAsTable("gold_sales_summary")

Medallion Architecture with Resource Profiles

Layer      Profile              Why
─────      ──────────           ─────────────────────────────────────────────
Bronze     writeHeavy           Ingestion-heavy. Fast writes, no V-Order overhead.
Silver     readHeavyForSpark    Dimension tables frequently read and joined by Spark.
Gold       readHeavyForPBI      Reporting layer. V-Order ON for Direct Lake + large bins.

Pipeline flow:
  Copy Activity → Bronze (writeHeavy)
  Notebook       → Silver (readHeavyForSpark)
  Notebook       → Gold   (readHeavyForPBI)
  Semantic Model Refresh → Power BI reads Gold with optimal V-Ordered files

Real-life analogy: Resource profiles are like oven presets on a smart oven. “Bake” (writeHeavy) heats fast for cooking. “Broil” (readHeavyForSpark) keeps consistent high heat for crisping. “Warm” (readHeavyForPBI) maintains low, steady heat for serving. You pick the preset instead of manually setting temperature, fan speed, and timer.

DP-700 exam note: Know that new workspaces default to writeHeavy, which disables V-Order. If a question asks why Power BI Direct Lake performance is slow on a new workspace, the answer is likely: switch the Gold layer profile to readHeavyForPBI to enable V-Order and larger file bins.

Broadcast Joins

When joining a large table with a small table, broadcast the small one:

from pyspark.sql.functions import broadcast

# ❌ SLOW: Both tables shuffled across executors
result = large_df.join(small_df, "key")

# ✅ FAST: Small table broadcasted to all executors (no shuffle)
result = large_df.join(broadcast(small_df), "key")

# Auto-broadcast threshold (default 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")  # Increase to 50MB
# Tables under 50MB are automatically broadcasted

Memory Configuration

# Driver memory: increase for collect(), toPandas(), large broadcast joins
spark.conf.set("spark.driver.memory", "8g")     # Default: 4g

# Executor memory: increase for large shuffles, joins, and transformations
spark.conf.set("spark.executor.memory", "16g")   # Default: 8g

# Memory fraction for storage vs execution
spark.conf.set("spark.memory.fraction", "0.8")          # 80% of executor memory for Spark
spark.conf.set("spark.memory.storageFraction", "0.3")    # 30% of that for caching

# Common OOM scenarios and fixes:
# 1. collect() or toPandas() on large DataFrame → increase driver memory
# 2. Large shuffle/join → increase executor memory
# 3. Many cached DataFrames → increase storageFraction or unpersist unused caches

# In Fabric, set these in the Environment's Spark Properties for consistency
# (rather than per-notebook — ensures all notebooks get the same memory config)

Spark UI: Understanding Your Job

Access the Spark UI from the notebook → Monitor tab (or click the job link after running a cell):

Spark UI tabs:
  Jobs → Shows all Spark jobs triggered by your code
  Stages → Shows stages within each job (read, shuffle, write)
  Tasks → Shows individual task execution on each executor
  Storage → Shows cached DataFrames
  Environment → Shows Spark configuration
  SQL → Shows query plans (most useful for optimization)

Stages, Tasks, and Shuffles

Spark breaks your code into three levels of execution:

JOB: One action (count(), write(), show()) = one Spark job
STAGE: Separated by shuffles (every group by, join, or window = new stage boundary)
TASK: One partition processed on one executor core

Example: df.groupBy("city").count().show()
  Stage 1: Read data from OneLake (one task per input partition/file)
  --- SHUFFLE (data redistributed across executors by "city") ---
  Stage 2: Aggregate count per city (one task per output partition)
  Stage 3: Collect results to driver for show()

Why this matters:
  - More stages = more shuffles = more network I/O = slower
  - Filter early (before joins) to reduce data entering shuffles
  - Broadcast small tables to eliminate shuffles entirely

Reading the Spark UI

The SQL tab in the Spark UI is the most useful for optimization. It shows the query execution plan — which operations happened, how much data flowed through each step, and where time was spent.

What to look for in the Spark UI:

Jobs tab:
  Each action (write, count, show) is a job. Check which job took the longest.

Stages tab:
  Click into the slow job. Look at stage durations:
  - Stage with high "Shuffle Read" → reduce partitions or broadcast
  - Stage with high "Shuffle Write" → data is being redistributed (join/groupby)

Tasks tab (within a stage):
  Sort by duration. If one task takes 10x longer than others → DATA SKEW
  - 95th percentile task: 2 seconds
  - Max task: 45 seconds ← This partition has 20x more data

SQL tab:
  Shows the logical and physical query plan as a visual DAG.
  Look for: SortMergeJoin (expensive) vs BroadcastHashJoin (cheap)
  If you see SortMergeJoin on a small table → add broadcast() hint

Identifying Bottlenecks

Symptom	Likely Cause	Fix
One task takes 10x longer than others	Data skew (one partition has much more data)	Repartition, salt the key, use AQE skew join
Many stages with tiny tasks	Too many shuffle partitions	Reduce shuffle partitions or enable AQE coalesce
Long wait between stages	Shuffle I/O bottleneck	Use broadcast join, optimize partition count
Out of memory errors	Data too large for driver/executor	Increase memory, reduce partitions, filter earlier

Performance Tuning Patterns

Pattern 1: Small Data Optimization

# Small data (< 1M rows): reduce partitions, coalesce output
spark.conf.set("spark.sql.shuffle.partitions", "8")
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Use .coalesce(1) before writing if you want a single output file
df.coalesce(1).write.format("delta").mode("overwrite").saveAsTable("small_table")

# For very small lookups, consider collecting to driver:
lookup_dict = {row["key"]: row["value"] for row in small_df.collect()}

Pattern 2: Large Data Optimization

# Large data (100M+ rows): let Spark parallelize fully
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

# Partition output by a frequently filtered column
df.write.format("delta").partitionBy("year", "month").saveAsTable("large_table")

# After loading, run OPTIMIZE for read performance
spark.sql("OPTIMIZE large_table ZORDER BY (customer_key, product_key)")

# Filter early — reduce data before expensive operations
df = spark.table("bronze.events") \
    .filter(col("event_date") >= "2026-01-01") \
    .filter(col("status") != "deleted")        # Reduce BEFORE join
result = df.join(broadcast(dim_df), "key")

Pattern 3: Join Optimization

# Small lookup table: broadcast it
result = fact_df.join(broadcast(dim_df), "customer_key")

# Skewed join: salt the key
from pyspark.sql.functions import rand, concat, lit
salted_df = skewed_df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() * 10).cast("int")))

Pattern 4: Write Optimization

# Optimize file sizes on write
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# After large loads, run OPTIMIZE
spark.sql("OPTIMIZE gold.fact_sales ZORDER BY (customer_key, date_key)")

Spark Job Definitions (Scheduled Spark Jobs)

For production PySpark scripts that do not need an interactive notebook:

+ New item → Spark Job Definition
Upload your .py file or reference from OneLake
Configure: runtime, environment, arguments
Schedule or trigger from a Pipeline

# my_etl_job.py (standalone script, not a notebook)
import sys
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DailyETL").getOrCreate()
table_name = sys.argv[1]  # Passed as argument

df = spark.table(f"bronze.{table_name}")
# ... transformations ...
df.write.saveAsTable(f"silver.{table_name}")
spark.stop()

Use Spark Job Definitions when: You have a stable, production PySpark script that does not need interactive cell-by-cell execution.

High Concurrency Mode

By default, each notebook gets its own Spark session (isolated resources, independent failures). High Concurrency Mode shares one Spark session across multiple notebooks:

Default mode:
  Notebook A → Session 1 (own driver, own executors)
  Notebook B → Session 2 (own driver, own executors)
  Notebook C → Session 3 (own driver, own executors)
  Total: 3 sessions × cluster startup time = expensive

High Concurrency mode:
  Notebook A ─┐
  Notebook B ─┼→ Session 1 (shared driver, shared executors)
  Notebook C ─┘
  Total: 1 session, faster startup, less memory used

Enable: Open notebook → click the Session dropdown in the toolbar → select High Concurrency.

Scenario	Best Mode
Multiple team members doing ad-hoc exploration	High Concurrency (shared resources, fast start)
Production pipeline notebooks	Standard (isolated, no interference between notebooks)
Notebooks with different library requirements	Standard (each gets its own environment)
Quick data validation by multiple analysts	High Concurrency (saves CUs)

Lakehouse Attached vs Multi-Lakehouse

Every notebook has a default lakehouse (pinned). You can also access other lakehouses explicitly:

# Default lakehouse (pinned via the left panel):
df = spark.table("customers")              # Reads from default lakehouse
spark.sql("SELECT * FROM orders")          # Same — default lakehouse

# Access OTHER lakehouses (cross-lakehouse queries):
df_silver = spark.table("silver_lakehouse.dbo.customers_clean")
df_gold = spark.table("gold_lakehouse.dbo.dim_customer")

# Write to a different lakehouse:
df.write.format("delta").saveAsTable("silver_lakehouse.dbo.customers_clean")

# Access Warehouse tables from a notebook:
df_wh = spark.sql("SELECT * FROM gold_warehouse.dbo.fact_sales")

Best practice: Pin your most-used lakehouse as default (the one you read from most). Reference other lakehouses with the full three-part name: lakehouse_name.schema.table. You can attach multiple lakehouses from the left panel — click Add Lakehouse and select additional ones. They appear in the explorer for browsing but only the pinned one is the default for unqualified table names.

Common Mistakes

Not setting shuffle partitions for small data — 200 partitions for 10,000 rows creates 200 tiny files. Set to 8-20 for small datasets.
Disabling AQE — AQE is free optimization. Never disable it. It automatically coalesces partitions, optimizes joins, and handles skew.
Not using broadcast for small dimension tables — shuffling a 10,000-row dimension table across the cluster wastes network I/O. Broadcast it.
Writing without optimizeWrite — creates many small files. Enable optimizeWrite and autoCompact for all production writes.
Not enabling V-Order on Gold layer tables — new workspaces default to writeHeavy, which disables V-Order. For Gold tables consumed by Power BI, switch to readHeavyForPBI profile or manually enable spark.sql.parquet.vorder.default = true. Without V-Order, Direct Lake and SQL endpoint reads are ~50% slower.
Not checking the Spark UI — when a notebook is slow, the Spark UI tells you exactly why. One skewed partition, too many shuffles, or OOM errors — all visible in the UI.

Interview Questions

Q: What is the most impactful Spark configuration for performance? A: Shuffle partitions (spark.sql.shuffle.partitions). The default 200 is correct for large data but creates overhead for small-medium data. Combined with AQE (Adaptive Query Execution), which auto-coalesces partitions at runtime, these two settings handle 80% of performance issues.

Q: What is V-Order in Fabric? A: A Fabric-specific write optimization that sorts data within Parquet files for optimal columnar access. In new workspaces, V-Order is disabled by default (because the writeHeavy resource profile is the default). It is automatically enabled when you switch to readHeavyForPBI. Improves Power BI Direct Lake performance by ~50% and SQL endpoint query speed. For Gold layer tables consumed by Power BI, always ensure V-Order is on.

Q: What are Spark resource profiles in Fabric and when would you use each? A: Resource profiles are predefined Spark configuration sets. writeHeavy (default) disables V-Order and optimizes for ingestion — use for Bronze layer ETL. readHeavyForSpark enables optimizeWrite for faster reads — use for Silver dimension tables. readHeavyForPBI enables V-Order and uses 1 GB bin sizes — use for Gold layer tables consumed by Power BI Direct Lake. Set via spark.conf.set("spark.fabric.resourceProfile", "readHeavyForPBI") at runtime or in the Environment’s Spark Configurations.

Q: How do you handle data skew in Spark? A: Three approaches: enable AQE skew join optimization (automatic), broadcast the smaller table in a join, or salt the skewed key (add a random suffix to distribute data evenly). Check the Spark UI for tasks that take 10x longer than others — that indicates skew.

Wrapping Up

Spark in Fabric is not just “write PySpark and click run.” Understanding shuffle partitions, AQE, V-Order, broadcast joins, and the Spark UI transforms you from someone who writes Spark to someone who masters it. The configurations in this post can turn a 45-minute notebook into a 5-minute one.

← Previous: Fabric Notebooks Deep Dive Fabric (19/38) Next: Spark Configuration & Performance Tuning →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Apache Spark in Fabric: Runtime Configurations, Starter Pools, Custom Environments, V-Order, Adaptive Query Execution, and Performance Tuning

Table of Contents

How Spark Works in Fabric

Driver and Executors

Spark Sessions in Fabric

Starter Pools (Quick Start)

Custom Spark Pools (via Environments)

Spark Runtime Versions

Key Spark Configurations

Shuffle Partitions

Adaptive Query Execution (AQE)

Auto-Optimize and Auto-Compact

V-Order Optimization

Resource Profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)

What Each Profile Configures Under the Hood

How to Set Resource Profiles

Medallion Architecture with Resource Profiles

Broadcast Joins

Memory Configuration

Spark UI: Understanding Your Job

Stages, Tasks, and Shuffles

Reading the Spark UI

Identifying Bottlenecks

Performance Tuning Patterns

Pattern 1: Small Data Optimization

Pattern 2: Large Data Optimization

Pattern 3: Join Optimization

Pattern 4: Write Optimization

Spark Job Definitions (Scheduled Spark Jobs)

High Concurrency Mode

Lakehouse Attached vs Multi-Lakehouse

Common Mistakes

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

Table of Contents

How Spark Works in Fabric

Driver and Executors

Spark Sessions in Fabric

Starter Pools (Quick Start)

Custom Spark Pools (via Environments)

Spark Runtime Versions

Key Spark Configurations

Shuffle Partitions

Adaptive Query Execution (AQE)

Auto-Optimize and Auto-Compact

V-Order Optimization

Resource Profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)

What Each Profile Configures Under the Hood

How to Set Resource Profiles

Medallion Architecture with Resource Profiles

Broadcast Joins

Memory Configuration

Spark UI: Understanding Your Job

Stages, Tasks, and Shuffles

Reading the Spark UI

Identifying Bottlenecks

Performance Tuning Patterns

Pattern 1: Small Data Optimization

Pattern 2: Large Data Optimization

Pattern 3: Join Optimization

Pattern 4: Write Optimization

Spark Job Definitions (Scheduled Spark Jobs)

High Concurrency Mode

Lakehouse Attached vs Multi-Lakehouse

Common Mistakes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply