Fabric Optimization Guide: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and Query Performance Tuning Across All Workloads

Performance optimization is the skill that separates a working data platform from a FAST data platform. A pipeline that takes 2 hours instead of 20 minutes. A dashboard that loads in 15 seconds instead of 2 seconds. A Spark notebook that costs $50 per run instead of $5.

This post consolidates optimization techniques across ALL Fabric workloads: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and queries. One reference for all optimization needs.

Lakehouse Optimization
Delta Table Optimization (OPTIMIZE, VACUUM, Z-ORDER, V-Order)
File Size and Partitioning
Schema Design
Pipeline Optimization
Copy Activity Tuning
Parallel Execution
Scheduling Strategy
Reducing Pipeline Duration
Warehouse Optimization
Statistics and Query Plans
Result Set Caching
Table Design
Query Optimization Patterns
Spark Optimization
Shuffle Partitions
Broadcast Joins
AQE (Adaptive Query Execution)
Memory Management
Caching and Persistence
Eventstream and Eventhouse Optimization
Ingestion Optimization
KQL Query Optimization
Retention and Caching Policies
Query Performance (Cross-Workload)
Common Query Anti-Patterns
Index-Like Optimization in Fabric
Capacity Optimization
Right-Sizing
Staggering Workloads
Pause/Resume
The Optimization Checklist
Common Mistakes
Interview Questions
Wrapping Up

Lakehouse Optimization

Delta Table Optimization (OPTIMIZE, VACUUM, Z-ORDER, V-Order)

%%sql
-- OPTIMIZE: Compact small files into larger ones (faster reads)
OPTIMIZE silver.customers_clean;

-- OPTIMIZE with Z-ORDER: Co-locate data by frequently filtered columns
OPTIMIZE gold.fact_sales ZORDER BY (customer_key, date_key);
-- Queries filtering on customer_key or date_key will be 5-10x faster

-- VACUUM: Remove old file versions (save storage)
VACUUM silver.customers_clean RETAIN 168 HOURS;  -- Keep 7 days of history

-- Check table details (file count, size)
DESCRIBE DETAIL gold.fact_sales;

V-Order is a Fabric-specific write optimization — sorts data within Parquet files for ~50% faster Direct Lake reads. Enabled by default. Verify with spark.conf.get("spark.sql.parquet.vorder.enabled").

When to run: OPTIMIZE weekly or after large loads. VACUUM weekly after OPTIMIZE. Z-ORDER on columns used in WHERE clauses and JOINs.

File Size and Partitioning

# Target: 128MB - 1GB per file
# Too many small files = slow reads (metadata overhead)
# Too few large files = slow writes (memory pressure)

# For small tables (< 1M rows): single partition
df.coalesce(1).write.format("delta").saveAsTable("small_table")

# For large tables: partition by date
df.write.format("delta").partitionBy("year", "month").saveAsTable("large_table")

# Enable auto-optimize (compacts during writes)
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Schema Design

Star schema (fact + dimensions) outperforms wide denormalized tables:

✅ Star Schema (optimized):
  fact_sales: date_key, customer_key, product_key, amount, quantity
  dim_customer: customer_key, name, city, segment
  dim_product: product_key, name, category, price
  dim_date: date_key, date, month, quarter, year

  → Smaller fact table (narrow, only keys + measures)
  → Dimensions queried only when needed (star join)
  → Direct Lake loads faster (less data to cache)
  → Z-ORDER on fact keys is highly effective

❌ Wide Denormalized Table (slow):
  sales_wide: date, customer_name, customer_city, customer_segment,
              product_name, product_category, product_price, amount, quantity

  → Every row repeats customer and product data (storage waste)
  → Direct Lake caches redundant data (hits guardrails sooner)
  → Z-ORDER less effective on wide tables

Data types matter: Use INT (4 bytes) instead of BIGINT (8 bytes) for keys. Use DATE (3 bytes) instead of DATETIME (8 bytes) when time is not needed. Narrower types = smaller files = faster reads = less CU consumption.

Pipeline Optimization

Copy Activity Tuning

Parallel copy: Increase DIU (Data Integration Units) for large copies
  Default: Auto → Let Fabric decide
  Manual: Set higher for large tables (10-50 DIU)

Degree of copy parallelism:
  Default: 4
  For large tables: 8-16 (more parallel threads)

Staging: Enable staging for cross-region copies
  Source (US East) → Staging (blob) → Destination (Canada Central)

Parallel Execution

❌ SLOW: Sequential
  Copy_Customers → Copy_Orders → Copy_Products → Copy_Returns
  Total: 4 × 5 min = 20 minutes

✅ FAST: Parallel (ForEach with sequential=false)
  ForEach → [Copy_Customers, Copy_Orders, Copy_Products, Copy_Returns]
  Total: 5 minutes (all run at once)

Scheduling Strategy

❌ BAD: All pipelines at 6:00 AM
  → CU spike → throttling → pipelines delayed → stale dashboards

✅ GOOD: Staggered schedule
  6:00 AM: PL_Ingest (high priority)
  6:15 AM: PL_Transform (depends on ingest)
  6:30 AM: PL_Gold_Build (depends on transform)
  6:45 AM: Semantic Model Refresh
  → Spread load → no throttling → predictable completion

Reducing Pipeline Duration

Techniques to cut pipeline duration:

1. INCREMENTAL LOADING (biggest impact):
   ❌ Full load: Copy ALL 100M rows every run → 45 minutes
   ✅ Incremental: Copy only NEW/CHANGED rows since last run → 2 minutes
   Use: watermark column (modified_date), Change Data Capture, or Change Tracking

2. SKIP UNCHANGED TABLES:
   Add a Lookup + If Condition at the start:
     Lookup: "SELECT MAX(modified_date) FROM source"
     If Condition: IF max_date > last_run_date → run Copy, ELSE skip
   Result: pipelines that skip 80% of tables (no new data) finish in 5 min not 30

3. REPLACE DATAFLOW GEN2 WITH NOTEBOOKS for large data:
   Dataflow Gen2: Power Query engine → struggles with 50M+ rows → 30 min
   Notebook: Spark engine → handles 50M+ rows easily → 5 min
   Use DFG2 for small transforms, notebooks for heavy lifting

4. NOTEBOOK OPTIMIZATION:
   Start notebook with: %run NB_Config (not %pip install)
   Use Environment (pre-installed libs) not %pip (2-3 min wasted per run)
   Exit early on no-data: if df.count() == 0: mssparkutils.notebook.exit("SKIP")

Warehouse Optimization

Statistics and Query Plans

-- Create statistics for better query plans
CREATE STATISTICS stat_fact_date ON gold.fact_sales (date_key);
CREATE STATISTICS stat_fact_cust ON gold.fact_sales (customer_key);

-- Update after large loads
UPDATE STATISTICS gold.fact_sales;

-- Without statistics: optimizer guesses row counts → bad plans
-- With statistics: optimizer knows actual data distribution → good plans

-- View the query execution plan:
-- In the Warehouse query editor, click "Explain" before running
-- Look for: Table Scans (bad) vs Index/Seek operations (good)
-- Look for: estimated vs actual row counts (large mismatch = stale statistics)

Result Set Caching

-- Enable: repeated queries return cached results instantly
ALTER DATABASE CURRENT SET RESULT_SET_CACHING ON;

-- A query that takes 30 seconds runs once, then returns in <1 second
-- Cache invalidated when underlying data changes

Table Design

-- Star schema in Warehouse: fact table + dimension tables
-- Fact tables: narrow (keys + measures only), millions/billions of rows
-- Dimension tables: wide (descriptive attributes), thousands of rows

-- Use appropriate data types:
CREATE TABLE gold.fact_sales (
    date_key        INT,                -- Not BIGINT (saves 50% storage on key columns)
    customer_key    INT,
    product_key     INT,
    amount          DECIMAL(10,2),      -- Not FLOAT (exact for money)
    quantity        SMALLINT,           -- Not INT (values 0-32K fit in 2 bytes)
    order_date      DATE                -- Not DATETIME (3 bytes vs 8 bytes)
);

-- Create views for complex reporting queries (reuse, not rewrite):
CREATE VIEW reports.vw_daily_revenue AS
SELECT d.full_date, d.month_name, d.quarter,
       SUM(f.amount) AS revenue, COUNT(*) AS orders
FROM gold.fact_sales f
JOIN gold.dim_date d ON f.date_key = d.date_key
GROUP BY d.full_date, d.month_name, d.quarter;

Query Optimization Patterns

-- ❌ SLOW: SELECT * (reads all columns from columnar storage)
SELECT * FROM gold.fact_sales;

-- ✅ FAST: SELECT only needed columns
SELECT date_key, customer_key, total_amount FROM gold.fact_sales;

-- ❌ SLOW: Functions on filter columns
SELECT * FROM gold.fact_sales WHERE YEAR(order_date) = 2026;

-- ✅ FAST: Range predicate (enables partition elimination)
SELECT * FROM gold.fact_sales WHERE order_date >= '2026-01-01' AND order_date < '2027-01-01';

-- ❌ SLOW: Correlated subquery (runs per row)
SELECT * FROM dim_customer WHERE customer_key IN (SELECT customer_key FROM fact_sales);

-- ✅ FAST: JOIN (set-based)
SELECT DISTINCT c.* FROM dim_customer c JOIN fact_sales f ON c.customer_key = f.customer_key;

Spark Optimization

Shuffle Partitions

# Default 200 shuffle partitions — too many for small/medium data
# Every GROUP BY, JOIN, and window function creates this many output files

# Small data (< 1M rows):
spark.conf.set("spark.sql.shuffle.partitions", "8")

# Medium data (1M-100M rows):
spark.conf.set("spark.sql.shuffle.partitions", "50")

# Large data (100M+ rows): default 200 is fine

# BEST: let AQE auto-tune (see below)

Broadcast Joins

from pyspark.sql.functions import broadcast

# ❌ SLOW: Both tables shuffled across executors
result = large_fact.join(small_dim, "key")

# ✅ FAST: Small table broadcasted to all executors (no shuffle)
result = large_fact.join(broadcast(small_dim), "key")

# Auto-broadcast threshold (default 10MB — increase for larger dims)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")  # 50MB

AQE (Adaptive Query Execution)

# AQE auto-optimizes at runtime — free performance, always enable
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# What AQE does automatically:
# 1. Coalesces shuffle partitions (200 → optimal number)
# 2. Converts sort-merge join to broadcast join when one side is small
# 3. Handles skewed partitions (rebalances uneven data)
# 4. Dynamic partition pruning

Memory Management

# Increase driver memory for collect() and toPandas()
spark.conf.set("spark.driver.memory", "8g")

# Increase executor memory for large joins and shuffles
spark.conf.set("spark.executor.memory", "16g")

# Common OOM fixes:
# 1. collect() on large DataFrame → increase driver memory or avoid collect()
# 2. Large broadcast join → reduce broadcast threshold, let Spark use sort-merge
# 3. Many cached DataFrames → unpersist() unused ones
# 4. Skewed partition → one partition has 10x more data → use AQE or salt the key

# Monitor: check Spark UI → Executors tab → memory usage per executor

Caching and Persistence

# Cache a DataFrame that is used multiple times
df_customers = spark.table("dim_customer").cache()
# First action: reads from storage and caches in memory
# Subsequent actions: reads from cache (instant)

# Use cache when:
#   ✅ DataFrame is used in 3+ joins or actions
#   ✅ DataFrame is expensive to compute (complex transformations)
# Do NOT cache when:
#   ❌ DataFrame is used once (cache overhead > benefit)
#   ❌ DataFrame is very large (exceeds executor memory)

# Unpersist when done (free memory)
df_customers.unpersist()

# Persist with storage level (when data doesn't fit in memory)
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)  # Spill to disk if memory full

Eventstream and Eventhouse Optimization

Ingestion Optimization

Eventstream ingestion tuning:

1. FILTER AT SOURCE (not at destination):
   ❌ Send all events → filter in Eventhouse query
   ✅ Add Filter transform in Eventstream → only relevant events reach Eventhouse
   Result: less storage, less ingestion CU, faster queries

2. BATCH SIZE:
   Larger batches = fewer writes = less overhead
   Default batching is usually fine, but for very high-volume sources:
   Increase Event Hub partition count (more parallel consumers)

3. SCHEMA:
   Flatten JSON before ingestion (Eventstream Manage Fields transform)
   Avoid storing deeply nested JSON — parse at ingestion, not query time

4. DUAL ROUTING:
   Raw events → Eventhouse (real-time queries, short retention)
   Aggregated events → Lakehouse (historical analysis, long retention)
   One Eventstream, two destinations, different granularity

KQL Query Optimization

// ALWAYS filter by time first (most important optimization)
sensor_readings
| where timestamp > ago(1h)          // Prunes partitions immediately
| where device_id == "sensor-042"    // Then filter further

// Use 'has' instead of 'contains' for text search (uses term index)
logs | where message has "error"     // ✅ Fast (term index)
logs | where message contains "err"  // ❌ Slow (full scan)

// Use materialized views for repeated aggregations
// Query raw table: scans 1 billion rows every time → 30 seconds
// Query materialized view: reads pre-computed result → 0.1 seconds

// Limit output (take/top instead of scanning everything)
sensor_readings | top 100 by timestamp desc   // Fast
sensor_readings | order by timestamp desc     // Slow (sorts ALL rows)

Retention and Caching Policies

// Retention: auto-delete old data (prevents storage growth)
.alter table sensor_readings policy retention "{'SoftDeletePeriod': '90.00:00:00'}"

// Caching: keep recent data on fast SSD (hot cache)
.alter table sensor_readings policy caching hot = 30d
// Last 30 days: sub-second queries (SSD)
// Older data: seconds (cold storage, still queryable)

// Materialized views: pre-compute frequent aggregations
.create materialized-view HourlyStats on table sensor_readings
{
    sensor_readings | summarize avg(temperature), count() by bin(timestamp, 1h), device_id
}

Query Performance (Cross-Workload)

Common Query Anti-Patterns

Anti-Pattern	Why It’s Slow	Fix
`SELECT *`	Reads ALL columns from columnar storage	Select only needed columns
Functions on filter columns	`WHERE YEAR(date)=2026` prevents partition pruning	Use range: `WHERE date >= '2026-01-01'`
No time filter on KQL	Scans entire Eventhouse table (billions of rows)	Add `\| where timestamp > ago(1h)` first
Correlated subqueries	Executes inner query once per outer row	Rewrite as JOIN
DISTINCT on large tables	Requires full sort of all data	Use GROUP BY or dcount() (approximate)
Joining unoptimized tables	Thousands of small files = slow scan	Run OPTIMIZE before querying
`contains` in KQL	Full string scan (no index)	Use `has` (term index, 10x faster)

Index-Like Optimization in Fabric

Fabric does not have traditional B-tree indexes (like SQL Server). Instead, it uses these techniques for fast data access:

Technique	What It Does	Equivalent To
Z-ORDER	Co-locates data by column values within files	Clustered index (approximate)
Data skipping	Min/max stats per file skip irrelevant files	Index seek (file-level)
V-Order	Sorts within Parquet for optimal columnar reads	Columnstore index
Partitioning	Physically separates data by partition column	Partition scheme
Statistics	Data distribution info for query optimizer	SQL Server statistics
Result set caching	Caches query results for repeated queries	Query result cache
Materialized views (KQL)	Pre-computed aggregations	Indexed/materialized view

The combination of Z-ORDER + data skipping + V-Order achieves query performance comparable to indexed tables — just with different mechanics. Instead of maintaining a B-tree, Delta maintains per-file statistics that the query engine uses to skip irrelevant files.

Capacity Optimization

Right-Sizing

Monitor first, then right-size:

1. Install Capacity Metrics app → observe 2 weeks of usage
2. Check average CU utilization:
   < 30% average → over-provisioned, downgrade
   50-70% average → well-sized (headroom for spikes)
   > 80% average → under-provisioned or needs optimization

Example:
  F64 capacity, average utilization 18% → downgrade to F16
  Savings: $8,384 - $2,096 = $6,288/month

  F16 capacity, throttling 10+ times/day → upgrade to F32 OR optimize workloads
  Try optimization first (cheaper than upgrading)

Staggering Workloads

CU usage over time with staggering:

❌ All at 6 AM (spike, throttling):
  |████████████████████|  ← 200% capacity at 6 AM
  |                    |  ← 5% rest of day
  Throttling delays everything by 30+ minutes

✅ Staggered (smooth, no throttling):
  |████ ████ ████ ████|  ← 60% capacity, spread 6-7 AM
  |     ████          |  ← 30% capacity, 7-8 AM
  Everything finishes on time, no throttling

How to stagger:
  Pipeline triggers: offset by 10-15 minutes
  Notebook schedules: round-robin across the hour
  Semantic Model refresh: after pipelines complete (not at fixed time)

Pause/Resume

Pause dev/test capacities during off-hours:

  Dev (F4): Pause 8 PM → Resume 7 AM weekdays. Pause all weekend.
    Hours running: 11h × 5 days = 55h/week (vs 168h = 67% savings)
    Savings: ~$340/month on F4

  Automate: Azure Automation runbook or Logic App
    Trigger: Recurrence (8 PM daily) → Action: Pause capacity
    Trigger: Recurrence (7 AM Mon-Fri) → Action: Resume capacity

  NEVER pause production if:
    - Pipelines run overnight
    - Reports need 24/7 access
    - Streaming (Eventstream) is active

The Optimization Checklist

LAKEHOUSE:
  ☐ Run OPTIMIZE on Gold tables weekly
  ☐ Run VACUUM to clean old versions
  ☐ Z-ORDER on frequently filtered columns
  ☐ V-Order enabled (default — verify)
  ☐ Enable optimizeWrite and autoCompact
  ☐ Partition large tables by date

PIPELINE:
  ☐ Use parallel ForEach (sequential=false)
  ☐ Stagger pipeline schedules (not all at 6 AM)
  ☐ Set retry on Copy activities (2-3 retries)
  ☐ Use event triggers instead of frequent polling
  ☐ Incremental loads instead of full loads

WAREHOUSE:
  ☐ Create statistics on filter/join columns
  ☐ Update statistics after large loads
  ☐ Enable result set caching
  ☐ SELECT only needed columns (never SELECT *)
  ☐ Use range predicates, not functions on columns

SPARK:
  ☐ Set shuffle partitions (10-50 for medium data)
  ☐ Enable AQE
  ☐ Broadcast small dimension tables in joins
  ☐ Filter early in the query plan
  ☐ Cache reused DataFrames, unpersist when done

EVENTHOUSE:
  ☐ Set retention policies (30-90 days)
  ☐ Configure hot cache duration
  ☐ Create materialized views for frequent aggregations
  ☐ Filter by time first in every KQL query

CAPACITY:
  ☐ Monitor CU usage with Capacity Metrics app
  ☐ Right-size your F-SKU
  ☐ Pause dev/test during off-hours
  ☐ Stagger heavy workloads

Common Mistakes

Not running OPTIMIZE — thousands of small files degrade read performance by 10x.
SELECT * everywhere — columnar storage reads only requested columns. SELECT * reads ALL columns wastefully.
All pipelines at 6 AM — CU spike = throttling = delayed dashboards.
200 shuffle partitions for small data — 200 partitions for 10K rows = 200 tiny files.
No statistics on Warehouse tables — optimizer makes bad query plans without data distribution knowledge.

Interview Questions

Q: How do you optimize a Lakehouse table in Fabric? A: Run OPTIMIZE to compact small files, Z-ORDER on frequently filtered columns, VACUUM to remove old versions, enable V-Order for optimal columnar reads, and enable optimizeWrite + autoCompact for ongoing maintenance. Partition large tables by date for faster range queries.

Q: How do you optimize Spark performance in Fabric? A: Right-size shuffle partitions (reduce from default 200 for medium data), enable AQE for automatic optimization, broadcast small tables in joins, filter data early in the query, cache frequently reused DataFrames, and enable optimizeWrite on Delta writes.

Wrapping Up

Optimization is not a one-time task — it is an ongoing practice. Run OPTIMIZE weekly, update statistics after loads, monitor CU usage, and review slow queries monthly. The optimization checklist in this post covers every Fabric workload.

← Previous: Monitoring & Troubleshooting Fabric (35/38) Next: Git Integration & CI/CD →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Fabric Optimization Guide: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and Query Performance Tuning Across All Workloads

Table of Contents

Lakehouse Optimization

Delta Table Optimization (OPTIMIZE, VACUUM, Z-ORDER, V-Order)

File Size and Partitioning

Schema Design

Pipeline Optimization

Copy Activity Tuning

Parallel Execution

Scheduling Strategy

Reducing Pipeline Duration

Warehouse Optimization

Statistics and Query Plans

Result Set Caching

Table Design

Query Optimization Patterns

Spark Optimization

Shuffle Partitions

Broadcast Joins

AQE (Adaptive Query Execution)

Memory Management

Caching and Persistence

Eventstream and Eventhouse Optimization

Ingestion Optimization

KQL Query Optimization

Retention and Caching Policies

Query Performance (Cross-Workload)

Common Query Anti-Patterns

Index-Like Optimization in Fabric

Capacity Optimization

Right-Sizing

Staggering Workloads

Pause/Resume

The Optimization Checklist

Common Mistakes

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

Table of Contents

Lakehouse Optimization

Delta Table Optimization (OPTIMIZE, VACUUM, Z-ORDER, V-Order)

File Size and Partitioning

Schema Design

Pipeline Optimization

Copy Activity Tuning

Parallel Execution

Scheduling Strategy

Reducing Pipeline Duration

Warehouse Optimization

Statistics and Query Plans

Result Set Caching

Table Design

Query Optimization Patterns

Spark Optimization

Shuffle Partitions

Broadcast Joins

AQE (Adaptive Query Execution)

Memory Management

Caching and Persistence

Eventstream and Eventhouse Optimization

Ingestion Optimization

KQL Query Optimization

Retention and Caching Policies

Query Performance (Cross-Workload)

Common Query Anti-Patterns

Index-Like Optimization in Fabric

Capacity Optimization

Right-Sizing

Staggering Workloads

Pause/Resume

The Optimization Checklist

Common Mistakes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply