Fabric Optimization Guide: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and Query Performance Tuning Across All Workloads
Performance optimization is the skill that separates a working data platform from a FAST data platform. A pipeline that takes 2 hours instead of 20 minutes. A dashboard that loads in 15 seconds instead of 2 seconds. A Spark notebook that costs $50 per run instead of $5.
This post consolidates optimization techniques across ALL Fabric workloads: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and queries. One reference for all optimization needs.
Table of Contents
- Lakehouse Optimization
- Delta Table Optimization (OPTIMIZE, VACUUM, Z-ORDER, V-Order)
- File Size and Partitioning
- Schema Design
- Pipeline Optimization
- Copy Activity Tuning
- Parallel Execution
- Scheduling Strategy
- Reducing Pipeline Duration
- Warehouse Optimization
- Statistics and Query Plans
- Result Set Caching
- Table Design
- Query Optimization Patterns
- Spark Optimization
- Shuffle Partitions
- Broadcast Joins
- AQE (Adaptive Query Execution)
- Memory Management
- Caching and Persistence
- Eventstream and Eventhouse Optimization
- Ingestion Optimization
- KQL Query Optimization
- Retention and Caching Policies
- Query Performance (Cross-Workload)
- Common Query Anti-Patterns
- Index-Like Optimization in Fabric
- Capacity Optimization
- Right-Sizing
- Staggering Workloads
- Pause/Resume
- The Optimization Checklist
- Common Mistakes
- Interview Questions
- Wrapping Up
Lakehouse Optimization
Delta Table Optimization
%%sql
-- OPTIMIZE: Compact small files into larger ones (faster reads)
OPTIMIZE silver.customers_clean;
-- OPTIMIZE with Z-ORDER: Co-locate data by frequently filtered columns
OPTIMIZE gold.fact_sales ZORDER BY (customer_key, date_key);
-- Queries filtering on customer_key or date_key will be 5-10x faster
-- VACUUM: Remove old file versions (save storage)
VACUUM silver.customers_clean RETAIN 168 HOURS; -- Keep 7 days of history
-- Frees storage from deleted/updated file versions
-- Check table details (file count, size)
DESCRIBE DETAIL gold.fact_sales;
When to run: – OPTIMIZE: Weekly, or after large data loads – VACUUM: Weekly (after OPTIMIZE) – Z-ORDER: On columns frequently used in WHERE clauses and JOINs
V-Order (Fabric-Specific)
# V-Order is ON by default in Fabric — verify:
spark.conf.get("spark.sql.parquet.vorder.enabled") # Should be "true"
# V-Order sorts data within Parquet files for optimal columnar access
# Impact: ~50% faster Direct Lake reads, faster SQL endpoint queries
# Cost: Zero (applied during write, no extra storage or compute)
File Size and Partitioning
# Too many small files = slow reads (metadata overhead)
# Too few large files = slow writes (memory pressure)
# Target: 128MB - 1GB per file
# For small tables (< 1M rows): single partition
df.coalesce(1).write.format("delta").saveAsTable("small_table")
# For large tables: partition by date
df.write.format("delta").partitionBy("year", "month").saveAsTable("large_table")
# Enable auto-optimize (compacts during writes)
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
Pipeline Optimization
Copy Activity Tuning
Parallel copy: Increase DIU (Data Integration Units) for large copies
Default: Auto → Let Fabric decide
Manual: Set higher for large tables (10-50 DIU)
Degree of copy parallelism:
Default: 4
For large tables: 8-16 (more parallel threads)
Staging: Enable staging for cross-region copies
Source (US East) → Staging (blob) → Destination (Canada Central)
Parallel Execution
❌ SLOW: Sequential
Copy_Customers → Copy_Orders → Copy_Products → Copy_Returns
Total: 4 × 5 min = 20 minutes
✅ FAST: Parallel (ForEach with sequential=false)
ForEach → [Copy_Customers, Copy_Orders, Copy_Products, Copy_Returns]
Total: 5 minutes (all run at once)
Scheduling Strategy
❌ BAD: All pipelines at 6:00 AM
→ CU spike → throttling → pipelines delayed → stale dashboards
✅ GOOD: Staggered schedule
6:00 AM: PL_Ingest (high priority)
6:15 AM: PL_Transform (depends on ingest)
6:30 AM: PL_Gold_Build (depends on transform)
6:45 AM: Semantic Model Refresh
→ Spread load → no throttling → predictable completion
Warehouse Optimization
Statistics
-- Create statistics for better query plans
CREATE STATISTICS stat_fact_date ON gold.fact_sales (date_key);
CREATE STATISTICS stat_fact_cust ON gold.fact_sales (customer_key);
-- Update after large loads
UPDATE STATISTICS gold.fact_sales;
-- Without statistics: optimizer guesses row counts → bad plans
-- With statistics: optimizer knows actual data distribution → good plans
Result Set Caching
-- Enable: repeated queries return cached results instantly
ALTER DATABASE CURRENT SET RESULT_SET_CACHING ON;
-- A query that takes 30 seconds runs once, then returns in <1 second
-- Cache invalidated when underlying data changes
Query Optimization Patterns
-- ❌ SLOW: SELECT * (reads all columns from columnar storage)
SELECT * FROM gold.fact_sales;
-- ✅ FAST: SELECT only needed columns
SELECT date_key, customer_key, total_amount FROM gold.fact_sales;
-- ❌ SLOW: Functions on filter columns
SELECT * FROM gold.fact_sales WHERE YEAR(order_date) = 2026;
-- ✅ FAST: Range predicate (enables partition elimination)
SELECT * FROM gold.fact_sales WHERE order_date >= '2026-01-01' AND order_date < '2027-01-01';
-- ❌ SLOW: Correlated subquery (runs per row)
SELECT * FROM dim_customer WHERE customer_key IN (SELECT customer_key FROM fact_sales);
-- ✅ FAST: JOIN (set-based)
SELECT DISTINCT c.* FROM dim_customer c JOIN fact_sales f ON c.customer_key = f.customer_key;
Spark Optimization
# The top 5 Spark optimizations:
# 1. Right-size shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", "50") # Not 200 for medium data
# 2. Enable AQE (auto-optimization)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# 3. Broadcast small dimension tables
from pyspark.sql.functions import broadcast
result = large_fact.join(broadcast(small_dim), "key")
# 4. Filter early (reduce data before joins)
df = spark.table("fact_sales").filter(col("date") > "2026-01-01") # Filter FIRST
# 5. Cache frequently reused DataFrames
df_customers = spark.table("dim_customer").cache() # Cached in memory
# Use df_customers in multiple joins without re-reading
Eventstream and Eventhouse Optimization
// Retention: auto-delete old data
.alter table sensor_readings policy retention "{'SoftDeletePeriod': '90.00:00:00'}"
// Caching: keep hot data in SSD
.alter table sensor_readings policy caching hot = 30d
// Materialized views: pre-compute frequent aggregations
.create materialized-view HourlyStats on table sensor_readings
{
sensor_readings | summarize avg(temperature), count() by bin(timestamp, 1h), device_id
}
// Partition: optimize for time-range queries
.alter table sensor_readings policy partitioning
@'{"PartitionKeys": [{"ColumnName": "timestamp", "Kind": "UniformRange", "Properties": {"Reference": "2020-01-01", "RangeSize": "1.00:00:00"}}]}'
The Optimization Checklist
LAKEHOUSE:
☐ Run OPTIMIZE on Gold tables weekly
☐ Run VACUUM to clean old versions
☐ Z-ORDER on frequently filtered columns
☐ V-Order enabled (default — verify)
☐ Enable optimizeWrite and autoCompact
☐ Partition large tables by date
PIPELINE:
☐ Use parallel ForEach (sequential=false)
☐ Stagger pipeline schedules (not all at 6 AM)
☐ Set retry on Copy activities (2-3 retries)
☐ Use event triggers instead of frequent polling
WAREHOUSE:
☐ Create statistics on filter/join columns
☐ Update statistics after large loads
☐ Enable result set caching
☐ SELECT only needed columns (never SELECT *)
☐ Use range predicates, not functions on columns
SPARK:
☐ Set shuffle partitions (10-50 for medium data)
☐ Enable AQE
☐ Broadcast small dimension tables in joins
☐ Filter early in the query plan
☐ Cache reused DataFrames
EVENTHOUSE:
☐ Set retention policies (30-90 days)
☐ Configure hot cache duration
☐ Create materialized views for frequent aggregations
☐ Partition by timestamp
CAPACITY:
☐ Monitor CU usage with Capacity Metrics app
☐ Right-size your F-SKU
☐ Pause dev/test during off-hours
☐ Stagger heavy workloads
Common Mistakes
- Not running OPTIMIZE — thousands of small files degrade read performance by 10x.
- SELECT * everywhere — columnar storage reads only requested columns. SELECT * reads ALL columns wastefully.
- All pipelines at 6 AM — CU spike = throttling = delayed dashboards.
- 200 shuffle partitions for small data — 200 partitions for 10K rows = 200 tiny files.
- No statistics on Warehouse tables — optimizer makes bad query plans without data distribution knowledge.
Interview Questions
Q: How do you optimize a Lakehouse table in Fabric? A: Run OPTIMIZE to compact small files, Z-ORDER on frequently filtered columns, VACUUM to remove old versions, enable V-Order for optimal columnar reads, and enable optimizeWrite + autoCompact for ongoing maintenance. Partition large tables by date for faster range queries.
Q: How do you optimize Spark performance in Fabric? A: Right-size shuffle partitions (reduce from default 200 for medium data), enable AQE for automatic optimization, broadcast small tables in joins, filter data early in the query, cache frequently reused DataFrames, and enable optimizeWrite on Delta writes.
Wrapping Up
Optimization is not a one-time task — it is an ongoing practice. Run OPTIMIZE weekly, update statistics after loads, monitor CU usage, and review slow queries monthly. The optimization checklist in this post covers every Fabric workload.
Related posts: – Apache Spark in Fabric – Delta Lake Optimization – Administration & Cost – SQL Indexes & Execution Plans
Naveen Vuppula is a Senior Data Engineering Consultant based in Ontario, Canada.