Fabric Notebooks: The Complete Guide to Spark Environments, Library Management, mssparkutils, Multi-Language Cells, and Production Notebook Patterns
The notebook is the data engineer’s daily tool in Fabric. Every transformation, every Delta MERGE, every data quality check, every ML feature table — all built in notebooks. Yet most tutorials show you how to write spark.read.table() and stop there.
This post covers everything ELSE: how Spark environments work, how to install custom libraries, how to chain notebooks with %run and mssparkutils.notebook.run(), how to mix PySpark and SQL in the same notebook, how to manage sessions efficiently, and the production patterns that separate a prototype notebook from a production-grade one.
Think of a notebook as a Swiss Army knife. Our previous posts taught you the blade (PySpark) and the scissors (SparkSQL). This post teaches you the other 15 tools — the bottle opener (mssparkutils), the screwdriver (library management), the saw (notebook chaining), and the file (Spark configuration). Same knife, full capability.
Table of Contents
- Notebook Basics Refresher
- Creating and Organizing Notebooks
- The Four Languages in One Notebook
- PySpark (Default)
- SparkSQL with %%sql Magic
- Scala with %%scala Magic
- R with %%r Magic
- Mixing Languages: Sharing Data Between Cells
- Spark Environments
- What Is a Spark Environment
- Creating and Configuring an Environment
- Attaching an Environment to a Notebook
- Runtime Versions
- Library Management
- Installing PyPI Packages
- Installing Wheel Files
- Installing Jar Files (Scala/Java)
- Inline Installation (Quick and Dirty)
- Environment-Level Installation (Production)
- mssparkutils: The Utility Toolkit
- File System Utilities (mssparkutils.fs)
- Notebook Utilities (mssparkutils.notebook)
- Credentials Utilities (mssparkutils.credentials)
- Environment Utilities (mssparkutils.env)
- Notebook Chaining and Orchestration
- %run: Simple Chaining
- mssparkutils.notebook.run(): Parameterized Chaining
- mssparkutils.notebook.runMultiple(): Parallel Execution
- The Config/Functions/Main Pattern
- Notebook Parameters and Widgets
- Receiving Parameters from Pipelines
- Setting Default Values
- Returning Values to Pipelines
- Session Management
- High Concurrency Mode
- Session Timeouts and Configuration
- Starter Pools vs Custom Pools
- Spark Configuration for Performance
- Key Spark Settings
- Setting Configurations in Notebook
- Setting Configurations in Environment
- Notebook Resources (Embedded Files)
- Production Notebook Patterns
- Pattern 1: Config-Driven Notebook
- Pattern 2: Reusable Functions Notebook
- Pattern 3: Parameterized ETL Notebook
- Pattern 4: Error Handling and Logging
- Notebook Visualization and Display
- Common Mistakes
- Interview Questions
- Wrapping Up
Notebook Basics Refresher
A Fabric Notebook:
├── Cell 1 (PySpark): df = spark.read.table("customers")
├── Cell 2 (SQL): %%sql SELECT * FROM customers LIMIT 10
├── Cell 3 (PySpark): df_clean = df.filter(col("id").isNotNull())
├── Cell 4 (Markdown): ## Data Quality Report
└── Cell 5 (PySpark): df_clean.write.saveAsTable("silver.customers")
Each cell runs independently. Results display inline.
Cells can be PySpark, SQL, Scala, R, or Markdown — mixed freely.
Creating and Organizing Notebooks
Creating a Notebook
- Open your Fabric workspace
- Click + New item → Notebook
- Name it following a convention:
NB_Clean_Customersornb_bronze_to_silver - Attach a Lakehouse (left panel → Add Lakehouse)
Naming Conventions (Production)
NB_Bronze_Load_Customers ← Stage + purpose + entity
NB_Silver_Transform_Orders ← Clear what it does
NB_Gold_Build_Dim_Customer ← Which layer it writes to
NB_Utils_Common_Functions ← Reusable functions notebook
NB_Config_Settings ← Configuration notebook
Attaching a Lakehouse
Every notebook needs a default Lakehouse for table access:
- In the notebook, click Lakehouses in the left panel
- Click Add → select your lakehouse
- Click Pin as default (makes it the default for
spark.table()calls)
# With default lakehouse pinned:
df = spark.table("customers") # Reads from default lakehouse
df = spark.sql("SELECT * FROM orders") # Same — uses default lakehouse
# Access a DIFFERENT lakehouse explicitly:
df = spark.sql("SELECT * FROM other_lakehouse.dbo.products")
The Four Languages in One Notebook
PySpark (Default)
# Default language — no magic command needed
from pyspark.sql.functions import col, trim, lower, current_timestamp
df = spark.table("bronze.raw_customers")
df_clean = df.withColumn("email", lower(trim(col("email")))) .filter(col("customer_id").isNotNull())
df_clean.write.format("delta").mode("overwrite").saveAsTable("silver.customers")
print(f"Rows written: {df_clean.count()}")
SparkSQL with %%sql Magic
%%sql
-- Write SQL directly in a notebook cell
SELECT department, COUNT(*) AS headcount, ROUND(AVG(salary), 0) AS avg_salary
FROM silver.employees
GROUP BY department
ORDER BY avg_salary DESC
Results display as a formatted table automatically — no df.show() needed.
Scala with %%scala Magic
%%scala
// Useful for Spark-native operations or Java library interop
val df = spark.table("silver.customers")
val count = df.count()
println(s"Customer count: $count")
R with %%r Magic
%%r
# Useful for statistical analysis
library(SparkR)
df <- sql("SELECT salary FROM silver.employees")
summary(df)
Mixing Languages: Sharing Data Between Cells
The challenge: variables in PySpark cells are NOT visible in SQL cells. Use temp views to bridge:
# Cell 1 (PySpark): Create a temp view
df_filtered = spark.table("customers").filter(col("city") == "Toronto")
df_filtered.createOrReplaceTempView("toronto_customers")
%%sql
-- Cell 2 (SQL): Query the temp view created in PySpark
SELECT * FROM toronto_customers ORDER BY name
# Cell 3 (PySpark): Read the SQL result back
df_result = spark.sql("SELECT * FROM toronto_customers")
Temp views are the bridge between languages. PySpark creates them, SQL queries them, and they disappear when the session ends.
Spark Environments
What Is a Spark Environment
A Spark Environment is a reusable configuration that defines the Spark runtime version, installed libraries, and Spark settings. Instead of installing libraries in every notebook, you install them ONCE in an environment and attach the environment to your notebooks.
Without Environments:
Notebook A: %pip install pandas openpyxl requests
Notebook B: %pip install pandas openpyxl requests ← Same install, repeated
Notebook C: %pip install pandas openpyxl requests ← Again!
Each notebook spends 2-3 minutes installing on every run.
With Environments:
Environment: "data_eng_env"
Libraries: pandas, openpyxl, requests (pre-installed)
Notebook A: attached to data_eng_env → libraries ready instantly
Notebook B: attached to data_eng_env → libraries ready instantly
Notebook C: attached to data_eng_env → libraries ready instantly
Creating and Configuring an Environment
- + New item → Environment
- Name:
data_eng_environment - Public Libraries tab:
- Click Add from PyPI
- Search and add:
openpyxl,great-expectations,azure-identity - Spark Properties tab:
- Add custom Spark configurations (optional)
- Click Publish → Fabric builds the environment (takes a few minutes)
Attaching an Environment to a Notebook
- Open your notebook
- Click the Environment dropdown in the toolbar (shows “Workspace default”)
- Select your custom environment:
data_eng_environment - All libraries from that environment are now available
Runtime Versions
Fabric provides Spark runtime versions:
| Runtime | Spark Version | Python | Delta Lake | Status |
|---|---|---|---|---|
| Runtime 1.2 | Spark 3.4 | Python 3.10 | Delta 2.4 | GA |
| Runtime 1.3 | Spark 3.5 | Python 3.11 | Delta 3.0 | GA |
Choose the runtime in the Environment settings. Newer runtimes have performance improvements and updated libraries.
Library Management
Installing PyPI Packages
Method 1: Inline (per session — quick testing)
# Install in the current session only (not persistent)
%pip install openpyxl
%pip install great-expectations==0.18.0 # Specific version
%pip install azure-identity azure-storage-blob # Multiple packages
Method 2: Environment-level (persistent — production)
- Open your Environment item
- Public Libraries → Add from PyPI
- Add packages with optional version constraints
- Click Publish (rebuilds environment — takes 2-5 minutes)
- All notebooks using this environment get the libraries automatically
Installing Wheel Files
For custom or private packages:
# Upload .whl file to lakehouse Files/
# Then install from the file path
%pip install /lakehouse/default/Files/libs/my_custom_lib-1.0.0-py3-none-any.whl
Or upload the .whl to the Environment → Custom Libraries tab.
Installing Jar Files (Scala/Java)
For Scala/Java libraries (e.g., custom connectors):
- Open your Environment
- Custom Libraries → Upload .jar file
- Publish → available in
%%scalacells
Best Practice
Development: Use %pip install (quick iteration)
Production: Use Environments (consistent, no install time per run)
mssparkutils: The Utility Toolkit
mssparkutils is Fabric’s built-in utility library — similar to dbutils in Databricks.
File System Utilities (mssparkutils.fs)
# List files in a directory
files = mssparkutils.fs.ls("Files/raw_csv/")
for f in files:
print(f"{f.name} — {f.size} bytes — isDir: {f.isDir}")
# Check if a file/folder exists
exists = mssparkutils.fs.exists("Files/raw_csv/customers.csv")
# Copy a file
mssparkutils.fs.cp("Files/raw_csv/old.csv", "Files/archive/old.csv")
# Move a file (rename)
mssparkutils.fs.mv("Files/raw_csv/temp.csv", "Files/raw_csv/final.csv")
# Delete a file
mssparkutils.fs.rm("Files/archive/old.csv")
# Delete a directory recursively
mssparkutils.fs.rm("Files/temp_folder/", recurse=True)
# Create a directory
mssparkutils.fs.mkdirs("Files/output/2026/05/")
# Read a small text file
content = mssparkutils.fs.head("Files/config/settings.json", maxBytes=1024)
print(content)
# Write a small text file
mssparkutils.fs.put("Files/output/status.txt", "Pipeline completed at 2026-05-25", overwrite=True)
Notebook Utilities (mssparkutils.notebook)
# Run another notebook (synchronous — waits for completion)
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600)
print(f"Result: {result}") # Gets the exit value from the called notebook
# Run with parameters
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600,
arguments={"table_name": "customers", "load_date": "2026-05-25"})
# Run multiple notebooks in parallel
mssparkutils.notebook.runMultiple([
{"name": "NB_Clean_Customers", "timeout": 600},
{"name": "NB_Clean_Orders", "timeout": 600},
{"name": "NB_Clean_Products", "timeout": 600},
])
# All three run in parallel — returns when ALL complete
# Exit a notebook with a return value
mssparkutils.notebook.exit("SUCCESS: 15000 rows processed")
Credentials Utilities (mssparkutils.credentials)
# Get an Azure AD token (for calling REST APIs)
token = mssparkutils.credentials.getToken("https://management.azure.com/")
# Get a secret from Azure Key Vault
secret = mssparkutils.credentials.getSecret(
"https://my-keyvault.vault.azure.net/",
"my-secret-name"
)
# Use in API calls
import requests
headers = {"Authorization": f"Bearer {token}"}
response = requests.get("https://api.example.com/data", headers=headers)
Environment Utilities (mssparkutils.env)
# Get workspace information
workspace_id = mssparkutils.env.getWorkspaceId()
workspace_name = mssparkutils.env.getWorkspaceName()
lakehouse_id = mssparkutils.env.getLakehouseId()
notebook_id = mssparkutils.env.getNotebookId()
print(f"Workspace: {workspace_name} ({workspace_id})")
print(f"Lakehouse: {lakehouse_id}")
Notebook Chaining and Orchestration
%run: Simple Chaining
%run executes another notebook in the SAME session (shares variables):
# Cell 1: Run a functions notebook (loads all its functions into this session)
%run NB_Utils_Common_Functions
# Cell 2: Now you can use functions defined in NB_Utils_Common_Functions
df_clean = clean_dataframe(df) # Function defined in the other notebook
Important: %run executes in the same Spark session — all variables and functions from the called notebook become available in the calling notebook. Use for shared utility functions.
mssparkutils.notebook.run(): Parameterized Chaining
Runs a notebook in a SEPARATE session (isolated):
# Run with parameters and get the result
result = mssparkutils.notebook.run(
"NB_Bronze_To_Silver",
timeout_seconds=1800, # 30 minute timeout
arguments={
"source_table": "raw_customers",
"target_table": "customers_clean",
"load_date": "2026-05-25"
}
)
print(f"Notebook returned: {result}")
In the CALLED notebook (NB_Bronze_To_Silver):
# Read parameters
source = mssparkutils.widgets.get("source_table") # "raw_customers"
target = mssparkutils.widgets.get("target_table") # "customers_clean"
load_date = mssparkutils.widgets.get("load_date") # "2026-05-25"
# Do the work
df = spark.table(f"bronze.{source}")
# ... transform ...
df_clean.write.saveAsTable(f"silver.{target}")
# Return a value
mssparkutils.notebook.exit(f"SUCCESS: {df_clean.count()} rows")
mssparkutils.notebook.runMultiple(): Parallel Execution
# Run 3 notebooks in parallel
results = mssparkutils.notebook.runMultiple([
{
"name": "NB_Clean_Customers",
"timeoutPerCellInSeconds": 300,
"arguments": {"load_date": "2026-05-25"}
},
{
"name": "NB_Clean_Orders",
"timeoutPerCellInSeconds": 300,
"arguments": {"load_date": "2026-05-25"}
},
{
"name": "NB_Clean_Products",
"timeoutPerCellInSeconds": 300,
"arguments": {"load_date": "2026-05-25"}
}
])
# All three execute simultaneously — total time = slowest notebook, not sum
The Config/Functions/Main Pattern
Production notebooks follow a three-notebook pattern:
NB_Config_Settings (config)
→ Defines: database names, paths, table lists, environment variables
NB_Utils_Functions (shared functions)
→ Defines: clean_dataframe(), validate_schema(), log_activity()
NB_Main_ETL (main logic)
→ Cell 1: %run NB_Config_Settings ← Load config
→ Cell 2: %run NB_Utils_Functions ← Load functions
→ Cell 3: Main ETL logic using config and functions
# NB_Config_Settings
BRONZE_LAKEHOUSE = "bronze_lakehouse"
SILVER_LAKEHOUSE = "silver_lakehouse"
TABLES_TO_PROCESS = ["customers", "orders", "products"]
LOG_TABLE = "etl_log"
# NB_Utils_Functions
def clean_dataframe(df, trim_strings=True, drop_nulls_on=None):
from pyspark.sql.functions import trim, col
if trim_strings:
for c in [f.name for f in df.schema.fields if f.dataType.simpleString() == 'string']:
df = df.withColumn(c, trim(col(c)))
if drop_nulls_on:
df = df.filter(col(drop_nulls_on).isNotNull())
return df
def log_activity(table_name, rows, status, message=""):
spark.sql(f'''
INSERT INTO {LOG_TABLE} VALUES (
'{table_name}', {rows}, '{status}', '{message}', current_timestamp()
)
''')
# NB_Main_ETL
%run NB_Config_Settings
%run NB_Utils_Functions
for table in TABLES_TO_PROCESS:
try:
df = spark.table(f"{BRONZE_LAKEHOUSE}.{table}")
df_clean = clean_dataframe(df, drop_nulls_on="id")
df_clean.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_LAKEHOUSE}.{table}")
log_activity(table, df_clean.count(), "SUCCESS")
print(f"✅ {table}: {df_clean.count()} rows")
except Exception as e:
log_activity(table, 0, "FAILED", str(e))
print(f"❌ {table}: {str(e)}")
Notebook Parameters and Widgets
Receiving Parameters from Pipelines
When a notebook is called from a Fabric Pipeline (Notebook Activity):
# Pipeline passes: {"table_name": "customers", "load_date": "2026-05-25"}
# Read parameters in the notebook:
table_name = mssparkutils.widgets.get("table_name") # "customers"
load_date = mssparkutils.widgets.get("load_date") # "2026-05-25"
Setting Default Values
# Define parameters with defaults (for manual testing)
mssparkutils.widgets.text("table_name", "customers", "Table to process")
mssparkutils.widgets.text("load_date", "2026-05-25", "Load date")
mssparkutils.widgets.dropdown("mode", "overwrite", ["overwrite", "append"], "Write mode")
# Read (uses pipeline value if passed, default if running manually)
table_name = mssparkutils.widgets.get("table_name")
load_date = mssparkutils.widgets.get("load_date")
mode = mssparkutils.widgets.get("mode")
Returning Values to Pipelines
# At the end of the notebook
import json
result = json.dumps({
"rows_processed": df.count(),
"status": "SUCCESS",
"duration_seconds": 45
})
mssparkutils.notebook.exit(result)
# In the Pipeline, access with:
# @activity('Notebook_Activity').output.result.exitValue
Session Management
High Concurrency Mode
By default, each notebook gets its own Spark session. High Concurrency Mode shares one session across multiple notebooks — faster start, less memory:
- Open notebook → click the Session dropdown
- Select High Concurrency (if available)
- Multiple notebooks share the same Spark cluster
Session Timeouts
# Default session timeout: 20 minutes of inactivity
# Configure in notebook or environment:
spark.conf.set("spark.fabric.session.idle.timeout", "30m") # 30 minutes
Starter Pools vs Custom Pools
| Feature | Starter Pool | Custom Pool (via Environment) |
|---|---|---|
| Startup time | Near instant (~10 seconds) | 1-3 minutes |
| Configuration | Default | Fully customizable |
| Libraries | Default only | Custom libraries pre-installed |
| Best for | Ad-hoc exploration, quick queries | Production notebooks |
Spark Configuration for Performance
Key Spark Settings
# Shuffle partitions (default 200 — too many for small data, too few for big)
spark.conf.set("spark.sql.shuffle.partitions", "50") # For medium datasets
# Adaptive Query Execution (auto-optimizes at runtime)
spark.conf.set("spark.sql.adaptive.enabled", "true") # Default in Fabric
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Delta Lake optimizations
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
# Memory management
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.executor.memory", "8g")
Setting Configurations in Environment
For persistent settings, add them to the Environment → Spark Properties tab:
spark.sql.shuffle.partitions = 50
spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true
These apply to ALL notebooks using this environment — no per-notebook configuration needed.
Notebook Resources (Embedded Files)
Upload small files (config, lookup data) directly to the notebook:
- Click Resources in the notebook sidebar
- Upload a file (e.g.,
config.json,lookup.csv) - Access in code:
import json
# Read an embedded resource file
with open("/synfs/nb_resource/config.json", "r") as f:
config = json.load(f)
print(config)
Use for: Small config files, lookup tables, schema definitions. NOT for large data files.
Production Notebook Patterns
Pattern 1: Config-Driven Notebook
Shown above in the Config/Functions/Main pattern. All settings in one place, all functions reusable, main logic clean.
Pattern 2: Error Handling and Logging
import json
from datetime import datetime
start_time = datetime.now()
results = {"tables_processed": 0, "tables_failed": 0, "errors": []}
for table in TABLES_TO_PROCESS:
try:
df = spark.table(f"bronze.{table}")
df_clean = clean_dataframe(df)
df_clean.write.format("delta").mode("overwrite").saveAsTable(f"silver.{table}")
results["tables_processed"] += 1
print(f"✅ {table}: {df_clean.count()} rows")
except Exception as e:
results["tables_failed"] += 1
results["errors"].append({"table": table, "error": str(e)})
print(f"❌ {table}: {str(e)}")
# Summary
duration = (datetime.now() - start_time).total_seconds()
results["duration_seconds"] = duration
results["status"] = "SUCCESS" if results["tables_failed"] == 0 else "PARTIAL_FAILURE"
print(f"
{'='*50}")
print(f"Processed: {results['tables_processed']}, Failed: {results['tables_failed']}")
print(f"Duration: {duration:.0f} seconds")
print(f"Status: {results['status']}")
# Return to pipeline
mssparkutils.notebook.exit(json.dumps(results))
Common Mistakes
-
Installing libraries with %pip in production notebooks — every run spends 2-3 minutes installing. Use Environments instead — libraries are pre-installed, notebooks start instantly.
-
Not attaching a default lakehouse —
spark.table("customers")fails without a default lakehouse. Always pin one. -
Using %run for data passing instead of tables — %run shares variables but makes debugging hard. Write intermediate results to tables, read in the next step. Tables are persistent; variables disappear when the session ends.
-
Hardcoding table names and paths — use parameters or a config notebook. Hardcoded values break when you deploy to a different environment.
-
Not returning exit values from notebooks called by pipelines — without
mssparkutils.notebook.exit(), the pipeline cannot access results or status. Always return a status. -
Ignoring Spark shuffle partitions — default 200 partitions for a 10,000-row table creates 200 tiny files. Set to 10-50 for medium data.
Interview Questions
Q: How do you pass parameters between a Fabric Pipeline and a Notebook?
A: The Pipeline’s Notebook Activity passes parameters as key-value pairs. The notebook reads them with mssparkutils.widgets.get("param_name"). The notebook returns values with mssparkutils.notebook.exit(json_string), which the pipeline accesses via @activity('Notebook').output.result.exitValue.
Q: What is the difference between %run and mssparkutils.notebook.run()?
A: %run executes another notebook in the SAME Spark session — all variables and functions become available in the calling notebook. Best for loading shared functions. mssparkutils.notebook.run() executes in a SEPARATE session — isolated, parameterized, returns a value. Best for orchestrating independent notebooks.
Q: What is a Spark Environment in Fabric? A: A reusable configuration that defines the Spark runtime version, pre-installed libraries, and Spark settings. Attach an environment to notebooks so libraries are available instantly without per-session installation. Essential for production — consistent dependencies across all notebooks.
Wrapping Up
The notebook is where all Fabric data engineering happens. Master the four languages, mssparkutils utilities, notebook chaining patterns, and environment management — and you can build any data pipeline Fabric supports.
Related posts: – Fabric Lakehouse Practical Guide – PySpark Transformations Cookbook – Delta Lake Deep Dive – Fabric Data Factory & Pipelines
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.