Fabric Notebooks: The Complete Guide to Spark Environments, Library Management, mssparkutils, Multi-Language Cells, and Production Notebook Patterns

Fabric Notebooks: The Complete Guide to Spark Environments, Library Management, mssparkutils, Multi-Language Cells, and Production Notebook Patterns

The notebook is the data engineer’s daily tool in Fabric. Every transformation, every Delta MERGE, every data quality check, every ML feature table — all built in notebooks. Yet most tutorials show you how to write spark.read.table() and stop there.

This post covers everything ELSE: how Spark environments work, how to install custom libraries, how to chain notebooks with %run and mssparkutils.notebook.run(), how to mix PySpark and SQL in the same notebook, how to manage sessions efficiently, and the production patterns that separate a prototype notebook from a production-grade one.

Think of a notebook as a Swiss Army knife. Our previous posts taught you the blade (PySpark) and the scissors (SparkSQL). This post teaches you the other 15 tools — the bottle opener (mssparkutils), the screwdriver (library management), the saw (notebook chaining), and the file (Spark configuration). Same knife, full capability.

Table of Contents

  • Notebook Basics Refresher
  • Creating and Organizing Notebooks
  • The Four Languages in One Notebook
  • PySpark (Default)
  • SparkSQL with %%sql Magic
  • Scala with %%scala Magic
  • R with %%r Magic
  • Mixing Languages: Sharing Data Between Cells
  • Spark Environments
  • What Is a Spark Environment
  • Creating and Configuring an Environment
  • Attaching an Environment to a Notebook
  • Runtime Versions
  • Library Management
  • Installing PyPI Packages
  • Installing Wheel Files
  • Installing Jar Files (Scala/Java)
  • Inline Installation (Quick and Dirty)
  • Environment-Level Installation (Production)
  • mssparkutils: The Utility Toolkit
  • File System Utilities (mssparkutils.fs)
  • Notebook Utilities (mssparkutils.notebook)
  • Credentials Utilities (mssparkutils.credentials)
  • Environment Utilities (mssparkutils.env)
  • Notebook Chaining and Orchestration
  • %run: Simple Chaining
  • mssparkutils.notebook.run(): Parameterized Chaining
  • mssparkutils.notebook.runMultiple(): Parallel Execution
  • The Config/Functions/Main Pattern
  • Notebook Parameters and Widgets
  • Receiving Parameters from Pipelines
  • Setting Default Values
  • Returning Values to Pipelines
  • Session Management
  • High Concurrency Mode
  • Session Timeouts and Configuration
  • Starter Pools vs Custom Pools
  • Spark Configuration for Performance
  • Key Spark Settings
  • Setting Configurations in Notebook
  • Setting Configurations in Environment
  • Notebook Resources (Embedded Files)
  • Production Notebook Patterns
  • Pattern 1: Config-Driven Notebook
  • Pattern 2: Reusable Functions Notebook
  • Pattern 3: Parameterized ETL Notebook
  • Pattern 4: Error Handling and Logging
  • Notebook Visualization and Display
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

Notebook Basics Refresher

A Fabric Notebook:
  ├── Cell 1 (PySpark):  df = spark.read.table("customers")
  ├── Cell 2 (SQL):      %%sql SELECT * FROM customers LIMIT 10
  ├── Cell 3 (PySpark):  df_clean = df.filter(col("id").isNotNull())
  ├── Cell 4 (Markdown): ## Data Quality Report
  └── Cell 5 (PySpark):  df_clean.write.saveAsTable("silver.customers")

Each cell runs independently. Results display inline.
Cells can be PySpark, SQL, Scala, R, or Markdown — mixed freely.

Creating and Organizing Notebooks

Creating a Notebook

  1. Open your Fabric workspace
  2. Click + New itemNotebook
  3. Name it following a convention: NB_Clean_Customers or nb_bronze_to_silver
  4. Attach a Lakehouse (left panel → Add Lakehouse)

Naming Conventions (Production)

NB_Bronze_Load_Customers       ← Stage + purpose + entity
NB_Silver_Transform_Orders     ← Clear what it does
NB_Gold_Build_Dim_Customer     ← Which layer it writes to
NB_Utils_Common_Functions      ← Reusable functions notebook
NB_Config_Settings             ← Configuration notebook

Attaching a Lakehouse

Every notebook needs a default Lakehouse for table access:

  1. In the notebook, click Lakehouses in the left panel
  2. Click Add → select your lakehouse
  3. Click Pin as default (makes it the default for spark.table() calls)
# With default lakehouse pinned:
df = spark.table("customers")          # Reads from default lakehouse
df = spark.sql("SELECT * FROM orders") # Same — uses default lakehouse

# Access a DIFFERENT lakehouse explicitly:
df = spark.sql("SELECT * FROM other_lakehouse.dbo.products")

The Four Languages in One Notebook

PySpark (Default)

# Default language — no magic command needed
from pyspark.sql.functions import col, trim, lower, current_timestamp

df = spark.table("bronze.raw_customers")
df_clean = df.withColumn("email", lower(trim(col("email"))))              .filter(col("customer_id").isNotNull())
df_clean.write.format("delta").mode("overwrite").saveAsTable("silver.customers")
print(f"Rows written: {df_clean.count()}")

SparkSQL with %%sql Magic

%%sql
-- Write SQL directly in a notebook cell
SELECT department, COUNT(*) AS headcount, ROUND(AVG(salary), 0) AS avg_salary
FROM silver.employees
GROUP BY department
ORDER BY avg_salary DESC

Results display as a formatted table automatically — no df.show() needed.

Scala with %%scala Magic

%%scala
// Useful for Spark-native operations or Java library interop
val df = spark.table("silver.customers")
val count = df.count()
println(s"Customer count: $count")

R with %%r Magic

%%r
# Useful for statistical analysis
library(SparkR)
df <- sql("SELECT salary FROM silver.employees")
summary(df)

Mixing Languages: Sharing Data Between Cells

The challenge: variables in PySpark cells are NOT visible in SQL cells. Use temp views to bridge:

# Cell 1 (PySpark): Create a temp view
df_filtered = spark.table("customers").filter(col("city") == "Toronto")
df_filtered.createOrReplaceTempView("toronto_customers")
%%sql
-- Cell 2 (SQL): Query the temp view created in PySpark
SELECT * FROM toronto_customers ORDER BY name
# Cell 3 (PySpark): Read the SQL result back
df_result = spark.sql("SELECT * FROM toronto_customers")

Temp views are the bridge between languages. PySpark creates them, SQL queries them, and they disappear when the session ends.

Spark Environments

What Is a Spark Environment

A Spark Environment is a reusable configuration that defines the Spark runtime version, installed libraries, and Spark settings. Instead of installing libraries in every notebook, you install them ONCE in an environment and attach the environment to your notebooks.

Without Environments:
  Notebook A: %pip install pandas openpyxl requests
  Notebook B: %pip install pandas openpyxl requests   ← Same install, repeated
  Notebook C: %pip install pandas openpyxl requests   ← Again!
  Each notebook spends 2-3 minutes installing on every run.

With Environments:
  Environment: "data_eng_env"
    Libraries: pandas, openpyxl, requests (pre-installed)

  Notebook A: attached to data_eng_env → libraries ready instantly
  Notebook B: attached to data_eng_env → libraries ready instantly
  Notebook C: attached to data_eng_env → libraries ready instantly

Creating and Configuring an Environment

  1. + New itemEnvironment
  2. Name: data_eng_environment
  3. Public Libraries tab:
  4. Click Add from PyPI
  5. Search and add: openpyxl, great-expectations, azure-identity
  6. Spark Properties tab:
  7. Add custom Spark configurations (optional)
  8. Click Publish → Fabric builds the environment (takes a few minutes)

Attaching an Environment to a Notebook

  1. Open your notebook
  2. Click the Environment dropdown in the toolbar (shows “Workspace default”)
  3. Select your custom environment: data_eng_environment
  4. All libraries from that environment are now available

Runtime Versions

Fabric provides Spark runtime versions:

Runtime Spark Version Python Delta Lake Status
Runtime 1.2 Spark 3.4 Python 3.10 Delta 2.4 GA
Runtime 1.3 Spark 3.5 Python 3.11 Delta 3.0 GA

Choose the runtime in the Environment settings. Newer runtimes have performance improvements and updated libraries.

Library Management

Installing PyPI Packages

Method 1: Inline (per session — quick testing)

# Install in the current session only (not persistent)
%pip install openpyxl
%pip install great-expectations==0.18.0  # Specific version
%pip install azure-identity azure-storage-blob  # Multiple packages

Method 2: Environment-level (persistent — production)

  1. Open your Environment item
  2. Public LibrariesAdd from PyPI
  3. Add packages with optional version constraints
  4. Click Publish (rebuilds environment — takes 2-5 minutes)
  5. All notebooks using this environment get the libraries automatically

Installing Wheel Files

For custom or private packages:

# Upload .whl file to lakehouse Files/
# Then install from the file path
%pip install /lakehouse/default/Files/libs/my_custom_lib-1.0.0-py3-none-any.whl

Or upload the .whl to the Environment → Custom Libraries tab.

Installing Jar Files (Scala/Java)

For Scala/Java libraries (e.g., custom connectors):

  1. Open your Environment
  2. Custom Libraries → Upload .jar file
  3. Publish → available in %%scala cells

Best Practice

Development: Use %pip install (quick iteration)
Production:  Use Environments (consistent, no install time per run)

mssparkutils: The Utility Toolkit

mssparkutils is Fabric’s built-in utility library — similar to dbutils in Databricks.

File System Utilities (mssparkutils.fs)

# List files in a directory
files = mssparkutils.fs.ls("Files/raw_csv/")
for f in files:
    print(f"{f.name} — {f.size} bytes — isDir: {f.isDir}")

# Check if a file/folder exists
exists = mssparkutils.fs.exists("Files/raw_csv/customers.csv")

# Copy a file
mssparkutils.fs.cp("Files/raw_csv/old.csv", "Files/archive/old.csv")

# Move a file (rename)
mssparkutils.fs.mv("Files/raw_csv/temp.csv", "Files/raw_csv/final.csv")

# Delete a file
mssparkutils.fs.rm("Files/archive/old.csv")

# Delete a directory recursively
mssparkutils.fs.rm("Files/temp_folder/", recurse=True)

# Create a directory
mssparkutils.fs.mkdirs("Files/output/2026/05/")

# Read a small text file
content = mssparkutils.fs.head("Files/config/settings.json", maxBytes=1024)
print(content)

# Write a small text file
mssparkutils.fs.put("Files/output/status.txt", "Pipeline completed at 2026-05-25", overwrite=True)

Notebook Utilities (mssparkutils.notebook)

# Run another notebook (synchronous — waits for completion)
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600)
print(f"Result: {result}")  # Gets the exit value from the called notebook

# Run with parameters
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600,
    arguments={"table_name": "customers", "load_date": "2026-05-25"})

# Run multiple notebooks in parallel
mssparkutils.notebook.runMultiple([
    {"name": "NB_Clean_Customers", "timeout": 600},
    {"name": "NB_Clean_Orders", "timeout": 600},
    {"name": "NB_Clean_Products", "timeout": 600},
])
# All three run in parallel — returns when ALL complete

# Exit a notebook with a return value
mssparkutils.notebook.exit("SUCCESS: 15000 rows processed")

Credentials Utilities (mssparkutils.credentials)

# Get an Azure AD token (for calling REST APIs)
token = mssparkutils.credentials.getToken("https://management.azure.com/")

# Get a secret from Azure Key Vault
secret = mssparkutils.credentials.getSecret(
    "https://my-keyvault.vault.azure.net/",
    "my-secret-name"
)

# Use in API calls
import requests
headers = {"Authorization": f"Bearer {token}"}
response = requests.get("https://api.example.com/data", headers=headers)

Environment Utilities (mssparkutils.env)

# Get workspace information
workspace_id = mssparkutils.env.getWorkspaceId()
workspace_name = mssparkutils.env.getWorkspaceName()
lakehouse_id = mssparkutils.env.getLakehouseId()
notebook_id = mssparkutils.env.getNotebookId()

print(f"Workspace: {workspace_name} ({workspace_id})")
print(f"Lakehouse: {lakehouse_id}")

Notebook Chaining and Orchestration

%run: Simple Chaining

%run executes another notebook in the SAME session (shares variables):

# Cell 1: Run a functions notebook (loads all its functions into this session)
%run NB_Utils_Common_Functions

# Cell 2: Now you can use functions defined in NB_Utils_Common_Functions
df_clean = clean_dataframe(df)  # Function defined in the other notebook

Important: %run executes in the same Spark session — all variables and functions from the called notebook become available in the calling notebook. Use for shared utility functions.

mssparkutils.notebook.run(): Parameterized Chaining

Runs a notebook in a SEPARATE session (isolated):

# Run with parameters and get the result
result = mssparkutils.notebook.run(
    "NB_Bronze_To_Silver",
    timeout_seconds=1800,  # 30 minute timeout
    arguments={
        "source_table": "raw_customers",
        "target_table": "customers_clean",
        "load_date": "2026-05-25"
    }
)
print(f"Notebook returned: {result}")

In the CALLED notebook (NB_Bronze_To_Silver):

# Read parameters
source = mssparkutils.widgets.get("source_table")     # "raw_customers"
target = mssparkutils.widgets.get("target_table")      # "customers_clean"
load_date = mssparkutils.widgets.get("load_date")      # "2026-05-25"

# Do the work
df = spark.table(f"bronze.{source}")
# ... transform ...
df_clean.write.saveAsTable(f"silver.{target}")

# Return a value
mssparkutils.notebook.exit(f"SUCCESS: {df_clean.count()} rows")

mssparkutils.notebook.runMultiple(): Parallel Execution

# Run 3 notebooks in parallel
results = mssparkutils.notebook.runMultiple([
    {
        "name": "NB_Clean_Customers",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    },
    {
        "name": "NB_Clean_Orders",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    },
    {
        "name": "NB_Clean_Products",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    }
])
# All three execute simultaneously — total time = slowest notebook, not sum

The Config/Functions/Main Pattern

Production notebooks follow a three-notebook pattern:

NB_Config_Settings (config)
  → Defines: database names, paths, table lists, environment variables

NB_Utils_Functions (shared functions)
  → Defines: clean_dataframe(), validate_schema(), log_activity()

NB_Main_ETL (main logic)
  → Cell 1: %run NB_Config_Settings        ← Load config
  → Cell 2: %run NB_Utils_Functions         ← Load functions
  → Cell 3: Main ETL logic using config and functions
# NB_Config_Settings
BRONZE_LAKEHOUSE = "bronze_lakehouse"
SILVER_LAKEHOUSE = "silver_lakehouse"
TABLES_TO_PROCESS = ["customers", "orders", "products"]
LOG_TABLE = "etl_log"
# NB_Utils_Functions
def clean_dataframe(df, trim_strings=True, drop_nulls_on=None):
    from pyspark.sql.functions import trim, col
    if trim_strings:
        for c in [f.name for f in df.schema.fields if f.dataType.simpleString() == 'string']:
            df = df.withColumn(c, trim(col(c)))
    if drop_nulls_on:
        df = df.filter(col(drop_nulls_on).isNotNull())
    return df

def log_activity(table_name, rows, status, message=""):
    spark.sql(f'''
        INSERT INTO {LOG_TABLE} VALUES (
            '{table_name}', {rows}, '{status}', '{message}', current_timestamp()
        )
    ''')
# NB_Main_ETL
%run NB_Config_Settings
%run NB_Utils_Functions

for table in TABLES_TO_PROCESS:
    try:
        df = spark.table(f"{BRONZE_LAKEHOUSE}.{table}")
        df_clean = clean_dataframe(df, drop_nulls_on="id")
        df_clean.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_LAKEHOUSE}.{table}")
        log_activity(table, df_clean.count(), "SUCCESS")
        print(f"✅ {table}: {df_clean.count()} rows")
    except Exception as e:
        log_activity(table, 0, "FAILED", str(e))
        print(f"❌ {table}: {str(e)}")

Notebook Parameters and Widgets

Receiving Parameters from Pipelines

When a notebook is called from a Fabric Pipeline (Notebook Activity):

# Pipeline passes: {"table_name": "customers", "load_date": "2026-05-25"}

# Read parameters in the notebook:
table_name = mssparkutils.widgets.get("table_name")     # "customers"
load_date = mssparkutils.widgets.get("load_date")        # "2026-05-25"

Setting Default Values

# Define parameters with defaults (for manual testing)
mssparkutils.widgets.text("table_name", "customers", "Table to process")
mssparkutils.widgets.text("load_date", "2026-05-25", "Load date")
mssparkutils.widgets.dropdown("mode", "overwrite", ["overwrite", "append"], "Write mode")

# Read (uses pipeline value if passed, default if running manually)
table_name = mssparkutils.widgets.get("table_name")
load_date = mssparkutils.widgets.get("load_date")
mode = mssparkutils.widgets.get("mode")

Returning Values to Pipelines

# At the end of the notebook
import json
result = json.dumps({
    "rows_processed": df.count(),
    "status": "SUCCESS",
    "duration_seconds": 45
})
mssparkutils.notebook.exit(result)

# In the Pipeline, access with:
# @activity('Notebook_Activity').output.result.exitValue

Session Management

High Concurrency Mode

By default, each notebook gets its own Spark session. High Concurrency Mode shares one session across multiple notebooks — faster start, less memory:

  1. Open notebook → click the Session dropdown
  2. Select High Concurrency (if available)
  3. Multiple notebooks share the same Spark cluster

Session Timeouts

# Default session timeout: 20 minutes of inactivity
# Configure in notebook or environment:
spark.conf.set("spark.fabric.session.idle.timeout", "30m")  # 30 minutes

Starter Pools vs Custom Pools

Feature Starter Pool Custom Pool (via Environment)
Startup time Near instant (~10 seconds) 1-3 minutes
Configuration Default Fully customizable
Libraries Default only Custom libraries pre-installed
Best for Ad-hoc exploration, quick queries Production notebooks

Spark Configuration for Performance

Key Spark Settings

# Shuffle partitions (default 200 — too many for small data, too few for big)
spark.conf.set("spark.sql.shuffle.partitions", "50")  # For medium datasets

# Adaptive Query Execution (auto-optimizes at runtime)
spark.conf.set("spark.sql.adaptive.enabled", "true")   # Default in Fabric
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Delta Lake optimizations
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Memory management
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.executor.memory", "8g")

Setting Configurations in Environment

For persistent settings, add them to the Environment → Spark Properties tab:

spark.sql.shuffle.partitions = 50
spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true

These apply to ALL notebooks using this environment — no per-notebook configuration needed.

Notebook Resources (Embedded Files)

Upload small files (config, lookup data) directly to the notebook:

  1. Click Resources in the notebook sidebar
  2. Upload a file (e.g., config.json, lookup.csv)
  3. Access in code:
import json

# Read an embedded resource file
with open("/synfs/nb_resource/config.json", "r") as f:
    config = json.load(f)
print(config)

Use for: Small config files, lookup tables, schema definitions. NOT for large data files.

Production Notebook Patterns

Pattern 1: Config-Driven Notebook

Shown above in the Config/Functions/Main pattern. All settings in one place, all functions reusable, main logic clean.

Pattern 2: Error Handling and Logging

import json
from datetime import datetime

start_time = datetime.now()
results = {"tables_processed": 0, "tables_failed": 0, "errors": []}

for table in TABLES_TO_PROCESS:
    try:
        df = spark.table(f"bronze.{table}")
        df_clean = clean_dataframe(df)
        df_clean.write.format("delta").mode("overwrite").saveAsTable(f"silver.{table}")
        results["tables_processed"] += 1
        print(f"✅ {table}: {df_clean.count()} rows")
    except Exception as e:
        results["tables_failed"] += 1
        results["errors"].append({"table": table, "error": str(e)})
        print(f"❌ {table}: {str(e)}")

# Summary
duration = (datetime.now() - start_time).total_seconds()
results["duration_seconds"] = duration
results["status"] = "SUCCESS" if results["tables_failed"] == 0 else "PARTIAL_FAILURE"

print(f"
{'='*50}")
print(f"Processed: {results['tables_processed']}, Failed: {results['tables_failed']}")
print(f"Duration: {duration:.0f} seconds")
print(f"Status: {results['status']}")

# Return to pipeline
mssparkutils.notebook.exit(json.dumps(results))

Common Mistakes

  1. Installing libraries with %pip in production notebooks — every run spends 2-3 minutes installing. Use Environments instead — libraries are pre-installed, notebooks start instantly.

  2. Not attaching a default lakehousespark.table("customers") fails without a default lakehouse. Always pin one.

  3. Using %run for data passing instead of tables — %run shares variables but makes debugging hard. Write intermediate results to tables, read in the next step. Tables are persistent; variables disappear when the session ends.

  4. Hardcoding table names and paths — use parameters or a config notebook. Hardcoded values break when you deploy to a different environment.

  5. Not returning exit values from notebooks called by pipelines — without mssparkutils.notebook.exit(), the pipeline cannot access results or status. Always return a status.

  6. Ignoring Spark shuffle partitions — default 200 partitions for a 10,000-row table creates 200 tiny files. Set to 10-50 for medium data.

Interview Questions

Q: How do you pass parameters between a Fabric Pipeline and a Notebook? A: The Pipeline’s Notebook Activity passes parameters as key-value pairs. The notebook reads them with mssparkutils.widgets.get("param_name"). The notebook returns values with mssparkutils.notebook.exit(json_string), which the pipeline accesses via @activity('Notebook').output.result.exitValue.

Q: What is the difference between %run and mssparkutils.notebook.run()? A: %run executes another notebook in the SAME Spark session — all variables and functions become available in the calling notebook. Best for loading shared functions. mssparkutils.notebook.run() executes in a SEPARATE session — isolated, parameterized, returns a value. Best for orchestrating independent notebooks.

Q: What is a Spark Environment in Fabric? A: A reusable configuration that defines the Spark runtime version, pre-installed libraries, and Spark settings. Attach an environment to notebooks so libraries are available instantly without per-session installation. Essential for production — consistent dependencies across all notebooks.

Wrapping Up

The notebook is where all Fabric data engineering happens. Master the four languages, mssparkutils utilities, notebook chaining patterns, and environment management — and you can build any data pipeline Fabric supports.

Related posts:Fabric Lakehouse Practical GuidePySpark Transformations CookbookDelta Lake Deep DiveFabric Data Factory & Pipelines


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link