Fabric Notebooks: The Complete Guide to Spark Environments, Library Management, mssparkutils, Multi-Language Cells, and Production Notebook Patterns

The notebook is the data engineer’s daily tool in Fabric. Every transformation, every Delta MERGE, every data quality check, every ML feature table — all built in notebooks. Yet most tutorials show you how to write spark.read.table() and stop there.

This post covers everything ELSE: how Spark environments work, how to install custom libraries, how to chain notebooks with %run and mssparkutils.notebook.run(), how to mix PySpark and SQL in the same notebook, how to manage sessions efficiently, and the production patterns that separate a prototype notebook from a production-grade one.

Think of a notebook as a Swiss Army knife. Our previous posts taught you the blade (PySpark) and the scissors (SparkSQL). This post teaches you the other 15 tools — the bottle opener (mssparkutils), the screwdriver (library management), the saw (notebook chaining), and the file (Spark configuration). Same knife, full capability.

Notebook Basics Refresher
Creating and Organizing Notebooks
The Four Languages in One Notebook
PySpark (Default)
SparkSQL with %%sql Magic
Scala with %%scala Magic
R with %%r Magic
Mixing Languages: Sharing Data Between Cells
Spark Environments
What Is a Spark Environment
Creating and Configuring an Environment
Attaching an Environment to a Notebook
Runtime Versions
Library Management
Installing PyPI Packages
Installing Wheel Files
Installing Jar Files (Scala/Java)
Inline Installation (Quick and Dirty)
Environment-Level Installation (Production)
mssparkutils: The Utility Toolkit
File System Utilities (mssparkutils.fs)
Notebook Utilities (mssparkutils.notebook)
Credentials Utilities (mssparkutils.credentials)
Environment Utilities (mssparkutils.env)
Notebook Chaining and Orchestration
%run: Simple Chaining
mssparkutils.notebook.run(): Parameterized Chaining
mssparkutils.notebook.runMultiple(): Parallel Execution
The Config/Functions/Main Pattern
Notebook Parameters and Widgets
Receiving Parameters from Pipelines
Setting Default Values
Returning Values to Pipelines
Session Management
High Concurrency Mode
Session Timeouts and Configuration
Starter Pools vs Custom Pools
Spark Configuration for Performance
Key Spark Settings
Setting Configurations in Notebook
Setting Configurations in Environment
Notebook Resources (Embedded Files)
Production Notebook Patterns
Pattern 1: Config-Driven Notebook
Pattern 2: Reusable Functions Notebook
Pattern 3: Parameterized ETL Notebook
Pattern 4: Error Handling and Logging
Notebook Visualization and Display
Common Mistakes
Interview Questions
Wrapping Up

Notebook Basics Refresher

A Fabric Notebook:
  ├── Cell 1 (PySpark):  df = spark.read.table("customers")
  ├── Cell 2 (SQL):      %%sql SELECT * FROM customers LIMIT 10
  ├── Cell 3 (PySpark):  df_clean = df.filter(col("id").isNotNull())
  ├── Cell 4 (Markdown): ## Data Quality Report
  └── Cell 5 (PySpark):  df_clean.write.saveAsTable("silver.customers")

Each cell runs independently. Results display inline.
Cells can be PySpark, SQL, Scala, R, or Markdown — mixed freely.

Creating and Organizing Notebooks

Creating a Notebook

Open your Fabric workspace
Click + New item → Notebook
Name it following a convention: NB_Clean_Customers or nb_bronze_to_silver
Attach a Lakehouse (left panel → Add Lakehouse)

Naming Conventions (Production)

NB_Bronze_Load_Customers       ← Stage + purpose + entity
NB_Silver_Transform_Orders     ← Clear what it does
NB_Gold_Build_Dim_Customer     ← Which layer it writes to
NB_Utils_Common_Functions      ← Reusable functions notebook
NB_Config_Settings             ← Configuration notebook

Attaching a Lakehouse

Every notebook needs a default Lakehouse for table access:

In the notebook, click Lakehouses in the left panel
Click Add → select your lakehouse
Click Pin as default (makes it the default for spark.table() calls)

# With default lakehouse pinned:
df = spark.table("customers")          # Reads from default lakehouse
df = spark.sql("SELECT * FROM orders") # Same — uses default lakehouse

# Access a DIFFERENT lakehouse explicitly:
df = spark.sql("SELECT * FROM other_lakehouse.dbo.products")

The Four Languages in One Notebook

PySpark (Default)

# Default language — no magic command needed
from pyspark.sql.functions import col, trim, lower, current_timestamp

df = spark.table("bronze.raw_customers")
df_clean = df.withColumn("email", lower(trim(col("email"))))              .filter(col("customer_id").isNotNull())
df_clean.write.format("delta").mode("overwrite").saveAsTable("silver.customers")
print(f"Rows written: {df_clean.count()}")

SparkSQL with %%sql Magic

%%sql
-- Write SQL directly in a notebook cell
SELECT department, COUNT(*) AS headcount, ROUND(AVG(salary), 0) AS avg_salary
FROM silver.employees
GROUP BY department
ORDER BY avg_salary DESC

Results display as a formatted table automatically — no df.show() needed.

Scala with %%scala Magic

%%scala
// Useful for Spark-native operations or Java library interop
val df = spark.table("silver.customers")
val count = df.count()
println(s"Customer count: $count")

R with %%r Magic

%%r
# Useful for statistical analysis
library(SparkR)
df <- sql("SELECT salary FROM silver.employees")
summary(df)

The challenge: variables in PySpark cells are NOT visible in SQL cells. Use temp views to bridge:

# Cell 1 (PySpark): Create a temp view
df_filtered = spark.table("customers").filter(col("city") == "Toronto")
df_filtered.createOrReplaceTempView("toronto_customers")

%%sql
-- Cell 2 (SQL): Query the temp view created in PySpark
SELECT * FROM toronto_customers ORDER BY name

# Cell 3 (PySpark): Read the SQL result back
df_result = spark.sql("SELECT * FROM toronto_customers")

Temp views are the bridge between languages. PySpark creates them, SQL queries them, and they disappear when the session ends.

Spark Environments

What Is a Spark Environment

A Spark Environment is a reusable configuration that defines the Spark runtime version, installed libraries, and Spark settings. Instead of installing libraries in every notebook, you install them ONCE in an environment and attach the environment to your notebooks.

Without Environments:
  Notebook A: %pip install pandas openpyxl requests
  Notebook B: %pip install pandas openpyxl requests   ← Same install, repeated
  Notebook C: %pip install pandas openpyxl requests   ← Again!
  Each notebook spends 2-3 minutes installing on every run.

With Environments:
  Environment: "data_eng_env"
    Libraries: pandas, openpyxl, requests (pre-installed)

  Notebook A: attached to data_eng_env → libraries ready instantly
  Notebook B: attached to data_eng_env → libraries ready instantly
  Notebook C: attached to data_eng_env → libraries ready instantly

Creating and Configuring an Environment

+ New item → Environment
Name: data_eng_environment
Public Libraries tab:
Click Add from PyPI
Search and add: openpyxl, great-expectations, azure-identity
Spark Properties tab:
Add custom Spark configurations (optional)
Click Publish → Fabric builds the environment (takes a few minutes)

Attaching an Environment to a Notebook

Open your notebook
Click the Environment dropdown in the toolbar (shows “Workspace default”)
Select your custom environment: data_eng_environment
All libraries from that environment are now available

Runtime Versions

Fabric provides Spark runtime versions:

Runtime	Spark Version	Python	Delta Lake	Status
Runtime 1.2	Spark 3.4	Python 3.10	Delta 2.4	GA
Runtime 1.3	Spark 3.5	Python 3.11	Delta 3.0	GA

Choose the runtime in the Environment settings. Newer runtimes have performance improvements and updated libraries.

Library Management

Installing PyPI Packages

Method 1: Inline (per session — quick testing)

# Install in the current session only (not persistent)
%pip install openpyxl
%pip install great-expectations==0.18.0  # Specific version
%pip install azure-identity azure-storage-blob  # Multiple packages

Method 2: Environment-level (persistent — production)

Open your Environment item
Public Libraries → Add from PyPI
Add packages with optional version constraints
Click Publish (rebuilds environment — takes 2-5 minutes)
All notebooks using this environment get the libraries automatically

Installing Wheel Files

For custom or private packages:

# Upload .whl file to lakehouse Files/
# Then install from the file path
%pip install /lakehouse/default/Files/libs/my_custom_lib-1.0.0-py3-none-any.whl

Or upload the .whl to the Environment → Custom Libraries tab.

Installing Jar Files (Scala/Java)

For Scala/Java libraries (e.g., custom connectors):

Open your Environment
Custom Libraries → Upload .jar file
Publish → available in %%scala cells

Inline Installation (Quick and Dirty)

Per-session only — installs every time the notebook runs. Good for quick testing and exploration. Not for production (adds 2-3 minutes overhead per run):

%pip install openpyxl
%pip install great-expectations==0.18.0
%pip install azure-identity azure-storage-blob

# After %pip install, restart the Spark session for the library to take effect
# Use for: quick testing, one-time exploration
# NOT for: production notebooks (wastes 2-3 min per run)

Environment-Level Installation (Production)

Install once in a Spark Environment — available instantly in all notebooks attached to that environment:

# Install once in Environment (not per-notebook):
# 1. Create Environment: + New item → Environment → "data_eng_env"
# 2. Public Libraries tab → Add from PyPI → search "openpyxl" → Add
# 3. Add more: great-expectations, azure-identity, etc.
# 4. Click Publish (builds environment — 2-5 minutes one-time)
# 5. Attach to notebook: Environment dropdown → select "data_eng_env"
#
# Libraries are pre-installed — notebook starts instantly, no %pip needed
# Update libraries: edit Environment → Publish → all notebooks get the update

Best Practice

Development: Use %pip install (quick iteration)
Production:  Use Environments (consistent, no install time per run)

mssparkutils: The Utility Toolkit

mssparkutils is Fabric’s built-in utility library — similar to dbutils in Databricks.

File System Utilities (mssparkutils.fs)

# List files in a directory
files = mssparkutils.fs.ls("Files/raw_csv/")
for f in files:
    print(f"{f.name} — {f.size} bytes — isDir: {f.isDir}")

# Check if a file/folder exists
exists = mssparkutils.fs.exists("Files/raw_csv/customers.csv")

# Copy a file
mssparkutils.fs.cp("Files/raw_csv/old.csv", "Files/archive/old.csv")

# Move a file (rename)
mssparkutils.fs.mv("Files/raw_csv/temp.csv", "Files/raw_csv/final.csv")

# Delete a file
mssparkutils.fs.rm("Files/archive/old.csv")

# Delete a directory recursively
mssparkutils.fs.rm("Files/temp_folder/", recurse=True)

# Create a directory
mssparkutils.fs.mkdirs("Files/output/2026/05/")

# Read a small text file
content = mssparkutils.fs.head("Files/config/settings.json", maxBytes=1024)
print(content)

# Write a small text file
mssparkutils.fs.put("Files/output/status.txt", "Pipeline completed at 2026-05-25", overwrite=True)

Notebook Utilities (mssparkutils.notebook)

# Run another notebook (synchronous — waits for completion)
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600)
print(f"Result: {result}")  # Gets the exit value from the called notebook

# Run with parameters
result = mssparkutils.notebook.run("NB_Clean_Customers", timeout_seconds=600,
    arguments={"table_name": "customers", "load_date": "2026-05-25"})

# Run multiple notebooks in parallel
mssparkutils.notebook.runMultiple([
    {"name": "NB_Clean_Customers", "timeout": 600},
    {"name": "NB_Clean_Orders", "timeout": 600},
    {"name": "NB_Clean_Products", "timeout": 600},
])
# All three run in parallel — returns when ALL complete

# Exit a notebook with a return value
mssparkutils.notebook.exit("SUCCESS: 15000 rows processed")

Credentials Utilities (mssparkutils.credentials)

# Get an Azure AD token (for calling REST APIs)
token = mssparkutils.credentials.getToken("https://management.azure.com/")

# Get a secret from Azure Key Vault
secret = mssparkutils.credentials.getSecret(
    "https://my-keyvault.vault.azure.net/",
    "my-secret-name"
)

# Use in API calls
import requests
headers = {"Authorization": f"Bearer {token}"}
response = requests.get("https://api.example.com/data", headers=headers)

Environment Utilities (mssparkutils.env)

# Get workspace information
workspace_id = mssparkutils.env.getWorkspaceId()
workspace_name = mssparkutils.env.getWorkspaceName()
lakehouse_id = mssparkutils.env.getLakehouseId()
notebook_id = mssparkutils.env.getNotebookId()

print(f"Workspace: {workspace_name} ({workspace_id})")
print(f"Lakehouse: {lakehouse_id}")

Notebook Chaining and Orchestration

%run: Simple Chaining

%run executes another notebook in the SAME session (shares variables):

# Cell 1: Run a functions notebook (loads all its functions into this session)
%run NB_Utils_Common_Functions

# Cell 2: Now you can use functions defined in NB_Utils_Common_Functions
df_clean = clean_dataframe(df)  # Function defined in the other notebook

Important: %run executes in the same Spark session — all variables and functions from the called notebook become available in the calling notebook. Use for shared utility functions.

mssparkutils.notebook.run(): Parameterized Chaining

Runs a notebook in a SEPARATE session (isolated):

# Run with parameters and get the result
result = mssparkutils.notebook.run(
    "NB_Bronze_To_Silver",
    timeout_seconds=1800,  # 30 minute timeout
    arguments={
        "source_table": "raw_customers",
        "target_table": "customers_clean",
        "load_date": "2026-05-25"
    }
)
print(f"Notebook returned: {result}")

In the CALLED notebook (NB_Bronze_To_Silver):

# Read parameters
source = mssparkutils.widgets.get("source_table")     # "raw_customers"
target = mssparkutils.widgets.get("target_table")      # "customers_clean"
load_date = mssparkutils.widgets.get("load_date")      # "2026-05-25"

# Do the work
df = spark.table(f"bronze.{source}")
# ... transform ...
df_clean.write.saveAsTable(f"silver.{target}")

# Return a value
mssparkutils.notebook.exit(f"SUCCESS: {df_clean.count()} rows")

mssparkutils.notebook.runMultiple(): Parallel Execution

# Run 3 notebooks in parallel
results = mssparkutils.notebook.runMultiple([
    {
        "name": "NB_Clean_Customers",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    },
    {
        "name": "NB_Clean_Orders",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    },
    {
        "name": "NB_Clean_Products",
        "timeoutPerCellInSeconds": 300,
        "arguments": {"load_date": "2026-05-25"}
    }
])
# All three execute simultaneously — total time = slowest notebook, not sum

The Config/Functions/Main Pattern

Production notebooks follow a three-notebook pattern:

NB_Config_Settings (config)
  → Defines: database names, paths, table lists, environment variables

NB_Utils_Functions (shared functions)
  → Defines: clean_dataframe(), validate_schema(), log_activity()

NB_Main_ETL (main logic)
  → Cell 1: %run NB_Config_Settings        ← Load config
  → Cell 2: %run NB_Utils_Functions         ← Load functions
  → Cell 3: Main ETL logic using config and functions

# NB_Config_Settings
BRONZE_LAKEHOUSE = "bronze_lakehouse"
SILVER_LAKEHOUSE = "silver_lakehouse"
TABLES_TO_PROCESS = ["customers", "orders", "products"]
LOG_TABLE = "etl_log"

# NB_Utils_Functions
def clean_dataframe(df, trim_strings=True, drop_nulls_on=None):
    from pyspark.sql.functions import trim, col
    if trim_strings:
        for c in [f.name for f in df.schema.fields if f.dataType.simpleString() == 'string']:
            df = df.withColumn(c, trim(col(c)))
    if drop_nulls_on:
        df = df.filter(col(drop_nulls_on).isNotNull())
    return df

def log_activity(table_name, rows, status, message=""):
    spark.sql(f'''
        INSERT INTO {LOG_TABLE} VALUES (
            '{table_name}', {rows}, '{status}', '{message}', current_timestamp()
        )
    ''')

# NB_Main_ETL
%run NB_Config_Settings
%run NB_Utils_Functions

for table in TABLES_TO_PROCESS:
    try:
        df = spark.table(f"{BRONZE_LAKEHOUSE}.{table}")
        df_clean = clean_dataframe(df, drop_nulls_on="id")
        df_clean.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_LAKEHOUSE}.{table}")
        log_activity(table, df_clean.count(), "SUCCESS")
        print(f"✅ {table}: {df_clean.count()} rows")
    except Exception as e:
        log_activity(table, 0, "FAILED", str(e))
        print(f"❌ {table}: {str(e)}")

Notebook Parameters and Widgets

Receiving Parameters from Pipelines

When a notebook is called from a Fabric Pipeline (Notebook Activity):

# Pipeline passes: {"table_name": "customers", "load_date": "2026-05-25"}

# Read parameters in the notebook:
table_name = mssparkutils.widgets.get("table_name")     # "customers"
load_date = mssparkutils.widgets.get("load_date")        # "2026-05-25"

Setting Default Values

# Define parameters with defaults (for manual testing)
mssparkutils.widgets.text("table_name", "customers", "Table to process")
mssparkutils.widgets.text("load_date", "2026-05-25", "Load date")
mssparkutils.widgets.dropdown("mode", "overwrite", ["overwrite", "append"], "Write mode")

# Read (uses pipeline value if passed, default if running manually)
table_name = mssparkutils.widgets.get("table_name")
load_date = mssparkutils.widgets.get("load_date")
mode = mssparkutils.widgets.get("mode")

Returning Values to Pipelines

# At the end of the notebook
import json
result = json.dumps({
    "rows_processed": df.count(),
    "status": "SUCCESS",
    "duration_seconds": 45
})
mssparkutils.notebook.exit(result)

# In the Pipeline, access with:
# @activity('Notebook_Activity').output.result.exitValue

Session Management

High Concurrency Mode

By default, each notebook gets its own Spark session. High Concurrency Mode shares one session across multiple notebooks — faster start, less memory:

Open notebook → click the Session dropdown
Select High Concurrency (if available)
Multiple notebooks share the same Spark cluster

Session Timeouts and Configuration

# Default session timeout: 20 minutes of inactivity
# If you step away for 20 min, session expires — next cell starts a new session

# Increase timeout for long exploration sessions:
spark.conf.set("spark.fabric.session.idle.timeout", "60m")  # 60 minutes

# For pipeline-triggered notebooks: timeout doesn't matter
# (notebook runs to completion, session closes automatically)

# Check current session configuration:
print(spark.conf.get("spark.fabric.session.idle.timeout", "20m"))

Starter Pools vs Custom Pools

Feature	Starter Pool	Custom Pool (via Environment)
Startup time	Near instant (~10 seconds)	1-3 minutes
Configuration	Default	Fully customizable
Libraries	Default only	Custom libraries pre-installed
Best for	Ad-hoc exploration, quick queries	Production notebooks

Spark Configuration for Performance

Key Spark Settings

# Shuffle partitions (default 200 — too many for small data, too few for big)
spark.conf.set("spark.sql.shuffle.partitions", "50")  # For medium datasets

# Adaptive Query Execution (auto-optimizes at runtime)
spark.conf.set("spark.sql.adaptive.enabled", "true")   # Default in Fabric
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Delta Lake optimizations
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Memory management
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.executor.memory", "8g")

Setting Configurations in Notebook

Set Spark configs at the start of your notebook — applies to the current session only:

# Per-session settings (apply to THIS notebook run only)
spark.conf.set("spark.sql.shuffle.partitions", "50")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Verify a setting
print(spark.conf.get("spark.sql.shuffle.partitions"))  # "50"

# These reset when the session ends
# For persistent settings, use the Environment instead (see below)

Setting Configurations in Environment

For persistent settings, add them to the Environment → Spark Properties tab:

spark.sql.shuffle.partitions = 50
spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true

These apply to ALL notebooks using this environment — no per-notebook configuration needed.

Notebook Resources (Embedded Files)

Upload small files (config, lookup data) directly to the notebook:

Click Resources in the notebook sidebar
Upload a file (e.g., config.json, lookup.csv)
Access in code:

import json

# Read an embedded resource file
with open("/synfs/nb_resource/config.json", "r") as f:
    config = json.load(f)
print(config)

Use for: Small config files, lookup tables, schema definitions. NOT for large data files.

Production Notebook Patterns

Pattern 1: Config-Driven Notebook

Shown above in the Config/Functions/Main pattern. All settings in one place, all functions reusable, main logic clean.

Pattern 2: Reusable Functions Notebook

# NB_Utils_Functions — shared across all notebooks via %run

def clean_dataframe(df, trim_strings=True, drop_nulls_on=None):
    """Standard cleaning: trim strings, drop nulls on key column."""
    from pyspark.sql.functions import trim, col
    if trim_strings:
        for c in [f.name for f in df.schema.fields if f.dataType.simpleString() == 'string']:
            df = df.withColumn(c, trim(col(c)))
    if drop_nulls_on:
        df = df.filter(col(drop_nulls_on).isNotNull())
    return df

def deduplicate(df, key_columns, order_column=None):
    """Remove duplicates keeping the latest record."""
    from pyspark.sql.functions import row_number, col
    from pyspark.sql.window import Window
    if order_column:
        w = Window.partitionBy(*key_columns).orderBy(col(order_column).desc())
        return df.withColumn("_rn", row_number().over(w)).filter("_rn = 1").drop("_rn")
    return df.dropDuplicates(key_columns)

def log_activity(spark, table_name, rows, status, message=""):
    """Log ETL activity for monitoring."""
    spark.sql(f"""INSERT INTO etl_log VALUES
        ('{table_name}', {rows}, '{status}', '{message}', current_timestamp())""")

# Usage in main notebook:
# %run NB_Utils_Functions
# df_clean = clean_dataframe(df, drop_nulls_on="customer_id")
# df_dedup = deduplicate(df_clean, ["customer_id"], "updated_at")

Pattern 3: Parameterized ETL Notebook

A single notebook that processes ANY table — called from a pipeline with parameters:

# NB_Generic_Bronze_to_Silver — processes any table via parameters

# Cell 1: Read parameters (passed from Pipeline Notebook Activity)
table_name = mssparkutils.widgets.get("table_name")     # "customers"
load_date = mssparkutils.widgets.get("load_date")        # "2026-06-09"
mode = mssparkutils.widgets.get("mode")                  # "overwrite" or "append"

# Cell 2: Load shared functions
%run NB_Utils_Functions

# Cell 3: Process
df = spark.table(f"bronze.{table_name}")
print(f"Read {df.count()} rows from bronze.{table_name}")

df_clean = clean_dataframe(df, drop_nulls_on="id")
df_dedup = deduplicate(df_clean, ["id"], "updated_at")

# Cell 4: Write
df_dedup.write.format("delta").mode(mode).saveAsTable(f"silver.{table_name}")
row_count = df_dedup.count()
print(f"Wrote {row_count} rows to silver.{table_name}")

# Cell 5: Return result to pipeline
mssparkutils.notebook.exit(f"SUCCESS: {row_count} rows to silver.{table_name}")

In the Pipeline, use a ForEach activity to call this notebook for each table — one notebook handles all tables.

Pattern 4: Error Handling and Logging

import json
from datetime import datetime

start_time = datetime.now()
results = {"tables_processed": 0, "tables_failed": 0, "errors": []}

for table in TABLES_TO_PROCESS:
    try:
        df = spark.table(f"bronze.{table}")
        df_clean = clean_dataframe(df)
        df_clean.write.format("delta").mode("overwrite").saveAsTable(f"silver.{table}")
        results["tables_processed"] += 1
        print(f"✅ {table}: {df_clean.count()} rows")
    except Exception as e:
        results["tables_failed"] += 1
        results["errors"].append({"table": table, "error": str(e)})
        print(f"❌ {table}: {str(e)}")

# Summary
duration = (datetime.now() - start_time).total_seconds()
results["duration_seconds"] = duration
results["status"] = "SUCCESS" if results["tables_failed"] == 0 else "PARTIAL_FAILURE"

print(f"
{'='*50}")
print(f"Processed: {results['tables_processed']}, Failed: {results['tables_failed']}")
print(f"Duration: {duration:.0f} seconds")
print(f"Status: {results['status']}")

# Return to pipeline
mssparkutils.notebook.exit(json.dumps(results))

Notebook Visualization and Display

# Display DataFrame as formatted table (Fabric-specific rich display)
display(df)                          # Interactive table with sorting, filtering
df.show(20, truncate=False)          # Plain text table (standard Spark)

# Display with built-in charts
display(df)   # Then click the chart icon below the table
              # Switch between: Table, Bar, Line, Pie, Area, Scatter
              # Drag columns to axes — no code needed

# Display summary statistics
display(df.summary())                # count, mean, stddev, min, max per column
display(df.describe())               # Same but fewer stats

# Convert to pandas for custom visualization (small DataFrames only)
import matplotlib.pyplot as plt
pdf = df.toPandas()                  # Only for small DataFrames (< 1M rows)

# Bar chart
pdf.groupby("category")["revenue"].sum().plot(kind="bar", title="Revenue by Category")
plt.tight_layout()
plt.show()

# Histogram
pdf["amount"].plot(kind="hist", bins=30, title="Amount Distribution")
plt.show()

# Seaborn (if installed in Environment)
import seaborn as sns
sns.boxplot(data=pdf, x="category", y="amount")
plt.show()

Key point: display(df) is Fabric-specific (like Databricks). df.show() is standard Spark. Use display() for interactive exploration, df.show() for logging output in pipeline-triggered notebooks.

Common Mistakes

Installing libraries with %pip in production notebooks — every run spends 2-3 minutes installing. Use Environments instead — libraries are pre-installed, notebooks start instantly.
Not attaching a default lakehouse — spark.table("customers") fails without a default lakehouse. Always pin one.
Using %run for data passing instead of tables — %run shares variables but makes debugging hard. Write intermediate results to tables, read in the next step. Tables are persistent; variables disappear when the session ends.
Hardcoding table names and paths — use parameters or a config notebook. Hardcoded values break when you deploy to a different environment.
Not returning exit values from notebooks called by pipelines — without mssparkutils.notebook.exit(), the pipeline cannot access results or status. Always return a status.
Ignoring Spark shuffle partitions — default 200 partitions for a 10,000-row table creates 200 tiny files. Set to 10-50 for medium data.

Interview Questions

Q: How do you pass parameters between a Fabric Pipeline and a Notebook? A: The Pipeline’s Notebook Activity passes parameters as key-value pairs. The notebook reads them with mssparkutils.widgets.get("param_name"). The notebook returns values with mssparkutils.notebook.exit(json_string), which the pipeline accesses via @activity('Notebook').output.result.exitValue.

Q: What is the difference between %run and mssparkutils.notebook.run()? A: %run executes another notebook in the SAME Spark session — all variables and functions become available in the calling notebook. Best for loading shared functions. mssparkutils.notebook.run() executes in a SEPARATE session — isolated, parameterized, returns a value. Best for orchestrating independent notebooks.

Q: What is a Spark Environment in Fabric? A: A reusable configuration that defines the Spark runtime version, pre-installed libraries, and Spark settings. Attach an environment to notebooks so libraries are available instantly without per-session installation. Essential for production — consistent dependencies across all notebooks.

Wrapping Up

The notebook is where all Fabric data engineering happens. Master the four languages, mssparkutils utilities, notebook chaining patterns, and environment management — and you can build any data pipeline Fabric supports.

← Previous: M Language (Power Query) Guide Fabric (18/38) Next: Apache Spark in Fabric →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Table of Contents