Azure Databricks for Data Engineers: Introduction, Architecture, and dbutils Commands Explained

If Azure Synapse is a Swiss Army knife with many blades, Azure Databricks is a professional chef’s knife — designed for one thing and optimized to do it brilliantly: Apache Spark workloads. It is the most popular platform for big data processing, machine learning, and lakehouse architecture in 2026.

Databricks was founded by the creators of Apache Spark themselves. They took Spark, added a collaborative notebook environment, built-in cluster management, Delta Lake, Unity Catalog, and a polished user experience — then hosted it on Azure, AWS, and GCP.

For data engineers, Databricks is where you write PySpark code, build Delta Lake tables, run ETL pipelines, and collaborate with data scientists. And the tool you will use most inside Databricks is dbutils — a utility library that handles file operations, secrets, widgets, and notebook orchestration.

This post covers everything you need to get started with Databricks and master the dbutils commands you will use daily.

What Is Azure Databricks?
Databricks vs Synapse Spark: When to Use Which
Databricks Architecture
The Workspace: Your Digital Office
Clusters: The Engines
Notebooks: The Workbench
What Is dbutils?
dbutils.fs — File System Operations
dbutils.secrets — Secure Secret Management
dbutils.widgets — Parameterize Your Notebooks
dbutils.notebook — Orchestrate Notebooks
dbutils.help() — Discover Commands
Mounting Storage with dbutils
Working with Delta Lake in Databricks
Unity Catalog Basics
Databricks Workflows (Jobs)
Real-World Scenarios
Cost Management
Common Mistakes
Interview Questions
Wrapping Up

What Is Azure Databricks?

Azure Databricks is a managed Apache Spark platform jointly developed by Databricks and Microsoft. It provides:

Collaborative notebooks — write PySpark, SQL, Scala, R in Jupyter-like notebooks
Managed Spark clusters — auto-scaling, auto-termination, zero infrastructure management
Delta Lake — ACID transactions, time travel, MERGE (native and default)
Unity Catalog — centralized governance, access control, data lineage
MLflow — experiment tracking, model registry, deployment
Databricks Workflows — job scheduling and orchestration
SQL Warehouses — serverless SQL endpoints for BI tools

Real-life analogy: If building a data platform is like building a house, Databricks is like hiring a construction company that brings their own tools, scaffolding, project management, and cleanup crew. You just design the house (write code) and they handle the rest (infrastructure, scaling, security).

Databricks vs Synapse Spark: When to Use Which

Feature	Azure Databricks	Synapse Spark Pool
Spark experience	Best-in-class (created by Spark founders)	Good (standard Spark)
Notebook experience	Superior (real-time coauthoring, versioning, commenting)	Basic
Delta Lake	Native, deeply integrated, optimized	Supported but not default
Cluster management	Advanced (auto-scaling, spot instances, pools)	Basic (auto-scale, auto-pause)
SQL analytics	SQL Warehouses (serverless BI endpoint)	Serverless SQL Pool (different engine)
ML capabilities	MLflow, AutoML, Feature Store	Basic ML library support
Unity Catalog	Full governance platform	Synapse uses Purview
Pipeline integration	Databricks Workflows or ADF	Synapse Pipelines (built-in)
Cost	Per DBU (Databricks Unit) + VM cost	Per node-hour
Best for	Spark-heavy, ML, lakehouse architecture	Integrated analytics (SQL + Spark + Pipelines)

Rule of thumb: Choose Databricks when Spark and Delta Lake are your primary tools. Choose Synapse Spark when you need Spark as ONE component of a larger analytics platform with SQL pools and built-in pipelines.

Databricks Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AZURE DATABRICKS                            │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                CONTROL PLANE (Managed by Databricks)     │    │
│  │                                                          │    │
│  │  Workspace UI │ Notebook Service │ Cluster Manager       │    │
│  │  Job Scheduler │ Unity Catalog │ REST API                │    │
│  └──────────────────────────┬───────────────────────────────┘    │
│                              │                                    │
│  ┌──────────────────────────┴───────────────────────────────┐    │
│  │                DATA PLANE (Runs in YOUR Azure subscription) │  │
│  │                                                            │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐               │  │
│  │  │ Cluster 1│  │ Cluster 2│  │ Cluster 3│               │  │
│  │  │ (Driver) │  │ (Driver) │  │ (Driver) │               │  │
│  │  │ Workers  │  │ Workers  │  │ Workers  │               │  │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘               │  │
│  │       │              │              │                      │  │
│  │       └──────────────┼──────────────┘                      │  │
│  │                      │                                      │  │
│  │              ┌───────┴────────┐                             │  │
│  │              │ ADLS Gen2 /    │                             │  │
│  │              │ Azure Storage  │                             │  │
│  │              │ (Your Data)    │                             │  │
│  │              └────────────────┘                             │  │
│  └────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Two Planes

Control Plane (Managed by Databricks): The workspace UI, notebook service, cluster manager, job scheduler, and REST API. You interact with this through your browser. Databricks manages it entirely.

Data Plane (Your Azure Subscription): The actual compute (VMs running Spark) and storage (ADLS Gen2). Clusters run as VMs in YOUR Azure subscription. Your data never leaves your environment.

Real-life analogy: The Control Plane is like the airline’s booking system — you search flights, book tickets, and check in online. The Data Plane is the actual airplane — it flies from YOUR airport with YOUR passengers. The airline manages the booking system, but the passengers (data) are always under your control.

The Workspace: Your Digital Office

When you open Databricks, you land in the Workspace — your central hub for everything:

Section	What It Contains	Icon
Workspace	Notebooks, folders, libraries, repos	Folder icon
Repos	Git-linked repositories	Git branch icon
Data	Databases, tables, file browser (DBFS)	Database icon
Compute	Clusters, SQL Warehouses	Server icon
Workflows	Scheduled jobs and pipelines	Clock icon
Catalog	Unity Catalog (databases, schemas, tables, permissions)	Shield icon

Creating a Workspace (Azure Portal)

Azure Portal > Create a resource > Azure Databricks
Workspace name: dbw-dataplatform-dev
Region: Canada Central
Pricing tier: Premium (required for Unity Catalog, RBAC, secrets)
Click Review + Create > Create
After deployment, click Launch Workspace — opens Databricks UI

Clusters: The Engines

A cluster is a set of VMs running Apache Spark. When you run a notebook, it runs on a cluster.

Creating a Cluster

Click Compute in the sidebar
Click Create compute
Configure:

Setting	Dev/Learning	Production
Cluster name	`dev-cluster`	`prod-etl-cluster`
Cluster mode	Single Node (cheapest)	Standard (multi-node)
Databricks Runtime	Latest LTS (e.g., 15.4 LTS)	Latest LTS
Worker type	Standard_DS3_v2 (4 cores, 14 GB)	Standard_E8s_v3 or larger
Workers	0 (single node)	2-10 (auto-scale)
Auto-terminate	30 minutes	60 minutes

Auto-Termination (Save Money!)

Clusters auto-terminate after a period of inactivity. If you walk away from your laptop, the cluster shuts down after 30 minutes. No more paying for idle compute overnight.

Real-life analogy: Auto-terminate is like a car engine that turns off automatically when you stop at a red light. The engine restarts when you press the gas. You save fuel (money) during every idle period.

Cluster Types

Type	Nodes	Best For
Single Node	1 (driver only, no workers)	Development, small data, learning
Standard	1 driver + N workers	Production ETL, big data
High Concurrency	Shared by multiple users	Interactive analytics, shared notebooks

Notebooks: The Workbench

Notebooks are interactive documents where you write and execute code. Each cell runs independently, and results appear inline.

Creating a Notebook

Click Workspace > + > Notebook
Name: ETL_Customers
Default language: Python (can also use SQL, Scala, R)
Cluster: attach to your cluster

Multi-Language Support

You can mix languages in the same notebook using magic commands:

# Default: Python
df = spark.read.parquet("/mnt/datalake/customers/")
df.count()

%sql
-- Switch to SQL
SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city
ORDER BY customer_count DESC

%scala
// Switch to Scala
val df = spark.read.parquet("/mnt/datalake/customers/")
df.show()

%md
### This is Markdown
You can write documentation directly in notebooks using `%md`.

Real-life analogy: Notebooks are like a lab journal. You write notes (markdown), run experiments (code cells), see results (output), and iterate. The entire thought process is captured in one document that you can share with colleagues.

What Is dbutils?

dbutils (Databricks Utilities) is a built-in library available in every Databricks notebook. It provides commands for file system operations, secret management, notebook parameters, and notebook orchestration.

Think of dbutils as the toolbox that comes with every Databricks workspace. You do not install it. You do not import it. It is always there, ready to use.

# It is already available — no import needed
dbutils.help()

The Four dbutils Modules

Module	What It Does	Real-Life Equivalent
`dbutils.fs`	File system operations (list, copy, move, delete)	File Explorer / Finder
`dbutils.secrets`	Read secrets from Key Vault securely	Password manager
`dbutils.widgets`	Create parameterized notebooks with input fields	Form fields on a web page
`dbutils.notebook`	Run other notebooks and pass parameters	Calling a coworker to do a subtask

dbutils.fs — File System Operations

List Files and Directories

# List files in a directory
files = dbutils.fs.ls("/mnt/datalake/bronze/")
for f in files:
    print(f.name, f.size, f.path)

# Output:
# customers/  0  dbfs:/mnt/datalake/bronze/customers/
# orders/     0  dbfs:/mnt/datalake/bronze/orders/
# products/   0  dbfs:/mnt/datalake/bronze/products/

# List with details
display(dbutils.fs.ls("/mnt/datalake/bronze/customers/"))
# Shows: path, name, size, modificationTime in a nice table

View File Contents

# Read the first 1000 characters of a file
print(dbutils.fs.head("/mnt/datalake/bronze/config.json", 1000))

Copy Files

# Copy a single file
dbutils.fs.cp(
    "/mnt/datalake/bronze/customers/data.parquet",
    "/mnt/datalake/archive/customers/data.parquet"
)

# Copy entire directory (recursive)
dbutils.fs.cp(
    "/mnt/datalake/bronze/customers/",
    "/mnt/datalake/archive/customers/",
    recurse=True
)

Move Files

# Move (rename) a file
dbutils.fs.mv(
    "/mnt/datalake/incoming/daily_sales.csv",
    "/mnt/datalake/bronze/sales/daily_sales.csv"
)

# Move entire directory
dbutils.fs.mv(
    "/mnt/datalake/incoming/",
    "/mnt/datalake/processed/",
    recurse=True
)

Delete Files

# Delete a single file
dbutils.fs.rm("/mnt/datalake/temp/old_file.parquet")

# Delete a directory and all contents (recursive)
dbutils.fs.rm("/mnt/datalake/temp/", recurse=True)

Create Directories

# Create a directory
dbutils.fs.mkdirs("/mnt/datalake/silver/customers/2026/04/")

Put Content (Write Text Files)

# Write a small text file
dbutils.fs.put(
    "/mnt/datalake/config/pipeline_status.txt",
    "Pipeline completed successfully at 2026-04-20 14:30:00",
    overwrite=True
)

DBFS vs ADLS Paths

# DBFS path (Databricks File System — internal)
dbutils.fs.ls("dbfs:/mnt/datalake/")

# ABFSS path (direct ADLS Gen2 — external)
dbutils.fs.ls("abfss://container@storageaccount.dfs.core.windows.net/")

# Both work — DBFS paths are often shorter and easier to type

Real-life analogy: dbutils.fs is like the File Explorer on your computer. You can browse folders, copy files, move files, delete files, and create directories. The difference is that these files are in a cloud data lake, not on your local hard drive.

dbutils.secrets — Secure Secret Management

Why Secrets Matter

Never hardcode passwords, API keys, or connection strings in notebooks. Anyone with notebook access can see them. Notebooks are committed to Git. Passwords in Git = security breach.

dbutils.secrets reads secrets from Azure Key Vault at runtime. The secret value is never displayed in the notebook output — it shows [REDACTED].

Setting Up a Secret Scope

Before using secrets, link your Databricks workspace to Azure Key Vault:

Go to https://<databricks-workspace-url>#secrets/createScope
Scope Name: keyvault-scope
Manage Principal: All Users (or Creator for restricted access)
DNS Name: your Key Vault URI (e.g., https://kv-dataplatform-dev.vault.azure.net/)
Resource ID: your Key Vault’s full resource ID (from Azure Portal > Key Vault > Properties)
Click Create

Reading Secrets

# Get a secret value
password = dbutils.secrets.get(scope="keyvault-scope", key="sql-admin-password")

# The value is available in your code but NEVER displayed
print(password)  # Output: [REDACTED] — Databricks hides it!

# Use it in a connection string
jdbc_url = f"jdbc:sqlserver://server.database.windows.net:1433;database=mydb;user=admin;password={password}"
df = spark.read.format("jdbc").option("url", jdbc_url).option("dbtable", "customers").load()

List Available Scopes and Secrets

# List all secret scopes
dbutils.secrets.listScopes()

# List secrets in a scope (shows key names, NOT values)
dbutils.secrets.list(scope="keyvault-scope")
# Output: [SecretMetadata(key='sql-admin-password'), SecretMetadata(key='storage-key')]

Common Secrets for Data Engineers

Secret Key	What It Stores	Used For
`sql-admin-password`	Azure SQL admin password	JDBC connections
`storage-access-key`	ADLS Gen2 access key	Storage mounting
`service-principal-secret`	SP client secret	Service authentication
`api-key`	External API key	REST API calls
`cosmosdb-key`	Cosmos DB primary key	Cosmos DB connections

Real-life analogy: dbutils.secrets is like a safe in a hotel room. You put your valuables (passwords) in the safe (Key Vault). When you need them, you enter the code (scope + key) and the safe opens temporarily. The valuables never sit out in the open (never displayed in notebook output).

dbutils.widgets — Parameterize Your Notebooks

Why Widgets Matter

Instead of hardcoding values like file paths, dates, or table names, widgets let you create input parameters that can be changed at runtime. This makes notebooks reusable without editing code.

Creating Widgets

# Text widget (free-form input)
dbutils.widgets.text("source_date", "2026-04-20", "Enter Source Date")

# Dropdown widget (select from predefined options)
dbutils.widgets.dropdown("environment", "dev", ["dev", "uat", "prod"], "Select Environment")

# Combobox widget (dropdown + free-form input)
dbutils.widgets.combobox("table_name", "customers", ["customers", "orders", "products"], "Select Table")

# Multiselect widget (select multiple values)
dbutils.widgets.multiselect("regions", "ALL", ["ALL", "NA", "EU", "APAC"], "Select Regions")

When you run these, input fields appear at the TOP of the notebook.

# Get the current value of a widget
source_date = dbutils.widgets.get("source_date")
environment = dbutils.widgets.get("environment")
table_name = dbutils.widgets.get("table_name")

print(f"Processing {table_name} for {source_date} in {environment}")
# Output: Processing customers for 2026-04-20 in dev

Using Widgets in Spark Code

source_date = dbutils.widgets.get("source_date")
env = dbutils.widgets.get("environment")

# Dynamic path based on widget values
input_path = f"/mnt/datalake/{env}/bronze/customers/{source_date}/"
output_path = f"/mnt/datalake/{env}/silver/customers/{source_date}/"

df = spark.read.parquet(input_path)
df_clean = df.filter(df.status == "Active")
df_clean.write.format("delta").mode("overwrite").save(output_path)

print(f"Processed {df_clean.count()} rows from {input_path}")

Removing Widgets

# Remove a specific widget
dbutils.widgets.remove("source_date")

# Remove ALL widgets
dbutils.widgets.removeAll()

Widgets with Databricks Workflows (Jobs)

When you schedule a notebook as a job, you pass widget values as parameters:

{
    "source_date": "2026-04-20",
    "environment": "prod",
    "table_name": "customers"
}

The job fills the widget values automatically — no manual input needed.

Real-life analogy: Widgets are like the settings dials on a washing machine. The washing machine (notebook) is the same, but you adjust the temperature (source_date), cycle type (environment), and load size (table_name) each time. One machine, many configurations.

dbutils.notebook — Orchestrate Notebooks

Running Another Notebook

# Run a notebook and wait for it to finish
result = dbutils.notebook.run(
    "/Workspace/ETL/Process_Customers",  # Path to the notebook
    timeout_seconds=300,                  # Max wait time (5 minutes)
    arguments={"source_date": "2026-04-20", "environment": "prod"}
)

print(f"Notebook returned: {result}")

Returning a Value from a Notebook

In the called notebook (Process_Customers):

# At the end of the notebook
row_count = df.count()
dbutils.notebook.exit(f"Processed {row_count} rows")

The calling notebook receives this string as the return value.

Building a Pipeline with Notebook Orchestration

# Master notebook that orchestrates multiple ETL steps

# Step 1: Ingest raw data
result1 = dbutils.notebook.run("/ETL/01_Ingest_Raw", 600,
    {"source": "sql_database", "target": "bronze"})
print(f"Ingest: {result1}")

# Step 2: Clean and standardize
result2 = dbutils.notebook.run("/ETL/02_Clean_Data", 600,
    {"source": "bronze", "target": "silver"})
print(f"Clean: {result2}")

# Step 3: Build aggregations
result3 = dbutils.notebook.run("/ETL/03_Build_Gold", 600,
    {"source": "silver", "target": "gold"})
print(f"Gold: {result3}")

print("Pipeline completed successfully!")

Error Handling

try:
    result = dbutils.notebook.run("/ETL/Process_Customers", 300,
        {"source_date": "2026-04-20"})
    print(f"Success: {result}")
except Exception as e:
    print(f"Notebook failed: {str(e)}")
    # Log error, send alert, etc.

Real-life analogy: dbutils.notebook.run() is like a manager delegating tasks. “Alice, process the customer data and tell me how many rows you handled.” Alice works, finishes, and reports back: “Processed 5,000 rows.” The manager then says “Bob, take Alice’s output and build the reports.” Each person (notebook) does their part, and the manager (master notebook) coordinates.

Mounting Storage with dbutils

What Is Mounting?

Mounting creates a shortcut from a DBFS path to an external storage location (ADLS Gen2). Instead of typing the full abfss://container@account.dfs.core.windows.net/ path every time, you type /mnt/datalake/.

Without mount: abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/bronze/customers/
With mount:    /mnt/datalake/bronze/customers/

Mount with Service Principal

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": dbutils.secrets.get("keyvault-scope", "sp-client-id"),
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get("keyvault-scope", "sp-client-secret"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs=configs
)

print("Mounted successfully!")

Mount with Access Key (Simpler)

dbutils.fs.mount(
    source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs={
        "fs.azure.account.key.naveensynapsedl.dfs.core.windows.net":
            dbutils.secrets.get("keyvault-scope", "storage-access-key")
    }
)

List and Unmount

# List all mounts
display(dbutils.fs.mounts())

# Unmount
dbutils.fs.unmount("/mnt/datalake")

After Mounting

# Now you can use simple paths
df = spark.read.parquet("/mnt/datalake/bronze/customers/")
df.write.format("delta").save("/mnt/datalake/silver/customers/")
dbutils.fs.ls("/mnt/datalake/bronze/")

Note: In modern Databricks with Unity Catalog, direct access using abfss:// paths with external locations is preferred over mounting. Mounts are still widely used but considered legacy.

Real-life analogy: Mounting is like creating a shortcut on your desktop. Instead of navigating through Computer > Network > Server > Share > Folder every time, you create a shortcut called “DataLake” that takes you directly there.

Working with Delta Lake in Databricks

Delta Lake is the DEFAULT format in Databricks. Every table you create is Delta unless you specify otherwise.

# Write Delta table
df.write.format("delta").mode("overwrite").save("/mnt/datalake/silver/customers/")

# Read Delta table
df = spark.read.format("delta").load("/mnt/datalake/silver/customers/")

# Create managed table (registered in catalog)
df.write.format("delta").saveAsTable("silver.customers")

# Query with SQL
spark.sql("SELECT * FROM silver.customers WHERE city = 'Toronto'").show()

# MERGE (upsert)
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "/mnt/datalake/silver/customers/")
target.alias("t").merge(
    source_df.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

# Time travel
df_yesterday = spark.read.format("delta").option("versionAsOf", 5).load("/mnt/datalake/silver/customers/")

# OPTIMIZE (compact small files)
spark.sql("OPTIMIZE silver.customers")

# Z-ORDER (cluster data for faster queries)
spark.sql("OPTIMIZE silver.customers ZORDER BY (city)")

# Vacuum (clean up old files)
spark.sql("VACUUM silver.customers RETAIN 168 HOURS")

Databricks Workflows (Jobs)

Creating a Job

Click Workflows in the sidebar
Click Create Job
Configure:
Task name: ETL_Daily_Customers
Type: Notebook
Source: select your notebook
Cluster: select existing or create new job cluster
Parameters: {"source_date": "{{job.start_date}}", "environment": "prod"}
Schedule: Add a schedule (cron expression or UI picker)
Click Create

Multi-Task Workflows

Task 1: Ingest_Raw ──→ Task 2: Clean_Data ──→ Task 3: Build_Gold
                                                    |
                                               Task 4: Update_Dashboard

Each task can be a different notebook, with dependencies between them.

Job Clusters vs All-Purpose Clusters

Type	When Created	When Destroyed	Cost
All-Purpose	Manually by user	After auto-terminate timeout	Higher (interactive use premium)
Job Cluster	Automatically when job starts	Automatically when job ends	Lower (no interactive premium)

Always use Job Clusters for scheduled production workloads — they are cheaper and auto-cleanup.

Real-World Scenarios

Scenario 1: Daily ETL Pipeline

# Notebook: ETL_Daily_Pipeline
source_date = dbutils.widgets.get("source_date")
password = dbutils.secrets.get("keyvault-scope", "sql-password")

# Extract from SQL
df = spark.read.format("jdbc")     .option("url", f"jdbc:sqlserver://server:1433;database=mydb;password={password}")     .option("query", f"SELECT * FROM orders WHERE order_date = '{source_date}'")     .load()

# Transform
df_clean = df.dropDuplicates(["order_id"]).filter("amount > 0")

# Load to Delta
df_clean.write.format("delta").mode("append")     .partitionBy("order_date")     .save("/mnt/datalake/silver/orders/")

dbutils.notebook.exit(f"Loaded {df_clean.count()} orders for {source_date}")

Scenario 2: Data Lake File Management

# Archive old files
old_files = dbutils.fs.ls("/mnt/datalake/bronze/incoming/")
for f in old_files:
    if "2025" in f.name:
        dbutils.fs.mv(f.path, f"/mnt/datalake/archive/{f.name}")
        print(f"Archived: {f.name}")

# Check file sizes
files = dbutils.fs.ls("/mnt/datalake/silver/customers/")
total_size = sum(f.size for f in files)
print(f"Total size: {total_size / (1024*1024):.2f} MB")

Cost Management

Component	Pricing	Tip
All-Purpose Cluster	DBU rate + VM cost	Set auto-terminate to 30 min
Job Cluster	Lower DBU rate + VM cost	Use for all scheduled jobs
SQL Warehouse	Per DBU (serverless)	Scales to zero when idle
Storage	ADLS Gen2 rates (~$0.02/GB)	Standard Azure pricing

Cost Saving Tips

Auto-terminate clusters after 30 minutes of inactivity
Use Job Clusters for scheduled workloads (cheaper DBU rate)
Use spot instances for fault-tolerant batch jobs (60-90% cheaper VMs)
Right-size clusters — start with 2 workers, scale up if needed
VACUUM Delta tables — delete old file versions to save storage
Use single-node clusters for development
Monitor usage — Databricks has a built-in cost dashboard

Common Mistakes

Hardcoding passwords in notebooks — use dbutils.secrets.get() instead. Notebooks are committed to Git. Passwords in Git = security breach.
Leaving clusters running overnight — set auto-terminate to 30 minutes. An 8-node cluster running overnight costs $200+.
Using All-Purpose clusters for scheduled jobs — Job Clusters have a lower DBU rate and auto-terminate when the job finishes.
Not using Delta Lake — writing Parquet directly means no ACID, no MERGE, no time travel. Delta is the default for a reason.
Forgetting recurse=True on directory operations — dbutils.fs.rm("/path/") without recurse=True fails on non-empty directories.
Mounting with access keys instead of service principals — access keys give full account access. Service principals can be scoped to specific containers.
Not parameterizing notebooks — hardcoded paths and dates make notebooks non-reusable. Use dbutils.widgets for everything that changes between runs.

Interview Questions

Q: What is Azure Databricks? A: A managed Apache Spark platform for big data processing, machine learning, and lakehouse architecture. It provides collaborative notebooks, managed clusters, native Delta Lake integration, and Unity Catalog for governance. Created by the founders of Spark and jointly developed with Microsoft for Azure.

Q: What is dbutils and what are its main modules? A: dbutils is a built-in utility library in Databricks notebooks. Four modules: dbutils.fs for file operations (list, copy, move, delete), dbutils.secrets for reading secrets from Key Vault, dbutils.widgets for notebook parameterization, and dbutils.notebook for orchestrating other notebooks.

Q: How do you securely handle credentials in Databricks? A: Store secrets in Azure Key Vault, create a secret scope linking Databricks to Key Vault, and read secrets at runtime using dbutils.secrets.get(scope, key). Secret values are automatically redacted in notebook output. Never hardcode credentials.

Q: What is the difference between mounting and direct ABFSS access? A: Mounting creates a DBFS shortcut (/mnt/datalake/) to an ADLS Gen2 path, persisted across sessions. Direct ABFSS access uses the full path (abfss://container@account.dfs.core.windows.net/) each time. Mounting is simpler but considered legacy. Unity Catalog external locations are the modern approach.

Q: How do you parameterize a notebook for different environments? A: Use dbutils.widgets to create input parameters (text, dropdown, multiselect). Read values with dbutils.widgets.get("param_name"). When scheduling as a job, pass parameters as JSON. This makes one notebook work for dev, UAT, and prod without code changes.

Q: What is the difference between All-Purpose and Job Clusters? A: All-Purpose clusters are created manually, persist across notebooks, and have a higher DBU rate — best for interactive development. Job Clusters are created automatically when a job starts and destroyed when it ends, with a lower DBU rate — best for scheduled production workloads.

Wrapping Up

Azure Databricks is where data engineering meets world-class Spark. The notebooks are intuitive, the cluster management is painless, Delta Lake is native, and dbutils gives you the file system, secret management, parameterization, and orchestration tools you need — all built in.

The key dbutils commands to memorize: – dbutils.fs.ls() — browse your data lake – dbutils.secrets.get() — read passwords securely – dbutils.widgets.get() — parameterize your notebooks – dbutils.notebook.run() — orchestrate your pipeline

Master these four, and you can build any data pipeline in Databricks.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.