Azure Databricks for Data Engineers: Introduction, Architecture, and dbutils Commands Explained

Azure Databricks for Data Engineers: Introduction, Architecture, and dbutils Commands Explained

If Azure Synapse is a Swiss Army knife with many blades, Azure Databricks is a professional chef’s knife — designed for one thing and optimized to do it brilliantly: Apache Spark workloads. It is the most popular platform for big data processing, machine learning, and lakehouse architecture in 2026.

Databricks was founded by the creators of Apache Spark themselves. They took Spark, added a collaborative notebook environment, built-in cluster management, Delta Lake, Unity Catalog, and a polished user experience — then hosted it on Azure, AWS, and GCP.

For data engineers, Databricks is where you write PySpark code, build Delta Lake tables, run ETL pipelines, and collaborate with data scientists. And the tool you will use most inside Databricks is dbutils — a utility library that handles file operations, secrets, widgets, and notebook orchestration.

This post covers everything you need to get started with Databricks and master the dbutils commands you will use daily.

Table of Contents

  • What Is Azure Databricks?
  • Databricks vs Synapse Spark: When to Use Which
  • Databricks Architecture
  • The Workspace: Your Digital Office
  • Clusters: The Engines
  • Notebooks: The Workbench
  • What Is dbutils?
  • dbutils.fs — File System Operations
  • dbutils.secrets — Secure Secret Management
  • dbutils.widgets — Parameterize Your Notebooks
  • dbutils.notebook — Orchestrate Notebooks
  • dbutils.help() — Discover Commands
  • Mounting Storage with dbutils
  • Working with Delta Lake in Databricks
  • Unity Catalog Basics
  • Databricks Workflows (Jobs)
  • Real-World Scenarios
  • Cost Management
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is Azure Databricks?

Azure Databricks is a managed Apache Spark platform jointly developed by Databricks and Microsoft. It provides:

  • Collaborative notebooks — write PySpark, SQL, Scala, R in Jupyter-like notebooks
  • Managed Spark clusters — auto-scaling, auto-termination, zero infrastructure management
  • Delta Lake — ACID transactions, time travel, MERGE (native and default)
  • Unity Catalog — centralized governance, access control, data lineage
  • MLflow — experiment tracking, model registry, deployment
  • Databricks Workflows — job scheduling and orchestration
  • SQL Warehouses — serverless SQL endpoints for BI tools

Real-life analogy: If building a data platform is like building a house, Databricks is like hiring a construction company that brings their own tools, scaffolding, project management, and cleanup crew. You just design the house (write code) and they handle the rest (infrastructure, scaling, security).

Databricks vs Synapse Spark: When to Use Which

Feature Azure Databricks Synapse Spark Pool
Spark experience Best-in-class (created by Spark founders) Good (standard Spark)
Notebook experience Superior (real-time coauthoring, versioning, commenting) Basic
Delta Lake Native, deeply integrated, optimized Supported but not default
Cluster management Advanced (auto-scaling, spot instances, pools) Basic (auto-scale, auto-pause)
SQL analytics SQL Warehouses (serverless BI endpoint) Serverless SQL Pool (different engine)
ML capabilities MLflow, AutoML, Feature Store Basic ML library support
Unity Catalog Full governance platform Synapse uses Purview
Pipeline integration Databricks Workflows or ADF Synapse Pipelines (built-in)
Cost Per DBU (Databricks Unit) + VM cost Per node-hour
Best for Spark-heavy, ML, lakehouse architecture Integrated analytics (SQL + Spark + Pipelines)

Rule of thumb: Choose Databricks when Spark and Delta Lake are your primary tools. Choose Synapse Spark when you need Spark as ONE component of a larger analytics platform with SQL pools and built-in pipelines.

Databricks Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AZURE DATABRICKS                            │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                CONTROL PLANE (Managed by Databricks)     │    │
│  │                                                          │    │
│  │  Workspace UI │ Notebook Service │ Cluster Manager       │    │
│  │  Job Scheduler │ Unity Catalog │ REST API                │    │
│  └──────────────────────────┬───────────────────────────────┘    │
│                              │                                    │
│  ┌──────────────────────────┴───────────────────────────────┐    │
│  │                DATA PLANE (Runs in YOUR Azure subscription) │  │
│  │                                                            │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐               │  │
│  │  │ Cluster 1│  │ Cluster 2│  │ Cluster 3│               │  │
│  │  │ (Driver) │  │ (Driver) │  │ (Driver) │               │  │
│  │  │ Workers  │  │ Workers  │  │ Workers  │               │  │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘               │  │
│  │       │              │              │                      │  │
│  │       └──────────────┼──────────────┘                      │  │
│  │                      │                                      │  │
│  │              ┌───────┴────────┐                             │  │
│  │              │ ADLS Gen2 /    │                             │  │
│  │              │ Azure Storage  │                             │  │
│  │              │ (Your Data)    │                             │  │
│  │              └────────────────┘                             │  │
│  └────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Two Planes

Control Plane (Managed by Databricks): The workspace UI, notebook service, cluster manager, job scheduler, and REST API. You interact with this through your browser. Databricks manages it entirely.

Data Plane (Your Azure Subscription): The actual compute (VMs running Spark) and storage (ADLS Gen2). Clusters run as VMs in YOUR Azure subscription. Your data never leaves your environment.

Real-life analogy: The Control Plane is like the airline’s booking system — you search flights, book tickets, and check in online. The Data Plane is the actual airplane — it flies from YOUR airport with YOUR passengers. The airline manages the booking system, but the passengers (data) are always under your control.

The Workspace: Your Digital Office

When you open Databricks, you land in the Workspace — your central hub for everything:

Section What It Contains Icon
Workspace Notebooks, folders, libraries, repos Folder icon
Repos Git-linked repositories Git branch icon
Data Databases, tables, file browser (DBFS) Database icon
Compute Clusters, SQL Warehouses Server icon
Workflows Scheduled jobs and pipelines Clock icon
Catalog Unity Catalog (databases, schemas, tables, permissions) Shield icon

Creating a Workspace (Azure Portal)

  1. Azure Portal > Create a resource > Azure Databricks
  2. Workspace name: dbw-dataplatform-dev
  3. Region: Canada Central
  4. Pricing tier: Premium (required for Unity Catalog, RBAC, secrets)
  5. Click Review + Create > Create
  6. After deployment, click Launch Workspace — opens Databricks UI

Clusters: The Engines

A cluster is a set of VMs running Apache Spark. When you run a notebook, it runs on a cluster.

Creating a Cluster

  1. Click Compute in the sidebar
  2. Click Create compute
  3. Configure:
Setting Dev/Learning Production
Cluster name dev-cluster prod-etl-cluster
Cluster mode Single Node (cheapest) Standard (multi-node)
Databricks Runtime Latest LTS (e.g., 15.4 LTS) Latest LTS
Worker type Standard_DS3_v2 (4 cores, 14 GB) Standard_E8s_v3 or larger
Workers 0 (single node) 2-10 (auto-scale)
Auto-terminate 30 minutes 60 minutes

Auto-Termination (Save Money!)

Clusters auto-terminate after a period of inactivity. If you walk away from your laptop, the cluster shuts down after 30 minutes. No more paying for idle compute overnight.

Real-life analogy: Auto-terminate is like a car engine that turns off automatically when you stop at a red light. The engine restarts when you press the gas. You save fuel (money) during every idle period.

Cluster Types

Type Nodes Best For
Single Node 1 (driver only, no workers) Development, small data, learning
Standard 1 driver + N workers Production ETL, big data
High Concurrency Shared by multiple users Interactive analytics, shared notebooks

Notebooks: The Workbench

Notebooks are interactive documents where you write and execute code. Each cell runs independently, and results appear inline.

Creating a Notebook

  1. Click Workspace > + > Notebook
  2. Name: ETL_Customers
  3. Default language: Python (can also use SQL, Scala, R)
  4. Cluster: attach to your cluster

Multi-Language Support

You can mix languages in the same notebook using magic commands:

# Default: Python
df = spark.read.parquet("/mnt/datalake/customers/")
df.count()
%sql
-- Switch to SQL
SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city
ORDER BY customer_count DESC
%scala
// Switch to Scala
val df = spark.read.parquet("/mnt/datalake/customers/")
df.show()
%md
### This is Markdown
You can write documentation directly in notebooks using `%md`.

Real-life analogy: Notebooks are like a lab journal. You write notes (markdown), run experiments (code cells), see results (output), and iterate. The entire thought process is captured in one document that you can share with colleagues.

What Is dbutils?

dbutils (Databricks Utilities) is a built-in library available in every Databricks notebook. It provides commands for file system operations, secret management, notebook parameters, and notebook orchestration.

Think of dbutils as the toolbox that comes with every Databricks workspace. You do not install it. You do not import it. It is always there, ready to use.

# It is already available — no import needed
dbutils.help()

The Four dbutils Modules

Module What It Does Real-Life Equivalent
dbutils.fs File system operations (list, copy, move, delete) File Explorer / Finder
dbutils.secrets Read secrets from Key Vault securely Password manager
dbutils.widgets Create parameterized notebooks with input fields Form fields on a web page
dbutils.notebook Run other notebooks and pass parameters Calling a coworker to do a subtask

dbutils.fs — File System Operations

List Files and Directories

# List files in a directory
files = dbutils.fs.ls("/mnt/datalake/bronze/")
for f in files:
    print(f.name, f.size, f.path)

# Output:
# customers/  0  dbfs:/mnt/datalake/bronze/customers/
# orders/     0  dbfs:/mnt/datalake/bronze/orders/
# products/   0  dbfs:/mnt/datalake/bronze/products/
# List with details
display(dbutils.fs.ls("/mnt/datalake/bronze/customers/"))
# Shows: path, name, size, modificationTime in a nice table

View File Contents

# Read the first 1000 characters of a file
print(dbutils.fs.head("/mnt/datalake/bronze/config.json", 1000))

Copy Files

# Copy a single file
dbutils.fs.cp(
    "/mnt/datalake/bronze/customers/data.parquet",
    "/mnt/datalake/archive/customers/data.parquet"
)

# Copy entire directory (recursive)
dbutils.fs.cp(
    "/mnt/datalake/bronze/customers/",
    "/mnt/datalake/archive/customers/",
    recurse=True
)

Move Files

# Move (rename) a file
dbutils.fs.mv(
    "/mnt/datalake/incoming/daily_sales.csv",
    "/mnt/datalake/bronze/sales/daily_sales.csv"
)

# Move entire directory
dbutils.fs.mv(
    "/mnt/datalake/incoming/",
    "/mnt/datalake/processed/",
    recurse=True
)

Delete Files

# Delete a single file
dbutils.fs.rm("/mnt/datalake/temp/old_file.parquet")

# Delete a directory and all contents (recursive)
dbutils.fs.rm("/mnt/datalake/temp/", recurse=True)

Create Directories

# Create a directory
dbutils.fs.mkdirs("/mnt/datalake/silver/customers/2026/04/")

Put Content (Write Text Files)

# Write a small text file
dbutils.fs.put(
    "/mnt/datalake/config/pipeline_status.txt",
    "Pipeline completed successfully at 2026-04-20 14:30:00",
    overwrite=True
)

DBFS vs ADLS Paths

# DBFS path (Databricks File System — internal)
dbutils.fs.ls("dbfs:/mnt/datalake/")

# ABFSS path (direct ADLS Gen2 — external)
dbutils.fs.ls("abfss://container@storageaccount.dfs.core.windows.net/")

# Both work — DBFS paths are often shorter and easier to type

Real-life analogy: dbutils.fs is like the File Explorer on your computer. You can browse folders, copy files, move files, delete files, and create directories. The difference is that these files are in a cloud data lake, not on your local hard drive.

dbutils.secrets — Secure Secret Management

Why Secrets Matter

Never hardcode passwords, API keys, or connection strings in notebooks. Anyone with notebook access can see them. Notebooks are committed to Git. Passwords in Git = security breach.

dbutils.secrets reads secrets from Azure Key Vault at runtime. The secret value is never displayed in the notebook output — it shows [REDACTED].

Setting Up a Secret Scope

Before using secrets, link your Databricks workspace to Azure Key Vault:

  1. Go to https://<databricks-workspace-url>#secrets/createScope
  2. Scope Name: keyvault-scope
  3. Manage Principal: All Users (or Creator for restricted access)
  4. DNS Name: your Key Vault URI (e.g., https://kv-dataplatform-dev.vault.azure.net/)
  5. Resource ID: your Key Vault’s full resource ID (from Azure Portal > Key Vault > Properties)
  6. Click Create

Reading Secrets

# Get a secret value
password = dbutils.secrets.get(scope="keyvault-scope", key="sql-admin-password")

# The value is available in your code but NEVER displayed
print(password)  # Output: [REDACTED] — Databricks hides it!

# Use it in a connection string
jdbc_url = f"jdbc:sqlserver://server.database.windows.net:1433;database=mydb;user=admin;password={password}"
df = spark.read.format("jdbc").option("url", jdbc_url).option("dbtable", "customers").load()

List Available Scopes and Secrets

# List all secret scopes
dbutils.secrets.listScopes()

# List secrets in a scope (shows key names, NOT values)
dbutils.secrets.list(scope="keyvault-scope")
# Output: [SecretMetadata(key='sql-admin-password'), SecretMetadata(key='storage-key')]

Common Secrets for Data Engineers

Secret Key What It Stores Used For
sql-admin-password Azure SQL admin password JDBC connections
storage-access-key ADLS Gen2 access key Storage mounting
service-principal-secret SP client secret Service authentication
api-key External API key REST API calls
cosmosdb-key Cosmos DB primary key Cosmos DB connections

Real-life analogy: dbutils.secrets is like a safe in a hotel room. You put your valuables (passwords) in the safe (Key Vault). When you need them, you enter the code (scope + key) and the safe opens temporarily. The valuables never sit out in the open (never displayed in notebook output).

dbutils.widgets — Parameterize Your Notebooks

Why Widgets Matter

Instead of hardcoding values like file paths, dates, or table names, widgets let you create input parameters that can be changed at runtime. This makes notebooks reusable without editing code.

Creating Widgets

# Text widget (free-form input)
dbutils.widgets.text("source_date", "2026-04-20", "Enter Source Date")

# Dropdown widget (select from predefined options)
dbutils.widgets.dropdown("environment", "dev", ["dev", "uat", "prod"], "Select Environment")

# Combobox widget (dropdown + free-form input)
dbutils.widgets.combobox("table_name", "customers", ["customers", "orders", "products"], "Select Table")

# Multiselect widget (select multiple values)
dbutils.widgets.multiselect("regions", "ALL", ["ALL", "NA", "EU", "APAC"], "Select Regions")

When you run these, input fields appear at the TOP of the notebook.

Reading Widget Values

# Get the current value of a widget
source_date = dbutils.widgets.get("source_date")
environment = dbutils.widgets.get("environment")
table_name = dbutils.widgets.get("table_name")

print(f"Processing {table_name} for {source_date} in {environment}")
# Output: Processing customers for 2026-04-20 in dev

Using Widgets in Spark Code

source_date = dbutils.widgets.get("source_date")
env = dbutils.widgets.get("environment")

# Dynamic path based on widget values
input_path = f"/mnt/datalake/{env}/bronze/customers/{source_date}/"
output_path = f"/mnt/datalake/{env}/silver/customers/{source_date}/"

df = spark.read.parquet(input_path)
df_clean = df.filter(df.status == "Active")
df_clean.write.format("delta").mode("overwrite").save(output_path)

print(f"Processed {df_clean.count()} rows from {input_path}")

Removing Widgets

# Remove a specific widget
dbutils.widgets.remove("source_date")

# Remove ALL widgets
dbutils.widgets.removeAll()

Widgets with Databricks Workflows (Jobs)

When you schedule a notebook as a job, you pass widget values as parameters:

{
    "source_date": "2026-04-20",
    "environment": "prod",
    "table_name": "customers"
}

The job fills the widget values automatically — no manual input needed.

Real-life analogy: Widgets are like the settings dials on a washing machine. The washing machine (notebook) is the same, but you adjust the temperature (source_date), cycle type (environment), and load size (table_name) each time. One machine, many configurations.

dbutils.notebook — Orchestrate Notebooks

Running Another Notebook

# Run a notebook and wait for it to finish
result = dbutils.notebook.run(
    "/Workspace/ETL/Process_Customers",  # Path to the notebook
    timeout_seconds=300,                  # Max wait time (5 minutes)
    arguments={"source_date": "2026-04-20", "environment": "prod"}
)

print(f"Notebook returned: {result}")

Returning a Value from a Notebook

In the called notebook (Process_Customers):

# At the end of the notebook
row_count = df.count()
dbutils.notebook.exit(f"Processed {row_count} rows")

The calling notebook receives this string as the return value.

Building a Pipeline with Notebook Orchestration

# Master notebook that orchestrates multiple ETL steps

# Step 1: Ingest raw data
result1 = dbutils.notebook.run("/ETL/01_Ingest_Raw", 600,
    {"source": "sql_database", "target": "bronze"})
print(f"Ingest: {result1}")

# Step 2: Clean and standardize
result2 = dbutils.notebook.run("/ETL/02_Clean_Data", 600,
    {"source": "bronze", "target": "silver"})
print(f"Clean: {result2}")

# Step 3: Build aggregations
result3 = dbutils.notebook.run("/ETL/03_Build_Gold", 600,
    {"source": "silver", "target": "gold"})
print(f"Gold: {result3}")

print("Pipeline completed successfully!")

Error Handling

try:
    result = dbutils.notebook.run("/ETL/Process_Customers", 300,
        {"source_date": "2026-04-20"})
    print(f"Success: {result}")
except Exception as e:
    print(f"Notebook failed: {str(e)}")
    # Log error, send alert, etc.

Real-life analogy: dbutils.notebook.run() is like a manager delegating tasks. “Alice, process the customer data and tell me how many rows you handled.” Alice works, finishes, and reports back: “Processed 5,000 rows.” The manager then says “Bob, take Alice’s output and build the reports.” Each person (notebook) does their part, and the manager (master notebook) coordinates.

Mounting Storage with dbutils

What Is Mounting?

Mounting creates a shortcut from a DBFS path to an external storage location (ADLS Gen2). Instead of typing the full abfss://container@account.dfs.core.windows.net/ path every time, you type /mnt/datalake/.

Without mount: abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/bronze/customers/
With mount:    /mnt/datalake/bronze/customers/

Mount with Service Principal

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": dbutils.secrets.get("keyvault-scope", "sp-client-id"),
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get("keyvault-scope", "sp-client-secret"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs=configs
)

print("Mounted successfully!")

Mount with Access Key (Simpler)

dbutils.fs.mount(
    source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs={
        "fs.azure.account.key.naveensynapsedl.dfs.core.windows.net":
            dbutils.secrets.get("keyvault-scope", "storage-access-key")
    }
)

List and Unmount

# List all mounts
display(dbutils.fs.mounts())

# Unmount
dbutils.fs.unmount("/mnt/datalake")

After Mounting

# Now you can use simple paths
df = spark.read.parquet("/mnt/datalake/bronze/customers/")
df.write.format("delta").save("/mnt/datalake/silver/customers/")
dbutils.fs.ls("/mnt/datalake/bronze/")

Note: In modern Databricks with Unity Catalog, direct access using abfss:// paths with external locations is preferred over mounting. Mounts are still widely used but considered legacy.

Real-life analogy: Mounting is like creating a shortcut on your desktop. Instead of navigating through Computer > Network > Server > Share > Folder every time, you create a shortcut called “DataLake” that takes you directly there.

Working with Delta Lake in Databricks

Delta Lake is the DEFAULT format in Databricks. Every table you create is Delta unless you specify otherwise.

# Write Delta table
df.write.format("delta").mode("overwrite").save("/mnt/datalake/silver/customers/")

# Read Delta table
df = spark.read.format("delta").load("/mnt/datalake/silver/customers/")

# Create managed table (registered in catalog)
df.write.format("delta").saveAsTable("silver.customers")

# Query with SQL
spark.sql("SELECT * FROM silver.customers WHERE city = 'Toronto'").show()

# MERGE (upsert)
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "/mnt/datalake/silver/customers/")
target.alias("t").merge(
    source_df.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

# Time travel
df_yesterday = spark.read.format("delta").option("versionAsOf", 5).load("/mnt/datalake/silver/customers/")

# OPTIMIZE (compact small files)
spark.sql("OPTIMIZE silver.customers")

# Z-ORDER (cluster data for faster queries)
spark.sql("OPTIMIZE silver.customers ZORDER BY (city)")

# Vacuum (clean up old files)
spark.sql("VACUUM silver.customers RETAIN 168 HOURS")

Databricks Workflows (Jobs)

Creating a Job

  1. Click Workflows in the sidebar
  2. Click Create Job
  3. Configure:
  4. Task name: ETL_Daily_Customers
  5. Type: Notebook
  6. Source: select your notebook
  7. Cluster: select existing or create new job cluster
  8. Parameters: {"source_date": "{{job.start_date}}", "environment": "prod"}
  9. Schedule: Add a schedule (cron expression or UI picker)
  10. Click Create

Multi-Task Workflows

Task 1: Ingest_Raw ──→ Task 2: Clean_Data ──→ Task 3: Build_Gold
                                                    |
                                               Task 4: Update_Dashboard

Each task can be a different notebook, with dependencies between them.

Job Clusters vs All-Purpose Clusters

Type When Created When Destroyed Cost
All-Purpose Manually by user After auto-terminate timeout Higher (interactive use premium)
Job Cluster Automatically when job starts Automatically when job ends Lower (no interactive premium)

Always use Job Clusters for scheduled production workloads — they are cheaper and auto-cleanup.

Real-World Scenarios

Scenario 1: Daily ETL Pipeline

# Notebook: ETL_Daily_Pipeline
source_date = dbutils.widgets.get("source_date")
password = dbutils.secrets.get("keyvault-scope", "sql-password")

# Extract from SQL
df = spark.read.format("jdbc")     .option("url", f"jdbc:sqlserver://server:1433;database=mydb;password={password}")     .option("query", f"SELECT * FROM orders WHERE order_date = '{source_date}'")     .load()

# Transform
df_clean = df.dropDuplicates(["order_id"]).filter("amount > 0")

# Load to Delta
df_clean.write.format("delta").mode("append")     .partitionBy("order_date")     .save("/mnt/datalake/silver/orders/")

dbutils.notebook.exit(f"Loaded {df_clean.count()} orders for {source_date}")

Scenario 2: Data Lake File Management

# Archive old files
old_files = dbutils.fs.ls("/mnt/datalake/bronze/incoming/")
for f in old_files:
    if "2025" in f.name:
        dbutils.fs.mv(f.path, f"/mnt/datalake/archive/{f.name}")
        print(f"Archived: {f.name}")

# Check file sizes
files = dbutils.fs.ls("/mnt/datalake/silver/customers/")
total_size = sum(f.size for f in files)
print(f"Total size: {total_size / (1024*1024):.2f} MB")

Cost Management

Component Pricing Tip
All-Purpose Cluster DBU rate + VM cost Set auto-terminate to 30 min
Job Cluster Lower DBU rate + VM cost Use for all scheduled jobs
SQL Warehouse Per DBU (serverless) Scales to zero when idle
Storage ADLS Gen2 rates (~$0.02/GB) Standard Azure pricing

Cost Saving Tips

  1. Auto-terminate clusters after 30 minutes of inactivity
  2. Use Job Clusters for scheduled workloads (cheaper DBU rate)
  3. Use spot instances for fault-tolerant batch jobs (60-90% cheaper VMs)
  4. Right-size clusters — start with 2 workers, scale up if needed
  5. VACUUM Delta tables — delete old file versions to save storage
  6. Use single-node clusters for development
  7. Monitor usage — Databricks has a built-in cost dashboard

Common Mistakes

  1. Hardcoding passwords in notebooks — use dbutils.secrets.get() instead. Notebooks are committed to Git. Passwords in Git = security breach.

  2. Leaving clusters running overnight — set auto-terminate to 30 minutes. An 8-node cluster running overnight costs $200+.

  3. Using All-Purpose clusters for scheduled jobs — Job Clusters have a lower DBU rate and auto-terminate when the job finishes.

  4. Not using Delta Lake — writing Parquet directly means no ACID, no MERGE, no time travel. Delta is the default for a reason.

  5. Forgetting recurse=True on directory operationsdbutils.fs.rm("/path/") without recurse=True fails on non-empty directories.

  6. Mounting with access keys instead of service principals — access keys give full account access. Service principals can be scoped to specific containers.

  7. Not parameterizing notebooks — hardcoded paths and dates make notebooks non-reusable. Use dbutils.widgets for everything that changes between runs.

Interview Questions

Q: What is Azure Databricks? A: A managed Apache Spark platform for big data processing, machine learning, and lakehouse architecture. It provides collaborative notebooks, managed clusters, native Delta Lake integration, and Unity Catalog for governance. Created by the founders of Spark and jointly developed with Microsoft for Azure.

Q: What is dbutils and what are its main modules? A: dbutils is a built-in utility library in Databricks notebooks. Four modules: dbutils.fs for file operations (list, copy, move, delete), dbutils.secrets for reading secrets from Key Vault, dbutils.widgets for notebook parameterization, and dbutils.notebook for orchestrating other notebooks.

Q: How do you securely handle credentials in Databricks? A: Store secrets in Azure Key Vault, create a secret scope linking Databricks to Key Vault, and read secrets at runtime using dbutils.secrets.get(scope, key). Secret values are automatically redacted in notebook output. Never hardcode credentials.

Q: What is the difference between mounting and direct ABFSS access? A: Mounting creates a DBFS shortcut (/mnt/datalake/) to an ADLS Gen2 path, persisted across sessions. Direct ABFSS access uses the full path (abfss://container@account.dfs.core.windows.net/) each time. Mounting is simpler but considered legacy. Unity Catalog external locations are the modern approach.

Q: How do you parameterize a notebook for different environments? A: Use dbutils.widgets to create input parameters (text, dropdown, multiselect). Read values with dbutils.widgets.get("param_name"). When scheduling as a job, pass parameters as JSON. This makes one notebook work for dev, UAT, and prod without code changes.

Q: What is the difference between All-Purpose and Job Clusters? A: All-Purpose clusters are created manually, persist across notebooks, and have a higher DBU rate — best for interactive development. Job Clusters are created automatically when a job starts and destroyed when it ends, with a lower DBU rate — best for scheduled production workloads.

Wrapping Up

Azure Databricks is where data engineering meets world-class Spark. The notebooks are intuitive, the cluster management is painless, Delta Lake is native, and dbutils gives you the file system, secret management, parameterization, and orchestration tools you need — all built in.

The key dbutils commands to memorize: – dbutils.fs.ls() — browse your data lake – dbutils.secrets.get() — read passwords securely – dbutils.widgets.get() — parameterize your notebooks – dbutils.notebook.run() — orchestrate your pipeline

Master these four, and you can build any data pipeline in Databricks.

Related posts:Apache Spark and PySparkData File Formats (Delta Lake)Azure Synapse Workspace SetupPython for Data EngineersData Flows in ADF/Synapse


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link