Azure Databricks for Data Engineers: Introduction, Architecture, and dbutils Commands Explained
If Azure Synapse is a Swiss Army knife with many blades, Azure Databricks is a professional chef’s knife — designed for one thing and optimized to do it brilliantly: Apache Spark workloads. It is the most popular platform for big data processing, machine learning, and lakehouse architecture in 2026.
Databricks was founded by the creators of Apache Spark themselves. They took Spark, added a collaborative notebook environment, built-in cluster management, Delta Lake, Unity Catalog, and a polished user experience — then hosted it on Azure, AWS, and GCP.
For data engineers, Databricks is where you write PySpark code, build Delta Lake tables, run ETL pipelines, and collaborate with data scientists. And the tool you will use most inside Databricks is dbutils — a utility library that handles file operations, secrets, widgets, and notebook orchestration.
This post covers everything you need to get started with Databricks and master the dbutils commands you will use daily.
Table of Contents
- What Is Azure Databricks?
- Databricks vs Synapse Spark: When to Use Which
- Databricks Architecture
- The Workspace: Your Digital Office
- Clusters: The Engines
- Notebooks: The Workbench
- What Is dbutils?
- dbutils.fs — File System Operations
- dbutils.secrets — Secure Secret Management
- dbutils.widgets — Parameterize Your Notebooks
- dbutils.notebook — Orchestrate Notebooks
- dbutils.help() — Discover Commands
- Mounting Storage with dbutils
- Working with Delta Lake in Databricks
- Unity Catalog Basics
- Databricks Workflows (Jobs)
- Real-World Scenarios
- Cost Management
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is Azure Databricks?
Azure Databricks is a managed Apache Spark platform jointly developed by Databricks and Microsoft. It provides:
- Collaborative notebooks — write PySpark, SQL, Scala, R in Jupyter-like notebooks
- Managed Spark clusters — auto-scaling, auto-termination, zero infrastructure management
- Delta Lake — ACID transactions, time travel, MERGE (native and default)
- Unity Catalog — centralized governance, access control, data lineage
- MLflow — experiment tracking, model registry, deployment
- Databricks Workflows — job scheduling and orchestration
- SQL Warehouses — serverless SQL endpoints for BI tools
Real-life analogy: If building a data platform is like building a house, Databricks is like hiring a construction company that brings their own tools, scaffolding, project management, and cleanup crew. You just design the house (write code) and they handle the rest (infrastructure, scaling, security).
Databricks vs Synapse Spark: When to Use Which
| Feature | Azure Databricks | Synapse Spark Pool |
|---|---|---|
| Spark experience | Best-in-class (created by Spark founders) | Good (standard Spark) |
| Notebook experience | Superior (real-time coauthoring, versioning, commenting) | Basic |
| Delta Lake | Native, deeply integrated, optimized | Supported but not default |
| Cluster management | Advanced (auto-scaling, spot instances, pools) | Basic (auto-scale, auto-pause) |
| SQL analytics | SQL Warehouses (serverless BI endpoint) | Serverless SQL Pool (different engine) |
| ML capabilities | MLflow, AutoML, Feature Store | Basic ML library support |
| Unity Catalog | Full governance platform | Synapse uses Purview |
| Pipeline integration | Databricks Workflows or ADF | Synapse Pipelines (built-in) |
| Cost | Per DBU (Databricks Unit) + VM cost | Per node-hour |
| Best for | Spark-heavy, ML, lakehouse architecture | Integrated analytics (SQL + Spark + Pipelines) |
Rule of thumb: Choose Databricks when Spark and Delta Lake are your primary tools. Choose Synapse Spark when you need Spark as ONE component of a larger analytics platform with SQL pools and built-in pipelines.
Databricks Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AZURE DATABRICKS │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CONTROL PLANE (Managed by Databricks) │ │
│ │ │ │
│ │ Workspace UI │ Notebook Service │ Cluster Manager │ │
│ │ Job Scheduler │ Unity Catalog │ REST API │ │
│ └──────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┴───────────────────────────────┐ │
│ │ DATA PLANE (Runs in YOUR Azure subscription) │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Cluster 1│ │ Cluster 2│ │ Cluster 3│ │ │
│ │ │ (Driver) │ │ (Driver) │ │ (Driver) │ │ │
│ │ │ Workers │ │ Workers │ │ Workers │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │
│ │ └──────────────┼──────────────┘ │ │
│ │ │ │ │
│ │ ┌───────┴────────┐ │ │
│ │ │ ADLS Gen2 / │ │ │
│ │ │ Azure Storage │ │ │
│ │ │ (Your Data) │ │ │
│ │ └────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Two Planes
Control Plane (Managed by Databricks): The workspace UI, notebook service, cluster manager, job scheduler, and REST API. You interact with this through your browser. Databricks manages it entirely.
Data Plane (Your Azure Subscription): The actual compute (VMs running Spark) and storage (ADLS Gen2). Clusters run as VMs in YOUR Azure subscription. Your data never leaves your environment.
Real-life analogy: The Control Plane is like the airline’s booking system — you search flights, book tickets, and check in online. The Data Plane is the actual airplane — it flies from YOUR airport with YOUR passengers. The airline manages the booking system, but the passengers (data) are always under your control.
The Workspace: Your Digital Office
When you open Databricks, you land in the Workspace — your central hub for everything:
| Section | What It Contains | Icon |
|---|---|---|
| Workspace | Notebooks, folders, libraries, repos | Folder icon |
| Repos | Git-linked repositories | Git branch icon |
| Data | Databases, tables, file browser (DBFS) | Database icon |
| Compute | Clusters, SQL Warehouses | Server icon |
| Workflows | Scheduled jobs and pipelines | Clock icon |
| Catalog | Unity Catalog (databases, schemas, tables, permissions) | Shield icon |
Creating a Workspace (Azure Portal)
- Azure Portal > Create a resource > Azure Databricks
- Workspace name:
dbw-dataplatform-dev - Region: Canada Central
- Pricing tier: Premium (required for Unity Catalog, RBAC, secrets)
- Click Review + Create > Create
- After deployment, click Launch Workspace — opens Databricks UI
Clusters: The Engines
A cluster is a set of VMs running Apache Spark. When you run a notebook, it runs on a cluster.
Creating a Cluster
- Click Compute in the sidebar
- Click Create compute
- Configure:
| Setting | Dev/Learning | Production |
|---|---|---|
| Cluster name | dev-cluster |
prod-etl-cluster |
| Cluster mode | Single Node (cheapest) | Standard (multi-node) |
| Databricks Runtime | Latest LTS (e.g., 15.4 LTS) | Latest LTS |
| Worker type | Standard_DS3_v2 (4 cores, 14 GB) | Standard_E8s_v3 or larger |
| Workers | 0 (single node) | 2-10 (auto-scale) |
| Auto-terminate | 30 minutes | 60 minutes |
Auto-Termination (Save Money!)
Clusters auto-terminate after a period of inactivity. If you walk away from your laptop, the cluster shuts down after 30 minutes. No more paying for idle compute overnight.
Real-life analogy: Auto-terminate is like a car engine that turns off automatically when you stop at a red light. The engine restarts when you press the gas. You save fuel (money) during every idle period.
Cluster Types
| Type | Nodes | Best For |
|---|---|---|
| Single Node | 1 (driver only, no workers) | Development, small data, learning |
| Standard | 1 driver + N workers | Production ETL, big data |
| High Concurrency | Shared by multiple users | Interactive analytics, shared notebooks |
Notebooks: The Workbench
Notebooks are interactive documents where you write and execute code. Each cell runs independently, and results appear inline.
Creating a Notebook
- Click Workspace > + > Notebook
- Name:
ETL_Customers - Default language: Python (can also use SQL, Scala, R)
- Cluster: attach to your cluster
Multi-Language Support
You can mix languages in the same notebook using magic commands:
# Default: Python
df = spark.read.parquet("/mnt/datalake/customers/")
df.count()
%sql
-- Switch to SQL
SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city
ORDER BY customer_count DESC
%scala
// Switch to Scala
val df = spark.read.parquet("/mnt/datalake/customers/")
df.show()
%md
### This is Markdown
You can write documentation directly in notebooks using `%md`.
Real-life analogy: Notebooks are like a lab journal. You write notes (markdown), run experiments (code cells), see results (output), and iterate. The entire thought process is captured in one document that you can share with colleagues.
What Is dbutils?
dbutils (Databricks Utilities) is a built-in library available in every Databricks notebook. It provides commands for file system operations, secret management, notebook parameters, and notebook orchestration.
Think of dbutils as the toolbox that comes with every Databricks workspace. You do not install it. You do not import it. It is always there, ready to use.
# It is already available — no import needed
dbutils.help()
The Four dbutils Modules
| Module | What It Does | Real-Life Equivalent |
|---|---|---|
dbutils.fs |
File system operations (list, copy, move, delete) | File Explorer / Finder |
dbutils.secrets |
Read secrets from Key Vault securely | Password manager |
dbutils.widgets |
Create parameterized notebooks with input fields | Form fields on a web page |
dbutils.notebook |
Run other notebooks and pass parameters | Calling a coworker to do a subtask |
dbutils.fs — File System Operations
List Files and Directories
# List files in a directory
files = dbutils.fs.ls("/mnt/datalake/bronze/")
for f in files:
print(f.name, f.size, f.path)
# Output:
# customers/ 0 dbfs:/mnt/datalake/bronze/customers/
# orders/ 0 dbfs:/mnt/datalake/bronze/orders/
# products/ 0 dbfs:/mnt/datalake/bronze/products/
# List with details
display(dbutils.fs.ls("/mnt/datalake/bronze/customers/"))
# Shows: path, name, size, modificationTime in a nice table
View File Contents
# Read the first 1000 characters of a file
print(dbutils.fs.head("/mnt/datalake/bronze/config.json", 1000))
Copy Files
# Copy a single file
dbutils.fs.cp(
"/mnt/datalake/bronze/customers/data.parquet",
"/mnt/datalake/archive/customers/data.parquet"
)
# Copy entire directory (recursive)
dbutils.fs.cp(
"/mnt/datalake/bronze/customers/",
"/mnt/datalake/archive/customers/",
recurse=True
)
Move Files
# Move (rename) a file
dbutils.fs.mv(
"/mnt/datalake/incoming/daily_sales.csv",
"/mnt/datalake/bronze/sales/daily_sales.csv"
)
# Move entire directory
dbutils.fs.mv(
"/mnt/datalake/incoming/",
"/mnt/datalake/processed/",
recurse=True
)
Delete Files
# Delete a single file
dbutils.fs.rm("/mnt/datalake/temp/old_file.parquet")
# Delete a directory and all contents (recursive)
dbutils.fs.rm("/mnt/datalake/temp/", recurse=True)
Create Directories
# Create a directory
dbutils.fs.mkdirs("/mnt/datalake/silver/customers/2026/04/")
Put Content (Write Text Files)
# Write a small text file
dbutils.fs.put(
"/mnt/datalake/config/pipeline_status.txt",
"Pipeline completed successfully at 2026-04-20 14:30:00",
overwrite=True
)
DBFS vs ADLS Paths
# DBFS path (Databricks File System — internal)
dbutils.fs.ls("dbfs:/mnt/datalake/")
# ABFSS path (direct ADLS Gen2 — external)
dbutils.fs.ls("abfss://container@storageaccount.dfs.core.windows.net/")
# Both work — DBFS paths are often shorter and easier to type
Real-life analogy: dbutils.fs is like the File Explorer on your computer. You can browse folders, copy files, move files, delete files, and create directories. The difference is that these files are in a cloud data lake, not on your local hard drive.
dbutils.secrets — Secure Secret Management
Why Secrets Matter
Never hardcode passwords, API keys, or connection strings in notebooks. Anyone with notebook access can see them. Notebooks are committed to Git. Passwords in Git = security breach.
dbutils.secrets reads secrets from Azure Key Vault at runtime. The secret value is never displayed in the notebook output — it shows [REDACTED].
Setting Up a Secret Scope
Before using secrets, link your Databricks workspace to Azure Key Vault:
- Go to
https://<databricks-workspace-url>#secrets/createScope - Scope Name:
keyvault-scope - Manage Principal: All Users (or Creator for restricted access)
- DNS Name: your Key Vault URI (e.g.,
https://kv-dataplatform-dev.vault.azure.net/) - Resource ID: your Key Vault’s full resource ID (from Azure Portal > Key Vault > Properties)
- Click Create
Reading Secrets
# Get a secret value
password = dbutils.secrets.get(scope="keyvault-scope", key="sql-admin-password")
# The value is available in your code but NEVER displayed
print(password) # Output: [REDACTED] — Databricks hides it!
# Use it in a connection string
jdbc_url = f"jdbc:sqlserver://server.database.windows.net:1433;database=mydb;user=admin;password={password}"
df = spark.read.format("jdbc").option("url", jdbc_url).option("dbtable", "customers").load()
List Available Scopes and Secrets
# List all secret scopes
dbutils.secrets.listScopes()
# List secrets in a scope (shows key names, NOT values)
dbutils.secrets.list(scope="keyvault-scope")
# Output: [SecretMetadata(key='sql-admin-password'), SecretMetadata(key='storage-key')]
Common Secrets for Data Engineers
| Secret Key | What It Stores | Used For |
|---|---|---|
sql-admin-password |
Azure SQL admin password | JDBC connections |
storage-access-key |
ADLS Gen2 access key | Storage mounting |
service-principal-secret |
SP client secret | Service authentication |
api-key |
External API key | REST API calls |
cosmosdb-key |
Cosmos DB primary key | Cosmos DB connections |
Real-life analogy: dbutils.secrets is like a safe in a hotel room. You put your valuables (passwords) in the safe (Key Vault). When you need them, you enter the code (scope + key) and the safe opens temporarily. The valuables never sit out in the open (never displayed in notebook output).
dbutils.widgets — Parameterize Your Notebooks
Why Widgets Matter
Instead of hardcoding values like file paths, dates, or table names, widgets let you create input parameters that can be changed at runtime. This makes notebooks reusable without editing code.
Creating Widgets
# Text widget (free-form input)
dbutils.widgets.text("source_date", "2026-04-20", "Enter Source Date")
# Dropdown widget (select from predefined options)
dbutils.widgets.dropdown("environment", "dev", ["dev", "uat", "prod"], "Select Environment")
# Combobox widget (dropdown + free-form input)
dbutils.widgets.combobox("table_name", "customers", ["customers", "orders", "products"], "Select Table")
# Multiselect widget (select multiple values)
dbutils.widgets.multiselect("regions", "ALL", ["ALL", "NA", "EU", "APAC"], "Select Regions")
When you run these, input fields appear at the TOP of the notebook.
Reading Widget Values
# Get the current value of a widget
source_date = dbutils.widgets.get("source_date")
environment = dbutils.widgets.get("environment")
table_name = dbutils.widgets.get("table_name")
print(f"Processing {table_name} for {source_date} in {environment}")
# Output: Processing customers for 2026-04-20 in dev
Using Widgets in Spark Code
source_date = dbutils.widgets.get("source_date")
env = dbutils.widgets.get("environment")
# Dynamic path based on widget values
input_path = f"/mnt/datalake/{env}/bronze/customers/{source_date}/"
output_path = f"/mnt/datalake/{env}/silver/customers/{source_date}/"
df = spark.read.parquet(input_path)
df_clean = df.filter(df.status == "Active")
df_clean.write.format("delta").mode("overwrite").save(output_path)
print(f"Processed {df_clean.count()} rows from {input_path}")
Removing Widgets
# Remove a specific widget
dbutils.widgets.remove("source_date")
# Remove ALL widgets
dbutils.widgets.removeAll()
Widgets with Databricks Workflows (Jobs)
When you schedule a notebook as a job, you pass widget values as parameters:
{
"source_date": "2026-04-20",
"environment": "prod",
"table_name": "customers"
}
The job fills the widget values automatically — no manual input needed.
Real-life analogy: Widgets are like the settings dials on a washing machine. The washing machine (notebook) is the same, but you adjust the temperature (source_date), cycle type (environment), and load size (table_name) each time. One machine, many configurations.
dbutils.notebook — Orchestrate Notebooks
Running Another Notebook
# Run a notebook and wait for it to finish
result = dbutils.notebook.run(
"/Workspace/ETL/Process_Customers", # Path to the notebook
timeout_seconds=300, # Max wait time (5 minutes)
arguments={"source_date": "2026-04-20", "environment": "prod"}
)
print(f"Notebook returned: {result}")
Returning a Value from a Notebook
In the called notebook (Process_Customers):
# At the end of the notebook
row_count = df.count()
dbutils.notebook.exit(f"Processed {row_count} rows")
The calling notebook receives this string as the return value.
Building a Pipeline with Notebook Orchestration
# Master notebook that orchestrates multiple ETL steps
# Step 1: Ingest raw data
result1 = dbutils.notebook.run("/ETL/01_Ingest_Raw", 600,
{"source": "sql_database", "target": "bronze"})
print(f"Ingest: {result1}")
# Step 2: Clean and standardize
result2 = dbutils.notebook.run("/ETL/02_Clean_Data", 600,
{"source": "bronze", "target": "silver"})
print(f"Clean: {result2}")
# Step 3: Build aggregations
result3 = dbutils.notebook.run("/ETL/03_Build_Gold", 600,
{"source": "silver", "target": "gold"})
print(f"Gold: {result3}")
print("Pipeline completed successfully!")
Error Handling
try:
result = dbutils.notebook.run("/ETL/Process_Customers", 300,
{"source_date": "2026-04-20"})
print(f"Success: {result}")
except Exception as e:
print(f"Notebook failed: {str(e)}")
# Log error, send alert, etc.
Real-life analogy: dbutils.notebook.run() is like a manager delegating tasks. “Alice, process the customer data and tell me how many rows you handled.” Alice works, finishes, and reports back: “Processed 5,000 rows.” The manager then says “Bob, take Alice’s output and build the reports.” Each person (notebook) does their part, and the manager (master notebook) coordinates.
Mounting Storage with dbutils
What Is Mounting?
Mounting creates a shortcut from a DBFS path to an external storage location (ADLS Gen2). Instead of typing the full abfss://container@account.dfs.core.windows.net/ path every time, you type /mnt/datalake/.
Without mount: abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/bronze/customers/
With mount: /mnt/datalake/bronze/customers/
Mount with Service Principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": dbutils.secrets.get("keyvault-scope", "sp-client-id"),
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get("keyvault-scope", "sp-client-secret"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
mount_point="/mnt/datalake",
extra_configs=configs
)
print("Mounted successfully!")
Mount with Access Key (Simpler)
dbutils.fs.mount(
source="abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/",
mount_point="/mnt/datalake",
extra_configs={
"fs.azure.account.key.naveensynapsedl.dfs.core.windows.net":
dbutils.secrets.get("keyvault-scope", "storage-access-key")
}
)
List and Unmount
# List all mounts
display(dbutils.fs.mounts())
# Unmount
dbutils.fs.unmount("/mnt/datalake")
After Mounting
# Now you can use simple paths
df = spark.read.parquet("/mnt/datalake/bronze/customers/")
df.write.format("delta").save("/mnt/datalake/silver/customers/")
dbutils.fs.ls("/mnt/datalake/bronze/")
Note: In modern Databricks with Unity Catalog, direct access using abfss:// paths with external locations is preferred over mounting. Mounts are still widely used but considered legacy.
Real-life analogy: Mounting is like creating a shortcut on your desktop. Instead of navigating through Computer > Network > Server > Share > Folder every time, you create a shortcut called “DataLake” that takes you directly there.
Working with Delta Lake in Databricks
Delta Lake is the DEFAULT format in Databricks. Every table you create is Delta unless you specify otherwise.
# Write Delta table
df.write.format("delta").mode("overwrite").save("/mnt/datalake/silver/customers/")
# Read Delta table
df = spark.read.format("delta").load("/mnt/datalake/silver/customers/")
# Create managed table (registered in catalog)
df.write.format("delta").saveAsTable("silver.customers")
# Query with SQL
spark.sql("SELECT * FROM silver.customers WHERE city = 'Toronto'").show()
# MERGE (upsert)
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "/mnt/datalake/silver/customers/")
target.alias("t").merge(
source_df.alias("s"),
"t.customer_id = s.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
# Time travel
df_yesterday = spark.read.format("delta").option("versionAsOf", 5).load("/mnt/datalake/silver/customers/")
# OPTIMIZE (compact small files)
spark.sql("OPTIMIZE silver.customers")
# Z-ORDER (cluster data for faster queries)
spark.sql("OPTIMIZE silver.customers ZORDER BY (city)")
# Vacuum (clean up old files)
spark.sql("VACUUM silver.customers RETAIN 168 HOURS")
Databricks Workflows (Jobs)
Creating a Job
- Click Workflows in the sidebar
- Click Create Job
- Configure:
- Task name:
ETL_Daily_Customers - Type: Notebook
- Source: select your notebook
- Cluster: select existing or create new job cluster
- Parameters:
{"source_date": "{{job.start_date}}", "environment": "prod"} - Schedule: Add a schedule (cron expression or UI picker)
- Click Create
Multi-Task Workflows
Task 1: Ingest_Raw ──→ Task 2: Clean_Data ──→ Task 3: Build_Gold
|
Task 4: Update_Dashboard
Each task can be a different notebook, with dependencies between them.
Job Clusters vs All-Purpose Clusters
| Type | When Created | When Destroyed | Cost |
|---|---|---|---|
| All-Purpose | Manually by user | After auto-terminate timeout | Higher (interactive use premium) |
| Job Cluster | Automatically when job starts | Automatically when job ends | Lower (no interactive premium) |
Always use Job Clusters for scheduled production workloads — they are cheaper and auto-cleanup.
Real-World Scenarios
Scenario 1: Daily ETL Pipeline
# Notebook: ETL_Daily_Pipeline
source_date = dbutils.widgets.get("source_date")
password = dbutils.secrets.get("keyvault-scope", "sql-password")
# Extract from SQL
df = spark.read.format("jdbc") .option("url", f"jdbc:sqlserver://server:1433;database=mydb;password={password}") .option("query", f"SELECT * FROM orders WHERE order_date = '{source_date}'") .load()
# Transform
df_clean = df.dropDuplicates(["order_id"]).filter("amount > 0")
# Load to Delta
df_clean.write.format("delta").mode("append") .partitionBy("order_date") .save("/mnt/datalake/silver/orders/")
dbutils.notebook.exit(f"Loaded {df_clean.count()} orders for {source_date}")
Scenario 2: Data Lake File Management
# Archive old files
old_files = dbutils.fs.ls("/mnt/datalake/bronze/incoming/")
for f in old_files:
if "2025" in f.name:
dbutils.fs.mv(f.path, f"/mnt/datalake/archive/{f.name}")
print(f"Archived: {f.name}")
# Check file sizes
files = dbutils.fs.ls("/mnt/datalake/silver/customers/")
total_size = sum(f.size for f in files)
print(f"Total size: {total_size / (1024*1024):.2f} MB")
Cost Management
| Component | Pricing | Tip |
|---|---|---|
| All-Purpose Cluster | DBU rate + VM cost | Set auto-terminate to 30 min |
| Job Cluster | Lower DBU rate + VM cost | Use for all scheduled jobs |
| SQL Warehouse | Per DBU (serverless) | Scales to zero when idle |
| Storage | ADLS Gen2 rates (~$0.02/GB) | Standard Azure pricing |
Cost Saving Tips
- Auto-terminate clusters after 30 minutes of inactivity
- Use Job Clusters for scheduled workloads (cheaper DBU rate)
- Use spot instances for fault-tolerant batch jobs (60-90% cheaper VMs)
- Right-size clusters — start with 2 workers, scale up if needed
- VACUUM Delta tables — delete old file versions to save storage
- Use single-node clusters for development
- Monitor usage — Databricks has a built-in cost dashboard
Common Mistakes
-
Hardcoding passwords in notebooks — use
dbutils.secrets.get()instead. Notebooks are committed to Git. Passwords in Git = security breach. -
Leaving clusters running overnight — set auto-terminate to 30 minutes. An 8-node cluster running overnight costs $200+.
-
Using All-Purpose clusters for scheduled jobs — Job Clusters have a lower DBU rate and auto-terminate when the job finishes.
-
Not using Delta Lake — writing Parquet directly means no ACID, no MERGE, no time travel. Delta is the default for a reason.
-
Forgetting
recurse=Trueon directory operations —dbutils.fs.rm("/path/")withoutrecurse=Truefails on non-empty directories. -
Mounting with access keys instead of service principals — access keys give full account access. Service principals can be scoped to specific containers.
-
Not parameterizing notebooks — hardcoded paths and dates make notebooks non-reusable. Use
dbutils.widgetsfor everything that changes between runs.
Interview Questions
Q: What is Azure Databricks? A: A managed Apache Spark platform for big data processing, machine learning, and lakehouse architecture. It provides collaborative notebooks, managed clusters, native Delta Lake integration, and Unity Catalog for governance. Created by the founders of Spark and jointly developed with Microsoft for Azure.
Q: What is dbutils and what are its main modules?
A: dbutils is a built-in utility library in Databricks notebooks. Four modules: dbutils.fs for file operations (list, copy, move, delete), dbutils.secrets for reading secrets from Key Vault, dbutils.widgets for notebook parameterization, and dbutils.notebook for orchestrating other notebooks.
Q: How do you securely handle credentials in Databricks?
A: Store secrets in Azure Key Vault, create a secret scope linking Databricks to Key Vault, and read secrets at runtime using dbutils.secrets.get(scope, key). Secret values are automatically redacted in notebook output. Never hardcode credentials.
Q: What is the difference between mounting and direct ABFSS access?
A: Mounting creates a DBFS shortcut (/mnt/datalake/) to an ADLS Gen2 path, persisted across sessions. Direct ABFSS access uses the full path (abfss://container@account.dfs.core.windows.net/) each time. Mounting is simpler but considered legacy. Unity Catalog external locations are the modern approach.
Q: How do you parameterize a notebook for different environments?
A: Use dbutils.widgets to create input parameters (text, dropdown, multiselect). Read values with dbutils.widgets.get("param_name"). When scheduling as a job, pass parameters as JSON. This makes one notebook work for dev, UAT, and prod without code changes.
Q: What is the difference between All-Purpose and Job Clusters? A: All-Purpose clusters are created manually, persist across notebooks, and have a higher DBU rate — best for interactive development. Job Clusters are created automatically when a job starts and destroyed when it ends, with a lower DBU rate — best for scheduled production workloads.
Wrapping Up
Azure Databricks is where data engineering meets world-class Spark. The notebooks are intuitive, the cluster management is painless, Delta Lake is native, and dbutils gives you the file system, secret management, parameterization, and orchestration tools you need — all built in.
The key dbutils commands to memorize:
– dbutils.fs.ls() — browse your data lake
– dbutils.secrets.get() — read passwords securely
– dbutils.widgets.get() — parameterize your notebooks
– dbutils.notebook.run() — orchestrate your pipeline
Master these four, and you can build any data pipeline in Databricks.
Related posts: – Apache Spark and PySpark – Data File Formats (Delta Lake) – Azure Synapse Workspace Setup – Python for Data Engineers – Data Flows in ADF/Synapse
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.