Databricks Workflows and Jobs: Scheduling, Multi-Task Pipelines, Alerts, and Production Orchestration
You have built notebooks that read data, transform it, and write Delta tables. They work perfectly when you click “Run All.” But clicking a button manually at 2 AM every night is not a production strategy.
Databricks Workflows turn your notebooks into automated, scheduled, monitored production pipelines. You define WHAT to run, WHEN to run it, in WHAT ORDER, and WHAT TO DO if something fails — then Databricks handles the rest.
Think of Workflows like a factory assembly line. Each station (task) does one specific job. Station 1 cuts the metal (ingest raw data). Station 2 welds it (transform to Silver). Station 3 paints it (build Gold tables). Station 4 inspects it (data quality checks). The assembly line runs on a schedule — 6 AM every day — without anyone pressing a button. If a station breaks, the line stops and an alarm sounds (email alert).
Table of Contents
- What Are Databricks Workflows?
- Workflows vs ADF Pipelines vs Synapse Pipelines
- Creating Your First Job
- Job Clusters vs All-Purpose Clusters
- Multi-Task Workflows (DAG Pipelines)
- Task Dependencies
- Passing Parameters Between Tasks
- Schedule Types
- Retry and Timeout Configuration
- Alerts and Notifications
- Monitoring Job Runs
- The Complete Medallion Workflow
- Triggering Workflows from ADF
- Cost Optimization
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
What Are Databricks Workflows?
Workflows is Databricks’ built-in job scheduler and orchestrator. It lets you:
- Schedule notebooks to run at specific times (cron)
- Chain multiple notebooks into a pipeline (multi-task DAG)
- Pass parameters between tasks
- Retry failed tasks automatically
- Alert via email/Slack/PagerDuty on failure
- Monitor run history with detailed logs
Workflow: Daily_ETL_Pipeline
Schedule: 2:00 AM daily
Task 1: Ingest_Bronze ──→ Task 2: Transform_Silver ──→ Task 3: Build_Gold
|
Task 4: Data_Quality_Check
|
Task 5: Notify_Success
Workflows vs ADF Pipelines vs Synapse Pipelines
| Feature | Databricks Workflows | ADF / Synapse Pipelines |
|---|---|---|
| Orchestrates | Databricks notebooks, Python scripts, JARs | Any Azure service (Copy, Data Flow, Databricks, SQL) |
| UI | Databricks workspace | Azure Portal / Synapse Studio |
| Triggers | Cron schedule, manual, API, file arrival | Schedule, tumbling window, event-based, manual |
| Parameters | JSON key-value, task values | Pipeline parameters, global parameters |
| Retry | Per task | Per activity |
| Monitoring | Databricks run history | ADF Monitor hub |
| Best for | Databricks-only workloads | Multi-service orchestration |
When to use which: – Databricks Workflows: All your transformation logic is in Databricks notebooks – ADF/Synapse: You need to orchestrate across services (ADF Copy + Databricks + SQL Pool + Logic App) – Hybrid: ADF triggers a Databricks Workflow using the Databricks activity
Creating Your First Job
Step 1: Navigate to Workflows
- Click Workflows in the Databricks sidebar
- Click Create Job
Step 2: Configure the Task
| Field | Value | Notes |
|---|---|---|
| Job name | Daily_Bronze_Ingest |
Descriptive name |
| Task name | ingest_customers |
Name for this specific task |
| Type | Notebook | Can also be Python script, JAR, SQL, dbt |
| Source | Workspace | Or Git repo |
| Path | /ETL/01_Ingest_Customers |
Path to your notebook |
| Cluster | New Job Cluster | Cheaper than all-purpose |
| Parameters | {"source_date": "2026-05-18", "env": "prod"} |
Passed as widgets |
Step 3: Configure the Cluster
For a Job Cluster:
| Setting | Dev/Test | Production |
|---|---|---|
| Node type | Standard_DS3_v2 | Standard_E8s_v3 |
| Workers | 1 (single node) | 2-10 (auto-scale) |
| Databricks Runtime | Latest LTS | Latest LTS |
| Spot instances | No | Yes (60-90% cheaper) |
Step 4: Save and Run
Click Create → then Run now to test.
Real-life analogy: Creating a job is like programming a washing machine. You select the cycle (notebook), set the temperature (parameters), choose the load size (cluster), and set the timer (schedule). Once programmed, it runs automatically.
Job Clusters vs All-Purpose Clusters
| Feature | Job Cluster | All-Purpose Cluster |
|---|---|---|
| Created | Automatically when job starts | Manually by user |
| Destroyed | Automatically when job ends | After auto-terminate timeout |
| DBU rate | Lower (jobs compute pricing) | Higher (all-purpose pricing) |
| Startup time | 3-5 minutes per job | Instant (if already running) |
| Shared | One job only | Multiple users/notebooks |
| Cost | Pay only during job execution | Pay while running (even idle) |
| Use for | Scheduled production jobs | Interactive development |
Always use Job Clusters for scheduled production workloads. They are 40-60% cheaper than all-purpose clusters.
Real-life analogy: A Job Cluster is like a rental car — pick it up when you need it, return it when done, pay only for the hours used. An All-Purpose Cluster is like owning a car — always available, but you pay insurance and parking even when it is sitting in the garage.
Multi-Task Workflows (DAG Pipelines)
Real pipelines have multiple steps that depend on each other:
Creating a Multi-Task Workflow
- Create a job with the first task
- Click + Add Task to add more tasks
- Set Depends on to define the execution order
Task 1: Ingest_Bronze
→ No dependencies (runs first)
→ Notebook: /ETL/01_Ingest_Customers
→ Parameters: {"source": "sql_db", "target": "bronze"}
Task 2: Transform_Silver
→ Depends on: Ingest_Bronze (runs after Task 1 succeeds)
→ Notebook: /ETL/02_Transform_Silver
→ Parameters: {"source": "bronze", "target": "silver"}
Task 3: Build_Gold_Dimensions
→ Depends on: Transform_Silver
→ Notebook: /ETL/03_Build_Gold_Dims
→ Parameters: {"source": "silver", "target": "gold"}
Task 4: Build_Gold_Facts
→ Depends on: Transform_Silver (SAME dependency as Task 3)
→ Notebook: /ETL/04_Build_Gold_Facts
Task 5: Data_Quality_Report
→ Depends on: Build_Gold_Dimensions AND Build_Gold_Facts (both must complete)
→ Notebook: /ETL/05_Quality_Check
The DAG Visualization
Ingest_Bronze
|
Transform_Silver
/ Build_Dims Build_Facts ← Run in PARALLEL (both depend on Silver)
\ /
Data_Quality_Report ← Runs after BOTH complete
Tasks 3 and 4 run in parallel because they both depend only on Task 2 (not on each other). Task 5 waits for both to finish.
Real-life analogy: The assembly line. Cutting metal (Task 1) must finish before welding (Task 2). After welding, painting (Task 3) and wiring (Task 4) can happen simultaneously on different stations. Final inspection (Task 5) waits for both painting and wiring to complete.
Task Dependencies
| Dependency Type | What It Means | Example |
|---|---|---|
| Success (default) | Next task runs only if this task succeeds | Ingest → Transform |
| Failed | Next task runs only if this task fails | Any task → Send_Failure_Alert |
| Done | Next task runs regardless of success/failure | Cleanup task that always runs |
Ingest_Bronze
|
├── (Success) → Transform_Silver → Build_Gold
|
└── (Failed) → Send_Alert_Email → Log_Failure
Passing Parameters Between Tasks
Method 1: Hardcoded Parameters
{
"source_date": "2026-05-18",
"environment": "prod",
"table_name": "customers"
}
The notebook reads them with dbutils.widgets.get("source_date").
Method 2: Task Values (Dynamic)
A task can output a value that downstream tasks read:
Task 1 (Producer):
# At the end of the notebook
row_count = df.count()
dbutils.jobs.taskValues.set(key="bronze_row_count", value=row_count)
dbutils.jobs.taskValues.set(key="load_date", value="2026-05-18")
Task 2 (Consumer):
# Read value from Task 1
bronze_rows = dbutils.jobs.taskValues.get(
taskKey="Ingest_Bronze",
key="bronze_row_count",
default=0
)
print(f"Bronze ingested {bronze_rows} rows")
Real-life analogy: Task values are like passing a baton in a relay race. Runner 1 (Ingest) passes the baton (row count) to Runner 2 (Transform). Runner 2 knows exactly how many rows to expect.
Schedule Types
Cron Schedule
# Every day at 2:00 AM UTC
0 2 * * *
# Every weekday at 6:00 AM
0 6 * * 1-5
# Every hour
0 * * * *
# Every Sunday at midnight
0 0 * * 0
# First day of every month at 3 AM
0 3 1 * *
Manual Trigger
Click Run now in the Workflows UI or trigger via REST API.
File Arrival Trigger
{
"file_arrival": {
"url": "abfss://raw-data@storage.dfs.core.windows.net/incoming/",
"min_time_between_triggers_seconds": 60,
"wait_after_last_change_seconds": 30
}
}
Triggers the workflow when new files land in the specified path.
Retry and Timeout Configuration
Retry Policy
| Setting | Value | What It Does |
|---|---|---|
| Max retries | 2 | Retry failed task up to 2 times |
| Min retry interval | 30 seconds | Wait 30 seconds before first retry |
| Max retry interval | 10 minutes | Maximum wait between retries |
Timeout
| Setting | Value | What It Does |
|---|---|---|
| Task timeout | 3600 seconds (1 hour) | Kill the task if it runs longer |
| Job timeout | 14400 seconds (4 hours) | Kill the entire job if it exceeds this |
Real-life analogy: Retry is like an automatic redial on a phone. If the call fails, try again after 30 seconds. If it fails 3 times, give up and send an alert. Timeout is like a kitchen timer — if the dish is not ready in 1 hour, something is wrong.
Alerts and Notifications
Email Alerts
Configure under Notifications in the job settings:
| Event | Send To | When |
|---|---|---|
| On Start | team@company.com |
Job starts running |
| On Success | team@company.com |
All tasks completed successfully |
| On Failure | oncall@company.com |
Any task failed (after retries) |
| On Duration | oncall@company.com |
Job exceeds expected duration |
Slack/PagerDuty Integration
Configure webhook URLs in the notification settings for real-time alerts to Slack channels or PagerDuty incidents.
Monitoring Job Runs
Run History
Click on a job → Runs tab shows:
| Column | What It Shows |
|---|---|
| Run ID | Unique identifier |
| Start time | When the job started |
| Duration | Total execution time |
| Status | Succeeded, Failed, Cancelled, Running |
| Tasks | Individual task statuses |
Task-Level Details
Click on a specific run → see each task’s: – Duration – Status (green/red) – Output logs – Spark UI link (for performance debugging) – Error message (if failed)
The Complete Medallion Workflow
Here is a production-ready workflow that implements the full Medallion Architecture:
Job: Daily_Medallion_ETL
Schedule: 2:00 AM daily
Task 1: Config_Setup
Notebook: /Config/Storage_Config
Purpose: Set up storage connections, define paths
Task 2: Ingest_Customers_Bronze
Depends: Config_Setup
Notebook: /Bronze/Ingest_Customers
Parameters: {"source_table": "SalesLT.Customer"}
Task 3: Ingest_Products_Bronze
Depends: Config_Setup
Notebook: /Bronze/Ingest_Products
Parameters: {"source_table": "SalesLT.Product"}
Task 4: Transform_Customers_Silver
Depends: Ingest_Customers_Bronze
Notebook: /Silver/Transform_Customers
Purpose: Null handling, dedup, schema enforcement
Task 5: Transform_Products_Silver
Depends: Ingest_Products_Bronze
Notebook: /Silver/Transform_Products
Task 6: Build_Dim_Customer (SCD Type 2)
Depends: Transform_Customers_Silver
Notebook: /Gold/SCD2_Dim_Customer
Task 7: Build_Fact_Orders
Depends: Transform_Customers_Silver AND Transform_Products_Silver
Notebook: /Gold/Build_Fact_Orders
Task 8: Data_Quality_Check
Depends: Build_Dim_Customer AND Build_Fact_Orders
Notebook: /Quality/Run_DQ_Checks
Task 9: Optimize_Tables
Depends: Data_Quality_Check
Notebook: /Maintenance/Optimize_Vacuum
DAG:
Config_Setup
/ Ingest_Cust Ingest_Prod ← Parallel ingestion
| |
Transform_Cust Transform_Prod ← Parallel transformation
| / |
| / |
Build_Dim Build_Fact ← Fact depends on BOTH
\ /
DQ_Check ← After all Gold tables
|
Optimize ← Maintenance
Triggering Workflows from ADF
If you use ADF for orchestration, trigger Databricks workflows using the Databricks activity in ADF:
- ADF Pipeline → add Databricks activity (under Databricks section)
- Configure linked service to your Databricks workspace
- Select Run Databricks Job
- Enter the Job ID (from Workflows UI)
- Pass parameters from ADF pipeline parameters
This gives you the best of both worlds: ADF for multi-service orchestration (Copy + Databricks + SQL), Databricks Workflows for notebook-level task management.
Cost Optimization
- Use Job Clusters — 40-60% cheaper than all-purpose clusters
- Use spot instances — for fault-tolerant batch jobs, 60-90% cheaper VMs
- Right-size clusters — do not use 10 workers for a job that processes 1 GB
- Set timeouts — prevent runaway jobs from consuming resources for hours
- Schedule during off-peak hours — some regions have lower spot prices at night
- Share Job Clusters across tasks — multiple tasks in the same workflow can reuse the same cluster
- Auto-scale workers — set min=2, max=10 and let Databricks adjust
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “Cluster startup timeout” | Job cluster took too long to provision | Increase timeout or use a cluster pool |
| “Notebook not found” | Wrong path or notebook was moved | Verify the path in Workflows settings |
| “Permission denied” | Job owner lacks access to notebooks or storage | Check workspace permissions and storage roles |
| “Task failed after max retries” | Persistent error (data issue, not transient) | Check task logs, fix the root cause, rerun |
| “Job timed out” | Processing took longer than expected | Increase timeout or optimize the notebook |
| “Spot instance reclaimed” | Azure reclaimed the spot VM | Add retry policy (spot reclaims are transient) |
Interview Questions
Q: What are Databricks Workflows? A: The built-in job scheduler and orchestrator in Databricks. It lets you schedule notebooks, chain them into multi-task DAG pipelines with dependencies, pass parameters between tasks, configure retries and timeouts, and send alerts on success/failure.
Q: What is the difference between Job Clusters and All-Purpose Clusters? A: Job Clusters are created when a job starts and destroyed when it ends — cheaper DBU rate, no idle cost. All-Purpose Clusters persist and are shared by multiple users — higher rate but instant availability. Always use Job Clusters for production jobs.
Q: How do you pass data between tasks in a Workflow?
A: Using Task Values. The producing task calls dbutils.jobs.taskValues.set(key, value) and the consuming task calls dbutils.jobs.taskValues.get(taskKey, key). This allows downstream tasks to know row counts, file paths, or status from upstream tasks.
Q: When would you use Databricks Workflows vs ADF Pipelines? A: Use Workflows when all logic is in Databricks notebooks. Use ADF when you need multi-service orchestration (ADF Copy + Databricks + SQL Pool + Azure Functions). Use both together: ADF triggers the Databricks Workflow for notebook-level task management.
Wrapping Up
Databricks Workflows turn your notebooks from manual experiments into automated production pipelines. The multi-task DAG lets you model complex dependencies, parallel execution cuts processing time, and email alerts ensure you know when something breaks.
The pattern is simple: one workflow per domain (Daily_Sales_ETL, Weekly_Customer_Refresh), tasks following the Medallion layers (Bronze → Silver → Gold → Quality), job clusters for cost efficiency, and alerts for reliability.
Related posts: – Azure Databricks Introduction – Medallion Architecture – SCD Type 1 and 2 in PySpark – Delta Lake Optimization – PySpark Foundations
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.