Databricks Asset Bundles (DABs): YAML-Based CI/CD, Project Structure, Deployment from Dev to Prod, and Modern Databricks DevOps
Our Git Integration and CI/CD post covered the fundamentals — Repos, GitHub Actions, and deployment workflows. But Databricks has introduced a newer, more powerful approach: Databricks Asset Bundles (DABs).
DABs let you define your entire Databricks project — notebooks, jobs, pipelines, clusters, permissions — in YAML configuration files. You deploy everything with a single command: databricks bundle deploy. Think of it as Infrastructure-as-Code, but for Databricks resources.
If the old approach (manually creating jobs in the UI, exporting JSON configs) was like hand-writing letters to every department, DABs is like setting up an automated mail system — write the template once, and it delivers the right version to every environment (Dev, Staging, Prod) automatically.
Table of Contents
- What Are Databricks Asset Bundles?
- Why DABs Over Manual Deployment
- Prerequisites and CLI Setup
- Project Structure
- The databricks.yml Configuration File
- Defining Resources
- Jobs
- Delta Live Tables Pipelines
- Environment Targets (Dev, Staging, Prod)
- Variables and Substitutions
- Deploying a Bundle
- Validate, Deploy, Run
- What Happens During Deployment
- CI/CD with DABs and GitHub Actions
- GitHub Actions Workflow
- Branch Strategy
- Permissions and Access Control
- DABs vs Repos-Based CI/CD
- Common Mistakes
- Interview Questions
- Wrapping Up
What Are Databricks Asset Bundles?
A Databricks Asset Bundle is a project directory that contains your code (notebooks, Python files) AND your infrastructure definitions (jobs, clusters, pipelines) as YAML files. Everything your Databricks project needs — code, compute configuration, schedules, and permissions — lives together in version-controlled files.
Without DABs:
Code lives in Git (notebooks, .py files)
Jobs created manually in Databricks UI
Cluster configs set through UI clicks
Promoting Dev → Prod = manual recreation or JSON export/import
Risk: UI settings drift between environments
With DABs:
Code lives in Git (same)
Jobs defined in databricks.yml (YAML file in Git)
Cluster configs defined in databricks.yml
Promoting Dev → Prod = "databricks bundle deploy --target prod"
Guarantee: environments are identical because they use the same YAML
Why DABs Over Manual Deployment
| Aspect | Manual/UI Deployment | DABs |
|---|---|---|
| Reproducibility | Environments drift over time | Identical config across all environments |
| Version control | Job configs not in Git | Everything in Git (code + config) |
| Deployment | Manual clicks or JSON import | One command: databricks bundle deploy |
| Rollback | Recreate manually | git revert + redeploy |
| Collaboration | “Who changed the job schedule?” | Git blame shows who changed what and when |
| Multi-environment | Copy-paste between workspaces | Targets: dev, staging, prod in one YAML |
Prerequisites and CLI Setup
# Install Databricks CLI v2 (required for DABs)
# macOS
brew tap databricks/tap
brew install databricks
# Linux / WSL
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
# Verify installation
databricks --version # Should be 0.200+ for DABs support
# Configure authentication
databricks configure --profile DEFAULT
# Enter: Databricks workspace URL (https://adb-xxx.azuredatabricks.net)
# Enter: Personal access token (or use OAuth / Azure CLI auth)
# Verify connection
databricks workspace list /
Project Structure
my_data_project/
├── databricks.yml ← Main configuration (jobs, targets, resources)
├── src/
│ ├── bronze_ingestion.py ← PySpark scripts
│ ├── silver_transform.py
│ └── gold_aggregate.py
├── notebooks/
│ ├── exploration.py ← Interactive notebooks (optional)
│ └── data_quality_check.py
├── tests/
│ ├── test_transforms.py ← Unit tests
│ └── test_data_quality.py
├── resources/
│ ├── jobs.yml ← Job definitions (can be separate files)
│ └── pipelines.yml ← DLT pipeline definitions
├── .github/
│ └── workflows/
│ └── deploy.yml ← GitHub Actions CI/CD
└── README.md
The databricks.yml Configuration File
# databricks.yml — the heart of your DAB project
bundle:
name: sales_data_pipeline
# Variables — reusable values across the config
variables:
warehouse_id:
description: "SQL Warehouse ID for queries"
default: "abc123def456"
# Resources — what gets deployed
resources:
jobs:
daily_etl:
name: "Daily Sales ETL"
schedule:
quartz_cron_expression: "0 0 6 * * ?" # 6 AM daily
timezone_id: "America/Toronto"
tasks:
- task_key: bronze_ingest
spark_python_task:
python_file: ./src/bronze_ingestion.py
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest
spark_python_task:
python_file: ./src/silver_transform.py
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 4
- task_key: gold_aggregate
depends_on:
- task_key: silver_transform
spark_python_task:
python_file: ./src/gold_aggregate.py
# Targets — environment-specific overrides
targets:
dev:
default: true
workspace:
host: https://adb-dev.azuredatabricks.net
resources:
jobs:
daily_etl:
name: "[DEV] Daily Sales ETL"
schedule: null # No schedule in dev — run manually
staging:
workspace:
host: https://adb-staging.azuredatabricks.net
resources:
jobs:
daily_etl:
name: "[STAGING] Daily Sales ETL"
prod:
workspace:
host: https://adb-prod.azuredatabricks.net
resources:
jobs:
daily_etl:
name: "Daily Sales ETL"
tasks:
- task_key: bronze_ingest
new_cluster:
num_workers: 8 # More workers in prod
- task_key: silver_transform
new_cluster:
num_workers: 16 # Scale up for production data
Defining Resources
Jobs
resources:
jobs:
my_job:
name: "Customer Pipeline"
email_notifications:
on_failure: ["team@company.com"]
max_concurrent_runs: 1
tasks:
- task_key: ingest
spark_python_task:
python_file: ./src/ingest.py
parameters: ["--date", "{{job.start_time}}"]
existing_cluster_id: "0123-456789-abcdef" # Or use new_cluster
Delta Live Tables Pipelines
resources:
pipelines:
sales_dlt:
name: "Sales DLT Pipeline"
target: "gold"
libraries:
- notebook:
path: ./src/dlt_sales_pipeline.py
configuration:
"spark.databricks.delta.optimizeWrite.enabled": "true"
clusters:
- label: "default"
num_workers: 4
Environment Targets (Dev, Staging, Prod)
The same YAML, three different deployments:
$ databricks bundle deploy --target dev
→ Creates "[DEV] Daily Sales ETL" in dev workspace
→ No schedule (manual runs only)
→ 2 workers (small cluster)
$ databricks bundle deploy --target staging
→ Creates "[STAGING] Daily Sales ETL" in staging workspace
→ Scheduled at 6 AM
→ 4 workers
$ databricks bundle deploy --target prod
→ Creates "Daily Sales ETL" in prod workspace
→ Scheduled at 6 AM
→ 8-16 workers (production-scale)
Same code. Different configs. One command.
Variables and Substitutions
variables:
catalog:
description: "Unity Catalog name"
warehouse_id:
description: "SQL Warehouse for queries"
targets:
dev:
variables:
catalog: "dev_catalog"
warehouse_id: "dev_warehouse_123"
prod:
variables:
catalog: "prod_catalog"
warehouse_id: "prod_warehouse_456"
# Use in task parameters:
tasks:
- task_key: transform
spark_python_task:
python_file: ./src/transform.py
parameters: ["--catalog", "${var.catalog}"]
Deploying a Bundle
Validate, Deploy, Run
# Step 1: Validate — check YAML syntax and resource references
databricks bundle validate --target dev
# Output: "Validation successful!" or detailed errors
# Step 2: Deploy — create/update resources in the workspace
databricks bundle deploy --target dev
# Creates jobs, uploads code, sets schedules
# Idempotent: safe to run multiple times
# Step 3: Run — trigger the job immediately (optional)
databricks bundle run daily_etl --target dev
# Runs the job and shows real-time output
# Destroy — remove all deployed resources (cleanup)
databricks bundle destroy --target dev
# Deletes jobs, pipelines created by this bundle
What Happens During Deployment
$ databricks bundle deploy --target prod
1. Reads databricks.yml + target overrides for "prod"
2. Uploads source files (src/*.py) to workspace file system
3. Creates or updates the Databricks Job:
- Name: "Daily Sales ETL"
- Schedule: 6 AM daily
- Tasks: bronze → silver → gold (with dependencies)
- Cluster: 8-16 workers
4. Sets permissions and email notifications
5. Reports: "Deployment successful. Job ID: 12345"
Next deploy (same target):
- Updates only what changed (incremental)
- Does NOT duplicate — updates the existing job
CI/CD with DABs and GitHub Actions
GitHub Actions Workflow
# .github/workflows/deploy.yml
name: Deploy Databricks Bundle
on:
push:
branches: [main] # Deploy to prod on merge to main
pull_request:
branches: [main] # Validate on PR
jobs:
validate:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle validate --target staging
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_STAGING_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_STAGING_TOKEN }}
deploy-prod:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle deploy --target prod
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
Branch Strategy
feature/DE-1234 → PR to main → Validate (staging) → Merge → Deploy (prod)
| | | | |
Develop Code review Bundle validate Approved bundle deploy
locally + approve against staging by team to production
Permissions and Access Control
# Set permissions in databricks.yml
resources:
jobs:
daily_etl:
permissions:
- user_name: "data-engineers@company.com"
level: CAN_MANAGE
- group_name: "analysts"
level: CAN_VIEW
- service_principal_name: "cicd-principal"
level: IS_OWNER
DABs vs Repos-Based CI/CD
| Feature | Repos + GitHub Actions (old) | DABs (modern) |
|---|---|---|
| Job definition | Created via UI or REST API | Defined in YAML (version-controlled) |
| Multi-environment | Separate scripts per environment | Targets in one YAML file |
| Cluster config | UI or JSON | YAML with target overrides |
| Deployment | REST API calls or dbx tool | databricks bundle deploy |
| Rollback | Manual or REST API | git revert + redeploy |
| Learning curve | Lower (familiar UI) | Moderate (YAML + CLI) |
| Recommended by Databricks | Legacy approach | Current best practice |
Common Mistakes
- Not using targets for environments — deploying the same config to Dev and Prod means Prod runs with Dev-sized clusters. Always use target overrides for cluster sizing, schedules, and naming.
- Hardcoding workspace URLs — use variables and target-specific workspace hosts. Never put Prod credentials in the YAML file — use environment variables or secrets.
- Forgetting
bundle validate— always validate before deploying. A typo in YAML can silently break a job configuration. - Not setting
max_concurrent_runs: 1— without it, overlapping scheduled runs can cause data corruption or duplicate processing. - Mixing UI changes with DABs — if you modify a DABs-deployed job through the UI, the next
bundle deployoverwrites your UI changes. All changes should go through YAML + Git.
Interview Questions
Q: What are Databricks Asset Bundles and why should you use them? A: DABs are a project-based approach to Databricks CI/CD. You define all resources — jobs, pipelines, clusters, permissions — in YAML files alongside your code. This ensures environments are identical (same YAML, different targets), enables Git-based version control for infrastructure, and simplifies deployment to a single command. They replace manual UI-based job creation and legacy tools like dbx.
Q: How do DABs handle multi-environment deployment?
A: Through targets in databricks.yml. Each target (dev, staging, prod) specifies a workspace URL and resource overrides (cluster size, schedule, naming). The command databricks bundle deploy --target prod applies the prod overrides. Same code, same YAML, different environments — guaranteed consistency.
Q: How would you set up CI/CD with DABs and GitHub Actions?
A: On pull request: run databricks bundle validate --target staging to check configuration. On merge to main: run databricks bundle deploy --target prod to deploy to production. Store workspace credentials as GitHub Secrets. This gives you automated validation on every PR and automated deployment on every merge.
Wrapping Up
Databricks Asset Bundles are the modern way to manage Databricks projects. Code and infrastructure live together in Git. Environments are defined as targets, not separate configurations. Deployment is a single command. Combined with GitHub Actions, you get a complete CI/CD pipeline that validates on every PR and deploys on every merge.
If you are still creating jobs through the UI and manually replicating them across environments, DABs is the upgrade you need.
Related posts: – Databricks Git Integration and CI/CD – Databricks Workflows and Jobs – Unity Catalog Deep Dive
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.