Databricks Asset Bundles (DABs): YAML-Based CI/CD, Project Structure, Deployment from Dev to Prod, and Modern Databricks DevOps

Databricks Asset Bundles (DABs): YAML-Based CI/CD, Project Structure, Deployment from Dev to Prod, and Modern Databricks DevOps

Our Git Integration and CI/CD post covered the fundamentals — Repos, GitHub Actions, and deployment workflows. But Databricks has introduced a newer, more powerful approach: Databricks Asset Bundles (DABs).

DABs let you define your entire Databricks project — notebooks, jobs, pipelines, clusters, permissions — in YAML configuration files. You deploy everything with a single command: databricks bundle deploy. Think of it as Infrastructure-as-Code, but for Databricks resources.

If the old approach (manually creating jobs in the UI, exporting JSON configs) was like hand-writing letters to every department, DABs is like setting up an automated mail system — write the template once, and it delivers the right version to every environment (Dev, Staging, Prod) automatically.

Table of Contents

  • What Are Databricks Asset Bundles?
  • Why DABs Over Manual Deployment
  • Prerequisites and CLI Setup
  • Project Structure
  • The databricks.yml Configuration File
  • Defining Resources
  • Jobs
  • Delta Live Tables Pipelines
  • Environment Targets (Dev, Staging, Prod)
  • Variables and Substitutions
  • Deploying a Bundle
  • Validate, Deploy, Run
  • What Happens During Deployment
  • CI/CD with DABs and GitHub Actions
  • GitHub Actions Workflow
  • Branch Strategy
  • Permissions and Access Control
  • DABs vs Repos-Based CI/CD
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Are Databricks Asset Bundles?

A Databricks Asset Bundle is a project directory that contains your code (notebooks, Python files) AND your infrastructure definitions (jobs, clusters, pipelines) as YAML files. Everything your Databricks project needs — code, compute configuration, schedules, and permissions — lives together in version-controlled files.

Without DABs:
  Code lives in Git (notebooks, .py files)
  Jobs created manually in Databricks UI
  Cluster configs set through UI clicks
  Promoting Dev → Prod = manual recreation or JSON export/import
  Risk: UI settings drift between environments

With DABs:
  Code lives in Git (same)
  Jobs defined in databricks.yml (YAML file in Git)
  Cluster configs defined in databricks.yml
  Promoting Dev → Prod = "databricks bundle deploy --target prod"
  Guarantee: environments are identical because they use the same YAML

Why DABs Over Manual Deployment

Aspect Manual/UI Deployment DABs
Reproducibility Environments drift over time Identical config across all environments
Version control Job configs not in Git Everything in Git (code + config)
Deployment Manual clicks or JSON import One command: databricks bundle deploy
Rollback Recreate manually git revert + redeploy
Collaboration “Who changed the job schedule?” Git blame shows who changed what and when
Multi-environment Copy-paste between workspaces Targets: dev, staging, prod in one YAML

Prerequisites and CLI Setup

# Install Databricks CLI v2 (required for DABs)
# macOS
brew tap databricks/tap
brew install databricks

# Linux / WSL
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Verify installation
databricks --version   # Should be 0.200+ for DABs support

# Configure authentication
databricks configure --profile DEFAULT
# Enter: Databricks workspace URL (https://adb-xxx.azuredatabricks.net)
# Enter: Personal access token (or use OAuth / Azure CLI auth)

# Verify connection
databricks workspace list /

Project Structure

my_data_project/
├── databricks.yml              ← Main configuration (jobs, targets, resources)
├── src/
│   ├── bronze_ingestion.py     ← PySpark scripts
│   ├── silver_transform.py
│   └── gold_aggregate.py
├── notebooks/
│   ├── exploration.py          ← Interactive notebooks (optional)
│   └── data_quality_check.py
├── tests/
│   ├── test_transforms.py      ← Unit tests
│   └── test_data_quality.py
├── resources/
│   ├── jobs.yml                ← Job definitions (can be separate files)
│   └── pipelines.yml           ← DLT pipeline definitions
├── .github/
│   └── workflows/
│       └── deploy.yml          ← GitHub Actions CI/CD
└── README.md

The databricks.yml Configuration File

# databricks.yml — the heart of your DAB project
bundle:
  name: sales_data_pipeline

# Variables — reusable values across the config
variables:
  warehouse_id:
    description: "SQL Warehouse ID for queries"
    default: "abc123def456"

# Resources — what gets deployed
resources:
  jobs:
    daily_etl:
      name: "Daily Sales ETL"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"  # 6 AM daily
        timezone_id: "America/Toronto"
      tasks:
        - task_key: bronze_ingest
          spark_python_task:
            python_file: ./src/bronze_ingestion.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 2

        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingest
          spark_python_task:
            python_file: ./src/silver_transform.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 4

        - task_key: gold_aggregate
          depends_on:
            - task_key: silver_transform
          spark_python_task:
            python_file: ./src/gold_aggregate.py

# Targets — environment-specific overrides
targets:
  dev:
    default: true
    workspace:
      host: https://adb-dev.azuredatabricks.net
    resources:
      jobs:
        daily_etl:
          name: "[DEV] Daily Sales ETL"
          schedule: null  # No schedule in dev — run manually

  staging:
    workspace:
      host: https://adb-staging.azuredatabricks.net
    resources:
      jobs:
        daily_etl:
          name: "[STAGING] Daily Sales ETL"

  prod:
    workspace:
      host: https://adb-prod.azuredatabricks.net
    resources:
      jobs:
        daily_etl:
          name: "Daily Sales ETL"
          tasks:
            - task_key: bronze_ingest
              new_cluster:
                num_workers: 8        # More workers in prod
            - task_key: silver_transform
              new_cluster:
                num_workers: 16       # Scale up for production data

Defining Resources

Jobs

resources:
  jobs:
    my_job:
      name: "Customer Pipeline"
      email_notifications:
        on_failure: ["team@company.com"]
      max_concurrent_runs: 1
      tasks:
        - task_key: ingest
          spark_python_task:
            python_file: ./src/ingest.py
            parameters: ["--date", "{{job.start_time}}"]
          existing_cluster_id: "0123-456789-abcdef"  # Or use new_cluster

Delta Live Tables Pipelines

resources:
  pipelines:
    sales_dlt:
      name: "Sales DLT Pipeline"
      target: "gold"
      libraries:
        - notebook:
            path: ./src/dlt_sales_pipeline.py
      configuration:
        "spark.databricks.delta.optimizeWrite.enabled": "true"
      clusters:
        - label: "default"
          num_workers: 4

Environment Targets (Dev, Staging, Prod)

The same YAML, three different deployments:

$ databricks bundle deploy --target dev
  → Creates "[DEV] Daily Sales ETL" in dev workspace
  → No schedule (manual runs only)
  → 2 workers (small cluster)

$ databricks bundle deploy --target staging
  → Creates "[STAGING] Daily Sales ETL" in staging workspace
  → Scheduled at 6 AM
  → 4 workers

$ databricks bundle deploy --target prod
  → Creates "Daily Sales ETL" in prod workspace
  → Scheduled at 6 AM
  → 8-16 workers (production-scale)

Same code. Different configs. One command.

Variables and Substitutions

variables:
  catalog:
    description: "Unity Catalog name"
  warehouse_id:
    description: "SQL Warehouse for queries"

targets:
  dev:
    variables:
      catalog: "dev_catalog"
      warehouse_id: "dev_warehouse_123"
  prod:
    variables:
      catalog: "prod_catalog"
      warehouse_id: "prod_warehouse_456"

# Use in task parameters:
tasks:
  - task_key: transform
    spark_python_task:
      python_file: ./src/transform.py
      parameters: ["--catalog", "${var.catalog}"]

Deploying a Bundle

Validate, Deploy, Run

# Step 1: Validate — check YAML syntax and resource references
databricks bundle validate --target dev
# Output: "Validation successful!" or detailed errors

# Step 2: Deploy — create/update resources in the workspace
databricks bundle deploy --target dev
# Creates jobs, uploads code, sets schedules
# Idempotent: safe to run multiple times

# Step 3: Run — trigger the job immediately (optional)
databricks bundle run daily_etl --target dev
# Runs the job and shows real-time output

# Destroy — remove all deployed resources (cleanup)
databricks bundle destroy --target dev
# Deletes jobs, pipelines created by this bundle

What Happens During Deployment

$ databricks bundle deploy --target prod

1. Reads databricks.yml + target overrides for "prod"
2. Uploads source files (src/*.py) to workspace file system
3. Creates or updates the Databricks Job:
   - Name: "Daily Sales ETL"
   - Schedule: 6 AM daily
   - Tasks: bronze → silver → gold (with dependencies)
   - Cluster: 8-16 workers
4. Sets permissions and email notifications
5. Reports: "Deployment successful. Job ID: 12345"

Next deploy (same target):
  - Updates only what changed (incremental)
  - Does NOT duplicate — updates the existing job

CI/CD with DABs and GitHub Actions

GitHub Actions Workflow

# .github/workflows/deploy.yml
name: Deploy Databricks Bundle

on:
  push:
    branches: [main]        # Deploy to prod on merge to main
  pull_request:
    branches: [main]        # Validate on PR

jobs:
  validate:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle validate --target staging
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_STAGING_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_STAGING_TOKEN }}

  deploy-prod:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle deploy --target prod
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}

Branch Strategy

feature/DE-1234  →  PR to main  →  Validate (staging)  →  Merge  →  Deploy (prod)
     |                  |                |                    |            |
  Develop          Code review      Bundle validate      Approved    bundle deploy
  locally          + approve        against staging      by team     to production

Permissions and Access Control

# Set permissions in databricks.yml
resources:
  jobs:
    daily_etl:
      permissions:
        - user_name: "data-engineers@company.com"
          level: CAN_MANAGE
        - group_name: "analysts"
          level: CAN_VIEW
        - service_principal_name: "cicd-principal"
          level: IS_OWNER

DABs vs Repos-Based CI/CD

Feature Repos + GitHub Actions (old) DABs (modern)
Job definition Created via UI or REST API Defined in YAML (version-controlled)
Multi-environment Separate scripts per environment Targets in one YAML file
Cluster config UI or JSON YAML with target overrides
Deployment REST API calls or dbx tool databricks bundle deploy
Rollback Manual or REST API git revert + redeploy
Learning curve Lower (familiar UI) Moderate (YAML + CLI)
Recommended by Databricks Legacy approach Current best practice

Common Mistakes

  1. Not using targets for environments — deploying the same config to Dev and Prod means Prod runs with Dev-sized clusters. Always use target overrides for cluster sizing, schedules, and naming.
  2. Hardcoding workspace URLs — use variables and target-specific workspace hosts. Never put Prod credentials in the YAML file — use environment variables or secrets.
  3. Forgetting bundle validate — always validate before deploying. A typo in YAML can silently break a job configuration.
  4. Not setting max_concurrent_runs: 1 — without it, overlapping scheduled runs can cause data corruption or duplicate processing.
  5. Mixing UI changes with DABs — if you modify a DABs-deployed job through the UI, the next bundle deploy overwrites your UI changes. All changes should go through YAML + Git.

Interview Questions

Q: What are Databricks Asset Bundles and why should you use them? A: DABs are a project-based approach to Databricks CI/CD. You define all resources — jobs, pipelines, clusters, permissions — in YAML files alongside your code. This ensures environments are identical (same YAML, different targets), enables Git-based version control for infrastructure, and simplifies deployment to a single command. They replace manual UI-based job creation and legacy tools like dbx.

Q: How do DABs handle multi-environment deployment? A: Through targets in databricks.yml. Each target (dev, staging, prod) specifies a workspace URL and resource overrides (cluster size, schedule, naming). The command databricks bundle deploy --target prod applies the prod overrides. Same code, same YAML, different environments — guaranteed consistency.

Q: How would you set up CI/CD with DABs and GitHub Actions? A: On pull request: run databricks bundle validate --target staging to check configuration. On merge to main: run databricks bundle deploy --target prod to deploy to production. Store workspace credentials as GitHub Secrets. This gives you automated validation on every PR and automated deployment on every merge.

Wrapping Up

Databricks Asset Bundles are the modern way to manage Databricks projects. Code and infrastructure live together in Git. Environments are defined as targets, not separate configurations. Deployment is a single command. Combined with GitHub Actions, you get a complete CI/CD pipeline that validates on every PR and deploys on every merge.

If you are still creating jobs through the UI and manually replicating them across environments, DABs is the upgrade you need.

Related posts:Databricks Git Integration and CI/CDDatabricks Workflows and JobsUnity Catalog Deep Dive


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link