Databricks Git Integration and CI/CD: Repos, Branching, Notebook Versioning, and Deploying Across Environments

In our earlier posts, we built CI/CD for ADF/Synapse with GitHub Actions and Azure DevOps. Those work because ADF stores everything as JSON files — pipelines, datasets, linked services — all serializable into Git.

But Databricks notebooks are different. They are interactive, cell-based, and often treated like scratch pads. Engineers run cells out of order, leave debug code in, and share notebooks by copy-pasting. This works for exploration but is a disaster for production.

Databricks Repos solves this by connecting your workspace directly to a Git repository. Every notebook becomes a versioned file. Changes are tracked. Branches work. Pull requests enforce code review. And CI/CD pipelines deploy tested code from Dev to UAT to Production — no manual copying.

Think of it like the difference between writing a book in a Word document emailed back and forth (copy-paste notebooks) versus writing in Google Docs with version history, comments, and approval workflows (Git-integrated Repos). Both produce a book. One produces chaos. The other produces a reliable, auditable process.

Why Git Integration Matters for Databricks
The Real-World Workflow (Dev → UAT → Prod)
Databricks Repos: What It Is and How It Works
Step 1: Connect Databricks to GitHub
Step 2: Clone a Repository into Databricks
Step 3: Create a Branch and Make Changes
Step 4: Commit, Push, and Create a Pull Request
Step 5: Code Review and Merge
Folder Structure for a Production Databricks Project
Environment Promotion: Dev → UAT → Prod
Option A: Separate Workspaces per Environment
Option B: Folder-Based Environments (Single Workspace)
CI/CD with GitHub Actions for Databricks
The Workflow File Explained
Deploying Notebooks with the Databricks CLI
Deploying Notebooks with the Databricks REST API
CI/CD with Azure DevOps for Databricks
Running Tests Automatically on Pull Request
Parameterizing Notebooks for Environments
Databricks Asset Bundles (DABs) — The Modern Approach
Secrets Management Across Environments
Comparing ADF CI/CD vs Databricks CI/CD
Common Mistakes
Interview Questions
Wrapping Up

Why Git Integration Matters for Databricks

Without Git (The Chaos)

Developer A edits Notebook_ETL in the shared workspace
Developer B edits the SAME notebook at the same time
Developer A saves → Developer B saves → A's changes are GONE
Nobody knows what changed, when, or why
The "production" notebook has debug print() statements from last Tuesday
Rolling back means "does anyone remember what it looked like before?"

With Git (The Order)

Developer A creates branch feature/add-scd-type2
Developer B creates branch feature/fix-null-handling
Both work independently — no conflicts
Both create Pull Requests with code review
Reviewer catches a bug in A's code before it reaches production
Changes are merged → CI/CD deploys tested code to production
Rolling back = git revert (one command, full audit trail)

Real-life analogy: Without Git, your team writes a report by passing a USB drive around the office — one person at a time, no track changes, hope nobody overwrites your section. With Git, everyone edits their own copy in Google Docs, changes are tracked, conflicts are highlighted, and a manager approves the final version.

The Real-World Workflow (Dev → UAT → Prod)

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  DEV Workspace   │     │  UAT Workspace   │     │ PROD Workspace   │
│                  │     │                  │     │                  │
│  Engineer edits  │────>│  Testers verify  │────>│  Scheduled jobs  │
│  notebooks in    │ PR  │  with UAT data   │ PR  │  run against     │
│  feature branch  │merge│                  │merge│  production data │
│                  │     │  (read from main) │     │  (read from main)│
│  Connected to    │     │  Connected to    │     │  Connected to    │
│  GitHub (develop)│     │  GitHub (main)   │     │  GitHub (release)│
└─────────────────┘     └─────────────────┘     └─────────────────┘
        |                        |                        |
        v                        v                        v
   GitHub Repository (Single Source of Truth)
   ├── develop branch (Dev workspace reads this)
   ├── main branch (UAT workspace reads this)
   └── release branch (Prod workspace reads this)

The golden rule: Code flows one direction — Dev → UAT → Prod. Nobody edits UAT or Prod directly. All changes go through Git.

Databricks Repos: What It Is and How It Works

Databricks Repos is a built-in Git client inside the Databricks workspace. It lets you:

Clone a GitHub/Azure DevOps/GitLab repository into your workspace
Create branches, commit changes, push, and pull — all from the Databricks UI
Run notebooks directly from the cloned repo (not copied to workspace)
Keep multiple branches checked out simultaneously

GitHub Repository                    Databricks Workspace
├── notebooks/                       Repos/
│   ├── Bronze/                        └── my-project/
│   │   └── Ingest_Customers.py            ├── notebooks/
│   ├── Silver/                            │   ├── Bronze/
│   │   └── Transform_Customers.py         │   │   └── Ingest_Customers
│   ├── Gold/                              │   ├── Silver/
│   │   └── SCD2_Dim_Customer.py           │   │   └── Transform_Customers
│   └── Config/                            │   ├── Gold/
│       └── Storage_Config.py              │   │   └── SCD2_Dim_Customer
├── tests/                                 │   └── Config/
│   └── test_transforms.py                 │       └── Storage_Config
├── .github/                               └── (synced via Git)
│   └── workflows/
│       └── deploy.yml
└── README.md

Step 1: Connect Databricks to GitHub

Generate a GitHub Personal Access Token

Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)
Click Generate new token (classic)
Name: databricks-git-integration
Scopes: check repo (full control of private repositories)
Click Generate token → copy the token (you will not see it again)

Configure Git in Databricks

In Databricks, click your user icon (top right) → Settings
Click Linked accounts (under User section)
Git provider: GitHub
Git provider username: your GitHub username
Token: paste the personal access token
Click Save

Step 2: Clone a Repository into Databricks

Create the GitHub Repository First

# On your local machine
mkdir databricks-etl-project
cd databricks-etl-project
git init
mkdir -p notebooks/Bronze notebooks/Silver notebooks/Gold notebooks/Config tests

# Create a placeholder
echo "# Databricks ETL Project" > README.md
git add .
git commit -m "Initial project structure"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/databricks-etl-project.git
git push -u origin main

Clone into Databricks

In Databricks, click Workspace in the sidebar
Click Repos (under your username)
Click Add Repo
Paste the repository URL: https://github.com/YOUR_USERNAME/databricks-etl-project.git
Click Create Repo

The repository is now cloned. You can see all files and notebooks inside Databricks.

Step 3: Create a Branch and Make Changes

Create a Feature Branch

In the Repos section, click on your repo
Click the branch dropdown (shows main) at the top
Click Create branch
Name: feature/add-bronze-ingest
Click Create

You are now on the feature branch. Any changes here do NOT affect main.

Edit a Notebook

Navigate to notebooks/Bronze/
Create a new notebook: Ingest_Customers
Write your code:

# Ingest_Customers notebook
from pyspark.sql.functions import *

# Read config
%run ../Config/Storage_Config

# Ingest from SQL to Bronze
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
df.withColumn("load_date", current_date())   .write.format("delta").mode("overwrite")   .save(f"{BASE_PATH}/bronze/customers/")

print(f"Ingested {df.count()} customers to Bronze")

Step 4: Commit, Push, and Create a Pull Request

Commit from Databricks UI

Click the branch name at the top of the Repos panel
You will see a list of changed files
Enter a commit message: feat: add bronze customer ingest notebook
Click Commit & Push

Create a Pull Request on GitHub

Go to your GitHub repository
You will see a banner: “feature/add-bronze-ingest had recent pushes — Compare & pull request”
Click Compare & pull request
Add a description explaining what the notebook does
Assign a reviewer (or self-review for personal projects)
Click Create pull request

Code Review

The reviewer can: – See every line of code changed (diff view) – Leave comments on specific lines – Request changes or approve – Once approved, click Merge pull request

After merge, the code is in main and ready for deployment.

Real-life analogy: A Pull Request is like submitting a building permit. You design the renovation (feature branch). You submit the plans for approval (PR). The inspector reviews (code review). If it passes, the permit is granted (merge). If not, you revise and resubmit. Nobody starts construction without an approved permit.

Folder Structure for a Production Databricks Project

databricks-etl-project/
├── notebooks/
│   ├── Config/
│   │   ├── Storage_Config.py          # ADLS + SQL connection setup
│   │   └── Environment_Config.py      # Environment-specific settings
│   ├── Bronze/
│   │   ├── Ingest_Customers.py
│   │   ├── Ingest_Products.py
│   │   └── Ingest_Orders.py
│   ├── Silver/
│   │   ├── Transform_Customers.py
│   │   ├── Transform_Products.py
│   │   └── Data_Quality_Checks.py
│   ├── Gold/
│   │   ├── SCD2_Dim_Customer.py
│   │   ├── Build_Fact_Orders.py
│   │   └── Agg_Daily_Revenue.py
│   └── Maintenance/
│       └── Optimize_Vacuum.py
├── tests/
│   ├── test_transforms.py
│   ├── test_data_quality.py
│   └── conftest.py
├── .github/
│   └── workflows/
│       ├── ci.yml                      # Run tests on PR
│       └── deploy.yml                  # Deploy to UAT/Prod
├── .gitignore
├── requirements.txt
└── README.md

.gitignore for Databricks

# Databricks
.databricks/
*.pyc
__pycache__/
.ipynb_checkpoints/

# Environment
.env
*.egg-info/

# IDE
.vscode/
.idea/

Environment Promotion: Dev → UAT → Prod

Option A: Separate Workspaces per Environment (Recommended)

Dev Workspace  → Repos synced to 'develop' branch
UAT Workspace  → Repos synced to 'main' branch
Prod Workspace → Notebooks deployed via CI/CD (not manually synced)

Each workspace has its own cluster configs, secret scopes, and storage connections.

Pros: Complete isolation, different permissions per environment, impossible to accidentally run dev code against prod data.

Cons: Higher cost (3 workspaces), more infrastructure to manage.

Option B: Folder-Based Environments (Single Workspace)

Repos/
├── dev/databricks-etl-project/    → develop branch
├── uat/databricks-etl-project/    → main branch
└── prod/databricks-etl-project/   → release branch

Notebooks use a config parameter to determine which storage account and Key Vault to connect to.

Pros: Lower cost, simpler management.

Cons: Risk of running dev code against prod data if misconfigured. Less isolation.

CI/CD with GitHub Actions for Databricks

The CI Workflow (Run Tests on Pull Request)

# .github/workflows/ci.yml
name: Databricks CI

on:
  pull_request:
    branches: [main, develop]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install pyspark pytest delta-spark

      - name: Run unit tests
        run: |
          pytest tests/ -v --tb=short

      - name: Lint notebooks
        run: |
          pip install flake8
          flake8 notebooks/ --max-line-length=120 --ignore=E501,W503

The Deploy Workflow (Deploy to UAT/Prod)

# .github/workflows/deploy.yml
name: Deploy to Databricks

on:
  push:
    branches:
      - main        # Deploy to UAT on merge to main
      - release      # Deploy to Prod on merge to release

env:
  DATABRICKS_HOST_UAT: https://adb-uat-workspace.azuredatabricks.net
  DATABRICKS_HOST_PROD: https://adb-prod-workspace.azuredatabricks.net

jobs:
  deploy-uat:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: UAT
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli

      - name: Configure Databricks CLI
        run: |
          cat > ~/.databrickscfg << EOF
          [DEFAULT]
          host = ${{ env.DATABRICKS_HOST_UAT }}
          token = ${{ secrets.DATABRICKS_TOKEN_UAT }}
          EOF

      - name: Deploy notebooks to UAT
        run: |
          databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite

      - name: Deploy workflow jobs
        run: |
          databricks jobs reset --job-id ${{ secrets.UAT_JOB_ID }} --json-file job_configs/daily_etl.json

  deploy-prod:
    if: github.ref == 'refs/heads/release'
    runs-on: ubuntu-latest
    environment: Production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli

      - name: Configure Databricks CLI
        run: |
          cat > ~/.databrickscfg << EOF
          [DEFAULT]
          host = ${{ env.DATABRICKS_HOST_PROD }}
          token = ${{ secrets.DATABRICKS_TOKEN_PROD }}
          EOF

      - name: Deploy notebooks to Production
        run: |
          databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite

      - name: Trigger smoke test job
        run: |
          databricks jobs run-now --job-id ${{ secrets.PROD_SMOKE_TEST_JOB_ID }}

GitHub Secrets to Configure

Secret Name	What It Is
`DATABRICKS_TOKEN_UAT`	Personal access token for UAT workspace
`DATABRICKS_TOKEN_PROD`	Personal access token for Prod workspace
`UAT_JOB_ID`	Workflow job ID in UAT workspace
`PROD_SMOKE_TEST_JOB_ID`	Smoke test job ID in Prod workspace

Deploying Notebooks with the Databricks CLI

Install and Configure

pip install databricks-cli

# Configure with your workspace
databricks configure --token
# Host: https://adb-XXXXXXXXXXXX.azuredatabricks.net
# Token: your personal access token

Common CLI Commands

# List workspace contents
databricks workspace ls /Repos/

# Export a notebook
databricks workspace export /Repos/my-project/notebooks/Bronze/Ingest_Customers -o ./local_copy.py

# Import a notebook
databricks workspace import ./local_copy.py /Repos/production/notebooks/Bronze/Ingest_Customers --overwrite --language PYTHON

# Import entire directory
databricks workspace import_dir ./notebooks/ /Repos/production/notebooks/ --overwrite

# List jobs
databricks jobs list

# Trigger a job
databricks jobs run-now --job-id 12345

Deploying Notebooks with the Databricks REST API

For CI/CD pipelines that cannot use the CLI:

import requests
import base64
import json

DATABRICKS_HOST = "https://adb-XXXXXXXXXXXX.azuredatabricks.net"
TOKEN = "your-token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

def deploy_notebook(local_path, remote_path):
    with open(local_path, "r") as f:
        content = f.read()

    payload = {
        "path": remote_path,
        "language": "PYTHON",
        "overwrite": True,
        "content": base64.b64encode(content.encode()).decode()
    }

    resp = requests.post(
        f"{DATABRICKS_HOST}/api/2.0/workspace/import",
        headers=HEADERS,
        json=payload
    )

    if resp.status_code == 200:
        print(f"Deployed: {remote_path}")
    else:
        print(f"FAILED: {resp.text}")

# Deploy all notebooks
import os
for root, dirs, files in os.walk("notebooks"):
    for file in files:
        if file.endswith(".py"):
            local = os.path.join(root, file)
            remote = f"/Repos/production/{local}"
            deploy_notebook(local, remote)

CI/CD with Azure DevOps for Databricks

# azure-pipelines.yml
trigger:
  branches:
    include:
      - main

pool:
  vmImage: 'ubuntu-latest'

stages:
  - stage: Deploy_UAT
    displayName: 'Deploy to UAT'
    jobs:
      - deployment: DeployUAT
        environment: 'UAT'
        strategy:
          runOnce:
            deploy:
              steps:
                - checkout: self

                - task: UsePythonVersion@0
                  inputs:
                    versionSpec: '3.11'

                - script: pip install databricks-cli
                  displayName: 'Install Databricks CLI'

                - script: |
                    echo "[DEFAULT]" > ~/.databrickscfg
                    echo "host = $(DATABRICKS_HOST_UAT)" >> ~/.databrickscfg
                    echo "token = $(DATABRICKS_TOKEN_UAT)" >> ~/.databrickscfg
                  displayName: 'Configure CLI'

                - script: |
                    databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite
                  displayName: 'Deploy Notebooks'

Running Tests Automatically on Pull Request

Writing Testable Notebooks

The trick is to write transformation FUNCTIONS in separate Python files, not inline in notebooks:

notebooks/Silver/transforms.py (importable functions):

from pyspark.sql.functions import *

def clean_customer_data(df):
    return df         .withColumn("name", initcap(trim(col("name"))))         .withColumn("email", lower(trim(col("email"))))         .fillna({"city": "Unknown", "country": "Unknown"})         .dropDuplicates(["customer_id"])

def validate_emails(df):
    email_regex = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return df.filter(col("email").rlike(email_regex))

tests/test_transforms.py (unit tests):

import pytest
from pyspark.sql import SparkSession
from notebooks.Silver.transforms import clean_customer_data, validate_emails

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[*]").appName("tests").getOrCreate()

def test_clean_trims_names(spark):
    data = [(1, "  naveen  ", "test@email.com", "Toronto", "Canada")]
    df = spark.createDataFrame(data, ["customer_id", "name", "email", "city", "country"])
    result = clean_customer_data(df)
    assert result.collect()[0]["name"] == "Naveen"

def test_clean_fills_nulls(spark):
    data = [(1, "Naveen", "test@email.com", None, None)]
    df = spark.createDataFrame(data, ["customer_id", "name", "email", "city", "country"])
    result = clean_customer_data(df)
    row = result.collect()[0]
    assert row["city"] == "Unknown"
    assert row["country"] == "Unknown"

def test_validate_emails_filters_invalid(spark):
    data = [(1, "good@email.com"), (2, "bad-email"), (3, "also@good.org")]
    df = spark.createDataFrame(data, ["id", "email"])
    result = validate_emails(df)
    assert result.count() == 2

Parameterizing Notebooks for Environments

# Environment_Config.py
# Use dbutils widgets to determine environment at runtime
dbutils.widgets.text("environment", "dev", "Environment")
env = dbutils.widgets.get("environment")

config = {
    "dev": {
        "storage_account": "devstorageaccount",
        "key_vault_scope": "dev-scope",
        "sql_server": "dev-sql.database.windows.net",
    },
    "uat": {
        "storage_account": "uatstorageaccount",
        "key_vault_scope": "uat-scope",
        "sql_server": "uat-sql.database.windows.net",
    },
    "prod": {
        "storage_account": "prodstorageaccount",
        "key_vault_scope": "prod-scope",
        "sql_server": "prod-sql.database.windows.net",
    }
}

# Use environment-specific config
STORAGE_ACCOUNT = config[env]["storage_account"]
SCOPE = config[env]["key_vault_scope"]
SQL_SERVER = config[env]["sql_server"]

print(f"Environment: {env} | Storage: {STORAGE_ACCOUNT} | SQL: {SQL_SERVER}")

When triggering a Workflow job, pass the environment parameter:

databricks jobs run-now --job-id 12345 --notebook-params '{"environment": "prod"}'

Databricks Asset Bundles (DABs) — The Modern Approach

Databricks Asset Bundles (DABs) is the newest approach to deploying Databricks projects. It bundles notebooks, jobs, libraries, and configurations into a single deployable unit:

# databricks.yml
bundle:
  name: etl-pipeline

workspace:
  host: https://adb-XXXXXXXXXXXX.azuredatabricks.net

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      tasks:
        - task_key: ingest_bronze
          notebook_task:
            notebook_path: ./notebooks/Bronze/Ingest_Customers.py
        - task_key: transform_silver
          depends_on:
            - task_key: ingest_bronze
          notebook_task:
            notebook_path: ./notebooks/Silver/Transform_Customers.py

targets:
  dev:
    workspace:
      host: https://adb-dev.azuredatabricks.net
  prod:
    workspace:
      host: https://adb-prod.azuredatabricks.net

Deploy with:

databricks bundle deploy --target prod

DABs is the direction Databricks is heading for CI/CD — it replaces manual CLI deployment scripts with a declarative config file.

Secrets Management Across Environments

Environment	Key Vault	Secret Scope	Who Has Access
Dev	`dev-keyvault`	`dev-scope`	All engineers
UAT	`uat-keyvault`	`uat-scope`	Senior engineers + testers
Prod	`prod-keyvault`	`prod-scope`	Service principals only (no human access)

The SAME notebook code runs in all environments — only the secret scope name changes, controlled by the environment parameter.

Comparing ADF CI/CD vs Databricks CI/CD

Aspect	ADF/Synapse CI/CD	Databricks CI/CD
What is versioned	JSON (pipelines, datasets, linked services)	Python notebooks, SQL files, configs
Git integration	Built-in (ADF Studio → Git)	Databricks Repos
Deployment artifact	ARM templates (auto-generated)	Notebooks + job configs
Deployment method	ARM template deployment	Databricks CLI / REST API / DABs
Parameterization	ARM parameter files per environment	`dbutils.widgets` + environment config
Testing	Limited (no built-in test framework)	`pytest` with local PySpark
Branch strategy	`adf_publish` / `workspace_publish` branch	Standard Git flow (develop → main → release)

Common Mistakes

Editing notebooks in Prod workspace directly — violates the one-way flow. All changes must go through Git. Prod should be read-only for humans.
Hardcoding environment-specific values — storage account names, server URLs, and Key Vault scopes must come from parameters, not hardcoded strings.
Not writing testable code — inline notebook code cannot be unit tested. Extract transformation functions into importable Python files.
Committing secrets to Git — never commit tokens, passwords, or connection strings. Use .gitignore for .env files and use Key Vault for secrets.
No branch protection — without requiring PR reviews, anyone can push broken code to main. Enable branch protection rules on GitHub.
Forgetting to sync Repos before editing — if someone else merged changes, your local Repos copy is stale. Always pull before starting new work.

Interview Questions

Q: How does CI/CD work for Databricks? A: Notebooks are stored in a Git repository via Databricks Repos. Engineers develop in feature branches, create pull requests with code review, and merge to main. CI runs tests automatically on PR. CD deploys notebooks to UAT/Prod workspaces using the Databricks CLI, REST API, or Asset Bundles. Each environment has its own workspace, secret scope, and storage connections.

Q: What is Databricks Repos? A: A built-in Git client in Databricks that lets you clone repositories, create branches, commit changes, and push — all from the workspace UI. Notebooks inside Repos are synced with Git, enabling version control, code review, and CI/CD.

Q: How do you handle environment differences (dev/uat/prod) in Databricks? A: Use parameterized config notebooks with dbutils.widgets. A single environment parameter drives which storage account, Key Vault scope, and SQL server to use. The same notebook code runs in all environments — only the config values change.

Q: What are Databricks Asset Bundles? A: DABs is the modern approach to Databricks CI/CD. A databricks.yml file declares notebooks, jobs, clusters, and environment targets in one config. Deploy with databricks bundle deploy --target prod. It replaces manual CLI scripts with a declarative, repeatable deployment process.

Q: How do you test Databricks notebooks in CI? A: Extract transformation logic into importable Python functions. Write pytest unit tests that create a local SparkSession and test those functions with sample data. Run pytest in the CI pipeline (GitHub Actions or Azure DevOps) on every pull request.

Wrapping Up

Databricks CI/CD is fundamentally about treating notebooks as production code — versioned, reviewed, tested, and deployed through an automated pipeline. Repos provides the Git integration. GitHub Actions or Azure DevOps provides the automation. Environment configs and secret scopes handle environment differences. And Databricks Asset Bundles are the future of declarative deployment.

The pattern is the same as ADF CI/CD — just different tools. Code flows one direction: Dev → UAT → Prod. Nobody edits production directly. Everything goes through Git.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.