Databricks Git Integration and CI/CD: Repos, Branching, Notebook Versioning, and Deploying Across Environments
In our earlier posts, we built CI/CD for ADF/Synapse with GitHub Actions and Azure DevOps. Those work because ADF stores everything as JSON files — pipelines, datasets, linked services — all serializable into Git.
But Databricks notebooks are different. They are interactive, cell-based, and often treated like scratch pads. Engineers run cells out of order, leave debug code in, and share notebooks by copy-pasting. This works for exploration but is a disaster for production.
Databricks Repos solves this by connecting your workspace directly to a Git repository. Every notebook becomes a versioned file. Changes are tracked. Branches work. Pull requests enforce code review. And CI/CD pipelines deploy tested code from Dev to UAT to Production — no manual copying.
Think of it like the difference between writing a book in a Word document emailed back and forth (copy-paste notebooks) versus writing in Google Docs with version history, comments, and approval workflows (Git-integrated Repos). Both produce a book. One produces chaos. The other produces a reliable, auditable process.
Table of Contents
- Why Git Integration Matters for Databricks
- The Real-World Workflow (Dev → UAT → Prod)
- Databricks Repos: What It Is and How It Works
- Step 1: Connect Databricks to GitHub
- Step 2: Clone a Repository into Databricks
- Step 3: Create a Branch and Make Changes
- Step 4: Commit, Push, and Create a Pull Request
- Step 5: Code Review and Merge
- Folder Structure for a Production Databricks Project
- Environment Promotion: Dev → UAT → Prod
- Option A: Separate Workspaces per Environment
- Option B: Folder-Based Environments (Single Workspace)
- CI/CD with GitHub Actions for Databricks
- The Workflow File Explained
- Deploying Notebooks with the Databricks CLI
- Deploying Notebooks with the Databricks REST API
- CI/CD with Azure DevOps for Databricks
- Running Tests Automatically on Pull Request
- Parameterizing Notebooks for Environments
- Databricks Asset Bundles (DABs) — The Modern Approach
- Secrets Management Across Environments
- Comparing ADF CI/CD vs Databricks CI/CD
- Common Mistakes
- Interview Questions
- Wrapping Up
Why Git Integration Matters for Databricks
Without Git (The Chaos)
Developer A edits Notebook_ETL in the shared workspace
Developer B edits the SAME notebook at the same time
Developer A saves → Developer B saves → A's changes are GONE
Nobody knows what changed, when, or why
The "production" notebook has debug print() statements from last Tuesday
Rolling back means "does anyone remember what it looked like before?"
With Git (The Order)
Developer A creates branch feature/add-scd-type2
Developer B creates branch feature/fix-null-handling
Both work independently — no conflicts
Both create Pull Requests with code review
Reviewer catches a bug in A's code before it reaches production
Changes are merged → CI/CD deploys tested code to production
Rolling back = git revert (one command, full audit trail)
Real-life analogy: Without Git, your team writes a report by passing a USB drive around the office — one person at a time, no track changes, hope nobody overwrites your section. With Git, everyone edits their own copy in Google Docs, changes are tracked, conflicts are highlighted, and a manager approves the final version.
The Real-World Workflow (Dev → UAT → Prod)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ DEV Workspace │ │ UAT Workspace │ │ PROD Workspace │
│ │ │ │ │ │
│ Engineer edits │────>│ Testers verify │────>│ Scheduled jobs │
│ notebooks in │ PR │ with UAT data │ PR │ run against │
│ feature branch │merge│ │merge│ production data │
│ │ │ (read from main) │ │ (read from main)│
│ Connected to │ │ Connected to │ │ Connected to │
│ GitHub (develop)│ │ GitHub (main) │ │ GitHub (release)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
| | |
v v v
GitHub Repository (Single Source of Truth)
├── develop branch (Dev workspace reads this)
├── main branch (UAT workspace reads this)
└── release branch (Prod workspace reads this)
The golden rule: Code flows one direction — Dev → UAT → Prod. Nobody edits UAT or Prod directly. All changes go through Git.
Databricks Repos: What It Is and How It Works
Databricks Repos is a built-in Git client inside the Databricks workspace. It lets you:
- Clone a GitHub/Azure DevOps/GitLab repository into your workspace
- Create branches, commit changes, push, and pull — all from the Databricks UI
- Run notebooks directly from the cloned repo (not copied to workspace)
- Keep multiple branches checked out simultaneously
GitHub Repository Databricks Workspace
├── notebooks/ Repos/
│ ├── Bronze/ └── my-project/
│ │ └── Ingest_Customers.py ├── notebooks/
│ ├── Silver/ │ ├── Bronze/
│ │ └── Transform_Customers.py │ │ └── Ingest_Customers
│ ├── Gold/ │ ├── Silver/
│ │ └── SCD2_Dim_Customer.py │ │ └── Transform_Customers
│ └── Config/ │ ├── Gold/
│ └── Storage_Config.py │ │ └── SCD2_Dim_Customer
├── tests/ │ └── Config/
│ └── test_transforms.py │ └── Storage_Config
├── .github/ └── (synced via Git)
│ └── workflows/
│ └── deploy.yml
└── README.md
Step 1: Connect Databricks to GitHub
Generate a GitHub Personal Access Token
- Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)
- Click Generate new token (classic)
- Name:
databricks-git-integration - Scopes: check repo (full control of private repositories)
- Click Generate token → copy the token (you will not see it again)
Configure Git in Databricks
- In Databricks, click your user icon (top right) → Settings
- Click Linked accounts (under User section)
- Git provider: GitHub
- Git provider username: your GitHub username
- Token: paste the personal access token
- Click Save
Step 2: Clone a Repository into Databricks
Create the GitHub Repository First
# On your local machine
mkdir databricks-etl-project
cd databricks-etl-project
git init
mkdir -p notebooks/Bronze notebooks/Silver notebooks/Gold notebooks/Config tests
# Create a placeholder
echo "# Databricks ETL Project" > README.md
git add .
git commit -m "Initial project structure"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/databricks-etl-project.git
git push -u origin main
Clone into Databricks
- In Databricks, click Workspace in the sidebar
- Click Repos (under your username)
- Click Add Repo
- Paste the repository URL:
https://github.com/YOUR_USERNAME/databricks-etl-project.git - Click Create Repo
The repository is now cloned. You can see all files and notebooks inside Databricks.
Step 3: Create a Branch and Make Changes
Create a Feature Branch
- In the Repos section, click on your repo
- Click the branch dropdown (shows
main) at the top - Click Create branch
- Name:
feature/add-bronze-ingest - Click Create
You are now on the feature branch. Any changes here do NOT affect main.
Edit a Notebook
- Navigate to
notebooks/Bronze/ - Create a new notebook:
Ingest_Customers - Write your code:
# Ingest_Customers notebook
from pyspark.sql.functions import *
# Read config
%run ../Config/Storage_Config
# Ingest from SQL to Bronze
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
df.withColumn("load_date", current_date()) .write.format("delta").mode("overwrite") .save(f"{BASE_PATH}/bronze/customers/")
print(f"Ingested {df.count()} customers to Bronze")
Step 4: Commit, Push, and Create a Pull Request
Commit from Databricks UI
- Click the branch name at the top of the Repos panel
- You will see a list of changed files
- Enter a commit message:
feat: add bronze customer ingest notebook - Click Commit & Push
Create a Pull Request on GitHub
- Go to your GitHub repository
- You will see a banner: “feature/add-bronze-ingest had recent pushes — Compare & pull request”
- Click Compare & pull request
- Add a description explaining what the notebook does
- Assign a reviewer (or self-review for personal projects)
- Click Create pull request
Code Review
The reviewer can: – See every line of code changed (diff view) – Leave comments on specific lines – Request changes or approve – Once approved, click Merge pull request
After merge, the code is in main and ready for deployment.
Real-life analogy: A Pull Request is like submitting a building permit. You design the renovation (feature branch). You submit the plans for approval (PR). The inspector reviews (code review). If it passes, the permit is granted (merge). If not, you revise and resubmit. Nobody starts construction without an approved permit.
Folder Structure for a Production Databricks Project
databricks-etl-project/
├── notebooks/
│ ├── Config/
│ │ ├── Storage_Config.py # ADLS + SQL connection setup
│ │ └── Environment_Config.py # Environment-specific settings
│ ├── Bronze/
│ │ ├── Ingest_Customers.py
│ │ ├── Ingest_Products.py
│ │ └── Ingest_Orders.py
│ ├── Silver/
│ │ ├── Transform_Customers.py
│ │ ├── Transform_Products.py
│ │ └── Data_Quality_Checks.py
│ ├── Gold/
│ │ ├── SCD2_Dim_Customer.py
│ │ ├── Build_Fact_Orders.py
│ │ └── Agg_Daily_Revenue.py
│ └── Maintenance/
│ └── Optimize_Vacuum.py
├── tests/
│ ├── test_transforms.py
│ ├── test_data_quality.py
│ └── conftest.py
├── .github/
│ └── workflows/
│ ├── ci.yml # Run tests on PR
│ └── deploy.yml # Deploy to UAT/Prod
├── .gitignore
├── requirements.txt
└── README.md
.gitignore for Databricks
# Databricks
.databricks/
*.pyc
__pycache__/
.ipynb_checkpoints/
# Environment
.env
*.egg-info/
# IDE
.vscode/
.idea/
Environment Promotion: Dev → UAT → Prod
Option A: Separate Workspaces per Environment (Recommended)
Dev Workspace → Repos synced to 'develop' branch
UAT Workspace → Repos synced to 'main' branch
Prod Workspace → Notebooks deployed via CI/CD (not manually synced)
Each workspace has its own cluster configs, secret scopes, and storage connections.
Pros: Complete isolation, different permissions per environment, impossible to accidentally run dev code against prod data.
Cons: Higher cost (3 workspaces), more infrastructure to manage.
Option B: Folder-Based Environments (Single Workspace)
Repos/
├── dev/databricks-etl-project/ → develop branch
├── uat/databricks-etl-project/ → main branch
└── prod/databricks-etl-project/ → release branch
Notebooks use a config parameter to determine which storage account and Key Vault to connect to.
Pros: Lower cost, simpler management.
Cons: Risk of running dev code against prod data if misconfigured. Less isolation.
CI/CD with GitHub Actions for Databricks
The CI Workflow (Run Tests on Pull Request)
# .github/workflows/ci.yml
name: Databricks CI
on:
pull_request:
branches: [main, develop]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install pyspark pytest delta-spark
- name: Run unit tests
run: |
pytest tests/ -v --tb=short
- name: Lint notebooks
run: |
pip install flake8
flake8 notebooks/ --max-line-length=120 --ignore=E501,W503
The Deploy Workflow (Deploy to UAT/Prod)
# .github/workflows/deploy.yml
name: Deploy to Databricks
on:
push:
branches:
- main # Deploy to UAT on merge to main
- release # Deploy to Prod on merge to release
env:
DATABRICKS_HOST_UAT: https://adb-uat-workspace.azuredatabricks.net
DATABRICKS_HOST_PROD: https://adb-prod-workspace.azuredatabricks.net
jobs:
deploy-uat:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: UAT
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Configure Databricks CLI
run: |
cat > ~/.databrickscfg << EOF
[DEFAULT]
host = ${{ env.DATABRICKS_HOST_UAT }}
token = ${{ secrets.DATABRICKS_TOKEN_UAT }}
EOF
- name: Deploy notebooks to UAT
run: |
databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite
- name: Deploy workflow jobs
run: |
databricks jobs reset --job-id ${{ secrets.UAT_JOB_ID }} --json-file job_configs/daily_etl.json
deploy-prod:
if: github.ref == 'refs/heads/release'
runs-on: ubuntu-latest
environment: Production
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Configure Databricks CLI
run: |
cat > ~/.databrickscfg << EOF
[DEFAULT]
host = ${{ env.DATABRICKS_HOST_PROD }}
token = ${{ secrets.DATABRICKS_TOKEN_PROD }}
EOF
- name: Deploy notebooks to Production
run: |
databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite
- name: Trigger smoke test job
run: |
databricks jobs run-now --job-id ${{ secrets.PROD_SMOKE_TEST_JOB_ID }}
GitHub Secrets to Configure
| Secret Name | What It Is |
|---|---|
DATABRICKS_TOKEN_UAT |
Personal access token for UAT workspace |
DATABRICKS_TOKEN_PROD |
Personal access token for Prod workspace |
UAT_JOB_ID |
Workflow job ID in UAT workspace |
PROD_SMOKE_TEST_JOB_ID |
Smoke test job ID in Prod workspace |
Deploying Notebooks with the Databricks CLI
Install and Configure
pip install databricks-cli
# Configure with your workspace
databricks configure --token
# Host: https://adb-XXXXXXXXXXXX.azuredatabricks.net
# Token: your personal access token
Common CLI Commands
# List workspace contents
databricks workspace ls /Repos/
# Export a notebook
databricks workspace export /Repos/my-project/notebooks/Bronze/Ingest_Customers -o ./local_copy.py
# Import a notebook
databricks workspace import ./local_copy.py /Repos/production/notebooks/Bronze/Ingest_Customers --overwrite --language PYTHON
# Import entire directory
databricks workspace import_dir ./notebooks/ /Repos/production/notebooks/ --overwrite
# List jobs
databricks jobs list
# Trigger a job
databricks jobs run-now --job-id 12345
Deploying Notebooks with the Databricks REST API
For CI/CD pipelines that cannot use the CLI:
import requests
import base64
import json
DATABRICKS_HOST = "https://adb-XXXXXXXXXXXX.azuredatabricks.net"
TOKEN = "your-token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
def deploy_notebook(local_path, remote_path):
with open(local_path, "r") as f:
content = f.read()
payload = {
"path": remote_path,
"language": "PYTHON",
"overwrite": True,
"content": base64.b64encode(content.encode()).decode()
}
resp = requests.post(
f"{DATABRICKS_HOST}/api/2.0/workspace/import",
headers=HEADERS,
json=payload
)
if resp.status_code == 200:
print(f"Deployed: {remote_path}")
else:
print(f"FAILED: {resp.text}")
# Deploy all notebooks
import os
for root, dirs, files in os.walk("notebooks"):
for file in files:
if file.endswith(".py"):
local = os.path.join(root, file)
remote = f"/Repos/production/{local}"
deploy_notebook(local, remote)
CI/CD with Azure DevOps for Databricks
# azure-pipelines.yml
trigger:
branches:
include:
- main
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Deploy_UAT
displayName: 'Deploy to UAT'
jobs:
- deployment: DeployUAT
environment: 'UAT'
strategy:
runOnce:
deploy:
steps:
- checkout: self
- task: UsePythonVersion@0
inputs:
versionSpec: '3.11'
- script: pip install databricks-cli
displayName: 'Install Databricks CLI'
- script: |
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = $(DATABRICKS_HOST_UAT)" >> ~/.databrickscfg
echo "token = $(DATABRICKS_TOKEN_UAT)" >> ~/.databrickscfg
displayName: 'Configure CLI'
- script: |
databricks workspace import_dir notebooks/ /Repos/production/notebooks --overwrite
displayName: 'Deploy Notebooks'
Running Tests Automatically on Pull Request
Writing Testable Notebooks
The trick is to write transformation FUNCTIONS in separate Python files, not inline in notebooks:
notebooks/Silver/transforms.py (importable functions):
from pyspark.sql.functions import *
def clean_customer_data(df):
return df .withColumn("name", initcap(trim(col("name")))) .withColumn("email", lower(trim(col("email")))) .fillna({"city": "Unknown", "country": "Unknown"}) .dropDuplicates(["customer_id"])
def validate_emails(df):
email_regex = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return df.filter(col("email").rlike(email_regex))
tests/test_transforms.py (unit tests):
import pytest
from pyspark.sql import SparkSession
from notebooks.Silver.transforms import clean_customer_data, validate_emails
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.master("local[*]").appName("tests").getOrCreate()
def test_clean_trims_names(spark):
data = [(1, " naveen ", "test@email.com", "Toronto", "Canada")]
df = spark.createDataFrame(data, ["customer_id", "name", "email", "city", "country"])
result = clean_customer_data(df)
assert result.collect()[0]["name"] == "Naveen"
def test_clean_fills_nulls(spark):
data = [(1, "Naveen", "test@email.com", None, None)]
df = spark.createDataFrame(data, ["customer_id", "name", "email", "city", "country"])
result = clean_customer_data(df)
row = result.collect()[0]
assert row["city"] == "Unknown"
assert row["country"] == "Unknown"
def test_validate_emails_filters_invalid(spark):
data = [(1, "good@email.com"), (2, "bad-email"), (3, "also@good.org")]
df = spark.createDataFrame(data, ["id", "email"])
result = validate_emails(df)
assert result.count() == 2
Parameterizing Notebooks for Environments
# Environment_Config.py
# Use dbutils widgets to determine environment at runtime
dbutils.widgets.text("environment", "dev", "Environment")
env = dbutils.widgets.get("environment")
config = {
"dev": {
"storage_account": "devstorageaccount",
"key_vault_scope": "dev-scope",
"sql_server": "dev-sql.database.windows.net",
},
"uat": {
"storage_account": "uatstorageaccount",
"key_vault_scope": "uat-scope",
"sql_server": "uat-sql.database.windows.net",
},
"prod": {
"storage_account": "prodstorageaccount",
"key_vault_scope": "prod-scope",
"sql_server": "prod-sql.database.windows.net",
}
}
# Use environment-specific config
STORAGE_ACCOUNT = config[env]["storage_account"]
SCOPE = config[env]["key_vault_scope"]
SQL_SERVER = config[env]["sql_server"]
print(f"Environment: {env} | Storage: {STORAGE_ACCOUNT} | SQL: {SQL_SERVER}")
When triggering a Workflow job, pass the environment parameter:
databricks jobs run-now --job-id 12345 --notebook-params '{"environment": "prod"}'
Databricks Asset Bundles (DABs) — The Modern Approach
Databricks Asset Bundles (DABs) is the newest approach to deploying Databricks projects. It bundles notebooks, jobs, libraries, and configurations into a single deployable unit:
# databricks.yml
bundle:
name: etl-pipeline
workspace:
host: https://adb-XXXXXXXXXXXX.azuredatabricks.net
resources:
jobs:
daily_etl:
name: "Daily ETL Pipeline"
tasks:
- task_key: ingest_bronze
notebook_task:
notebook_path: ./notebooks/Bronze/Ingest_Customers.py
- task_key: transform_silver
depends_on:
- task_key: ingest_bronze
notebook_task:
notebook_path: ./notebooks/Silver/Transform_Customers.py
targets:
dev:
workspace:
host: https://adb-dev.azuredatabricks.net
prod:
workspace:
host: https://adb-prod.azuredatabricks.net
Deploy with:
databricks bundle deploy --target prod
DABs is the direction Databricks is heading for CI/CD — it replaces manual CLI deployment scripts with a declarative config file.
Secrets Management Across Environments
| Environment | Key Vault | Secret Scope | Who Has Access |
|---|---|---|---|
| Dev | dev-keyvault |
dev-scope |
All engineers |
| UAT | uat-keyvault |
uat-scope |
Senior engineers + testers |
| Prod | prod-keyvault |
prod-scope |
Service principals only (no human access) |
The SAME notebook code runs in all environments — only the secret scope name changes, controlled by the environment parameter.
Comparing ADF CI/CD vs Databricks CI/CD
| Aspect | ADF/Synapse CI/CD | Databricks CI/CD |
|---|---|---|
| What is versioned | JSON (pipelines, datasets, linked services) | Python notebooks, SQL files, configs |
| Git integration | Built-in (ADF Studio → Git) | Databricks Repos |
| Deployment artifact | ARM templates (auto-generated) | Notebooks + job configs |
| Deployment method | ARM template deployment | Databricks CLI / REST API / DABs |
| Parameterization | ARM parameter files per environment | dbutils.widgets + environment config |
| Testing | Limited (no built-in test framework) | pytest with local PySpark |
| Branch strategy | adf_publish / workspace_publish branch |
Standard Git flow (develop → main → release) |
Common Mistakes
-
Editing notebooks in Prod workspace directly — violates the one-way flow. All changes must go through Git. Prod should be read-only for humans.
-
Hardcoding environment-specific values — storage account names, server URLs, and Key Vault scopes must come from parameters, not hardcoded strings.
-
Not writing testable code — inline notebook code cannot be unit tested. Extract transformation functions into importable Python files.
-
Committing secrets to Git — never commit tokens, passwords, or connection strings. Use
.gitignorefor.envfiles and use Key Vault for secrets. -
No branch protection — without requiring PR reviews, anyone can push broken code to main. Enable branch protection rules on GitHub.
-
Forgetting to sync Repos before editing — if someone else merged changes, your local Repos copy is stale. Always pull before starting new work.
Interview Questions
Q: How does CI/CD work for Databricks? A: Notebooks are stored in a Git repository via Databricks Repos. Engineers develop in feature branches, create pull requests with code review, and merge to main. CI runs tests automatically on PR. CD deploys notebooks to UAT/Prod workspaces using the Databricks CLI, REST API, or Asset Bundles. Each environment has its own workspace, secret scope, and storage connections.
Q: What is Databricks Repos? A: A built-in Git client in Databricks that lets you clone repositories, create branches, commit changes, and push — all from the workspace UI. Notebooks inside Repos are synced with Git, enabling version control, code review, and CI/CD.
Q: How do you handle environment differences (dev/uat/prod) in Databricks?
A: Use parameterized config notebooks with dbutils.widgets. A single environment parameter drives which storage account, Key Vault scope, and SQL server to use. The same notebook code runs in all environments — only the config values change.
Q: What are Databricks Asset Bundles?
A: DABs is the modern approach to Databricks CI/CD. A databricks.yml file declares notebooks, jobs, clusters, and environment targets in one config. Deploy with databricks bundle deploy --target prod. It replaces manual CLI scripts with a declarative, repeatable deployment process.
Q: How do you test Databricks notebooks in CI?
A: Extract transformation logic into importable Python functions. Write pytest unit tests that create a local SparkSession and test those functions with sample data. Run pytest in the CI pipeline (GitHub Actions or Azure DevOps) on every pull request.
Wrapping Up
Databricks CI/CD is fundamentally about treating notebooks as production code — versioned, reviewed, tested, and deployed through an automated pipeline. Repos provides the Git integration. GitHub Actions or Azure DevOps provides the automation. Environment configs and secret scopes handle environment differences. And Databricks Asset Bundles are the future of declarative deployment.
The pattern is the same as ADF CI/CD — just different tools. Code flows one direction: Dev → UAT → Prod. Nobody edits production directly. Everything goes through Git.
Related posts: – CI/CD with GitHub Actions (ADF/Synapse) – CI/CD with Azure DevOps (ADF/Synapse) – Databricks Workflows and Jobs – Databricks Introduction and dbutils – Databricks Secret Scopes
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.