Connecting Azure Databricks to Azure SQL Database: JDBC Read, Write, and Production Patterns

Connecting Azure Databricks to Azure SQL Database: JDBC Read, Write, and Production Patterns

You have data in Azure SQL Database. You have a Databricks notebook ready to transform it. But between them is a JDBC connection that needs the right URL, the right driver, the right credentials, and the right syntax — get any one wrong and you get cryptic Java exceptions that make no sense.

This post walks through everything: connecting Databricks to Azure SQL via JDBC, reading tables and custom queries, writing data back, the ORDER BY subquery trap that catches everyone, and a production-ready three-notebook architecture with reusable functions.

Think of JDBC as a phone call between Databricks and SQL Server. You need the right phone number (server URL), the right extension (port 1433), the correct name to ask for (database), and your ID badge (credentials). Get the phone number wrong and you hear static. Forget your badge and security hangs up on you.

Table of Contents

  • What Is JDBC and Why Databricks Needs It
  • The JDBC Connection URL
  • Method 1: Direct Connection (Quick Start)
  • Method 2: Secure Connection with Key Vault
  • Reading SQL Tables into DataFrames
  • Reading with Custom SQL Queries
  • The ORDER BY Subquery Trap
  • Writing DataFrames to SQL Tables
  • Write Modes: Append, Overwrite, Ignore
  • Upsert (MERGE) Pattern
  • The Three-Notebook Architecture (Production)
  • Notebook 1: Config_SQL (Connection Setup)
  • Notebook 2: Functions_SQL (Reusable Functions)
  • Notebook 3: SQL_Operations (Worker)
  • Data Quality Functions
  • End-to-End Pipeline: SQL to ADLS to SQL
  • Firewall Configuration
  • Performance Optimization
  • Common Errors and Fixes
  • Interview Questions
  • Wrapping Up

What Is JDBC and Why Databricks Needs It

JDBC (Java Database Connectivity) is how Spark connects to relational databases. Unlike ADLS Gen2 where Spark reads files directly, SQL databases require a protocol-level connection — Spark sends SQL commands over JDBC and the database returns result sets.

Databricks (Spark)                     Azure SQL Database
┌──────────────┐                      ┌──────────────────┐
│              │   JDBC Connection     │                  │
│ spark.read   │ ──── Port 1433 ────> │ SalesLT.Customer │
│   .jdbc()    │ <── Result Set ───── │ (847 rows)       │
│              │                      │                  │
└──────────────┘                      └──────────────────┘

Real-life analogy: Reading from ADLS is like picking up a book from a shelf — you just grab it. Reading from SQL via JDBC is like calling a librarian — you ask for a specific book, they find it, and read it to you over the phone.

The JDBC Connection URL

Every SQL connection needs a JDBC URL:

jdbc:sqlserver://SERVER_NAME:1433;database=DATABASE_NAME;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30

Breaking It Down

Part Value Meaning
jdbc:sqlserver:// Protocol prefix “I’m using SQL Server JDBC driver”
SERVER_NAME sql-dataplatform-dev.database.windows.net Your SQL server’s address
:1433 Port number SQL Server’s default port
database= AdventureWorksLT Which database to connect to
encrypt=true Security setting Encrypt the connection
loginTimeout=30 Timeout in seconds Give up after 30 seconds

The JDBC Driver

Spark uses the Microsoft SQL Server JDBC driver. In Databricks, it is pre-installed. In local PySpark, you must download it.

# The driver class name — same for all SQL Server connections
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

Method 1: Direct Connection (Quick Start)

For learning and quick testing:

# Connection details
jdbc_url = "jdbc:sqlserver://sql-dataplatform-dev.database.windows.net:1433;database=AdventureWorksLT;encrypt=true"

connection_properties = {
    "user": "sqladmin",
    "password": "YourPassword123!",   # ❌ NEVER hardcode in production
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read a table
df = spark.read.jdbc(
    url=jdbc_url,
    table="SalesLT.Customer",
    properties=connection_properties
)

df.show(5)
print(f"Read {df.count()} rows")

This works but is insecure. The password is visible in the notebook. Use Key Vault for anything beyond quick testing.

Method 2: Secure Connection with Key Vault

Prerequisites

Store these secrets in Azure Key Vault:

Secret Name Value Example
sql-server-name Your SQL server URL sql-dataplatform-dev.database.windows.net
sql-database-name Database name AdventureWorksLT
sql-username SQL admin username sqladmin
sql-password SQL admin password P@ssw0rd!2026

Connect Using Key Vault

# Read credentials from Key Vault (never visible in notebook)
scope = "keyvault-scope"

server = dbutils.secrets.get(scope, "sql-server-name")
database = dbutils.secrets.get(scope, "sql-database-name")
username = dbutils.secrets.get(scope, "sql-username")
password = dbutils.secrets.get(scope, "sql-password")

# Build JDBC URL
sql_url = f"jdbc:sqlserver://{server}:1433;database={database};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

# Build properties dictionary
sql_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Verify (values are [REDACTED] in output)
print(f"SQL URL: {sql_url}")
print(f"SQL Properties keys: {list(sql_properties.keys())}")

# Test connection
df_test = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
print(f"Connected! Read {df_test.count()} rows from SalesLT.Customer")

Real-life analogy: Method 1 is like writing your building access code on a sticky note on your laptop. Method 2 is like using a key card stored in a secure safe — you retrieve it when needed, use it, and it is never visible to anyone looking at your screen.

Reading SQL Tables into DataFrames

Read an Entire Table

# Read full table
df_customers = spark.read.jdbc(
    url=sql_url,
    table="SalesLT.Customer",
    properties=sql_properties
)

df_customers.show(5)
df_customers.printSchema()
print(f"Rows: {df_customers.count()}, Columns: {len(df_customers.columns)}")

Read Multiple Tables

# Read several tables
tables = ["SalesLT.Customer", "SalesLT.Product", "SalesLT.Address", "SalesLT.SalesOrderHeader"]

for table in tables:
    df = spark.read.jdbc(url=sql_url, table=table, properties=sql_properties)
    print(f"{table}: {df.count()} rows, {len(df.columns)} columns")

Reading with Custom SQL Queries

The Subquery Syntax

Spark wraps your query as a subquery. You must provide an alias:

# Custom query — wrap in parentheses with an alias
query = "(SELECT TOP 10 ProductID, Name, ListPrice FROM SalesLT.Product ORDER BY ListPrice DESC) AS top_products"

df = spark.read.jdbc(
    url=sql_url,
    table=query,        # Not actually a table — it's a subquery
    properties=sql_properties
)

df.show(truncate=False)

As a Reusable Function

def read_sql_query(query, alias="query_result"):
    '''Read data using a custom SQL query'''
    df = spark.read.jdbc(
        url=sql_url,
        table=f"({query}) AS {alias}",
        properties=sql_properties
    )
    return df

# Usage
df = read_sql_query(
    "SELECT CustomerID, FirstName, LastName, CompanyName FROM SalesLT.Customer WHERE CompanyName LIKE '%Bike%'",
    "bike_customers"
)
df.show()

The ORDER BY Subquery Trap

This is the #1 error that catches everyone:

# ❌ THIS FAILS
query = '''(
    SELECT TABLE_SCHEMA, TABLE_NAME 
    FROM INFORMATION_SCHEMA.TABLES 
    ORDER BY TABLE_NAME
) AS tables'''

df = spark.read.jdbc(url=sql_url, table=query, properties=sql_properties)
# ERROR: The ORDER BY clause is invalid in views, inline functions,
#        derived tables, subqueries, and common table expressions,
#        unless TOP, OFFSET or FOR XML is also specified.

Why It Fails

Spark wraps your query inside another SELECT:

SELECT * FROM (
    SELECT TABLE_SCHEMA, TABLE_NAME 
    FROM INFORMATION_SCHEMA.TABLES 
    ORDER BY TABLE_NAME          -- ❌ ORDER BY inside subquery = illegal in SQL Server
) AS tables

SQL Server does not allow ORDER BY inside a subquery unless you use TOP, OFFSET, or FOR XML.

The Fix: Sort in Spark

# ✅ Remove ORDER BY from SQL, sort in Spark instead
query = '''(
    SELECT TABLE_SCHEMA, TABLE_NAME 
    FROM INFORMATION_SCHEMA.TABLES
) AS tables'''

df = spark.read.jdbc(url=sql_url, table=query, properties=sql_properties)
df.orderBy("TABLE_NAME").show()  # Sort on the Spark side

Rule: Never use ORDER BY in JDBC queries. Always use .orderBy() on the DataFrame after reading.

Real-life analogy: It is like asking a warehouse to ship you boxes sorted by size. The warehouse says “I cannot sort them — I just put them on the truck.” But once the truck arrives (DataFrame), YOU can sort them however you want (.orderBy()). Let the warehouse do what it is good at (filtering, joining) and let Spark do what IT is good at (sorting, aggregating).

Writing DataFrames to SQL Tables

Basic Write

# Create sample data
from pyspark.sql.functions import *

data = [(1, "Laptop", 999.99, 10), (2, "Mouse", 29.99, 50), (3, "Keyboard", 79.99, 30)]
df = spark.createDataFrame(data, ["product_id", "name", "price", "quantity"])

# Write to SQL table
df.write.jdbc(
    url=sql_url,
    table="dbo.products_from_databricks",
    mode="overwrite",
    properties=sql_properties
)

print("Written to SQL successfully!")

Verify the Write

# Read back to verify
df_verify = spark.read.jdbc(url=sql_url, table="dbo.products_from_databricks", properties=sql_properties)
df_verify.show()

Write Modes: Append, Overwrite, Ignore

Mode What It Does When to Use
append Adds rows to existing table Incremental loads, daily inserts
overwrite Drops and recreates the table, writes new data Full refresh, dev/testing
ignore Does nothing if table already exists Prevent accidental overwrites
error (default) Throws error if table exists Catch unexpected situations

# Append — add to existing
df.write.jdbc(url=sql_url, table="dbo.sales", mode="append", properties=sql_properties)

# Overwrite — replace entire table
df.write.jdbc(url=sql_url, table="dbo.sales", mode="overwrite", properties=sql_properties)

# Ignore — skip if exists
df.write.jdbc(url=sql_url, table="dbo.sales", mode="ignore", properties=sql_properties)

Warning with overwrite: It DROPS the entire table and recreates it from the DataFrame schema. Indexes, constraints, and permissions on the original table are LOST. For production, use append or the upsert pattern.

Upsert (MERGE) Pattern

For production updates without losing the table structure:

def upsert_to_sql(df, table_name, key_column):
    '''Insert new rows, update existing rows using MERGE via staging table'''
    staging_table = f"{table_name}_staging"

    # Step 1: Write to staging table (overwrite is fine — it's temporary)
    df.write.jdbc(url=sql_url, table=staging_table, mode="overwrite", properties=sql_properties)

    # Step 2: Build and execute MERGE statement
    columns = df.columns
    update_cols = [c for c in columns if c != key_column]

    update_set = ", ".join([f"target.{c} = source.{c}" for c in update_cols])
    insert_cols = ", ".join(columns)
    insert_vals = ", ".join([f"source.{c}" for c in columns])

    merge_sql = f'''
        MERGE {table_name} AS target
        USING {staging_table} AS source
        ON target.{key_column} = source.{key_column}
        WHEN MATCHED THEN UPDATE SET {update_set}
        WHEN NOT MATCHED THEN INSERT ({insert_cols}) VALUES ({insert_vals});
    '''

    # Execute MERGE on SQL Server
    import jaydebeapi
    # Or use pyodbc / spark JDBC execute

    # Step 3: Drop staging table
    drop_sql = f"DROP TABLE IF EXISTS {staging_table}"

    print(f"Upserted {df.count()} rows into {table_name}")

Real-life analogy: The upsert pattern is like a hotel check-in system. New guests (not matched) get assigned a room (INSERT). Returning guests (matched) get their room updated with new preferences (UPDATE). The staging table is the check-in queue — temporary, cleared after everyone is processed.

The Three-Notebook Architecture (Production)

In production, you separate concerns into three notebooks:

Config_SQL          — Connection setup (credentials from Key Vault)
Functions_SQL       — Reusable functions (read, write, validate)
SQL_Operations      — Actual work (%run Config_SQL, %run Functions_SQL)

All three must be in the SAME folder for %run ./NotebookName to work.

Notebook 1: Config_SQL (Connection Setup)

# Config_SQL — Connection configuration from Key Vault
SCOPE = "keyvault-scope"

# Read credentials securely
server = dbutils.secrets.get(SCOPE, "sql-server-name")
database = dbutils.secrets.get(SCOPE, "sql-database-name")
username = dbutils.secrets.get(SCOPE, "sql-username")
password = dbutils.secrets.get(SCOPE, "sql-password")

# Build JDBC URL
sql_url = (
    f"jdbc:sqlserver://{server}:1433;"
    f"database={database};"
    "encrypt=true;"
    "trustServerCertificate=false;"
    "hostNameInCertificate=*.database.windows.net;"
    "loginTimeout=30"
)

# Build properties
sql_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# ADLS Config (if also needed)
STORAGE_ACCOUNT = "naveenadlsgen2de"
CONTAINER = "synapse-workspace"
BASE_PATH = f"abfss://{CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.windows.net"

storage_key = dbutils.secrets.get(SCOPE, "adls-storage-key")
spark.conf.set(f"fs.azure.account.key.{STORAGE_ACCOUNT}.dfs.core.windows.net", storage_key)

print("Config loaded: SQL + ADLS connections ready")

Notebook 2: Functions_SQL (Reusable Functions)

# Functions_SQL — Reusable read/write/validate functions

# ── SQL READ FUNCTIONS ──

def read_sql_table(table_name):
    '''Read an entire SQL table into a DataFrame'''
    df = spark.read.jdbc(url=sql_url, table=table_name, properties=sql_properties)
    print(f"Read {df.count()} rows from {table_name}")
    return df

def read_sql_query(query, alias="query_result"):
    '''Read data using a custom SQL query (never use ORDER BY — sort in Spark)'''
    df = spark.read.jdbc(
        url=sql_url,
        table=f"({query}) AS {alias}",
        properties=sql_properties
    )
    print(f"Query returned {df.count()} rows")
    return df

def list_sql_tables(schema="dbo"):
    '''List all tables in a schema'''
    query = f"SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '{schema}'"
    return read_sql_query(query, "tables_list").orderBy("TABLE_NAME")

# ── SQL WRITE FUNCTIONS ──

def write_to_sql(df, table_name, mode="append"):
    '''Write DataFrame to SQL table (modes: append, overwrite, ignore)'''
    df.write.jdbc(url=sql_url, table=table_name, mode=mode, properties=sql_properties)
    print(f"Written {df.count()} rows to {table_name} (mode={mode})")

# ── ADLS FUNCTIONS ──

def read_adls_csv(path, delimiter=",", header=True):
    '''Read CSV from ADLS Gen2'''
    full_path = f"{BASE_PATH}/{path}"
    df = spark.read.format("csv").option("header", str(header)).option("delimiter", delimiter).option("inferSchema", "true").load(full_path)
    print(f"Read {df.count()} rows from {path}")
    return df

def write_adls_parquet(df, path, mode="overwrite"):
    '''Write DataFrame as Parquet to ADLS Gen2'''
    full_path = f"{BASE_PATH}/{path}"
    df.write.mode(mode).parquet(full_path)
    print(f"Written {df.count()} rows to {path} as Parquet")

def write_adls_delta(df, path, mode="overwrite", partition_cols=None):
    '''Write DataFrame as Delta to ADLS Gen2'''
    full_path = f"{BASE_PATH}/{path}"
    writer = df.write.format("delta").mode(mode)
    if partition_cols:
        writer = writer.partitionBy(*partition_cols)
    writer.save(full_path)
    print(f"Written {df.count()} rows to {path} as Delta")

# ── DATA QUALITY FUNCTIONS ──

def check_nulls(df, columns):
    '''Check null counts for specified columns'''
    from pyspark.sql.functions import col, count, when
    print("Null Analysis:")
    for c in columns:
        null_count = df.filter(col(c).isNull()).count()
        total = df.count()
        pct = round(null_count / total * 100, 2) if total > 0 else 0
        status = "⚠️" if null_count > 0 else "✓"
        print(f"  {status} {c}: {null_count} nulls ({pct}%)")

def check_duplicates(df, key_columns):
    '''Check for duplicate rows based on key columns'''
    from pyspark.sql.functions import col, count
    total = df.count()
    distinct = df.dropDuplicates(key_columns).count()
    dupes = total - distinct
    if dupes > 0:
        print(f"⚠️ Found {dupes} duplicate rows (out of {total})")
        df.groupBy(key_columns).count().filter(col("count") > 1).show()
    else:
        print(f"✓ No duplicates found ({total} rows, all unique)")

def profile_data(df):
    '''Quick data profiling'''
    print(f"Rows: {df.count()}")
    print(f"Columns: {len(df.columns)}")
    df.printSchema()
    df.show(5, truncate=False)
    df.describe().show()

# ── LOGGING ──

def log_activity(activity, table_name, rows, status, message=""):
    '''Log pipeline activity'''
    from datetime import datetime
    ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"[{ts}] {activity} | {table_name} | {rows} rows | {status} | {message}")

print("Functions loaded: read_sql_table, read_sql_query, list_sql_tables, write_to_sql,")
print("  read_adls_csv, write_adls_parquet, write_adls_delta,")
print("  check_nulls, check_duplicates, profile_data, log_activity")

Notebook 3: SQL_Operations (Worker)

# Load config and functions (must be first cells, each in its own cell)
%run ./Config_SQL
%run ./Functions_SQL
# 1. List all tables
df_tables = list_sql_tables()
df_tables.show(truncate=False)
# 2. Read a full table
df_customers = read_sql_table("SalesLT.Customer")
df_customers.show(5)
# 3. Read with custom query
df_top_products = read_sql_query(
    "SELECT TOP 10 p.ProductID, p.Name, p.ListPrice, pc.Name AS Category "
    "FROM SalesLT.Product p "
    "JOIN SalesLT.ProductCategory pc ON p.ProductCategoryID = pc.ProductCategoryID",
    "top_products"
)
df_top_products.orderBy(col("ListPrice").desc()).show(truncate=False)
# 4. Data quality checks
profile_data(df_customers)
check_nulls(df_customers, ["FirstName", "LastName", "EmailAddress", "Phone", "CompanyName"])
check_duplicates(df_customers, ["EmailAddress"])
# 5. Read CSV from ADLS, transform, write to SQL
from pyspark.sql.functions import *

df_sales = read_adls_csv("sales/sales_pipe.csv", delimiter="|")

df_transformed = df_sales     .withColumn("price", col("price").cast("double"))     .withColumn("quantity", coalesce(col("quantity").cast("int"), lit(0)))     .withColumn("total_amount", col("price") * col("quantity"))     .withColumn("load_date", current_date())     .filter(col("id").isNotNull())

write_to_sql(df_transformed, "dbo.sales_from_databricks", mode="overwrite")
log_activity("Write_SQL", "sales_from_databricks", df_transformed.count(), "SUCCESS")
# 6. SQL to ADLS Gen2 (export as Delta)
df_products = read_sql_table("SalesLT.Product")
df_products = df_products     .withColumn("export_year", year(current_date()))     .withColumn("export_month", month(current_date()))

write_adls_delta(df_products, "output/products_delta/", partition_cols=["export_year", "export_month"])
log_activity("Export", "SalesLT.Product", df_products.count(), "SUCCESS", "Delta with partitions")
# 7. Save as managed table in catalog
df_products = read_sql_table("SalesLT.Product")
df_products.write.format("delta").mode("overwrite").saveAsTable("sales_db.products_from_sql")
log_activity("Create_Table", "sales_db.products_from_sql", df_products.count(), "SUCCESS")

Firewall Configuration

Azure SQL Database blocks all connections by default. You must allow Databricks:

Option 1: Allow Azure Services (Easiest)

  1. Azure Portal → SQL Server → Networking
  2. Toggle Allow Azure services and resources to access this serverYes
  3. Save

This allows ALL Azure services (including Databricks) to connect.

Option 2: Add Databricks IPs (More Secure)

  1. Find your Databricks cluster’s outbound IPs (cluster logs or Azure Databricks docs)
  2. Add each IP as a firewall rule in SQL Server → Networking

Option 3: Private Endpoint (Production)

Use a private endpoint to connect Databricks to SQL over private network — no public internet.

Performance Optimization

1. Use Partitioned Reads for Large Tables

# Read in parallel partitions (much faster for large tables)
df = spark.read.jdbc(
    url=sql_url,
    table="SalesLT.SalesOrderDetail",
    column="SalesOrderDetailID",      # Partition column (numeric, evenly distributed)
    lowerBound=1,                      # Min value of partition column
    upperBound=100000,                 # Max value
    numPartitions=10,                  # Number of parallel readers
    properties=sql_properties
)

This creates 10 parallel JDBC connections, each reading a range of rows. 10x faster than a single connection for large tables.

2. Select Only Needed Columns

# ❌ Reads ALL columns (slow for wide tables)
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)

# ✅ Read only needed columns using a query
df = read_sql_query(
    "SELECT CustomerID, FirstName, LastName, EmailAddress FROM SalesLT.Customer",
    "customers_slim"
)

3. Push Filters to SQL Server

# ❌ Reads ALL rows, then filters in Spark (slow)
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
df = df.filter(col("CompanyName").like("%Bike%"))

# ✅ Filter on SQL Server side (much faster — less data transferred)
df = read_sql_query(
    "SELECT * FROM SalesLT.Customer WHERE CompanyName LIKE '%Bike%'",
    "bike_customers"
)

4. Batch Writes

# Write with batch size for better performance
df.write.jdbc(
    url=sql_url,
    table="dbo.large_table",
    mode="append",
    properties={**sql_properties, "batchsize": "10000"}  # 10K rows per batch
)

Common Errors and Fixes

Error Cause Fix
“Cannot open server requested by the login” Wrong server name or firewall blocking Verify server URL, enable “Allow Azure services” in SQL firewall
“Login failed for user” Wrong username or password Reset password in Azure Portal, update Key Vault secret
“ORDER BY is invalid in subqueries” Using ORDER BY in JDBC query Remove ORDER BY from SQL, use .orderBy() in Spark
“The TCP/IP connection has failed” Firewall blocking port 1433 Add firewall rule or enable Azure services access
“driver not found” Missing JDBC driver class Use com.microsoft.sqlserver.jdbc.SQLServerDriver (pre-installed in Databricks)
“Table or view not found” Wrong schema or table name Verify with SELECT * FROM INFORMATION_SCHEMA.TABLES
“File ‘./Config_SQL’ not found” Notebooks not in the same folder Move all 3 notebooks to the same Databricks folder
“truncated data” on write DataFrame column types wider than SQL column Cast columns before writing or alter SQL table

Interview Questions

Q: How do you connect Databricks to Azure SQL Database? A: Using JDBC. Build a JDBC URL with the server name, port 1433, and database name. Create a properties dictionary with username, password (from Key Vault), and the SQL Server JDBC driver class. Use spark.read.jdbc() to read and df.write.jdbc() to write.

Q: Why can’t you use ORDER BY in JDBC queries? A: Because Spark wraps your query as a subquery: SELECT * FROM (your query) AS alias. SQL Server prohibits ORDER BY inside subqueries unless TOP or OFFSET is used. The fix is to sort on the Spark side using .orderBy() after reading the data.

Q: How do you optimize reading large SQL tables via JDBC? A: Use partitioned reads with column, lowerBound, upperBound, and numPartitions parameters. This creates multiple parallel JDBC connections, each reading a range of the partition column. Also push filters and column selection into the SQL query to reduce data transfer.

Q: What is the three-notebook architecture? A: Config_SQL (credentials from Key Vault + connection setup), Functions_SQL (reusable read/write/validate functions), and SQL_Operations (actual work using %run ./Config_SQL and %run ./Functions_SQL). This separates concerns, centralizes credentials, and makes functions reusable across multiple worker notebooks.

Q: What write modes are available for JDBC writes? A: append (add rows), overwrite (drop and recreate table — loses indexes/constraints), ignore (skip if table exists), and error (throw exception if exists, default). For production, use append or implement a staging table + MERGE pattern for upserts.

Wrapping Up

Connecting Databricks to Azure SQL is a fundamental skill for any data engineer working in Azure. The JDBC connection is the bridge between your relational database (OLTP) and your Spark processing engine (big data). Master the connection setup, learn the ORDER BY trap, and adopt the three-notebook architecture for production.

The key takeaways: – Always use Key Vault for credentials — never hardcode – Never use ORDER BY in JDBC queries — sort in Spark – Use the three-notebook pattern — Config, Functions, Operations – Push filters to SQL Server — transfer less data – Use partitioned reads for large tables — parallel is faster

Related posts:Azure Databricks Introduction and dbutilsDatabricks Secret ScopesConnecting to Blob/ADLS Gen2Azure SQL Database GuidePySpark Transformations Cookbook


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link