Connecting Azure Databricks to Azure SQL Database: JDBC Read, Write, and Production Patterns
You have data in Azure SQL Database. You have a Databricks notebook ready to transform it. But between them is a JDBC connection that needs the right URL, the right driver, the right credentials, and the right syntax — get any one wrong and you get cryptic Java exceptions that make no sense.
This post walks through everything: connecting Databricks to Azure SQL via JDBC, reading tables and custom queries, writing data back, the ORDER BY subquery trap that catches everyone, and a production-ready three-notebook architecture with reusable functions.
Think of JDBC as a phone call between Databricks and SQL Server. You need the right phone number (server URL), the right extension (port 1433), the correct name to ask for (database), and your ID badge (credentials). Get the phone number wrong and you hear static. Forget your badge and security hangs up on you.
Table of Contents
- What Is JDBC and Why Databricks Needs It
- The JDBC Connection URL
- Method 1: Direct Connection (Quick Start)
- Method 2: Secure Connection with Key Vault
- Reading SQL Tables into DataFrames
- Reading with Custom SQL Queries
- The ORDER BY Subquery Trap
- Writing DataFrames to SQL Tables
- Write Modes: Append, Overwrite, Ignore
- Upsert (MERGE) Pattern
- The Three-Notebook Architecture (Production)
- Notebook 1: Config_SQL (Connection Setup)
- Notebook 2: Functions_SQL (Reusable Functions)
- Notebook 3: SQL_Operations (Worker)
- Data Quality Functions
- End-to-End Pipeline: SQL to ADLS to SQL
- Firewall Configuration
- Performance Optimization
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
What Is JDBC and Why Databricks Needs It
JDBC (Java Database Connectivity) is how Spark connects to relational databases. Unlike ADLS Gen2 where Spark reads files directly, SQL databases require a protocol-level connection — Spark sends SQL commands over JDBC and the database returns result sets.
Databricks (Spark) Azure SQL Database
┌──────────────┐ ┌──────────────────┐
│ │ JDBC Connection │ │
│ spark.read │ ──── Port 1433 ────> │ SalesLT.Customer │
│ .jdbc() │ <── Result Set ───── │ (847 rows) │
│ │ │ │
└──────────────┘ └──────────────────┘
Real-life analogy: Reading from ADLS is like picking up a book from a shelf — you just grab it. Reading from SQL via JDBC is like calling a librarian — you ask for a specific book, they find it, and read it to you over the phone.
The JDBC Connection URL
Every SQL connection needs a JDBC URL:
jdbc:sqlserver://SERVER_NAME:1433;database=DATABASE_NAME;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30
Breaking It Down
| Part | Value | Meaning |
|---|---|---|
jdbc:sqlserver:// |
Protocol prefix | “I’m using SQL Server JDBC driver” |
SERVER_NAME |
sql-dataplatform-dev.database.windows.net |
Your SQL server’s address |
:1433 |
Port number | SQL Server’s default port |
database= |
AdventureWorksLT |
Which database to connect to |
encrypt=true |
Security setting | Encrypt the connection |
loginTimeout=30 |
Timeout in seconds | Give up after 30 seconds |
The JDBC Driver
Spark uses the Microsoft SQL Server JDBC driver. In Databricks, it is pre-installed. In local PySpark, you must download it.
# The driver class name — same for all SQL Server connections
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
Method 1: Direct Connection (Quick Start)
For learning and quick testing:
# Connection details
jdbc_url = "jdbc:sqlserver://sql-dataplatform-dev.database.windows.net:1433;database=AdventureWorksLT;encrypt=true"
connection_properties = {
"user": "sqladmin",
"password": "YourPassword123!", # ❌ NEVER hardcode in production
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
# Read a table
df = spark.read.jdbc(
url=jdbc_url,
table="SalesLT.Customer",
properties=connection_properties
)
df.show(5)
print(f"Read {df.count()} rows")
This works but is insecure. The password is visible in the notebook. Use Key Vault for anything beyond quick testing.
Method 2: Secure Connection with Key Vault
Prerequisites
Store these secrets in Azure Key Vault:
| Secret Name | Value | Example |
|---|---|---|
sql-server-name |
Your SQL server URL | sql-dataplatform-dev.database.windows.net |
sql-database-name |
Database name | AdventureWorksLT |
sql-username |
SQL admin username | sqladmin |
sql-password |
SQL admin password | P@ssw0rd!2026 |
Connect Using Key Vault
# Read credentials from Key Vault (never visible in notebook)
scope = "keyvault-scope"
server = dbutils.secrets.get(scope, "sql-server-name")
database = dbutils.secrets.get(scope, "sql-database-name")
username = dbutils.secrets.get(scope, "sql-username")
password = dbutils.secrets.get(scope, "sql-password")
# Build JDBC URL
sql_url = f"jdbc:sqlserver://{server}:1433;database={database};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
# Build properties dictionary
sql_properties = {
"user": username,
"password": password,
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
# Verify (values are [REDACTED] in output)
print(f"SQL URL: {sql_url}")
print(f"SQL Properties keys: {list(sql_properties.keys())}")
# Test connection
df_test = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
print(f"Connected! Read {df_test.count()} rows from SalesLT.Customer")
Real-life analogy: Method 1 is like writing your building access code on a sticky note on your laptop. Method 2 is like using a key card stored in a secure safe — you retrieve it when needed, use it, and it is never visible to anyone looking at your screen.
Reading SQL Tables into DataFrames
Read an Entire Table
# Read full table
df_customers = spark.read.jdbc(
url=sql_url,
table="SalesLT.Customer",
properties=sql_properties
)
df_customers.show(5)
df_customers.printSchema()
print(f"Rows: {df_customers.count()}, Columns: {len(df_customers.columns)}")
Read Multiple Tables
# Read several tables
tables = ["SalesLT.Customer", "SalesLT.Product", "SalesLT.Address", "SalesLT.SalesOrderHeader"]
for table in tables:
df = spark.read.jdbc(url=sql_url, table=table, properties=sql_properties)
print(f"{table}: {df.count()} rows, {len(df.columns)} columns")
Reading with Custom SQL Queries
The Subquery Syntax
Spark wraps your query as a subquery. You must provide an alias:
# Custom query — wrap in parentheses with an alias
query = "(SELECT TOP 10 ProductID, Name, ListPrice FROM SalesLT.Product ORDER BY ListPrice DESC) AS top_products"
df = spark.read.jdbc(
url=sql_url,
table=query, # Not actually a table — it's a subquery
properties=sql_properties
)
df.show(truncate=False)
As a Reusable Function
def read_sql_query(query, alias="query_result"):
'''Read data using a custom SQL query'''
df = spark.read.jdbc(
url=sql_url,
table=f"({query}) AS {alias}",
properties=sql_properties
)
return df
# Usage
df = read_sql_query(
"SELECT CustomerID, FirstName, LastName, CompanyName FROM SalesLT.Customer WHERE CompanyName LIKE '%Bike%'",
"bike_customers"
)
df.show()
The ORDER BY Subquery Trap
This is the #1 error that catches everyone:
# ❌ THIS FAILS
query = '''(
SELECT TABLE_SCHEMA, TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_NAME
) AS tables'''
df = spark.read.jdbc(url=sql_url, table=query, properties=sql_properties)
# ERROR: The ORDER BY clause is invalid in views, inline functions,
# derived tables, subqueries, and common table expressions,
# unless TOP, OFFSET or FOR XML is also specified.
Why It Fails
Spark wraps your query inside another SELECT:
SELECT * FROM (
SELECT TABLE_SCHEMA, TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_NAME -- ❌ ORDER BY inside subquery = illegal in SQL Server
) AS tables
SQL Server does not allow ORDER BY inside a subquery unless you use TOP, OFFSET, or FOR XML.
The Fix: Sort in Spark
# ✅ Remove ORDER BY from SQL, sort in Spark instead
query = '''(
SELECT TABLE_SCHEMA, TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
) AS tables'''
df = spark.read.jdbc(url=sql_url, table=query, properties=sql_properties)
df.orderBy("TABLE_NAME").show() # Sort on the Spark side
Rule: Never use ORDER BY in JDBC queries. Always use .orderBy() on the DataFrame after reading.
Real-life analogy: It is like asking a warehouse to ship you boxes sorted by size. The warehouse says “I cannot sort them — I just put them on the truck.” But once the truck arrives (DataFrame), YOU can sort them however you want (.orderBy()). Let the warehouse do what it is good at (filtering, joining) and let Spark do what IT is good at (sorting, aggregating).
Writing DataFrames to SQL Tables
Basic Write
# Create sample data
from pyspark.sql.functions import *
data = [(1, "Laptop", 999.99, 10), (2, "Mouse", 29.99, 50), (3, "Keyboard", 79.99, 30)]
df = spark.createDataFrame(data, ["product_id", "name", "price", "quantity"])
# Write to SQL table
df.write.jdbc(
url=sql_url,
table="dbo.products_from_databricks",
mode="overwrite",
properties=sql_properties
)
print("Written to SQL successfully!")
Verify the Write
# Read back to verify
df_verify = spark.read.jdbc(url=sql_url, table="dbo.products_from_databricks", properties=sql_properties)
df_verify.show()
Write Modes: Append, Overwrite, Ignore
| Mode | What It Does | When to Use |
|---|---|---|
append |
Adds rows to existing table | Incremental loads, daily inserts |
overwrite |
Drops and recreates the table, writes new data | Full refresh, dev/testing |
ignore |
Does nothing if table already exists | Prevent accidental overwrites |
error (default) |
Throws error if table exists | Catch unexpected situations |
# Append — add to existing
df.write.jdbc(url=sql_url, table="dbo.sales", mode="append", properties=sql_properties)
# Overwrite — replace entire table
df.write.jdbc(url=sql_url, table="dbo.sales", mode="overwrite", properties=sql_properties)
# Ignore — skip if exists
df.write.jdbc(url=sql_url, table="dbo.sales", mode="ignore", properties=sql_properties)
Warning with overwrite: It DROPS the entire table and recreates it from the DataFrame schema. Indexes, constraints, and permissions on the original table are LOST. For production, use append or the upsert pattern.
Upsert (MERGE) Pattern
For production updates without losing the table structure:
def upsert_to_sql(df, table_name, key_column):
'''Insert new rows, update existing rows using MERGE via staging table'''
staging_table = f"{table_name}_staging"
# Step 1: Write to staging table (overwrite is fine — it's temporary)
df.write.jdbc(url=sql_url, table=staging_table, mode="overwrite", properties=sql_properties)
# Step 2: Build and execute MERGE statement
columns = df.columns
update_cols = [c for c in columns if c != key_column]
update_set = ", ".join([f"target.{c} = source.{c}" for c in update_cols])
insert_cols = ", ".join(columns)
insert_vals = ", ".join([f"source.{c}" for c in columns])
merge_sql = f'''
MERGE {table_name} AS target
USING {staging_table} AS source
ON target.{key_column} = source.{key_column}
WHEN MATCHED THEN UPDATE SET {update_set}
WHEN NOT MATCHED THEN INSERT ({insert_cols}) VALUES ({insert_vals});
'''
# Execute MERGE on SQL Server
import jaydebeapi
# Or use pyodbc / spark JDBC execute
# Step 3: Drop staging table
drop_sql = f"DROP TABLE IF EXISTS {staging_table}"
print(f"Upserted {df.count()} rows into {table_name}")
Real-life analogy: The upsert pattern is like a hotel check-in system. New guests (not matched) get assigned a room (INSERT). Returning guests (matched) get their room updated with new preferences (UPDATE). The staging table is the check-in queue — temporary, cleared after everyone is processed.
The Three-Notebook Architecture (Production)
In production, you separate concerns into three notebooks:
Config_SQL — Connection setup (credentials from Key Vault)
Functions_SQL — Reusable functions (read, write, validate)
SQL_Operations — Actual work (%run Config_SQL, %run Functions_SQL)
All three must be in the SAME folder for %run ./NotebookName to work.
Notebook 1: Config_SQL (Connection Setup)
# Config_SQL — Connection configuration from Key Vault
SCOPE = "keyvault-scope"
# Read credentials securely
server = dbutils.secrets.get(SCOPE, "sql-server-name")
database = dbutils.secrets.get(SCOPE, "sql-database-name")
username = dbutils.secrets.get(SCOPE, "sql-username")
password = dbutils.secrets.get(SCOPE, "sql-password")
# Build JDBC URL
sql_url = (
f"jdbc:sqlserver://{server}:1433;"
f"database={database};"
"encrypt=true;"
"trustServerCertificate=false;"
"hostNameInCertificate=*.database.windows.net;"
"loginTimeout=30"
)
# Build properties
sql_properties = {
"user": username,
"password": password,
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
# ADLS Config (if also needed)
STORAGE_ACCOUNT = "naveenadlsgen2de"
CONTAINER = "synapse-workspace"
BASE_PATH = f"abfss://{CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.windows.net"
storage_key = dbutils.secrets.get(SCOPE, "adls-storage-key")
spark.conf.set(f"fs.azure.account.key.{STORAGE_ACCOUNT}.dfs.core.windows.net", storage_key)
print("Config loaded: SQL + ADLS connections ready")
Notebook 2: Functions_SQL (Reusable Functions)
# Functions_SQL — Reusable read/write/validate functions
# ── SQL READ FUNCTIONS ──
def read_sql_table(table_name):
'''Read an entire SQL table into a DataFrame'''
df = spark.read.jdbc(url=sql_url, table=table_name, properties=sql_properties)
print(f"Read {df.count()} rows from {table_name}")
return df
def read_sql_query(query, alias="query_result"):
'''Read data using a custom SQL query (never use ORDER BY — sort in Spark)'''
df = spark.read.jdbc(
url=sql_url,
table=f"({query}) AS {alias}",
properties=sql_properties
)
print(f"Query returned {df.count()} rows")
return df
def list_sql_tables(schema="dbo"):
'''List all tables in a schema'''
query = f"SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '{schema}'"
return read_sql_query(query, "tables_list").orderBy("TABLE_NAME")
# ── SQL WRITE FUNCTIONS ──
def write_to_sql(df, table_name, mode="append"):
'''Write DataFrame to SQL table (modes: append, overwrite, ignore)'''
df.write.jdbc(url=sql_url, table=table_name, mode=mode, properties=sql_properties)
print(f"Written {df.count()} rows to {table_name} (mode={mode})")
# ── ADLS FUNCTIONS ──
def read_adls_csv(path, delimiter=",", header=True):
'''Read CSV from ADLS Gen2'''
full_path = f"{BASE_PATH}/{path}"
df = spark.read.format("csv").option("header", str(header)).option("delimiter", delimiter).option("inferSchema", "true").load(full_path)
print(f"Read {df.count()} rows from {path}")
return df
def write_adls_parquet(df, path, mode="overwrite"):
'''Write DataFrame as Parquet to ADLS Gen2'''
full_path = f"{BASE_PATH}/{path}"
df.write.mode(mode).parquet(full_path)
print(f"Written {df.count()} rows to {path} as Parquet")
def write_adls_delta(df, path, mode="overwrite", partition_cols=None):
'''Write DataFrame as Delta to ADLS Gen2'''
full_path = f"{BASE_PATH}/{path}"
writer = df.write.format("delta").mode(mode)
if partition_cols:
writer = writer.partitionBy(*partition_cols)
writer.save(full_path)
print(f"Written {df.count()} rows to {path} as Delta")
# ── DATA QUALITY FUNCTIONS ──
def check_nulls(df, columns):
'''Check null counts for specified columns'''
from pyspark.sql.functions import col, count, when
print("Null Analysis:")
for c in columns:
null_count = df.filter(col(c).isNull()).count()
total = df.count()
pct = round(null_count / total * 100, 2) if total > 0 else 0
status = "⚠️" if null_count > 0 else "✓"
print(f" {status} {c}: {null_count} nulls ({pct}%)")
def check_duplicates(df, key_columns):
'''Check for duplicate rows based on key columns'''
from pyspark.sql.functions import col, count
total = df.count()
distinct = df.dropDuplicates(key_columns).count()
dupes = total - distinct
if dupes > 0:
print(f"⚠️ Found {dupes} duplicate rows (out of {total})")
df.groupBy(key_columns).count().filter(col("count") > 1).show()
else:
print(f"✓ No duplicates found ({total} rows, all unique)")
def profile_data(df):
'''Quick data profiling'''
print(f"Rows: {df.count()}")
print(f"Columns: {len(df.columns)}")
df.printSchema()
df.show(5, truncate=False)
df.describe().show()
# ── LOGGING ──
def log_activity(activity, table_name, rows, status, message=""):
'''Log pipeline activity'''
from datetime import datetime
ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"[{ts}] {activity} | {table_name} | {rows} rows | {status} | {message}")
print("Functions loaded: read_sql_table, read_sql_query, list_sql_tables, write_to_sql,")
print(" read_adls_csv, write_adls_parquet, write_adls_delta,")
print(" check_nulls, check_duplicates, profile_data, log_activity")
Notebook 3: SQL_Operations (Worker)
# Load config and functions (must be first cells, each in its own cell)
%run ./Config_SQL
%run ./Functions_SQL
# 1. List all tables
df_tables = list_sql_tables()
df_tables.show(truncate=False)
# 2. Read a full table
df_customers = read_sql_table("SalesLT.Customer")
df_customers.show(5)
# 3. Read with custom query
df_top_products = read_sql_query(
"SELECT TOP 10 p.ProductID, p.Name, p.ListPrice, pc.Name AS Category "
"FROM SalesLT.Product p "
"JOIN SalesLT.ProductCategory pc ON p.ProductCategoryID = pc.ProductCategoryID",
"top_products"
)
df_top_products.orderBy(col("ListPrice").desc()).show(truncate=False)
# 4. Data quality checks
profile_data(df_customers)
check_nulls(df_customers, ["FirstName", "LastName", "EmailAddress", "Phone", "CompanyName"])
check_duplicates(df_customers, ["EmailAddress"])
# 5. Read CSV from ADLS, transform, write to SQL
from pyspark.sql.functions import *
df_sales = read_adls_csv("sales/sales_pipe.csv", delimiter="|")
df_transformed = df_sales .withColumn("price", col("price").cast("double")) .withColumn("quantity", coalesce(col("quantity").cast("int"), lit(0))) .withColumn("total_amount", col("price") * col("quantity")) .withColumn("load_date", current_date()) .filter(col("id").isNotNull())
write_to_sql(df_transformed, "dbo.sales_from_databricks", mode="overwrite")
log_activity("Write_SQL", "sales_from_databricks", df_transformed.count(), "SUCCESS")
# 6. SQL to ADLS Gen2 (export as Delta)
df_products = read_sql_table("SalesLT.Product")
df_products = df_products .withColumn("export_year", year(current_date())) .withColumn("export_month", month(current_date()))
write_adls_delta(df_products, "output/products_delta/", partition_cols=["export_year", "export_month"])
log_activity("Export", "SalesLT.Product", df_products.count(), "SUCCESS", "Delta with partitions")
# 7. Save as managed table in catalog
df_products = read_sql_table("SalesLT.Product")
df_products.write.format("delta").mode("overwrite").saveAsTable("sales_db.products_from_sql")
log_activity("Create_Table", "sales_db.products_from_sql", df_products.count(), "SUCCESS")
Firewall Configuration
Azure SQL Database blocks all connections by default. You must allow Databricks:
Option 1: Allow Azure Services (Easiest)
- Azure Portal → SQL Server → Networking
- Toggle Allow Azure services and resources to access this server → Yes
- Save
This allows ALL Azure services (including Databricks) to connect.
Option 2: Add Databricks IPs (More Secure)
- Find your Databricks cluster’s outbound IPs (cluster logs or Azure Databricks docs)
- Add each IP as a firewall rule in SQL Server → Networking
Option 3: Private Endpoint (Production)
Use a private endpoint to connect Databricks to SQL over private network — no public internet.
Performance Optimization
1. Use Partitioned Reads for Large Tables
# Read in parallel partitions (much faster for large tables)
df = spark.read.jdbc(
url=sql_url,
table="SalesLT.SalesOrderDetail",
column="SalesOrderDetailID", # Partition column (numeric, evenly distributed)
lowerBound=1, # Min value of partition column
upperBound=100000, # Max value
numPartitions=10, # Number of parallel readers
properties=sql_properties
)
This creates 10 parallel JDBC connections, each reading a range of rows. 10x faster than a single connection for large tables.
2. Select Only Needed Columns
# ❌ Reads ALL columns (slow for wide tables)
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
# ✅ Read only needed columns using a query
df = read_sql_query(
"SELECT CustomerID, FirstName, LastName, EmailAddress FROM SalesLT.Customer",
"customers_slim"
)
3. Push Filters to SQL Server
# ❌ Reads ALL rows, then filters in Spark (slow)
df = spark.read.jdbc(url=sql_url, table="SalesLT.Customer", properties=sql_properties)
df = df.filter(col("CompanyName").like("%Bike%"))
# ✅ Filter on SQL Server side (much faster — less data transferred)
df = read_sql_query(
"SELECT * FROM SalesLT.Customer WHERE CompanyName LIKE '%Bike%'",
"bike_customers"
)
4. Batch Writes
# Write with batch size for better performance
df.write.jdbc(
url=sql_url,
table="dbo.large_table",
mode="append",
properties={**sql_properties, "batchsize": "10000"} # 10K rows per batch
)
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “Cannot open server requested by the login” | Wrong server name or firewall blocking | Verify server URL, enable “Allow Azure services” in SQL firewall |
| “Login failed for user” | Wrong username or password | Reset password in Azure Portal, update Key Vault secret |
| “ORDER BY is invalid in subqueries” | Using ORDER BY in JDBC query | Remove ORDER BY from SQL, use .orderBy() in Spark |
| “The TCP/IP connection has failed” | Firewall blocking port 1433 | Add firewall rule or enable Azure services access |
| “driver not found” | Missing JDBC driver class | Use com.microsoft.sqlserver.jdbc.SQLServerDriver (pre-installed in Databricks) |
| “Table or view not found” | Wrong schema or table name | Verify with SELECT * FROM INFORMATION_SCHEMA.TABLES |
| “File ‘./Config_SQL’ not found” | Notebooks not in the same folder | Move all 3 notebooks to the same Databricks folder |
| “truncated data” on write | DataFrame column types wider than SQL column | Cast columns before writing or alter SQL table |
Interview Questions
Q: How do you connect Databricks to Azure SQL Database?
A: Using JDBC. Build a JDBC URL with the server name, port 1433, and database name. Create a properties dictionary with username, password (from Key Vault), and the SQL Server JDBC driver class. Use spark.read.jdbc() to read and df.write.jdbc() to write.
Q: Why can’t you use ORDER BY in JDBC queries?
A: Because Spark wraps your query as a subquery: SELECT * FROM (your query) AS alias. SQL Server prohibits ORDER BY inside subqueries unless TOP or OFFSET is used. The fix is to sort on the Spark side using .orderBy() after reading the data.
Q: How do you optimize reading large SQL tables via JDBC?
A: Use partitioned reads with column, lowerBound, upperBound, and numPartitions parameters. This creates multiple parallel JDBC connections, each reading a range of the partition column. Also push filters and column selection into the SQL query to reduce data transfer.
Q: What is the three-notebook architecture?
A: Config_SQL (credentials from Key Vault + connection setup), Functions_SQL (reusable read/write/validate functions), and SQL_Operations (actual work using %run ./Config_SQL and %run ./Functions_SQL). This separates concerns, centralizes credentials, and makes functions reusable across multiple worker notebooks.
Q: What write modes are available for JDBC writes? A: append (add rows), overwrite (drop and recreate table — loses indexes/constraints), ignore (skip if table exists), and error (throw exception if exists, default). For production, use append or implement a staging table + MERGE pattern for upserts.
Wrapping Up
Connecting Databricks to Azure SQL is a fundamental skill for any data engineer working in Azure. The JDBC connection is the bridge between your relational database (OLTP) and your Spark processing engine (big data). Master the connection setup, learn the ORDER BY trap, and adopt the three-notebook architecture for production.
The key takeaways: – Always use Key Vault for credentials — never hardcode – Never use ORDER BY in JDBC queries — sort in Spark – Use the three-notebook pattern — Config, Functions, Operations – Push filters to SQL Server — transfer less data – Use partitioned reads for large tables — parallel is faster
Related posts: – Azure Databricks Introduction and dbutils – Databricks Secret Scopes – Connecting to Blob/ADLS Gen2 – Azure SQL Database Guide – PySpark Transformations Cookbook
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.