Azure SQL Database for Data Engineers: Setup, Configuration, and Pipeline Integration

Azure SQL Database is the relational database you will use most often in Azure data engineering projects. It is where your metadata tables, audit logs, config tables, and structured data live. Every pipeline we have built in this blog series reads from or writes to Azure SQL Database.

Yet many data engineers treat it as a black box — they know how to write queries but cannot explain the difference between DTUs and vCores, do not understand service tiers, and struggle with authentication and firewall configuration.

This post gives you the complete picture of Azure SQL Database from a data engineering perspective — not DBA-level depth, but everything you need to create databases, configure access, optimize performance, and integrate with your pipelines.

What Is Azure SQL Database?
Azure SQL Database vs SQL Server vs Managed Instance
Creating an Azure SQL Database
Service Tiers and Compute Models
Authentication: SQL Auth vs Azure AD
Firewall Rules and Network Access
Connecting from Different Tools
Key SQL Server Features Data Engineers Use
Integrating with ADF and Synapse Pipelines
Performance Monitoring Basics
Cost Optimization
Backup and Recovery
Interview Questions
Wrapping Up

What Is Azure SQL Database?

Azure SQL Database is a fully managed relational database service built on SQL Server engine. Microsoft handles patching, backups, high availability, and infrastructure. You focus on your data and queries.

Key characteristics:

Fully managed — no OS, no patching, no infrastructure
Built-in high availability — 99.99% SLA with automatic failover
Automatic backups — point-in-time restore up to 35 days
Elastic scaling — scale up/down without downtime
SQL Server compatible — same T-SQL syntax, same tools (SSMS, Azure Data Studio)
Pay per use — choose between DTU and vCore pricing models

Azure SQL Database vs SQL Server vs Managed Instance

Feature	Azure SQL Database	Azure SQL Managed Instance	SQL Server (on-prem/VM)
Management	Fully managed (PaaS)	Fully managed (PaaS)	You manage everything
SQL Server features	Most features	Near 100% compatibility	Full features
Cross-database queries	Not supported	Supported	Supported
SQL Agent jobs	Not available	Available	Available
Linked servers	Not supported	Supported	Supported
CLR integration	Limited	Supported	Full support
Cost	Lowest	Higher	Depends on licensing
Best for	Single databases, new projects	Lift-and-shift from on-prem	Full control, legacy apps

For data engineering: Azure SQL Database is the best choice for new projects. Use Managed Instance only if you need cross-database queries, SQL Agent, or near-100% SQL Server compatibility for migration.

Creating an Azure SQL Database

Step 1: Create a SQL Server (Logical Server)

The “server” in Azure SQL is a logical container — not a physical machine. It holds databases, firewall rules, and login credentials.

Azure Portal > Create a resource > SQL Server (logical server)
Configure:
Server name: sql-dataplatform-dev (globally unique, becomes sql-dataplatform-dev.database.windows.net)
Region: same as your ADF/Synapse workspace
Authentication: SQL authentication (set admin username and password) or Azure AD only
Click Review + Create

Step 2: Create the Database

Azure Portal > Create a resource > SQL Database
Configure:
Database name: AdventureWorksLT (or your project name)
Server: select the server you created
Want to use SQL elastic pool: No (for single databases)
Compute + storage: click Configure database
- For dev/test: Basic (5 DTUs, $5/month) or Serverless (auto-pause)
- For production: General Purpose vCore
Under Additional settings:
Use existing data: Sample (loads AdventureWorksLT sample data — great for learning)
Click Review + Create

Step 3: Configure Firewall

By default, no external connections are allowed. You must add firewall rules:

Go to your SQL Server > Networking
Under Firewall rules:
Click Add your client IPv4 address (adds your current IP)
Toggle Allow Azure services and resources to access this server to Yes (required for ADF/Synapse)
Click Save

Service Tiers and Compute Models

DTU Model (Simpler)

DTU (Database Transaction Unit) bundles CPU, memory, and I/O into a single metric:

Tier	DTUs	Storage	Cost/Month	Use Case
Basic	5	2 GB	~$5	Dev/test, tiny workloads
Standard S0	10	250 GB	~$15	Light production
Standard S3	100	250 GB	~$150	Medium workloads
Premium P1	125	500 GB	~$465	High-performance

vCore Model (More Control)

You choose CPU cores and storage independently:

Tier	Description	Use Case
General Purpose	Balanced compute and storage	Most production workloads
Business Critical	High I/O, built-in read replicas	Mission-critical, low-latency
Hyperscale	Auto-scaling storage up to 100 TB	Very large databases

Serverless Compute (Cost Saver)

The database auto-pauses when not in use and auto-resumes on the first connection:

Min vCores: 0.5 (auto-pause when idle)
Max vCores: up to 80
Auto-pause delay: 1 hour (configurable)
You pay only for compute used + storage

Perfect for: Dev/test databases, intermittent workloads, metadata databases that are only accessed during pipeline runs.

Authentication: SQL Auth vs Azure AD

SQL Authentication (Simple)

Traditional username/password:

Server: sql-dataplatform-dev.database.windows.net
Database: AdventureWorksLT
Username: sqladmin
Password: MyP@ssw0rd123

Pros: Simple to set up, works from anywhere. Cons: Password management, no MFA, credentials in linked services.

Azure AD (Entra ID) Authentication (Recommended)

Uses Azure Active Directory identities:

-- Create an Azure AD user in the database
CREATE USER [naveen@company.com] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [naveen@company.com];
ALTER ROLE db_datawriter ADD MEMBER [naveen@company.com];

Pros: No passwords to manage, MFA support, Managed Identity support. Cons: Requires Azure AD setup, slightly more complex initial configuration.

Managed Identity (Best for Pipelines)

ADF/Synapse uses its managed identity to connect — no credentials needed:

Set ADF managed identity as Azure AD admin on the SQL server, OR
Create the managed identity as a database user:

CREATE USER [naveen-datafactory-de] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [naveen-datafactory-de];
ALTER ROLE db_datawriter ADD MEMBER [naveen-datafactory-de];
ALTER ROLE db_ddladmin ADD MEMBER [naveen-datafactory-de];

In the ADF linked service, select Managed Identity as the authentication type. No password, no key rotation, no secrets.

Firewall Rules and Network Access

Public Access (Default)

The database is accessible over the internet. Firewall rules control which IPs can connect:

Server firewall rules:
  - My office IP: 203.0.113.50
  - My home IP: 198.51.100.25
  - Allow Azure services: Yes (required for ADF)

Private Access (Production)

Use Private Endpoints to make the database accessible only from your Virtual Network:

Go to SQL Server > Networking > Private endpoint connections
Create a private endpoint in your VNet
The database gets a private IP address (e.g., 10.0.1.5)
Disable public access

ADF/Synapse connects via Managed VNet with a managed private endpoint.

Connecting from Different Tools

SQL Server Management Studio (SSMS)

Server: sql-dataplatform-dev.database.windows.net
Authentication: SQL Server Authentication (or Azure AD)
Login: sqladmin
Password: ****

Azure Data Studio

Same connection details as SSMS. Lighter weight, cross-platform, supports notebooks.

Python (SQLAlchemy + pyodbc)

from sqlalchemy import create_engine

engine = create_engine(
    'mssql+pyodbc://sqladmin:password@sql-dataplatform-dev.database.windows.net/AdventureWorksLT'
    '?driver=ODBC+Driver+18+for+SQL+Server'
    '&encrypt=yes'
    '&TrustServerCertificate=no'
)

ADF/Synapse Linked Service

Manage > Linked services > + New > Azure SQL Database
Server: sql-dataplatform-dev.database.windows.net
Database: AdventureWorksLT
Authentication: Managed Identity (recommended) or SQL auth
Test connection > Create

Key SQL Server Features Data Engineers Use

Stored Procedures

Encapsulate reusable SQL logic:

CREATE PROCEDURE sp_insert_audit_log
    @rows_read INT,
    @table_name VARCHAR(100),
    @error_message VARCHAR(500)
AS
BEGIN
    INSERT INTO audit (rows_read, table_name, error_message, load_date)
    VALUES (@rows_read, @table_name, @error_message, GETDATE());
END;

Called from ADF via the Stored Procedure activity.

Views

Create reusable query abstractions:

CREATE VIEW v_active_customers AS
SELECT customer_id, first_name, last_name, email
FROM customers
WHERE is_active = 1;

Indexes

Speed up queries on large tables:

-- Index on the column used in WHERE clauses
CREATE INDEX IX_orders_customer_id ON orders (customer_id);

-- Composite index for multi-column filters
CREATE INDEX IX_orders_date_customer ON orders (order_date, customer_id);

-- Index on the watermark column for incremental loads
CREATE INDEX IX_employees_empid ON employees (empid);

Temporal Tables (System-Versioned)

Automatically track row changes over time:

CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    valid_from DATETIME2 GENERATED ALWAYS AS ROW START,
    valid_to DATETIME2 GENERATED ALWAYS AS ROW END,
    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
) WITH (SYSTEM_VERSIONING = ON);

Query historical data:

-- What did the customer record look like on March 1?
SELECT * FROM customers
FOR SYSTEM_TIME AS OF '2026-03-01';

Dynamic Data Masking

Hide sensitive data from non-privileged users:

ALTER TABLE customers
ALTER COLUMN email ADD MASKED WITH (FUNCTION = 'email()');

ALTER TABLE customers
ALTER COLUMN phone ADD MASKED WITH (FUNCTION = 'partial(0,"XXX-XXX-",4)');

Non-admin users see: aXXX@XXXX.com and XXX-XXX-1234.

Integrating with ADF and Synapse Pipelines

As a Source (Reading Data)

Lookup activity: Read metadata/config tables
Copy activity source: Copy table data to ADLS Gen2
Stored Procedure: Execute queries and return results

As a Sink (Writing Data)

Copy activity sink: Write data from other sources into SQL tables
Stored Procedure: Insert audit logs, update watermarks
Script activity: Run DDL or DML statements

The Metadata + Audit Pattern

The pattern we have used throughout this blog series:

Azure SQL Database stores:
  - metadata table (which tables to copy)
  - configtable (watermarks for incremental load)
  - audit table (pipeline execution logs)

ADF/Synapse reads metadata, copies data to ADLS Gen2,
and writes audit logs back to Azure SQL.

Performance Monitoring Basics

Query Performance Insight

Built into the Azure Portal:

Go to your database > Query Performance Insight
See top resource-consuming queries
Identify slow queries and optimization opportunities

Key DMVs (Dynamic Management Views)

-- Top 10 queries by CPU usage
SELECT TOP 10
    qs.total_worker_time / qs.execution_count as avg_cpu,
    qs.execution_count,
    SUBSTRING(st.text, 1, 200) as query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_cpu DESC;

-- Current connections
SELECT COUNT(*) as connection_count
FROM sys.dm_exec_connections;

-- Database size
SELECT
    SUM(size * 8 / 1024) as size_mb
FROM sys.database_files;

Automatic Tuning

Azure SQL Database automatically: – Creates missing indexes – Drops unused indexes – Forces good query plans (plan regression fix)

Enable in Database > Automatic tuning > turn on all options.

Cost Optimization

Use Serverless for dev/test (auto-pauses when idle)
Right-size DTUs/vCores — most dev databases are over-provisioned
Reserved capacity for production (save up to 65% with 3-year reservation)
Elastic pools if you have multiple databases with variable usage
Short-term retention — default backup retention is 7 days. Only increase if needed.
Scale down during off-hours using Azure Automation or Logic Apps

Backup and Recovery

Azure SQL Database handles backups automatically:

Feature	Details
Full backups	Weekly
Differential backups	Every 12-24 hours
Transaction log backups	Every 5-10 minutes
Retention	7-35 days (configurable)
Point-in-time restore	Any second within retention period
Geo-redundant backup	Optional (copies to paired region)
Long-term retention	Up to 10 years (for compliance)

Restore to a Point in Time

az sql db restore     --dest-name AdventureWorksLT_restored     --resource-group rg-dev     --server sql-dataplatform-dev     --name AdventureWorksLT     --time "2026-04-07T10:00:00Z"

This creates a NEW database with data as of the specified time.

Interview Questions

Q: What is Azure SQL Database? A: A fully managed relational database service based on SQL Server. Microsoft handles infrastructure, patching, backups, and high availability. You manage your data and queries.

Q: What is the difference between DTU and vCore pricing? A: DTU bundles CPU, memory, and I/O into a single metric — simpler to understand. vCore lets you choose CPU and storage independently — more control. vCore also supports Serverless (auto-pause) and reserved capacity discounts.

Q: How does ADF connect to Azure SQL Database? A: Through a Linked Service. Authentication can be SQL authentication (username/password), Azure AD, or Managed Identity (recommended — no credentials to manage).

Q: Why would you use Managed Identity instead of SQL authentication? A: No passwords to store, rotate, or risk leaking. ADF authenticates using its Azure-managed identity automatically. More secure and easier to maintain.

Q: What is the Serverless compute tier? A: The database auto-pauses after idle time and auto-resumes on first connection. You pay only for compute used. Perfect for dev/test databases and intermittent workloads.

Wrapping Up

Azure SQL Database is the workhorse behind every Azure data pipeline. Understanding service tiers, authentication, firewall configuration, and pipeline integration gives you the foundation to build production-grade data platforms.

For data engineers, the key skills are: creating databases with the right tier, configuring Managed Identity access for ADF/Synapse, writing stored procedures for audit logging, and understanding the metadata/config/audit pattern.

← Previous: ADLS Gen2

Azure (9/37)

Next: Azure Networking →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

← ADLS Gen2 Azure Networking →