Azure Synapse Analytics Workspace Setup Guide: From Creation to Your First Pipeline
Azure Synapse Analytics is Microsoft’s answer to the question: “What if we combined Azure Data Factory, a data lake, a SQL data warehouse, and Apache Spark into one unified platform?”
Before Synapse, building a data platform meant stitching together 5-6 separate Azure services — ADF for orchestration, ADLS for storage, SQL Pool for warehousing, Databricks for Spark, Power BI for visualization. Each had its own portal, its own security model, its own billing. Managing them felt like running a restaurant where the kitchen, dining room, bar, and reception each had separate management systems that did not talk to each other.
Synapse puts everything under one roof. One workspace. One studio. One security model. One place to build pipelines, write SQL, run Spark notebooks, and query your data lake.
This post walks you through setting up a Synapse workspace from scratch, configuring every component, and running your first pipeline — with the context of why each piece exists and when you need it.
Table of Contents
- What Is Azure Synapse Analytics?
- Synapse Components: The Restaurant Analogy
- Prerequisites
- Step 1: Create the Synapse Workspace
- Step 2: Explore Synapse Studio
- Step 3: Understanding the Default Storage (ADLS Gen2)
- Step 4: Create a SQL Linked Service
- Step 5: Serverless SQL Pool (Built-In)
- Step 6: Dedicated SQL Pool (Optional)
- Step 7: Apache Spark Pool
- Step 8: Synapse Pipelines (ADF Inside Synapse)
- Step 9: Build Your First Pipeline
- Step 10: Connecting to Git (Source Control)
- Synapse vs ADF: What Is the Same, What Is Different
- Managed VNet and Private Endpoints
- Access Control (RBAC and Synapse Roles)
- Monitoring and Troubleshooting
- Cost Management
- Best Practices
- Interview Questions
- Wrapping Up
What Is Azure Synapse Analytics?
Azure Synapse Analytics is a unified analytics platform that combines:
| Component | What It Does | Standalone Equivalent |
|---|---|---|
| Synapse Pipelines | Data orchestration (ETL/ELT) | Azure Data Factory |
| Serverless SQL Pool | Query data lake files with SQL (no infrastructure) | N/A (unique to Synapse) |
| Dedicated SQL Pool | Enterprise data warehouse (provisioned resources) | Azure SQL Data Warehouse (old name) |
| Apache Spark Pool | Big data processing, ML, notebooks | Azure Databricks / HDInsight |
| Synapse Studio | Unified web interface for everything | Azure Portal + multiple service UIs |
| Data Explorer Pool | Real-time analytics on streaming data | Azure Data Explorer |
The key value: Everything shares the same workspace, the same security model, the same storage, and the same studio. No context-switching between portals.
Synapse Components: The Restaurant Analogy
Imagine Synapse as a modern restaurant:
Synapse Workspace = The restaurant building itself. It holds everything.
ADLS Gen2 (Primary Storage) = The pantry and walk-in refrigerator. All your raw ingredients (data) are stored here. Every component in the restaurant can access the pantry.
Serverless SQL Pool = The salad bar. Customers (analysts) walk up, pick what they want (query data lake files), and pay only for what they take (per TB scanned). No chef needed. No reservation needed. Always available.
Dedicated SQL Pool = The fine dining kitchen. Reserved capacity, dedicated chefs (compute), and a menu (schema) designed for the most important dishes (critical reports). Expensive to keep running, but delivers consistent, high-performance results.
Spark Pool = The experimental test kitchen. Chefs (data engineers and scientists) experiment with new recipes (ML models, complex transformations), work with exotic ingredients (unstructured data), and create new dishes (derived datasets). Scales up for big events, scales down when quiet.
Synapse Pipelines = The kitchen manager. Coordinates the flow: ingredients arrive from suppliers (source systems), get prepped (transformed), and are delivered to the right station (sink). Manages the schedule, handles failures, logs everything.
Synapse Studio = The restaurant’s management dashboard. One screen to monitor the salad bar, fine dining kitchen, test kitchen, and delivery schedule. No running between rooms.
Prerequisites
Before creating the workspace, you need:
- Azure subscription — Pay-As-You-Go or higher
- Resource group — create one if you do not have it (e.g.,
rg-dataplatform-dev) - Azure ADLS Gen2 storage account — Synapse creates one during workspace setup, OR you can use an existing one
- Permissions — Owner or Contributor on the resource group
Step 1: Create the Synapse Workspace
Using Azure Portal
- Go to Azure Portal > search Synapse > click Azure Synapse Analytics
- Click + Create
Basics Tab
| Field | Value | Notes |
|---|---|---|
| Subscription | Your subscription | |
| Resource group | rg-dataplatform-dev |
Create new if needed |
| Workspace name | naveen-synapse-ws |
Globally unique |
| Region | Canada Central | Choose closest to your data |
| Data Lake Storage Gen2 | Create new or select existing | |
| Account name | naveensynapsedl |
Only if creating new |
| File system name | synapse-workspace |
This becomes your default container |
Important: The storage account MUST have hierarchical namespace enabled (ADLS Gen2, not regular Blob Storage). If you select an existing account without hierarchical namespace, the creation will fail.
Security Tab
| Field | Value | Notes |
|---|---|---|
| SQL administrator login | sqladmin |
Admin for dedicated SQL pools |
| SQL administrator password | Strong password | Save this — you will need it |
| System assigned managed identity | Enabled (default) | Synapse uses this to access storage |
Networking Tab
For learning/dev:
| Field | Value |
|---|---|
| Managed virtual network | Disable (simpler for dev) |
| Allow connections from all IP addresses | Yes |
For production:
| Field | Value |
|---|---|
| Managed virtual network | Enable |
| Allow connections from all IP addresses | No (use private endpoints) |
Review + Create
Click Create. Deployment takes 3-5 minutes.
What Gets Created
After deployment, Azure creates:
Resource Group: rg-dataplatform-dev
|-- Synapse Workspace: naveen-synapse-ws
|-- ADLS Gen2 Storage: naveensynapsedl
| |-- Container: synapse-workspace (your file system)
|-- Managed Identity: naveen-synapse-ws (auto-created)
Step 2: Explore Synapse Studio
Open Synapse Studio
- Go to your Synapse workspace in Azure Portal
- Click Open Synapse Studio (or go directly to
web.azuresynapse.net) - Select your workspace
The Five Hubs
Synapse Studio has five main sections (hubs) in the left sidebar:
| Hub | Icon | What It Does |
|---|---|---|
| Home | House | Overview, quick actions, recent items |
| Data | Database cylinder | Browse data lake files, SQL databases, linked data |
| Develop | Code brackets | SQL scripts, Spark notebooks, Data Flows |
| Integrate | Pipeline icon | Pipelines (same as ADF Author tab) |
| Monitor | Chart | Pipeline runs, Spark jobs, SQL queries |
| Manage | Gear | Linked services, integration runtimes, pools, Git config |
Real-life analogy: Think of the hubs as rooms in a house: – Data = the library (browse and explore your data) – Develop = the workshop (write code and scripts) – Integrate = the control room (build and schedule pipelines) – Monitor = the security room (watch everything running) – Manage = the utility room (configure connections and infrastructure)
Step 3: Understanding the Default Storage (ADLS Gen2)
When you created the workspace, Synapse set up a primary storage account with a default container. This is where your data lake lives.
Exploring Storage in Synapse Studio
- Click Data hub > Linked tab
- Expand Azure Data Lake Storage Gen2
- You see your primary storage account and the
synapse-workspacecontainer - Right-click the container to create folders, upload files, or browse
The Default Linked Service
Synapse automatically creates a linked service for your primary storage:
Name: naveen-synapse-ws-WorkspaceDefaultStorage
Type: Azure Data Lake Storage Gen2
Authentication: Managed Identity (automatic, no credentials)
You do NOT need to create a linked service for your default storage. It is ready to use in datasets and pipelines.
Grant Storage Permissions
The Synapse managed identity needs the Storage Blob Data Contributor role on the ADLS Gen2 account:
- Go to your storage account in Azure Portal
- Click Access Control (IAM) > + Add > Add role assignment
- Role: Storage Blob Data Contributor
- Members: search for your workspace name (
naveen-synapse-ws) - Click Review + assign
Without this, pipelines will fail with “Forbidden” errors when writing to storage.
Real-life analogy: The managed identity is like a building key card. Synapse (the employee) uses its key card (managed identity) to access the storage room (ADLS Gen2). But someone (you, the admin) must first program the key card to unlock that specific room (role assignment).
Step 4: Create a SQL Linked Service
To read from Azure SQL Database, create a linked service:
- Manage hub > Linked services > + New
- Search Azure SQL Database > Continue
- Configure:
- Name:
LS_AzureSqlDB - Server:
sql-dataplatform-dev.database.windows.net - Database:
AdventureWorksLT - Authentication: SQL authentication or Managed Identity
- Test connection > Create
SQL Authentication vs Managed Identity
| Method | Pros | Cons |
|---|---|---|
| SQL Authentication | Simple setup | Password management, rotation |
| Managed Identity | No passwords, most secure | Requires Azure AD admin setup on SQL Server |
For production, always use Managed Identity. For learning, SQL authentication is fine.
Setting Up Managed Identity for SQL
If you want the more secure approach:
-- Run in Azure SQL Database
CREATE USER [naveen-synapse-ws] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [naveen-synapse-ws];
ALTER ROLE db_datawriter ADD MEMBER [naveen-synapse-ws];
Then in the linked service, select Managed Identity as authentication. No password needed.
Step 5: Serverless SQL Pool (Built-In)
What It Is
A query engine that reads files directly from your data lake (ADLS Gen2) using standard SQL. No data loading, no infrastructure to manage, no cluster to start. It is always available.
You pay only for the data scanned — approximately $5 per TB. If your query scans 10 GB, you pay $0.05.
Real-life analogy: Serverless SQL Pool is like a library with a photocopier. The books (data files) stay on the shelves (ADLS Gen2). You walk up, photocopy the pages you need (query), and pay per page copied (per TB scanned). No need to check out the book or carry it home.
Querying Data Lake Files
-- Query Parquet files directly from ADLS Gen2
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/sqldb/Customer/*.parquet',
FORMAT = 'PARQUET'
) AS result;
-- Query CSV files
SELECT *
FROM OPENROWSET(
BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/raw/data.csv',
FORMAT = 'CSV',
HEADER_ROW = TRUE
) AS result;
-- Query Delta Lake tables
SELECT *
FROM OPENROWSET(
BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/',
FORMAT = 'DELTA'
) AS result;
Creating External Tables (Permanent Views on Data Lake)
-- Create a database for your logical data warehouse
CREATE DATABASE analytics;
GO
USE analytics;
GO
-- Create an external data source
CREATE EXTERNAL DATA SOURCE datalake
WITH (
LOCATION = 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/'
);
-- Create an external table (metadata only, data stays in ADLS)
CREATE EXTERNAL TABLE customers
WITH (
LOCATION = 'silver/customers/',
DATA_SOURCE = datalake,
FILE_FORMAT = ParquetFormat
)
AS SELECT * FROM OPENROWSET(
BULK 'silver/customers/*.parquet',
DATA_SOURCE = 'datalake',
FORMAT = 'PARQUET'
) AS r;
Now analysts can query SELECT * FROM analytics.dbo.customers like a regular SQL table, but the data lives in ADLS Gen2.
When to Use Serverless SQL Pool
- Ad-hoc exploration — “What does this data look like?”
- Logical data warehouse — create SQL views on top of data lake files
- Power BI reports — connect Power BI directly to serverless SQL
- Data quality checks — quick row counts, null checks, profiling
- Cost-effective querying — pay only when you query, $0 when idle
Step 6: Dedicated SQL Pool (Optional)
What It Is
A provisioned data warehouse with reserved compute resources. Unlike serverless (pay per query), dedicated SQL pool runs continuously with guaranteed performance. This is the old Azure SQL Data Warehouse, rebranded.
Real-life analogy: If serverless SQL Pool is a library photocopier (pay per page), Dedicated SQL Pool is your own personal printer. You pay a monthly fee whether you print or not, but it is always ready and handles high-volume printing faster.
When You Need It
- High-concurrency reporting — 50+ analysts querying simultaneously
- Sub-second query performance — dashboards that need instant response
- Complex aggregations — star schema queries with billions of fact rows
- Always-on workloads — 24/7 production dashboards
When You Do NOT Need It
- Learning and development — use serverless instead (free when idle)
- Ad-hoc exploration — serverless is cheaper
- Small datasets — serverless handles these efficiently
- Variable workloads — dedicated charges even when idle
Creating a Dedicated SQL Pool
- Manage hub > SQL pools > + New
- Name:
dwh_pool - Performance level: DW100c (smallest, ~$1.50/hour)
- Click Create
Cost warning: DW100c costs approximately $1.50/hour (~$36/day, ~$1,080/month). Always pause when not in use:
Manage > SQL pools > dwh_pool > Pause
A paused dedicated pool costs $0 for compute (you still pay for storage).
Loading Data into Dedicated SQL Pool
-- COPY INTO is the fastest way to load from ADLS Gen2
COPY INTO dbo.customers
FROM 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/*.parquet'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Managed Identity')
);
Step 7: Apache Spark Pool
What It Is
A managed Spark cluster for big data processing, machine learning, and notebook-based development. You write PySpark, Scala, or SQL in Jupyter-like notebooks.
Real-life analogy: The Spark Pool is like a catering team. When you have a big event (large data processing job), you hire a team of chefs (Spark nodes). They work together to prepare the food (process data) much faster than one chef alone. After the event, they leave (auto-scale down). You do not pay chefs to stand around waiting.
Creating a Spark Pool
- Manage hub > Apache Spark pools > + New
- Configure:
- Name:
sparkpool - Node size: Small (4 vCores, 32 GB) — cheapest for learning
- Autoscale: Enabled
- Min nodes: 3 (minimum required)
- Max nodes: 5 (for learning — increase for production)
- Auto-pause: Enabled, idle minutes: 15
- Click Review + Create > Create
Auto-Pause (Save Money!)
When enabled, the Spark pool automatically shuts down after the idle timeout. Next time you run a notebook, it starts up again (takes 2-3 minutes).
With auto-pause: Pay only during active use
Without auto-pause: Pay 24/7 whether you use it or not
Always enable auto-pause for development. In production, disable it if you need instant notebook startup.
Running a Spark Notebook
- Develop hub > + > Notebook
- Attach to:
sparkpool - Language: PySpark
# Read Parquet from data lake
df = spark.read.parquet(
'abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/sqldb/Customer/'
)
# Show data
df.show(5)
# Transform
from pyspark.sql.functions import upper, col, current_timestamp
df_clean = df .withColumn('FullName', upper(col('FirstName'))) .withColumn('LoadTimestamp', current_timestamp()) .filter(col('CompanyName').isNotNull())
# Write as Delta
df_clean.write.format('delta').mode('overwrite') .save('abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/silver/customers/')
print(f"Written {df_clean.count()} rows")
When to Use Spark Pool
- Large-scale transformations — processing GBs to TBs of data
- Machine learning — training models on big datasets
- Delta Lake operations — MERGE, OPTIMIZE, time travel
- Complex ETL — when Data Flows are not flexible enough
- Notebook-based development — exploratory analysis, prototyping
Step 8: Synapse Pipelines (ADF Inside Synapse)
What It Is
Synapse Pipelines is Azure Data Factory built into Synapse. Same visual designer, same activities, same expressions, same triggers. If you know ADF, you know Synapse Pipelines.
Key Differences from Standalone ADF
| Feature | Standalone ADF | Synapse Pipelines |
|---|---|---|
| Portal | adf.azure.com | Synapse Studio |
| Default storage | Must create linked service | Default storage built-in |
| Spark | Not available | Run Spark notebooks from pipeline |
| SQL Pool | Not available | Script activity runs on SQL pools |
| Data Flows | Runs on ADF-managed Spark | Runs on Synapse Spark |
| Publish branch | adf_publish |
workspace_publish |
| Pricing | Separate billing | Included in Synapse billing |
What Is Identical
- Pipeline designer (canvas, activities, arrows)
- All activities (Copy, Lookup, ForEach, If Condition, Web, Stored Procedure, etc.)
- Expressions (@item(), @pipeline(), @activity(), etc.)
- Parameterized datasets
- Triggers (Schedule, Tumbling Window, Event)
- Integration Runtimes (Azure IR, Self-Hosted IR)
- Git integration (Azure DevOps and GitHub)
Bottom line: If you built our metadata-driven pipeline in ADF, you can build the exact same pipeline in Synapse with almost no changes.
Step 9: Build Your First Pipeline
Let us build the classic Lookup + ForEach + Copy pipeline in Synapse.
Create Datasets
DS_SqlDB_Metadata:
1. Data hub > + > Integration dataset > Azure SQL Database
2. Name: DS_SqlDB_Metadata
3. Linked service: LS_AzureSqlDB
4. Table: dbo.metadata
DS_SqlDB_SourceTable (parameterized):
1. + > Integration dataset > Azure SQL Database
2. Name: DS_SqlDB_SourceTable
3. Linked service: LS_AzureSqlDB
4. Leave table blank
5. Add parameters: SchemaName (String), TableName (String)
6. Connection tab: Schema = @dataset().SchemaName, Table = @dataset().TableName
DS_ADLS_Sink_Parquet (parameterized):
1. + > Integration dataset > ADLS Gen2 > Parquet
2. Name: DS_ADLS_Sink_Parquet
3. Linked service: naveen-synapse-ws-WorkspaceDefaultStorage
4. Add parameters: ContainerName (String), FolderName (String)
5. Connection tab: File system = @dataset().ContainerName, Directory = @dataset().FolderName
Create the Pipeline
- Integrate hub > + > Pipeline
- Name:
PL_Copy_SqlToADLS
Lookup activity:
– Name: Lookup_Metadata
– Dataset: DS_SqlDB_Metadata
– First row only: unchecked
ForEach activity:
– Name: ForEach_Table
– Items: @activity('Lookup_Metadata').output.value
– Sequential: unchecked
Copy activity (inside ForEach):
– Name: Copy_TableData
– Source: DS_SqlDB_SourceTable
– SchemaName: @item().SchemaName
– TableName: @item().TableName
– Sink: DS_ADLS_Sink_Parquet
– ContainerName: @item().ContainerName
– FolderName: @item().FolderName
Validate, Debug, Publish
- Click Validate — fix any errors
- Click Debug — test the pipeline
- Click Publish all — save to the live service
Step 10: Connecting to Git (Source Control)
Why Connect to Git
Without Git, clicking Publish saves directly to the live Synapse workspace. There is no undo, no history, no code review. That is fine for learning but dangerous in production.
With Git, every change is versioned. You work in feature branches, create pull requests, and deploy through CI/CD.
Connect to GitHub
- Manage hub > Git configuration > Configure
- Repository type: GitHub (or Azure DevOps)
- GitHub account: authorize access
- Repository: select your repo
- Collaboration branch:
main - Publish branch:
workspace_publish - Root folder:
/ - Import existing resources: Yes
- Click Apply
What Changes
After connecting: – A branch dropdown appears in the top toolbar – Changes auto-save to the Git branch (not to live Synapse) – You must Publish to deploy changes to the live workspace – Pull Requests enable code review before publishing
Synapse vs ADF: What Is the Same, What Is Different
| Aspect | Choose Synapse | Choose Standalone ADF |
|---|---|---|
| Need SQL warehouse | Yes (dedicated SQL pool) | No |
| Need Spark/notebooks | Yes | No (use Databricks separately) |
| Need serverless SQL | Yes | No |
| Pipeline-only workload | Synapse works but overkill | ADF is simpler and cheaper |
| Existing ADF investment | Consider migration later | Keep using ADF |
| Unified platform | Yes (one studio for everything) | No (separate portals) |
Rule of thumb: If you only need pipelines (ETL orchestration), use standalone ADF. If you need pipelines + SQL + Spark + data lake exploration, use Synapse.
Managed VNet and Private Endpoints
What Is Managed VNet
When enabled, Synapse creates a virtual network that isolates all outbound traffic. Your pipelines and Spark jobs run inside this VNet, and you control which external services they can connect to via managed private endpoints.
When to Enable
- Production workloads — data should not traverse the public internet
- Compliance — regulatory requirements for network isolation
- Enterprise security — company policy requires private connectivity
When to Skip (Dev/Learning)
- Learning and development — adds complexity without benefit
- Small projects — public access with firewall rules is sufficient
Creating a Managed Private Endpoint
- Manage hub > Managed private endpoints > + New
- Select the target service (e.g., Azure SQL Database)
- Name:
pe-sql-database - Target resource: select your SQL server
- Click Create
- The target resource owner must approve the private endpoint connection
Access Control (RBAC and Synapse Roles)
Azure RBAC (Resource Level)
| Role | What They Can Do |
|---|---|
| Owner | Full access including role assignments |
| Contributor | Create and manage all resources, cannot assign roles |
| Reader | View resources only |
Synapse-Specific Roles
| Role | Scope | Permissions |
|---|---|---|
| Synapse Administrator | Workspace | Full control of everything |
| Synapse SQL Administrator | SQL pools | Manage SQL pools and queries |
| Synapse Spark Administrator | Spark pools | Manage Spark pools and notebooks |
| Synapse Contributor | Workspace | Create and edit pipelines, scripts, notebooks |
| Synapse Artifact User | Workspace | Run pipelines and read artifacts |
| Synapse Credential User | Workspace | Use credentials in pipelines |
Best Practice
Assign the minimum role needed: – Data engineers: Synapse Contributor (create pipelines and notebooks) – Analysts: Synapse Artifact User (run queries and read data) – Admins: Synapse Administrator (manage everything)
Monitoring and Troubleshooting
Monitor Hub
The Monitor hub shows:
| Tab | What It Shows |
|---|---|
| Pipeline runs | All pipeline executions with status, duration, errors |
| Trigger runs | Which triggers fired and what they started |
| Integration runtime | IR status, node health, usage |
| Apache Spark applications | Spark job history, logs, performance |
| SQL requests | Queries running on serverless and dedicated pools |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| “Forbidden” on storage write | Managed identity lacks permissions | Add Storage Blob Data Contributor role |
| Spark pool takes 5 minutes to start | Cold start (auto-pause was active) | Expected behavior. Reduce by increasing auto-pause timeout |
| Pipeline runs in ADF but fails in Synapse | Different default IR or linked service | Verify linked services and IR configuration |
| “Cannot connect to SQL” | Firewall not configured | Add your IP or enable “Allow Azure services” |
| Publish button grayed out | Not on collaboration branch | Switch to main branch |
Cost Management
What Costs Money
| Component | Pricing Model | Approximate Cost |
|---|---|---|
| Serverless SQL | Per TB scanned | ~$5/TB |
| Dedicated SQL Pool | Per DWU-hour | DW100c: ~$1.50/hr |
| Spark Pool | Per node-hour | Small node: ~$0.40/hr |
| Pipeline activities | Per activity run + DIU-hours | Similar to ADF |
| Data Flow | Per vCore-hour | ~$0.27/vCore-hr |
| Storage (ADLS Gen2) | Per GB stored + transactions | ~$0.02/GB/month |
Cost Saving Tips
- Pause Dedicated SQL Pool when not in use — saves ~$36/day (DW100c)
- Enable Spark auto-pause — stops the cluster after idle timeout
- Use Serverless SQL for ad-hoc queries instead of spinning up Dedicated
- Turn off Data Flow Debug when not actively developing
- Set Spark pool to minimum nodes (3) for development
- Monitor costs — set up Azure Cost Management alerts
- Use reserved capacity for production Dedicated SQL Pools (save up to 65%)
Real-life analogy: Managing Synapse costs is like managing electricity in a house. Turn off lights (pause pools) when you leave a room. Do not leave the oven on (Debug cluster) overnight. Use energy-efficient appliances (serverless instead of dedicated) when possible.
Best Practices
-
Start with Serverless SQL and Pipelines — do not create Dedicated SQL or Spark pools until you need them. They cost money when idle.
-
Use Managed Identity everywhere — no passwords to manage, rotate, or leak.
-
Connect to Git from day one — even for learning. It builds good habits and protects against accidental deletions.
-
Organize your data lake — use the Bronze/Silver/Gold pattern in your default storage container.
-
Use the default storage linked service — do not create a separate linked service for your primary ADLS account.
-
Set auto-pause on Spark pools — 15 minutes for dev, 30 minutes for production.
-
Pause Dedicated SQL Pool when not running queries — this is the biggest cost trap for new users.
-
Use Synapse Roles for access control — do not give everyone Synapse Administrator.
Interview Questions
Q: What is Azure Synapse Analytics? A: A unified analytics platform that combines data integration (pipelines), serverless SQL (query data lake), dedicated SQL pool (data warehouse), and Apache Spark (big data processing) in one workspace with a single management studio.
Q: What is the difference between Serverless SQL Pool and Dedicated SQL Pool? A: Serverless SQL Pool queries data lake files directly with no provisioned infrastructure — you pay per TB scanned. Dedicated SQL Pool is a provisioned data warehouse with reserved compute — you pay per DWU-hour whether you query or not. Serverless is for ad-hoc queries; Dedicated is for high-concurrency production workloads.
Q: How is Synapse different from standalone ADF? A: Synapse includes ADF’s pipeline capabilities plus serverless SQL, dedicated SQL pools, Spark pools, and a unified studio. The pipeline engine is identical. Choose standalone ADF for pipeline-only workloads; choose Synapse when you need the full analytics platform.
Q: What is the default storage in Synapse? A: Every Synapse workspace has a primary ADLS Gen2 storage account configured during creation. A default linked service is auto-created, and the workspace’s managed identity is used for authentication. All components (pipelines, Spark, SQL) can access this storage natively.
Q: How do you manage costs in Synapse? A: Pause Dedicated SQL Pools when idle, enable auto-pause on Spark pools, use Serverless SQL for ad-hoc queries, turn off Data Flow Debug when not developing, and monitor costs with Azure Cost Management. The biggest cost traps are Dedicated SQL and Spark pools left running overnight.
Q: What Synapse roles should a data engineer have? A: Synapse Contributor — allows creating and editing pipelines, notebooks, SQL scripts, and data flows. Add Synapse Credential User if they need to use linked service credentials. Avoid giving Synapse Administrator to everyone.
Wrapping Up
Azure Synapse Analytics is the Swiss Army knife of Azure data platforms. It puts pipelines, SQL, Spark, and data lake exploration under one roof. The setup is straightforward — create a workspace, explore with serverless SQL, build pipelines in the Integrate hub, and connect to Git for source control.
Start small: workspace + serverless SQL + pipelines. Add Spark pools when you need big data processing. Add Dedicated SQL Pool only when you need a high-performance production warehouse. This approach keeps costs low while you learn.
Related posts: – What is Azure Data Factory? – ADF vs Synapse Comparison – Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging – Azure Fundamentals – Data Flows Guide
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.