Azure Synapse Analytics Workspace Setup Guide: From Creation to Your First Pipeline

Azure Synapse Analytics is Microsoft’s answer to the question: “What if we combined Azure Data Factory, a data lake, a SQL data warehouse, and Apache Spark into one unified platform?”

Before Synapse, building a data platform meant stitching together 5-6 separate Azure services — ADF for orchestration, ADLS for storage, SQL Pool for warehousing, Databricks for Spark, Power BI for visualization. Each had its own portal, its own security model, its own billing. Managing them felt like running a restaurant where the kitchen, dining room, bar, and reception each had separate management systems that did not talk to each other.

Synapse puts everything under one roof. One workspace. One studio. One security model. One place to build pipelines, write SQL, run Spark notebooks, and query your data lake.

This post walks you through setting up a Synapse workspace from scratch, configuring every component, and running your first pipeline — with the context of why each piece exists and when you need it.

What Is Azure Synapse Analytics?
Synapse Components: The Restaurant Analogy
Prerequisites
Step 1: Create the Synapse Workspace
Step 2: Explore Synapse Studio
Step 3: Understanding the Default Storage (ADLS Gen2)
Step 4: Create a SQL Linked Service
Step 5: Serverless SQL Pool (Built-In)
Step 6: Dedicated SQL Pool (Optional)
Step 7: Apache Spark Pool
Step 8: Synapse Pipelines (ADF Inside Synapse)
Step 9: Build Your First Pipeline
Step 10: Connecting to Git (Source Control)
Synapse vs ADF: What Is the Same, What Is Different
Managed VNet and Private Endpoints
Access Control (RBAC and Synapse Roles)
Monitoring and Troubleshooting
Cost Management
Best Practices
Interview Questions
Wrapping Up

What Is Azure Synapse Analytics?

Azure Synapse Analytics is a unified analytics platform that combines:

Component	What It Does	Standalone Equivalent
Synapse Pipelines	Data orchestration (ETL/ELT)	Azure Data Factory
Serverless SQL Pool	Query data lake files with SQL (no infrastructure)	N/A (unique to Synapse)
Dedicated SQL Pool	Enterprise data warehouse (provisioned resources)	Azure SQL Data Warehouse (old name)
Apache Spark Pool	Big data processing, ML, notebooks	Azure Databricks / HDInsight
Synapse Studio	Unified web interface for everything	Azure Portal + multiple service UIs
Data Explorer Pool	Real-time analytics on streaming data	Azure Data Explorer

The key value: Everything shares the same workspace, the same security model, the same storage, and the same studio. No context-switching between portals.

Synapse Components: The Restaurant Analogy

Imagine Synapse as a modern restaurant:

Synapse Workspace = The restaurant building itself. It holds everything.

ADLS Gen2 (Primary Storage) = The pantry and walk-in refrigerator. All your raw ingredients (data) are stored here. Every component in the restaurant can access the pantry.

Serverless SQL Pool = The salad bar. Customers (analysts) walk up, pick what they want (query data lake files), and pay only for what they take (per TB scanned). No chef needed. No reservation needed. Always available.

Dedicated SQL Pool = The fine dining kitchen. Reserved capacity, dedicated chefs (compute), and a menu (schema) designed for the most important dishes (critical reports). Expensive to keep running, but delivers consistent, high-performance results.

Spark Pool = The experimental test kitchen. Chefs (data engineers and scientists) experiment with new recipes (ML models, complex transformations), work with exotic ingredients (unstructured data), and create new dishes (derived datasets). Scales up for big events, scales down when quiet.

Synapse Pipelines = The kitchen manager. Coordinates the flow: ingredients arrive from suppliers (source systems), get prepped (transformed), and are delivered to the right station (sink). Manages the schedule, handles failures, logs everything.

Synapse Studio = The restaurant’s management dashboard. One screen to monitor the salad bar, fine dining kitchen, test kitchen, and delivery schedule. No running between rooms.

Prerequisites

Before creating the workspace, you need:

Azure subscription — Pay-As-You-Go or higher
Resource group — create one if you do not have it (e.g., rg-dataplatform-dev)
Azure ADLS Gen2 storage account — Synapse creates one during workspace setup, OR you can use an existing one
Permissions — Owner or Contributor on the resource group

Step 1: Create the Synapse Workspace

Using Azure Portal

Go to Azure Portal > search Synapse > click Azure Synapse Analytics
Click + Create

Basics Tab

Field	Value	Notes
Subscription	Your subscription
Resource group	`rg-dataplatform-dev`	Create new if needed
Workspace name	`naveen-synapse-ws`	Globally unique
Region	Canada Central	Choose closest to your data
Data Lake Storage Gen2	Create new or select existing
Account name	`naveensynapsedl`	Only if creating new
File system name	`synapse-workspace`	This becomes your default container

Important: The storage account MUST have hierarchical namespace enabled (ADLS Gen2, not regular Blob Storage). If you select an existing account without hierarchical namespace, the creation will fail.

Security Tab

Field	Value	Notes
SQL administrator login	`sqladmin`	Admin for dedicated SQL pools
SQL administrator password	Strong password	Save this — you will need it
System assigned managed identity	Enabled (default)	Synapse uses this to access storage

Networking Tab

For learning/dev:

Field	Value
Managed virtual network	Disable (simpler for dev)
Allow connections from all IP addresses	Yes

For production:

Field	Value
Managed virtual network	Enable
Allow connections from all IP addresses	No (use private endpoints)

Review + Create

Click Create. Deployment takes 3-5 minutes.

What Gets Created

After deployment, Azure creates:

Resource Group: rg-dataplatform-dev
  |-- Synapse Workspace: naveen-synapse-ws
  |-- ADLS Gen2 Storage: naveensynapsedl
  |   |-- Container: synapse-workspace (your file system)
  |-- Managed Identity: naveen-synapse-ws (auto-created)

Step 2: Explore Synapse Studio

Open Synapse Studio

Go to your Synapse workspace in Azure Portal
Click Open Synapse Studio (or go directly to web.azuresynapse.net)
Select your workspace

The Five Hubs

Synapse Studio has five main sections (hubs) in the left sidebar:

Hub	Icon	What It Does
Home	House	Overview, quick actions, recent items
Data	Database cylinder	Browse data lake files, SQL databases, linked data
Develop	Code brackets	SQL scripts, Spark notebooks, Data Flows
Integrate	Pipeline icon	Pipelines (same as ADF Author tab)
Monitor	Chart	Pipeline runs, Spark jobs, SQL queries
Manage	Gear	Linked services, integration runtimes, pools, Git config

Real-life analogy: Think of the hubs as rooms in a house: – Data = the library (browse and explore your data) – Develop = the workshop (write code and scripts) – Integrate = the control room (build and schedule pipelines) – Monitor = the security room (watch everything running) – Manage = the utility room (configure connections and infrastructure)

Step 3: Understanding the Default Storage (ADLS Gen2)

When you created the workspace, Synapse set up a primary storage account with a default container. This is where your data lake lives.

Exploring Storage in Synapse Studio

Click Data hub > Linked tab
Expand Azure Data Lake Storage Gen2
You see your primary storage account and the synapse-workspace container
Right-click the container to create folders, upload files, or browse

The Default Linked Service

Synapse automatically creates a linked service for your primary storage:

Name: naveen-synapse-ws-WorkspaceDefaultStorage
Type: Azure Data Lake Storage Gen2
Authentication: Managed Identity (automatic, no credentials)

You do NOT need to create a linked service for your default storage. It is ready to use in datasets and pipelines.

Grant Storage Permissions

The Synapse managed identity needs the Storage Blob Data Contributor role on the ADLS Gen2 account:

Go to your storage account in Azure Portal
Click Access Control (IAM) > + Add > Add role assignment
Role: Storage Blob Data Contributor
Members: search for your workspace name (naveen-synapse-ws)
Click Review + assign

Without this, pipelines will fail with “Forbidden” errors when writing to storage.

Real-life analogy: The managed identity is like a building key card. Synapse (the employee) uses its key card (managed identity) to access the storage room (ADLS Gen2). But someone (you, the admin) must first program the key card to unlock that specific room (role assignment).

Step 4: Create a SQL Linked Service

To read from Azure SQL Database, create a linked service:

Manage hub > Linked services > + New
Search Azure SQL Database > Continue
Configure:
Name: LS_AzureSqlDB
Server: sql-dataplatform-dev.database.windows.net
Database: AdventureWorksLT
Authentication: SQL authentication or Managed Identity
Test connection > Create

SQL Authentication vs Managed Identity

Method	Pros	Cons
SQL Authentication	Simple setup	Password management, rotation
Managed Identity	No passwords, most secure	Requires Azure AD admin setup on SQL Server

For production, always use Managed Identity. For learning, SQL authentication is fine.

Setting Up Managed Identity for SQL

If you want the more secure approach:

-- Run in Azure SQL Database
CREATE USER [naveen-synapse-ws] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [naveen-synapse-ws];
ALTER ROLE db_datawriter ADD MEMBER [naveen-synapse-ws];

Then in the linked service, select Managed Identity as authentication. No password needed.

Step 5: Serverless SQL Pool (Built-In)

What It Is

A query engine that reads files directly from your data lake (ADLS Gen2) using standard SQL. No data loading, no infrastructure to manage, no cluster to start. It is always available.

You pay only for the data scanned — approximately $5 per TB. If your query scans 10 GB, you pay $0.05.

Real-life analogy: Serverless SQL Pool is like a library with a photocopier. The books (data files) stay on the shelves (ADLS Gen2). You walk up, photocopy the pages you need (query), and pay per page copied (per TB scanned). No need to check out the book or carry it home.

Querying Data Lake Files

-- Query Parquet files directly from ADLS Gen2
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/sqldb/Customer/*.parquet',
    FORMAT = 'PARQUET'
) AS result;

-- Query CSV files
SELECT *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/raw/data.csv',
    FORMAT = 'CSV',
    HEADER_ROW = TRUE
) AS result;

-- Query Delta Lake tables
SELECT *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/',
    FORMAT = 'DELTA'
) AS result;

Creating External Tables (Permanent Views on Data Lake)

-- Create a database for your logical data warehouse
CREATE DATABASE analytics;
GO

USE analytics;
GO

-- Create an external data source
CREATE EXTERNAL DATA SOURCE datalake
WITH (
    LOCATION = 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/'
);

-- Create an external table (metadata only, data stays in ADLS)
CREATE EXTERNAL TABLE customers
WITH (
    LOCATION = 'silver/customers/',
    DATA_SOURCE = datalake,
    FILE_FORMAT = ParquetFormat
)
AS SELECT * FROM OPENROWSET(
    BULK 'silver/customers/*.parquet',
    DATA_SOURCE = 'datalake',
    FORMAT = 'PARQUET'
) AS r;

Now analysts can query SELECT * FROM analytics.dbo.customers like a regular SQL table, but the data lives in ADLS Gen2.

When to Use Serverless SQL Pool

Ad-hoc exploration — “What does this data look like?”
Logical data warehouse — create SQL views on top of data lake files
Power BI reports — connect Power BI directly to serverless SQL
Data quality checks — quick row counts, null checks, profiling
Cost-effective querying — pay only when you query, $0 when idle

Step 6: Dedicated SQL Pool (Optional)

What It Is

A provisioned data warehouse with reserved compute resources. Unlike serverless (pay per query), dedicated SQL pool runs continuously with guaranteed performance. This is the old Azure SQL Data Warehouse, rebranded.

Real-life analogy: If serverless SQL Pool is a library photocopier (pay per page), Dedicated SQL Pool is your own personal printer. You pay a monthly fee whether you print or not, but it is always ready and handles high-volume printing faster.

When You Need It

High-concurrency reporting — 50+ analysts querying simultaneously
Sub-second query performance — dashboards that need instant response
Complex aggregations — star schema queries with billions of fact rows
Always-on workloads — 24/7 production dashboards

When You Do NOT Need It

Learning and development — use serverless instead (free when idle)
Ad-hoc exploration — serverless is cheaper
Small datasets — serverless handles these efficiently
Variable workloads — dedicated charges even when idle

Creating a Dedicated SQL Pool

Manage hub > SQL pools > + New
Name: dwh_pool
Performance level: DW100c (smallest, ~$1.50/hour)
Click Create

Cost warning: DW100c costs approximately $1.50/hour (~$36/day, ~$1,080/month). Always pause when not in use:

Manage > SQL pools > dwh_pool > Pause

A paused dedicated pool costs $0 for compute (you still pay for storage).

Loading Data into Dedicated SQL Pool

-- COPY INTO is the fastest way to load from ADLS Gen2
COPY INTO dbo.customers
FROM 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/*.parquet'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY = 'Managed Identity')
);

Step 7: Apache Spark Pool

What It Is

A managed Spark cluster for big data processing, machine learning, and notebook-based development. You write PySpark, Scala, or SQL in Jupyter-like notebooks.

Real-life analogy: The Spark Pool is like a catering team. When you have a big event (large data processing job), you hire a team of chefs (Spark nodes). They work together to prepare the food (process data) much faster than one chef alone. After the event, they leave (auto-scale down). You do not pay chefs to stand around waiting.

Creating a Spark Pool

Manage hub > Apache Spark pools > + New
Configure:
Name: sparkpool
Node size: Small (4 vCores, 32 GB) — cheapest for learning
Autoscale: Enabled
Min nodes: 3 (minimum required)
Max nodes: 5 (for learning — increase for production)
Auto-pause: Enabled, idle minutes: 15
Click Review + Create > Create

Auto-Pause (Save Money!)

When enabled, the Spark pool automatically shuts down after the idle timeout. Next time you run a notebook, it starts up again (takes 2-3 minutes).

With auto-pause: Pay only during active use
Without auto-pause: Pay 24/7 whether you use it or not

Always enable auto-pause for development. In production, disable it if you need instant notebook startup.

Running a Spark Notebook

Develop hub > + > Notebook
Attach to: sparkpool
Language: PySpark

# Read Parquet from data lake
df = spark.read.parquet(
    'abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/sqldb/Customer/'
)

# Show data
df.show(5)

# Transform
from pyspark.sql.functions import upper, col, current_timestamp

df_clean = df     .withColumn('FullName', upper(col('FirstName')))     .withColumn('LoadTimestamp', current_timestamp())     .filter(col('CompanyName').isNotNull())

# Write as Delta
df_clean.write.format('delta').mode('overwrite')     .save('abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/silver/customers/')

print(f"Written {df_clean.count()} rows")

When to Use Spark Pool

Large-scale transformations — processing GBs to TBs of data
Machine learning — training models on big datasets
Delta Lake operations — MERGE, OPTIMIZE, time travel
Complex ETL — when Data Flows are not flexible enough
Notebook-based development — exploratory analysis, prototyping

Step 8: Synapse Pipelines (ADF Inside Synapse)

What It Is

Synapse Pipelines is Azure Data Factory built into Synapse. Same visual designer, same activities, same expressions, same triggers. If you know ADF, you know Synapse Pipelines.

Key Differences from Standalone ADF

Feature	Standalone ADF	Synapse Pipelines
Portal	adf.azure.com	Synapse Studio
Default storage	Must create linked service	Default storage built-in
Spark	Not available	Run Spark notebooks from pipeline
SQL Pool	Not available	Script activity runs on SQL pools
Data Flows	Runs on ADF-managed Spark	Runs on Synapse Spark
Publish branch	`adf_publish`	`workspace_publish`
Pricing	Separate billing	Included in Synapse billing

What Is Identical

Pipeline designer (canvas, activities, arrows)
All activities (Copy, Lookup, ForEach, If Condition, Web, Stored Procedure, etc.)
Expressions (@item(), @pipeline(), @activity(), etc.)
Parameterized datasets
Triggers (Schedule, Tumbling Window, Event)
Integration Runtimes (Azure IR, Self-Hosted IR)
Git integration (Azure DevOps and GitHub)

Bottom line: If you built our metadata-driven pipeline in ADF, you can build the exact same pipeline in Synapse with almost no changes.

Step 9: Build Your First Pipeline

Let us build the classic Lookup + ForEach + Copy pipeline in Synapse.

Create Datasets

DS_SqlDB_Metadata: 1. Data hub > + > Integration dataset > Azure SQL Database 2. Name: DS_SqlDB_Metadata 3. Linked service: LS_AzureSqlDB 4. Table: dbo.metadata

DS_SqlDB_SourceTable (parameterized): 1. + > Integration dataset > Azure SQL Database 2. Name: DS_SqlDB_SourceTable 3. Linked service: LS_AzureSqlDB 4. Leave table blank 5. Add parameters: SchemaName (String), TableName (String) 6. Connection tab: Schema = @dataset().SchemaName, Table = @dataset().TableName

DS_ADLS_Sink_Parquet (parameterized): 1. + > Integration dataset > ADLS Gen2 > Parquet 2. Name: DS_ADLS_Sink_Parquet 3. Linked service: naveen-synapse-ws-WorkspaceDefaultStorage 4. Add parameters: ContainerName (String), FolderName (String) 5. Connection tab: File system = @dataset().ContainerName, Directory = @dataset().FolderName

Create the Pipeline

Integrate hub > + > Pipeline
Name: PL_Copy_SqlToADLS

Lookup activity: – Name: Lookup_Metadata – Dataset: DS_SqlDB_Metadata – First row only: unchecked

ForEach activity: – Name: ForEach_Table – Items: @activity('Lookup_Metadata').output.value – Sequential: unchecked

Copy activity (inside ForEach): – Name: Copy_TableData – Source: DS_SqlDB_SourceTable – SchemaName: @item().SchemaName – TableName: @item().TableName – Sink: DS_ADLS_Sink_Parquet – ContainerName: @item().ContainerName – FolderName: @item().FolderName

Validate, Debug, Publish

Click Validate — fix any errors
Click Debug — test the pipeline
Click Publish all — save to the live service

Step 10: Connecting to Git (Source Control)

Why Connect to Git

Without Git, clicking Publish saves directly to the live Synapse workspace. There is no undo, no history, no code review. That is fine for learning but dangerous in production.

With Git, every change is versioned. You work in feature branches, create pull requests, and deploy through CI/CD.

Connect to GitHub

Manage hub > Git configuration > Configure
Repository type: GitHub (or Azure DevOps)
GitHub account: authorize access
Repository: select your repo
Collaboration branch: main
Publish branch: workspace_publish
Root folder: /
Import existing resources: Yes
Click Apply

What Changes

After connecting: – A branch dropdown appears in the top toolbar – Changes auto-save to the Git branch (not to live Synapse) – You must Publish to deploy changes to the live workspace – Pull Requests enable code review before publishing

Synapse vs ADF: What Is the Same, What Is Different

Aspect	Choose Synapse	Choose Standalone ADF
Need SQL warehouse	Yes (dedicated SQL pool)	No
Need Spark/notebooks	Yes	No (use Databricks separately)
Need serverless SQL	Yes	No
Pipeline-only workload	Synapse works but overkill	ADF is simpler and cheaper
Existing ADF investment	Consider migration later	Keep using ADF
Unified platform	Yes (one studio for everything)	No (separate portals)

Rule of thumb: If you only need pipelines (ETL orchestration), use standalone ADF. If you need pipelines + SQL + Spark + data lake exploration, use Synapse.

Managed VNet and Private Endpoints

What Is Managed VNet

When enabled, Synapse creates a virtual network that isolates all outbound traffic. Your pipelines and Spark jobs run inside this VNet, and you control which external services they can connect to via managed private endpoints.

When to Enable

Production workloads — data should not traverse the public internet
Compliance — regulatory requirements for network isolation
Enterprise security — company policy requires private connectivity

When to Skip (Dev/Learning)

Learning and development — adds complexity without benefit
Small projects — public access with firewall rules is sufficient

Creating a Managed Private Endpoint

Manage hub > Managed private endpoints > + New
Select the target service (e.g., Azure SQL Database)
Name: pe-sql-database
Target resource: select your SQL server
Click Create
The target resource owner must approve the private endpoint connection

Access Control (RBAC and Synapse Roles)

Azure RBAC (Resource Level)

Role	What They Can Do
Owner	Full access including role assignments
Contributor	Create and manage all resources, cannot assign roles
Reader	View resources only

Synapse-Specific Roles

Role	Scope	Permissions
Synapse Administrator	Workspace	Full control of everything
Synapse SQL Administrator	SQL pools	Manage SQL pools and queries
Synapse Spark Administrator	Spark pools	Manage Spark pools and notebooks
Synapse Contributor	Workspace	Create and edit pipelines, scripts, notebooks
Synapse Artifact User	Workspace	Run pipelines and read artifacts
Synapse Credential User	Workspace	Use credentials in pipelines

Best Practice

Assign the minimum role needed: – Data engineers: Synapse Contributor (create pipelines and notebooks) – Analysts: Synapse Artifact User (run queries and read data) – Admins: Synapse Administrator (manage everything)

Monitoring and Troubleshooting

Monitor Hub

The Monitor hub shows:

Tab	What It Shows
Pipeline runs	All pipeline executions with status, duration, errors
Trigger runs	Which triggers fired and what they started
Integration runtime	IR status, node health, usage
Apache Spark applications	Spark job history, logs, performance
SQL requests	Queries running on serverless and dedicated pools

Common Issues

Problem	Cause	Fix
“Forbidden” on storage write	Managed identity lacks permissions	Add Storage Blob Data Contributor role
Spark pool takes 5 minutes to start	Cold start (auto-pause was active)	Expected behavior. Reduce by increasing auto-pause timeout
Pipeline runs in ADF but fails in Synapse	Different default IR or linked service	Verify linked services and IR configuration
“Cannot connect to SQL”	Firewall not configured	Add your IP or enable “Allow Azure services”
Publish button grayed out	Not on collaboration branch	Switch to `main` branch

Cost Management

What Costs Money

Component	Pricing Model	Approximate Cost
Serverless SQL	Per TB scanned	~$5/TB
Dedicated SQL Pool	Per DWU-hour	DW100c: ~$1.50/hr
Spark Pool	Per node-hour	Small node: ~$0.40/hr
Pipeline activities	Per activity run + DIU-hours	Similar to ADF
Data Flow	Per vCore-hour	~$0.27/vCore-hr
Storage (ADLS Gen2)	Per GB stored + transactions	~$0.02/GB/month

Cost Saving Tips

Pause Dedicated SQL Pool when not in use — saves ~$36/day (DW100c)
Enable Spark auto-pause — stops the cluster after idle timeout
Use Serverless SQL for ad-hoc queries instead of spinning up Dedicated
Turn off Data Flow Debug when not actively developing
Set Spark pool to minimum nodes (3) for development
Monitor costs — set up Azure Cost Management alerts
Use reserved capacity for production Dedicated SQL Pools (save up to 65%)

Real-life analogy: Managing Synapse costs is like managing electricity in a house. Turn off lights (pause pools) when you leave a room. Do not leave the oven on (Debug cluster) overnight. Use energy-efficient appliances (serverless instead of dedicated) when possible.

Best Practices

Start with Serverless SQL and Pipelines — do not create Dedicated SQL or Spark pools until you need them. They cost money when idle.
Use Managed Identity everywhere — no passwords to manage, rotate, or leak.
Connect to Git from day one — even for learning. It builds good habits and protects against accidental deletions.
Organize your data lake — use the Bronze/Silver/Gold pattern in your default storage container.
Use the default storage linked service — do not create a separate linked service for your primary ADLS account.
Set auto-pause on Spark pools — 15 minutes for dev, 30 minutes for production.
Pause Dedicated SQL Pool when not running queries — this is the biggest cost trap for new users.
Use Synapse Roles for access control — do not give everyone Synapse Administrator.

Interview Questions

Q: What is Azure Synapse Analytics? A: A unified analytics platform that combines data integration (pipelines), serverless SQL (query data lake), dedicated SQL pool (data warehouse), and Apache Spark (big data processing) in one workspace with a single management studio.

Q: What is the difference between Serverless SQL Pool and Dedicated SQL Pool? A: Serverless SQL Pool queries data lake files directly with no provisioned infrastructure — you pay per TB scanned. Dedicated SQL Pool is a provisioned data warehouse with reserved compute — you pay per DWU-hour whether you query or not. Serverless is for ad-hoc queries; Dedicated is for high-concurrency production workloads.

Q: How is Synapse different from standalone ADF? A: Synapse includes ADF’s pipeline capabilities plus serverless SQL, dedicated SQL pools, Spark pools, and a unified studio. The pipeline engine is identical. Choose standalone ADF for pipeline-only workloads; choose Synapse when you need the full analytics platform.

Q: What is the default storage in Synapse? A: Every Synapse workspace has a primary ADLS Gen2 storage account configured during creation. A default linked service is auto-created, and the workspace’s managed identity is used for authentication. All components (pipelines, Spark, SQL) can access this storage natively.

Q: How do you manage costs in Synapse? A: Pause Dedicated SQL Pools when idle, enable auto-pause on Spark pools, use Serverless SQL for ad-hoc queries, turn off Data Flow Debug when not developing, and monitor costs with Azure Cost Management. The biggest cost traps are Dedicated SQL and Spark pools left running overnight.

Q: What Synapse roles should a data engineer have? A: Synapse Contributor — allows creating and editing pipelines, notebooks, SQL scripts, and data flows. Add Synapse Credential User if they need to use linked service credentials. Avoid giving Synapse Administrator to everyone.

Wrapping Up

Azure Synapse Analytics is the Swiss Army knife of Azure data platforms. It puts pipelines, SQL, Spark, and data lake exploration under one roof. The setup is straightforward — create a workspace, explore with serverless SQL, build pipelines in the Integrate hub, and connect to Git for source control.

Start small: workspace + serverless SQL + pipelines. Add Spark pools when you need big data processing. Add Dedicated SQL Pool only when you need a high-performance production warehouse. This approach keeps costs low while you learn.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Azure Synapse Analytics Workspace Setup Guide: From Creation to Your First Pipeline

Table of Contents

What Is Azure Synapse Analytics?

Synapse Components: The Restaurant Analogy

Prerequisites

Step 1: Create the Synapse Workspace

Using Azure Portal

Basics Tab

Security Tab

Networking Tab

Review + Create

What Gets Created

Step 2: Explore Synapse Studio

Open Synapse Studio

The Five Hubs

Step 3: Understanding the Default Storage (ADLS Gen2)

Exploring Storage in Synapse Studio

The Default Linked Service

Grant Storage Permissions

Step 4: Create a SQL Linked Service

SQL Authentication vs Managed Identity

Setting Up Managed Identity for SQL

Step 5: Serverless SQL Pool (Built-In)

What It Is

Querying Data Lake Files

Creating External Tables (Permanent Views on Data Lake)

When to Use Serverless SQL Pool

Step 6: Dedicated SQL Pool (Optional)

What It Is

When You Need It

When You Do NOT Need It

Creating a Dedicated SQL Pool

Loading Data into Dedicated SQL Pool

Step 7: Apache Spark Pool

What It Is

Creating a Spark Pool

Auto-Pause (Save Money!)

Running a Spark Notebook

When to Use Spark Pool

Step 8: Synapse Pipelines (ADF Inside Synapse)

What It Is

Key Differences from Standalone ADF

What Is Identical

Step 9: Build Your First Pipeline

Create Datasets

Create the Pipeline

Validate, Debug, Publish

Step 10: Connecting to Git (Source Control)

Why Connect to Git

Connect to GitHub

What Changes

Synapse vs ADF: What Is the Same, What Is Different

Managed VNet and Private Endpoints

What Is Managed VNet

When to Enable

When to Skip (Dev/Learning)

Creating a Managed Private Endpoint

Access Control (RBAC and Synapse Roles)

Azure RBAC (Resource Level)

Synapse-Specific Roles

Best Practice

Monitoring and Troubleshooting

Monitor Hub

Common Issues

Cost Management

What Costs Money

Cost Saving Tips

Best Practices

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply