Azure Synapse Analytics Workspace Setup Guide: From Creation to Your First Pipeline

Azure Synapse Analytics Workspace Setup Guide: From Creation to Your First Pipeline

Azure Synapse Analytics is Microsoft’s answer to the question: “What if we combined Azure Data Factory, a data lake, a SQL data warehouse, and Apache Spark into one unified platform?”

Before Synapse, building a data platform meant stitching together 5-6 separate Azure services — ADF for orchestration, ADLS for storage, SQL Pool for warehousing, Databricks for Spark, Power BI for visualization. Each had its own portal, its own security model, its own billing. Managing them felt like running a restaurant where the kitchen, dining room, bar, and reception each had separate management systems that did not talk to each other.

Synapse puts everything under one roof. One workspace. One studio. One security model. One place to build pipelines, write SQL, run Spark notebooks, and query your data lake.

This post walks you through setting up a Synapse workspace from scratch, configuring every component, and running your first pipeline — with the context of why each piece exists and when you need it.

Table of Contents

  • What Is Azure Synapse Analytics?
  • Synapse Components: The Restaurant Analogy
  • Prerequisites
  • Step 1: Create the Synapse Workspace
  • Step 2: Explore Synapse Studio
  • Step 3: Understanding the Default Storage (ADLS Gen2)
  • Step 4: Create a SQL Linked Service
  • Step 5: Serverless SQL Pool (Built-In)
  • Step 6: Dedicated SQL Pool (Optional)
  • Step 7: Apache Spark Pool
  • Step 8: Synapse Pipelines (ADF Inside Synapse)
  • Step 9: Build Your First Pipeline
  • Step 10: Connecting to Git (Source Control)
  • Synapse vs ADF: What Is the Same, What Is Different
  • Managed VNet and Private Endpoints
  • Access Control (RBAC and Synapse Roles)
  • Monitoring and Troubleshooting
  • Cost Management
  • Best Practices
  • Interview Questions
  • Wrapping Up

What Is Azure Synapse Analytics?

Azure Synapse Analytics is a unified analytics platform that combines:

Component What It Does Standalone Equivalent
Synapse Pipelines Data orchestration (ETL/ELT) Azure Data Factory
Serverless SQL Pool Query data lake files with SQL (no infrastructure) N/A (unique to Synapse)
Dedicated SQL Pool Enterprise data warehouse (provisioned resources) Azure SQL Data Warehouse (old name)
Apache Spark Pool Big data processing, ML, notebooks Azure Databricks / HDInsight
Synapse Studio Unified web interface for everything Azure Portal + multiple service UIs
Data Explorer Pool Real-time analytics on streaming data Azure Data Explorer

The key value: Everything shares the same workspace, the same security model, the same storage, and the same studio. No context-switching between portals.

Synapse Components: The Restaurant Analogy

Imagine Synapse as a modern restaurant:

Synapse Workspace = The restaurant building itself. It holds everything.

ADLS Gen2 (Primary Storage) = The pantry and walk-in refrigerator. All your raw ingredients (data) are stored here. Every component in the restaurant can access the pantry.

Serverless SQL Pool = The salad bar. Customers (analysts) walk up, pick what they want (query data lake files), and pay only for what they take (per TB scanned). No chef needed. No reservation needed. Always available.

Dedicated SQL Pool = The fine dining kitchen. Reserved capacity, dedicated chefs (compute), and a menu (schema) designed for the most important dishes (critical reports). Expensive to keep running, but delivers consistent, high-performance results.

Spark Pool = The experimental test kitchen. Chefs (data engineers and scientists) experiment with new recipes (ML models, complex transformations), work with exotic ingredients (unstructured data), and create new dishes (derived datasets). Scales up for big events, scales down when quiet.

Synapse Pipelines = The kitchen manager. Coordinates the flow: ingredients arrive from suppliers (source systems), get prepped (transformed), and are delivered to the right station (sink). Manages the schedule, handles failures, logs everything.

Synapse Studio = The restaurant’s management dashboard. One screen to monitor the salad bar, fine dining kitchen, test kitchen, and delivery schedule. No running between rooms.

Prerequisites

Before creating the workspace, you need:

  1. Azure subscription — Pay-As-You-Go or higher
  2. Resource group — create one if you do not have it (e.g., rg-dataplatform-dev)
  3. Azure ADLS Gen2 storage account — Synapse creates one during workspace setup, OR you can use an existing one
  4. Permissions — Owner or Contributor on the resource group

Step 1: Create the Synapse Workspace

Using Azure Portal

  1. Go to Azure Portal > search Synapse > click Azure Synapse Analytics
  2. Click + Create

Basics Tab

Field Value Notes
Subscription Your subscription
Resource group rg-dataplatform-dev Create new if needed
Workspace name naveen-synapse-ws Globally unique
Region Canada Central Choose closest to your data
Data Lake Storage Gen2 Create new or select existing
Account name naveensynapsedl Only if creating new
File system name synapse-workspace This becomes your default container

Important: The storage account MUST have hierarchical namespace enabled (ADLS Gen2, not regular Blob Storage). If you select an existing account without hierarchical namespace, the creation will fail.

Security Tab

Field Value Notes
SQL administrator login sqladmin Admin for dedicated SQL pools
SQL administrator password Strong password Save this — you will need it
System assigned managed identity Enabled (default) Synapse uses this to access storage

Networking Tab

For learning/dev:

Field Value
Managed virtual network Disable (simpler for dev)
Allow connections from all IP addresses Yes

For production:

Field Value
Managed virtual network Enable
Allow connections from all IP addresses No (use private endpoints)

Review + Create

Click Create. Deployment takes 3-5 minutes.

What Gets Created

After deployment, Azure creates:

Resource Group: rg-dataplatform-dev
  |-- Synapse Workspace: naveen-synapse-ws
  |-- ADLS Gen2 Storage: naveensynapsedl
  |   |-- Container: synapse-workspace (your file system)
  |-- Managed Identity: naveen-synapse-ws (auto-created)

Step 2: Explore Synapse Studio

Open Synapse Studio

  1. Go to your Synapse workspace in Azure Portal
  2. Click Open Synapse Studio (or go directly to web.azuresynapse.net)
  3. Select your workspace

The Five Hubs

Synapse Studio has five main sections (hubs) in the left sidebar:

Hub Icon What It Does
Home House Overview, quick actions, recent items
Data Database cylinder Browse data lake files, SQL databases, linked data
Develop Code brackets SQL scripts, Spark notebooks, Data Flows
Integrate Pipeline icon Pipelines (same as ADF Author tab)
Monitor Chart Pipeline runs, Spark jobs, SQL queries
Manage Gear Linked services, integration runtimes, pools, Git config

Real-life analogy: Think of the hubs as rooms in a house: – Data = the library (browse and explore your data) – Develop = the workshop (write code and scripts) – Integrate = the control room (build and schedule pipelines) – Monitor = the security room (watch everything running) – Manage = the utility room (configure connections and infrastructure)

Step 3: Understanding the Default Storage (ADLS Gen2)

When you created the workspace, Synapse set up a primary storage account with a default container. This is where your data lake lives.

Exploring Storage in Synapse Studio

  1. Click Data hub > Linked tab
  2. Expand Azure Data Lake Storage Gen2
  3. You see your primary storage account and the synapse-workspace container
  4. Right-click the container to create folders, upload files, or browse

The Default Linked Service

Synapse automatically creates a linked service for your primary storage:

Name: naveen-synapse-ws-WorkspaceDefaultStorage
Type: Azure Data Lake Storage Gen2
Authentication: Managed Identity (automatic, no credentials)

You do NOT need to create a linked service for your default storage. It is ready to use in datasets and pipelines.

Grant Storage Permissions

The Synapse managed identity needs the Storage Blob Data Contributor role on the ADLS Gen2 account:

  1. Go to your storage account in Azure Portal
  2. Click Access Control (IAM) > + Add > Add role assignment
  3. Role: Storage Blob Data Contributor
  4. Members: search for your workspace name (naveen-synapse-ws)
  5. Click Review + assign

Without this, pipelines will fail with “Forbidden” errors when writing to storage.

Real-life analogy: The managed identity is like a building key card. Synapse (the employee) uses its key card (managed identity) to access the storage room (ADLS Gen2). But someone (you, the admin) must first program the key card to unlock that specific room (role assignment).

Step 4: Create a SQL Linked Service

To read from Azure SQL Database, create a linked service:

  1. Manage hub > Linked services > + New
  2. Search Azure SQL Database > Continue
  3. Configure:
  4. Name: LS_AzureSqlDB
  5. Server: sql-dataplatform-dev.database.windows.net
  6. Database: AdventureWorksLT
  7. Authentication: SQL authentication or Managed Identity
  8. Test connection > Create

SQL Authentication vs Managed Identity

Method Pros Cons
SQL Authentication Simple setup Password management, rotation
Managed Identity No passwords, most secure Requires Azure AD admin setup on SQL Server

For production, always use Managed Identity. For learning, SQL authentication is fine.

Setting Up Managed Identity for SQL

If you want the more secure approach:

-- Run in Azure SQL Database
CREATE USER [naveen-synapse-ws] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [naveen-synapse-ws];
ALTER ROLE db_datawriter ADD MEMBER [naveen-synapse-ws];

Then in the linked service, select Managed Identity as authentication. No password needed.

Step 5: Serverless SQL Pool (Built-In)

What It Is

A query engine that reads files directly from your data lake (ADLS Gen2) using standard SQL. No data loading, no infrastructure to manage, no cluster to start. It is always available.

You pay only for the data scanned — approximately $5 per TB. If your query scans 10 GB, you pay $0.05.

Real-life analogy: Serverless SQL Pool is like a library with a photocopier. The books (data files) stay on the shelves (ADLS Gen2). You walk up, photocopy the pages you need (query), and pay per page copied (per TB scanned). No need to check out the book or carry it home.

Querying Data Lake Files

-- Query Parquet files directly from ADLS Gen2
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/sqldb/Customer/*.parquet',
    FORMAT = 'PARQUET'
) AS result;

-- Query CSV files
SELECT *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/raw/data.csv',
    FORMAT = 'CSV',
    HEADER_ROW = TRUE
) AS result;

-- Query Delta Lake tables
SELECT *
FROM OPENROWSET(
    BULK 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/',
    FORMAT = 'DELTA'
) AS result;

Creating External Tables (Permanent Views on Data Lake)

-- Create a database for your logical data warehouse
CREATE DATABASE analytics;
GO

USE analytics;
GO

-- Create an external data source
CREATE EXTERNAL DATA SOURCE datalake
WITH (
    LOCATION = 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/'
);

-- Create an external table (metadata only, data stays in ADLS)
CREATE EXTERNAL TABLE customers
WITH (
    LOCATION = 'silver/customers/',
    DATA_SOURCE = datalake,
    FILE_FORMAT = ParquetFormat
)
AS SELECT * FROM OPENROWSET(
    BULK 'silver/customers/*.parquet',
    DATA_SOURCE = 'datalake',
    FORMAT = 'PARQUET'
) AS r;

Now analysts can query SELECT * FROM analytics.dbo.customers like a regular SQL table, but the data lives in ADLS Gen2.

When to Use Serverless SQL Pool

  • Ad-hoc exploration — “What does this data look like?”
  • Logical data warehouse — create SQL views on top of data lake files
  • Power BI reports — connect Power BI directly to serverless SQL
  • Data quality checks — quick row counts, null checks, profiling
  • Cost-effective querying — pay only when you query, $0 when idle

Step 6: Dedicated SQL Pool (Optional)

What It Is

A provisioned data warehouse with reserved compute resources. Unlike serverless (pay per query), dedicated SQL pool runs continuously with guaranteed performance. This is the old Azure SQL Data Warehouse, rebranded.

Real-life analogy: If serverless SQL Pool is a library photocopier (pay per page), Dedicated SQL Pool is your own personal printer. You pay a monthly fee whether you print or not, but it is always ready and handles high-volume printing faster.

When You Need It

  • High-concurrency reporting — 50+ analysts querying simultaneously
  • Sub-second query performance — dashboards that need instant response
  • Complex aggregations — star schema queries with billions of fact rows
  • Always-on workloads — 24/7 production dashboards

When You Do NOT Need It

  • Learning and development — use serverless instead (free when idle)
  • Ad-hoc exploration — serverless is cheaper
  • Small datasets — serverless handles these efficiently
  • Variable workloads — dedicated charges even when idle

Creating a Dedicated SQL Pool

  1. Manage hub > SQL pools > + New
  2. Name: dwh_pool
  3. Performance level: DW100c (smallest, ~$1.50/hour)
  4. Click Create

Cost warning: DW100c costs approximately $1.50/hour (~$36/day, ~$1,080/month). Always pause when not in use:

Manage > SQL pools > dwh_pool > Pause

A paused dedicated pool costs $0 for compute (you still pay for storage).

Loading Data into Dedicated SQL Pool

-- COPY INTO is the fastest way to load from ADLS Gen2
COPY INTO dbo.customers
FROM 'https://naveensynapsedl.dfs.core.windows.net/synapse-workspace/silver/customers/*.parquet'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY = 'Managed Identity')
);

Step 7: Apache Spark Pool

What It Is

A managed Spark cluster for big data processing, machine learning, and notebook-based development. You write PySpark, Scala, or SQL in Jupyter-like notebooks.

Real-life analogy: The Spark Pool is like a catering team. When you have a big event (large data processing job), you hire a team of chefs (Spark nodes). They work together to prepare the food (process data) much faster than one chef alone. After the event, they leave (auto-scale down). You do not pay chefs to stand around waiting.

Creating a Spark Pool

  1. Manage hub > Apache Spark pools > + New
  2. Configure:
  3. Name: sparkpool
  4. Node size: Small (4 vCores, 32 GB) — cheapest for learning
  5. Autoscale: Enabled
  6. Min nodes: 3 (minimum required)
  7. Max nodes: 5 (for learning — increase for production)
  8. Auto-pause: Enabled, idle minutes: 15
  9. Click Review + Create > Create

Auto-Pause (Save Money!)

When enabled, the Spark pool automatically shuts down after the idle timeout. Next time you run a notebook, it starts up again (takes 2-3 minutes).

With auto-pause: Pay only during active use
Without auto-pause: Pay 24/7 whether you use it or not

Always enable auto-pause for development. In production, disable it if you need instant notebook startup.

Running a Spark Notebook

  1. Develop hub > + > Notebook
  2. Attach to: sparkpool
  3. Language: PySpark
# Read Parquet from data lake
df = spark.read.parquet(
    'abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/sqldb/Customer/'
)

# Show data
df.show(5)

# Transform
from pyspark.sql.functions import upper, col, current_timestamp

df_clean = df     .withColumn('FullName', upper(col('FirstName')))     .withColumn('LoadTimestamp', current_timestamp())     .filter(col('CompanyName').isNotNull())

# Write as Delta
df_clean.write.format('delta').mode('overwrite')     .save('abfss://synapse-workspace@naveensynapsedl.dfs.core.windows.net/silver/customers/')

print(f"Written {df_clean.count()} rows")

When to Use Spark Pool

  • Large-scale transformations — processing GBs to TBs of data
  • Machine learning — training models on big datasets
  • Delta Lake operations — MERGE, OPTIMIZE, time travel
  • Complex ETL — when Data Flows are not flexible enough
  • Notebook-based development — exploratory analysis, prototyping

Step 8: Synapse Pipelines (ADF Inside Synapse)

What It Is

Synapse Pipelines is Azure Data Factory built into Synapse. Same visual designer, same activities, same expressions, same triggers. If you know ADF, you know Synapse Pipelines.

Key Differences from Standalone ADF

Feature Standalone ADF Synapse Pipelines
Portal adf.azure.com Synapse Studio
Default storage Must create linked service Default storage built-in
Spark Not available Run Spark notebooks from pipeline
SQL Pool Not available Script activity runs on SQL pools
Data Flows Runs on ADF-managed Spark Runs on Synapse Spark
Publish branch adf_publish workspace_publish
Pricing Separate billing Included in Synapse billing

What Is Identical

  • Pipeline designer (canvas, activities, arrows)
  • All activities (Copy, Lookup, ForEach, If Condition, Web, Stored Procedure, etc.)
  • Expressions (@item(), @pipeline(), @activity(), etc.)
  • Parameterized datasets
  • Triggers (Schedule, Tumbling Window, Event)
  • Integration Runtimes (Azure IR, Self-Hosted IR)
  • Git integration (Azure DevOps and GitHub)

Bottom line: If you built our metadata-driven pipeline in ADF, you can build the exact same pipeline in Synapse with almost no changes.

Step 9: Build Your First Pipeline

Let us build the classic Lookup + ForEach + Copy pipeline in Synapse.

Create Datasets

DS_SqlDB_Metadata: 1. Data hub > + > Integration dataset > Azure SQL Database 2. Name: DS_SqlDB_Metadata 3. Linked service: LS_AzureSqlDB 4. Table: dbo.metadata

DS_SqlDB_SourceTable (parameterized): 1. + > Integration dataset > Azure SQL Database 2. Name: DS_SqlDB_SourceTable 3. Linked service: LS_AzureSqlDB 4. Leave table blank 5. Add parameters: SchemaName (String), TableName (String) 6. Connection tab: Schema = @dataset().SchemaName, Table = @dataset().TableName

DS_ADLS_Sink_Parquet (parameterized): 1. + > Integration dataset > ADLS Gen2 > Parquet 2. Name: DS_ADLS_Sink_Parquet 3. Linked service: naveen-synapse-ws-WorkspaceDefaultStorage 4. Add parameters: ContainerName (String), FolderName (String) 5. Connection tab: File system = @dataset().ContainerName, Directory = @dataset().FolderName

Create the Pipeline

  1. Integrate hub > + > Pipeline
  2. Name: PL_Copy_SqlToADLS

Lookup activity: – Name: Lookup_Metadata – Dataset: DS_SqlDB_Metadata – First row only: unchecked

ForEach activity: – Name: ForEach_Table – Items: @activity('Lookup_Metadata').output.value – Sequential: unchecked

Copy activity (inside ForEach): – Name: Copy_TableData – Source: DS_SqlDB_SourceTable – SchemaName: @item().SchemaName – TableName: @item().TableName – Sink: DS_ADLS_Sink_Parquet – ContainerName: @item().ContainerName – FolderName: @item().FolderName

Validate, Debug, Publish

  1. Click Validate — fix any errors
  2. Click Debug — test the pipeline
  3. Click Publish all — save to the live service

Step 10: Connecting to Git (Source Control)

Why Connect to Git

Without Git, clicking Publish saves directly to the live Synapse workspace. There is no undo, no history, no code review. That is fine for learning but dangerous in production.

With Git, every change is versioned. You work in feature branches, create pull requests, and deploy through CI/CD.

Connect to GitHub

  1. Manage hub > Git configuration > Configure
  2. Repository type: GitHub (or Azure DevOps)
  3. GitHub account: authorize access
  4. Repository: select your repo
  5. Collaboration branch: main
  6. Publish branch: workspace_publish
  7. Root folder: /
  8. Import existing resources: Yes
  9. Click Apply

What Changes

After connecting: – A branch dropdown appears in the top toolbar – Changes auto-save to the Git branch (not to live Synapse) – You must Publish to deploy changes to the live workspace – Pull Requests enable code review before publishing

Synapse vs ADF: What Is the Same, What Is Different

Aspect Choose Synapse Choose Standalone ADF
Need SQL warehouse Yes (dedicated SQL pool) No
Need Spark/notebooks Yes No (use Databricks separately)
Need serverless SQL Yes No
Pipeline-only workload Synapse works but overkill ADF is simpler and cheaper
Existing ADF investment Consider migration later Keep using ADF
Unified platform Yes (one studio for everything) No (separate portals)

Rule of thumb: If you only need pipelines (ETL orchestration), use standalone ADF. If you need pipelines + SQL + Spark + data lake exploration, use Synapse.

Managed VNet and Private Endpoints

What Is Managed VNet

When enabled, Synapse creates a virtual network that isolates all outbound traffic. Your pipelines and Spark jobs run inside this VNet, and you control which external services they can connect to via managed private endpoints.

When to Enable

  • Production workloads — data should not traverse the public internet
  • Compliance — regulatory requirements for network isolation
  • Enterprise security — company policy requires private connectivity

When to Skip (Dev/Learning)

  • Learning and development — adds complexity without benefit
  • Small projects — public access with firewall rules is sufficient

Creating a Managed Private Endpoint

  1. Manage hub > Managed private endpoints > + New
  2. Select the target service (e.g., Azure SQL Database)
  3. Name: pe-sql-database
  4. Target resource: select your SQL server
  5. Click Create
  6. The target resource owner must approve the private endpoint connection

Access Control (RBAC and Synapse Roles)

Azure RBAC (Resource Level)

Role What They Can Do
Owner Full access including role assignments
Contributor Create and manage all resources, cannot assign roles
Reader View resources only

Synapse-Specific Roles

Role Scope Permissions
Synapse Administrator Workspace Full control of everything
Synapse SQL Administrator SQL pools Manage SQL pools and queries
Synapse Spark Administrator Spark pools Manage Spark pools and notebooks
Synapse Contributor Workspace Create and edit pipelines, scripts, notebooks
Synapse Artifact User Workspace Run pipelines and read artifacts
Synapse Credential User Workspace Use credentials in pipelines

Best Practice

Assign the minimum role needed: – Data engineers: Synapse Contributor (create pipelines and notebooks) – Analysts: Synapse Artifact User (run queries and read data) – Admins: Synapse Administrator (manage everything)

Monitoring and Troubleshooting

Monitor Hub

The Monitor hub shows:

Tab What It Shows
Pipeline runs All pipeline executions with status, duration, errors
Trigger runs Which triggers fired and what they started
Integration runtime IR status, node health, usage
Apache Spark applications Spark job history, logs, performance
SQL requests Queries running on serverless and dedicated pools

Common Issues

Problem Cause Fix
“Forbidden” on storage write Managed identity lacks permissions Add Storage Blob Data Contributor role
Spark pool takes 5 minutes to start Cold start (auto-pause was active) Expected behavior. Reduce by increasing auto-pause timeout
Pipeline runs in ADF but fails in Synapse Different default IR or linked service Verify linked services and IR configuration
“Cannot connect to SQL” Firewall not configured Add your IP or enable “Allow Azure services”
Publish button grayed out Not on collaboration branch Switch to main branch

Cost Management

What Costs Money

Component Pricing Model Approximate Cost
Serverless SQL Per TB scanned ~$5/TB
Dedicated SQL Pool Per DWU-hour DW100c: ~$1.50/hr
Spark Pool Per node-hour Small node: ~$0.40/hr
Pipeline activities Per activity run + DIU-hours Similar to ADF
Data Flow Per vCore-hour ~$0.27/vCore-hr
Storage (ADLS Gen2) Per GB stored + transactions ~$0.02/GB/month

Cost Saving Tips

  1. Pause Dedicated SQL Pool when not in use — saves ~$36/day (DW100c)
  2. Enable Spark auto-pause — stops the cluster after idle timeout
  3. Use Serverless SQL for ad-hoc queries instead of spinning up Dedicated
  4. Turn off Data Flow Debug when not actively developing
  5. Set Spark pool to minimum nodes (3) for development
  6. Monitor costs — set up Azure Cost Management alerts
  7. Use reserved capacity for production Dedicated SQL Pools (save up to 65%)

Real-life analogy: Managing Synapse costs is like managing electricity in a house. Turn off lights (pause pools) when you leave a room. Do not leave the oven on (Debug cluster) overnight. Use energy-efficient appliances (serverless instead of dedicated) when possible.

Best Practices

  1. Start with Serverless SQL and Pipelines — do not create Dedicated SQL or Spark pools until you need them. They cost money when idle.

  2. Use Managed Identity everywhere — no passwords to manage, rotate, or leak.

  3. Connect to Git from day one — even for learning. It builds good habits and protects against accidental deletions.

  4. Organize your data lake — use the Bronze/Silver/Gold pattern in your default storage container.

  5. Use the default storage linked service — do not create a separate linked service for your primary ADLS account.

  6. Set auto-pause on Spark pools — 15 minutes for dev, 30 minutes for production.

  7. Pause Dedicated SQL Pool when not running queries — this is the biggest cost trap for new users.

  8. Use Synapse Roles for access control — do not give everyone Synapse Administrator.

Interview Questions

Q: What is Azure Synapse Analytics? A: A unified analytics platform that combines data integration (pipelines), serverless SQL (query data lake), dedicated SQL pool (data warehouse), and Apache Spark (big data processing) in one workspace with a single management studio.

Q: What is the difference between Serverless SQL Pool and Dedicated SQL Pool? A: Serverless SQL Pool queries data lake files directly with no provisioned infrastructure — you pay per TB scanned. Dedicated SQL Pool is a provisioned data warehouse with reserved compute — you pay per DWU-hour whether you query or not. Serverless is for ad-hoc queries; Dedicated is for high-concurrency production workloads.

Q: How is Synapse different from standalone ADF? A: Synapse includes ADF’s pipeline capabilities plus serverless SQL, dedicated SQL pools, Spark pools, and a unified studio. The pipeline engine is identical. Choose standalone ADF for pipeline-only workloads; choose Synapse when you need the full analytics platform.

Q: What is the default storage in Synapse? A: Every Synapse workspace has a primary ADLS Gen2 storage account configured during creation. A default linked service is auto-created, and the workspace’s managed identity is used for authentication. All components (pipelines, Spark, SQL) can access this storage natively.

Q: How do you manage costs in Synapse? A: Pause Dedicated SQL Pools when idle, enable auto-pause on Spark pools, use Serverless SQL for ad-hoc queries, turn off Data Flow Debug when not developing, and monitor costs with Azure Cost Management. The biggest cost traps are Dedicated SQL and Spark pools left running overnight.

Q: What Synapse roles should a data engineer have? A: Synapse Contributor — allows creating and editing pipelines, notebooks, SQL scripts, and data flows. Add Synapse Credential User if they need to use linked service credentials. Avoid giving Synapse Administrator to everyone.

Wrapping Up

Azure Synapse Analytics is the Swiss Army knife of Azure data platforms. It puts pipelines, SQL, Spark, and data lake exploration under one roof. The setup is straightforward — create a workspace, explore with serverless SQL, build pipelines in the Integrate hub, and connect to Git for source control.

Start small: workspace + serverless SQL + pipelines. Add Spark pools when you need big data processing. Add Dedicated SQL Pool only when you need a high-performance production warehouse. This approach keeps costs low while you learn.

Related posts:What is Azure Data Factory?ADF vs Synapse ComparisonMetadata-Driven Pipeline in ADFSynapse Pipeline with Audit LoggingAzure FundamentalsData Flows Guide


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link