Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

If Azure Blob Storage is the foundation, ADLS Gen2 is the evolution — purpose-built for big data analytics. It adds a hierarchical file system, fine-grained access control, and performance optimizations for Spark, Synapse, and Databricks.

Every ADF and Synapse pipeline I’ve built writes to ADLS Gen2. It’s the standard destination for data lake ingestion.

What is ADLS Gen2?

ADLS Gen2 is Azure Blob Storage with hierarchical namespace enabled. Not a separate service — a capability you activate on a storage account. You get all Blob Storage features plus:

Real directories — rename, move, delete directories atomically
POSIX ACLs — fine-grained access at directory/file level
Analytics optimization — faster Spark, Synapse Serverless, Databricks
DFS API — dedicated endpoint (dfs.core.windows.net)

ADLS Gen2 vs Blob Storage vs Gen1

Feature	Blob Storage	ADLS Gen2	ADLS Gen1 (deprecated)
Namespace	Flat (virtual folders)	Hierarchical (real dirs)	Hierarchical
Rename directory	Copy+delete each blob	Atomic operation	Atomic
POSIX ACLs	No	Yes	Yes
Cost	Standard	Same as Blob	Higher
Status	Active	Recommended	Deprecated

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Renaming folder customer to customers with 10,000 files = 30,000 API calls (list + copy + delete each).

Hierarchical Namespace (ADLS Gen2)

Same rename = 1 atomic operation. Instant.

Why this matters: – Spark writes temp files then renames to final names. Flat = slow. Hierarchical = instant. – Delta Lake/Iceberg rely on atomic directory operations for ACID transactions. – Pipeline folder management — organizing date-partitioned folders is fast.

Setting Up ADLS Gen2

Azure Portal > Create resource > Storage account
Name: globally unique (e.g., naveendatalake)
Region: same as your ADF/Synapse workspace
Performance: Standard
Advanced tab > Check “Enable hierarchical namespace” — this makes it ADLS Gen2
Review + Create

Create Containers

raw (or bronze) — raw ingested data
curated (or silver) — cleaned data
analytics (or gold) — business-ready data

Grant Access for Pipelines

Storage Account > Access Control (IAM) > + Add role assignment
Role: Storage Blob Data Contributor
Members: your ADF or Synapse workspace managed identity
Review + assign

Security: RBAC vs ACLs

RBAC (Broad Access)

Storage account or container level: – Storage Blob Data Reader — read all – Storage Blob Data Contributor — read + write + delete all – Storage Blob Data Owner — full control

Use for: Service identities (ADF, Synapse, Databricks).

POSIX ACLs (Fine-Grained)

Directory and file level permissions (Read, Write, Execute — like Linux):

/datalake/raw/                  -- Marketing: Read+Execute
/datalake/raw/finance/          -- Finance: Read+Write+Execute
/datalake/curated/              -- Analytics: Read+Execute

Use for: Multi-team data lakes with different access requirements.

Best practice: RBAC for services, ACLs for human users.

Bronze/Silver/Gold Folder Structure

datalake/
-- bronze/                        (Raw - schema-on-read)
   -- sqldb/
      -- Customer/
         -- 2026/04/05/
            -- part-00000.snappy.parquet
      -- Address/
      -- Product/
   -- api/
      -- weather/
         -- 2026-04-05.json
-- silver/                        (Cleaned - light schema)
   -- customer_cleaned/
   -- address_standardized/
-- gold/                          (Business-ready - schema-on-write)
   -- dim_customer/
   -- fact_sales/

Bronze: Raw data as-is. No transforms. Append-only. ADF/Synapse Copy writes here.

Silver: Cleaned, standardized. Nulls handled, dupes removed. Spark notebooks or Data Flows.

Gold: Aggregated, modeled. Star schema. Dashboards and reports query this layer.

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Create Linked Service > Dataset (Parquet) with parameterized container/folder > Use as Copy Sink.

Covered in detail in metadata-driven pipelines and Synapse with audit logging.

Event-Triggered Pipelines

File uploaded to /raw/uploads/ > Event Grid > ADF Trigger > Pipeline runs automatically.

Connecting from Different Tools

Tool	Connection	URL Format
ADF/Synapse	Linked Service (Managed Identity)	`https://{account}.dfs.core.windows.net`
Databricks	Mount or direct	`abfss://{container}@{account}.dfs.core.windows.net/`
Synapse Serverless	OPENROWSET	`https://{account}.dfs.core.windows.net/{container}/`
Spark	ABFSS driver	`abfss://{container}@{account}.dfs.core.windows.net/`
Python	azure-storage-file-datalake SDK	Connection string

Note the DFS endpoint: ADLS Gen2 uses dfs.core.windows.net (not blob.core.windows.net) for hierarchical operations.

Performance Tips

Use Parquet — columnar + compression = less data to read
Partition by date — year=2026/month=04/ enables partition pruning
Right-size files — aim for 256MB to 1GB per Parquet file
Use DFS endpoint — better performance than Blob endpoint
Colocate compute and storage — same Azure region

Cost

Same as Blob Storage. Enabling hierarchical namespace does NOT increase cost.

Component	Hot	Cool
Storage (per GB/month)	$0.0208	$0.0115
Write ops (per 10K)	$0.065	$0.13
Read ops (per 10K)	$0.0052	$0.013

Example: 1 TB Parquet in Hot = ~$21/month.

Interview Questions

Q: What is ADLS Gen2? A: Blob Storage with hierarchical namespace enabled. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost as regular Blob Storage.

Q: Why enable hierarchical namespace? A: Atomic directory operations make Spark, Delta Lake, and pipelines faster. Without it, renaming 10K files = 30K API calls. With it = 1 call.

Q: RBAC vs ACLs? A: RBAC = broad access at account/container level. ACLs = fine-grained at directory/file level. Use both together.

Q: Recommended folder structure? A: Bronze (raw), Silver (cleaned), Gold (business-ready). Called Medallion architecture.

Q: Which endpoint for analytics? A: DFS endpoint (dfs.core.windows.net) for better hierarchical namespace performance.

Wrapping Up

ADLS Gen2 is the standard for modern data lakes on Azure. Same cost as Blob Storage, better performance for analytics. Use Parquet, partition by date, follow Bronze/Silver/Gold — you’re following industry best practice.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.