Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

If Azure Blob Storage is the foundation, ADLS Gen2 is the evolution — purpose-built for big data analytics. It adds a hierarchical file system, fine-grained access control, and performance optimizations for Spark, Synapse, and Databricks.

Every ADF and Synapse pipeline I’ve built writes to ADLS Gen2. It’s the standard destination for data lake ingestion.

What is ADLS Gen2?

ADLS Gen2 is Azure Blob Storage with hierarchical namespace enabled. Not a separate service — a capability you activate on a storage account. You get all Blob Storage features plus:

  • Real directories — rename, move, delete directories atomically
  • POSIX ACLs — fine-grained access at directory/file level
  • Analytics optimization — faster Spark, Synapse Serverless, Databricks
  • DFS API — dedicated endpoint (dfs.core.windows.net)

ADLS Gen2 vs Blob Storage vs Gen1

Feature Blob Storage ADLS Gen2 ADLS Gen1 (deprecated)
Namespace Flat (virtual folders) Hierarchical (real dirs) Hierarchical
Rename directory Copy+delete each blob Atomic operation Atomic
POSIX ACLs No Yes Yes
Cost Standard Same as Blob Higher
Status Active Recommended Deprecated

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Renaming folder customer to customers with 10,000 files = 30,000 API calls (list + copy + delete each).

Hierarchical Namespace (ADLS Gen2)

Same rename = 1 atomic operation. Instant.

Why this matters:Spark writes temp files then renames to final names. Flat = slow. Hierarchical = instant. – Delta Lake/Iceberg rely on atomic directory operations for ACID transactions. – Pipeline folder management — organizing date-partitioned folders is fast.

Setting Up ADLS Gen2

  1. Azure Portal > Create resource > Storage account
  2. Name: globally unique (e.g., naveendatalake)
  3. Region: same as your ADF/Synapse workspace
  4. Performance: Standard
  5. Advanced tab > Check “Enable hierarchical namespace” — this makes it ADLS Gen2
  6. Review + Create

Create Containers

raw (or bronze) — raw ingested data
curated (or silver) — cleaned data
analytics (or gold) — business-ready data

Grant Access for Pipelines

  1. Storage Account > Access Control (IAM) > + Add role assignment
  2. Role: Storage Blob Data Contributor
  3. Members: your ADF or Synapse workspace managed identity
  4. Review + assign

Security: RBAC vs ACLs

RBAC (Broad Access)

Storage account or container level: – Storage Blob Data Reader — read all – Storage Blob Data Contributor — read + write + delete all – Storage Blob Data Owner — full control

Use for: Service identities (ADF, Synapse, Databricks).

POSIX ACLs (Fine-Grained)

Directory and file level permissions (Read, Write, Execute — like Linux):

/datalake/raw/                  -- Marketing: Read+Execute
/datalake/raw/finance/          -- Finance: Read+Write+Execute
/datalake/curated/              -- Analytics: Read+Execute

Use for: Multi-team data lakes with different access requirements.

Best practice: RBAC for services, ACLs for human users.

Bronze/Silver/Gold Folder Structure

datalake/
-- bronze/                        (Raw - schema-on-read)
   -- sqldb/
      -- Customer/
         -- 2026/04/05/
            -- part-00000.snappy.parquet
      -- Address/
      -- Product/
   -- api/
      -- weather/
         -- 2026-04-05.json
-- silver/                        (Cleaned - light schema)
   -- customer_cleaned/
   -- address_standardized/
-- gold/                          (Business-ready - schema-on-write)
   -- dim_customer/
   -- fact_sales/

Bronze: Raw data as-is. No transforms. Append-only. ADF/Synapse Copy writes here.

Silver: Cleaned, standardized. Nulls handled, dupes removed. Spark notebooks or Data Flows.

Gold: Aggregated, modeled. Star schema. Dashboards and reports query this layer.

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Create Linked Service > Dataset (Parquet) with parameterized container/folder > Use as Copy Sink.

Covered in detail in metadata-driven pipelines and Synapse with audit logging.

Event-Triggered Pipelines

File uploaded to /raw/uploads/ > Event Grid > ADF Trigger > Pipeline runs automatically.

Connecting from Different Tools

Tool Connection URL Format
ADF/Synapse Linked Service (Managed Identity) https://{account}.dfs.core.windows.net
Databricks Mount or direct abfss://{container}@{account}.dfs.core.windows.net/
Synapse Serverless OPENROWSET https://{account}.dfs.core.windows.net/{container}/
Spark ABFSS driver abfss://{container}@{account}.dfs.core.windows.net/
Python azure-storage-file-datalake SDK Connection string

Note the DFS endpoint: ADLS Gen2 uses dfs.core.windows.net (not blob.core.windows.net) for hierarchical operations.

Performance Tips

  1. Use Parquet — columnar + compression = less data to read
  2. Partition by dateyear=2026/month=04/ enables partition pruning
  3. Right-size files — aim for 256MB to 1GB per Parquet file
  4. Use DFS endpoint — better performance than Blob endpoint
  5. Colocate compute and storage — same Azure region

Cost

Same as Blob Storage. Enabling hierarchical namespace does NOT increase cost.

Component Hot Cool
Storage (per GB/month) $0.0208 $0.0115
Write ops (per 10K) $0.065 $0.13
Read ops (per 10K) $0.0052 $0.013

Example: 1 TB Parquet in Hot = ~$21/month.

Interview Questions

Q: What is ADLS Gen2? A: Blob Storage with hierarchical namespace enabled. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost as regular Blob Storage.

Q: Why enable hierarchical namespace? A: Atomic directory operations make Spark, Delta Lake, and pipelines faster. Without it, renaming 10K files = 30K API calls. With it = 1 call.

Q: RBAC vs ACLs? A: RBAC = broad access at account/container level. ACLs = fine-grained at directory/file level. Use both together.

Q: Recommended folder structure? A: Bronze (raw), Silver (cleaned), Gold (business-ready). Called Medallion architecture.

Q: Which endpoint for analytics? A: DFS endpoint (dfs.core.windows.net) for better hierarchical namespace performance.

Wrapping Up

ADLS Gen2 is the standard for modern data lakes on Azure. Same cost as Blob Storage, better performance for analytics. Use Parquet, partition by date, follow Bronze/Silver/Gold — you’re following industry best practice.

Related posts:Azure Blob Storage GuideParquet vs CSV vs JSONSchema-on-Write vs Schema-on-ReadMetadata-Driven Pipeline in ADF


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link