Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

If Azure Blob Storage is the foundation, ADLS Gen2 is the evolution — purpose-built for big data analytics. It adds a hierarchical file system, fine-grained access control, and performance optimizations for Spark, Synapse, and Databricks.

Every ADF and Synapse pipeline on this blog writes to ADLS Gen2. It is the standard destination for data lake ingestion — our metadata-driven pipeline writes Parquet to ADLS, our Databricks connectivity post reads from it, and our Medallion Architecture organizes layers inside it.

Think of ADLS Gen2 like upgrading from a basic storage locker (Blob Storage) to a fully organized filing system (ADLS Gen2). Both hold the same amount of stuff at the same price. But the filing system has labeled drawers (real directories), access controls per drawer (POSIX ACLs), and a librarian who can instantly move an entire drawer to a new cabinet (atomic directory operations). The storage locker makes you move each item one by one.

What is ADLS Gen2?
ADLS Gen2 vs Blob Storage vs Gen1
The Hierarchical Namespace Explained
Setting Up ADLS Gen2
Security: RBAC vs ACLs
POSIX ACLs Deep Dive
Bronze/Silver/Gold Folder Structure
Using ADLS Gen2 with Pipelines
Connecting from Different Tools
Performance Tips
Cost
Common Mistakes
Interview Questions
Wrapping Up

What is ADLS Gen2?

ADLS Gen2 is Azure Blob Storage with hierarchical namespace enabled. Not a separate service — a capability you activate on a storage account. You get all Blob Storage features plus:

Real directories — rename, move, delete directories atomically
POSIX ACLs — fine-grained access at directory/file level
Analytics optimization — faster Spark, Synapse Serverless, Databricks
DFS API — dedicated endpoint (dfs.core.windows.net)

ADLS Gen2 vs Blob Storage vs Gen1

Feature	Blob Storage	ADLS Gen2	ADLS Gen1 (deprecated)
Namespace	Flat (virtual folders)	Hierarchical (real dirs)	Hierarchical
Rename directory	Copy+delete each blob	Atomic operation	Atomic
POSIX ACLs	No	Yes	Yes
Cost	Standard	Same as Blob	Higher
Status	Active	Recommended	Deprecated

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Renaming folder customer to customers with 10,000 files = 30,000 API calls (list + copy + delete each).

Hierarchical Namespace (ADLS Gen2)

Same rename = 1 atomic operation. Instant.

Why this matters: – Spark writes temp files then renames to final names. Flat = slow. Hierarchical = instant. – Delta Lake/Iceberg rely on atomic directory operations for ACID transactions. – Pipeline folder management — organizing date-partitioned folders is fast.

Real-life analogy: Flat namespace is like a warehouse where every box has a long label: “Aisle-3/Shelf-B/Row-2/Box-15.” To “move Shelf B to Aisle 5,” a worker must pick up every box from Shelf B, cross out “Aisle-3” on each label, write “Aisle-5,” and put each box back — one at a time. Hierarchical namespace is like a real filing cabinet — grab the entire drawer labeled “Shelf B” and slide it into Aisle 5. One move, instant, done.

Impact on real operations:

Operation                    Flat (Blob)              Hierarchical (ADLS Gen2)
─────────────────────────    ──────────────────       ──────────────────────
Rename folder (10K files)    ~30,000 API calls        1 atomic operation
Delete folder (10K files)    ~20,000 API calls        1 atomic operation
Spark job commit             Slow (rename temp dirs)  Instant (atomic rename)
Delta Lake ACID commit       Relies on file renames   Fast atomic directory ops
List files in directory      Scans ALL blob names     Direct directory listing
Pipeline folder management   Slow at scale            Fast at any scale

Setting Up ADLS Gen2

Azure Portal > Create resource > Storage account
Name: globally unique (e.g., naveendatalake)
Region: same as your ADF/Synapse workspace
Performance: Standard
Advanced tab > Check “Enable hierarchical namespace” — this makes it ADLS Gen2
Review + Create

Create Containers

raw (or bronze) — raw ingested data
curated (or silver) — cleaned data
analytics (or gold) — business-ready data

Grant Access for Pipelines

Storage Account > Access Control (IAM) > + Add role assignment
Role: Storage Blob Data Contributor
Members: your ADF or Synapse workspace managed identity
Review + assign

Security: RBAC vs ACLs

RBAC (Broad Access)

Storage account or container level: – Storage Blob Data Reader — read all – Storage Blob Data Contributor — read + write + delete all – Storage Blob Data Owner — full control

Use for: Service identities (ADF, Synapse, Databricks).

POSIX ACLs (Fine-Grained)

Directory and file level permissions (Read, Write, Execute — like Linux):

/datalake/raw/                  -- Marketing: Read+Execute
/datalake/raw/finance/          -- Finance: Read+Write+Execute
/datalake/curated/              -- Analytics: Read+Execute

Use for: Multi-team data lakes with different access requirements.

Best practice: RBAC for services, ACLs for human users.

POSIX ACLs Deep Dive

POSIX ACLs work like Linux file permissions — three permission types applied to directories and files:

Permission	On Files	On Directories	Symbol
Read (r)	Read file contents	List directory contents	r
Write (w)	Write to file	Create/delete child items	w
Execute (x)	N/A for data files	Traverse (access children)	x

Critical concept — the “Execute on directories” trap: To read a file at /raw/customers/2026/data.parquet, a user needs Read on the file AND Execute on every parent directory (/raw/, /raw/customers/, /raw/customers/2026/). Without Execute on the parent directories, the user cannot traverse the path to reach the file — even if they have Read on the file itself.

ACL example: Marketing team reads curated data, Finance reads everything

/datalake/
  ├── raw/                        Marketing: --x    Finance: r-x
  │     ├── customers/            Marketing: --x    Finance: r-x
  │     └── finance/              Marketing: ---    Finance: rwx
  ├── curated/                    Marketing: r-x    Finance: r-x
  │     ├── customer_clean/       Marketing: r-x    Finance: r-x
  │     └── finance_clean/        Marketing: ---    Finance: r-x
  └── analytics/                  Marketing: r-x    Finance: r-x

Marketing can:    Read curated/customer_clean/ ✓    Read raw/finance/ ✗
Finance can:      Read everything ✓                 Write to raw/finance/ ✓

Two types of ACLs:

ACL Type	What It Does	When It Applies
Access ACL	Controls access to the specific directory or file	Every directory and file
Default ACL	Template inherited by NEW child items created inside a directory	Directories only

Key rule: Setting a Default ACL on a directory does NOT change permissions on existing children — it only applies to items created after the default is set. To fix existing items, you must recursively update ACLs using Azure CLI or Storage Explorer.

Real-life analogy: An Access ACL is a lock on a specific door — only people with the key can enter. A Default ACL is a policy saying “every new room built in this wing gets this same lock type.” Rooms built before the policy keep their old locks.

Bronze/Silver/Gold Folder Structure

datalake/
-- bronze/                        (Raw - schema-on-read)
   -- sqldb/
      -- Customer/
         -- 2026/04/05/
            -- part-00000.snappy.parquet
      -- Address/
      -- Product/
   -- api/
      -- weather/
         -- 2026-04-05.json
-- silver/                        (Cleaned - light schema)
   -- customer_cleaned/
   -- address_standardized/
-- gold/                          (Business-ready - schema-on-write)
   -- dim_customer/
   -- fact_sales/

Bronze: Raw data as-is. No transforms. Append-only. ADF/Synapse Copy writes here.

Silver: Cleaned, standardized. Nulls handled, dupes removed. Spark notebooks or Data Flows.

Gold: Aggregated, modeled. Star schema. Dashboards and reports query this layer.

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Create Linked Service > Dataset (Parquet) with parameterized container/folder > Use as Copy Sink.

Covered in detail in metadata-driven pipelines and Synapse with audit logging.

Event-Triggered Pipelines

File uploaded to /raw/uploads/ > Event Grid > ADF Trigger > Pipeline runs automatically.

As a Source (Reading Data Back)

ADLS Gen2 is also used as a pipeline source when downstream processes need to read landed files:

Common source patterns:

1. Databricks reads Bronze Parquet from ADLS → transforms → writes Silver Delta
   df = spark.read.parquet("abfss://raw@naveendatalake.dfs.core.windows.net/customers/")

2. Synapse Serverless SQL queries files directly in ADLS (no copy needed)
   SELECT * FROM OPENROWSET(
     BULK 'https://naveendatalake.dfs.core.windows.net/raw/customers/*.parquet',
     FORMAT = 'PARQUET'
   ) AS customers

3. ADF Lookup activity reads a config file from ADLS
   Source: JSON file at /config/pipeline_config.json

4. Power BI Direct Lake reads Gold Delta tables from ADLS via Fabric shortcuts

Folder Naming Conventions

Consistent naming makes pipelines parameterizable and debugging easier:

Convention 1: Date-partitioned (most common for Bronze)
  /raw/customers/2026/06/12/customers.parquet
  /raw/orders/2026/06/12/orders.parquet
  ADF expression: @concat('raw/', item().TableName, '/', formatDateTime(utcNow(),'yyyy/MM/dd'), '/')

Convention 2: Hive-style partitioning (best for Spark/Databricks)
  /raw/orders/year=2026/month=06/day=12/part-00000.parquet
  Spark reads with: df = spark.read.parquet("/raw/orders/").filter("year = 2026 AND month = 6")

Convention 3: Full load snapshots
  /raw/customers/full/2026-06-12/customers.parquet
  /raw/customers/incremental/2026-06-12/delta.parquet

Rules:
  • Lowercase everything (no mixed case)
  • Underscores for multi-word names (customer_address, not customerAddress)
  • No spaces, no special characters
  • Table name matches source table name

Connecting from Different Tools

Tool	Connection	URL Format
ADF/Synapse	Linked Service (Managed Identity)	`https://{account}.dfs.core.windows.net`
Databricks	Mount or direct	`abfss://{container}@{account}.dfs.core.windows.net/`
Synapse Serverless	OPENROWSET	`https://{account}.dfs.core.windows.net/{container}/`
Spark	ABFSS driver	`abfss://{container}@{account}.dfs.core.windows.net/`
Python	azure-storage-file-datalake SDK	Connection string

Note the DFS endpoint: ADLS Gen2 uses dfs.core.windows.net (not blob.core.windows.net) for hierarchical operations.

Performance Tips

Use Parquet — columnar + compression = less data to read
Partition by date — year=2026/month=04/ enables partition pruning
Right-size files — aim for 256MB to 1GB per Parquet file
Use DFS endpoint — better performance than Blob endpoint
Colocate compute and storage — same Azure region

Cost

Same as Blob Storage. Enabling hierarchical namespace does NOT increase cost.

Component	Hot	Cool
Storage (per GB/month)	$0.0208	$0.0115
Write ops (per 10K)	$0.065	$0.13
Read ops (per 10K)	$0.0052	$0.013

Example: 1 TB Parquet in Hot = ~$21/month.

Common Mistakes

Forgetting to enable hierarchical namespace at creation — you cannot enable it after the storage account is created. The only fix is creating a new account and migrating data. Always check the “Enable hierarchical namespace” box in the Advanced tab.
Using the Blob endpoint instead of the DFS endpoint — blob.core.windows.net works but bypasses hierarchical namespace optimizations. Always use dfs.core.windows.net for ADLS Gen2 operations, especially from Spark and Databricks.
Setting ACLs on a directory but forgetting Execute on parents — granting Read on /raw/customers/ does nothing if the user lacks Execute on /raw/. Every parent directory in the path needs Execute permission.
Confusing Default ACLs with Access ACLs — setting a Default ACL on a directory does NOT change existing children. New items inherit the default, but existing items keep their old permissions. Use recursive ACL updates for existing data.
Storing too many small files — thousands of small files (under 1 MB) kill Spark performance. Use coalesce() or repartition() to create fewer, larger files (256 MB to 1 GB target). This is the “small file problem” that plagues every data lake.
Not colocating storage and compute — if your ADLS Gen2 is in East US but your Databricks workspace is in Canada Central, every read crosses Azure regions. This adds latency and egress charges. Always put storage and compute in the same region.
Still using ADLS Gen1 — Gen1 is deprecated. Microsoft has published migration guides to move to Gen2. If you are on Gen1, migrate now — Gen2 costs the same or less with better performance.

Interview Questions

Q: What is ADLS Gen2? A: Blob Storage with hierarchical namespace enabled. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost as regular Blob Storage.

Q: Why enable hierarchical namespace? A: Atomic directory operations make Spark, Delta Lake, and pipelines faster. Without it, renaming 10K files = 30K API calls. With it = 1 call.

Q: RBAC vs ACLs? A: RBAC = broad access at account/container level. ACLs = fine-grained at directory/file level. Use both together.

Q: Recommended folder structure? A: Bronze (raw), Silver (cleaned), Gold (business-ready). Called Medallion architecture.

Q: Which endpoint for analytics? A: DFS endpoint (dfs.core.windows.net) for better hierarchical namespace performance.

Q: What is the difference between Access ACLs and Default ACLs? A: Access ACLs control permissions on a specific directory or file. Default ACLs are templates on directories that are automatically inherited by new child items created inside. Default ACLs do not retroactively change existing children — you must recursively update for existing data.

Q: Why do you need Execute permission on parent directories? A: Execute on a directory means “traverse” — the ability to pass through the directory to reach children. To read a file at /raw/customers/data.parquet, you need Execute on /raw/ and /raw/customers/, plus Read on the file itself. Without Execute on parents, the path is blocked even with Read on the file.

Q: How do you handle the small file problem in ADLS Gen2? A: Small files (under 1 MB each) cause Spark to create too many tasks and slow down reads. Fix with: coalesce() or repartition() when writing to create fewer larger files (256 MB–1 GB target), Delta Lake OPTIMIZE command to compact small files, or ADF Copy activity with “merge files” option.

Q: Can you convert an existing Blob Storage account to ADLS Gen2? A: No. Hierarchical namespace must be enabled at storage account creation time. To migrate, create a new ADLS Gen2 account, copy data using AzCopy or ADF, update all connection references, and decommission the old account.

Wrapping Up

ADLS Gen2 is the standard for modern data lakes on Azure. Same cost as Blob Storage, better performance for analytics. Use Parquet, partition by date, follow Bronze/Silver/Gold — you’re following industry best practice.

← Previous: Blob Storage

Azure (8/37)

Next: Azure SQL Database →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

Table of Contents

What is ADLS Gen2?

ADLS Gen2 vs Blob Storage vs Gen1

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Hierarchical Namespace (ADLS Gen2)

Setting Up ADLS Gen2

Create Containers

Grant Access for Pipelines

Security: RBAC vs ACLs

RBAC (Broad Access)

POSIX ACLs (Fine-Grained)

POSIX ACLs Deep Dive

Bronze/Silver/Gold Folder Structure

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Event-Triggered Pipelines

As a Source (Reading Data Back)

Folder Naming Conventions

Connecting from Different Tools

Performance Tips

Cost

Common Mistakes

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

Table of Contents

What is ADLS Gen2?

ADLS Gen2 vs Blob Storage vs Gen1

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Hierarchical Namespace (ADLS Gen2)

Setting Up ADLS Gen2

Create Containers

Grant Access for Pipelines

Security: RBAC vs ACLs

RBAC (Broad Access)

POSIX ACLs (Fine-Grained)

POSIX ACLs Deep Dive

Bronze/Silver/Gold Folder Structure

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Event-Triggered Pipelines

As a Source (Reading Data Back)

Folder Naming Conventions

Connecting from Different Tools

Performance Tips

Cost

Common Mistakes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply