Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers
If Azure Blob Storage is the foundation, ADLS Gen2 is the evolution — purpose-built for big data analytics. It adds a hierarchical file system, fine-grained access control, and performance optimizations for Spark, Synapse, and Databricks.
Every ADF and Synapse pipeline on this blog writes to ADLS Gen2. It is the standard destination for data lake ingestion — our metadata-driven pipeline writes Parquet to ADLS, our Databricks connectivity post reads from it, and our Medallion Architecture organizes layers inside it.
Think of ADLS Gen2 like upgrading from a basic storage locker (Blob Storage) to a fully organized filing system (ADLS Gen2). Both hold the same amount of stuff at the same price. But the filing system has labeled drawers (real directories), access controls per drawer (POSIX ACLs), and a librarian who can instantly move an entire drawer to a new cabinet (atomic directory operations). The storage locker makes you move each item one by one.
Table of Contents
- What is ADLS Gen2?
- ADLS Gen2 vs Blob Storage vs Gen1
- The Hierarchical Namespace Explained
- Setting Up ADLS Gen2
- Security: RBAC vs ACLs
- POSIX ACLs Deep Dive
- Bronze/Silver/Gold Folder Structure
- Using ADLS Gen2 with Pipelines
- Connecting from Different Tools
- Performance Tips
- Cost
- Common Mistakes
- Interview Questions
- Wrapping Up
What is ADLS Gen2?
ADLS Gen2 is Azure Blob Storage with hierarchical namespace enabled. Not a separate service — a capability you activate on a storage account. You get all Blob Storage features plus:
- Real directories — rename, move, delete directories atomically
- POSIX ACLs — fine-grained access at directory/file level
- Analytics optimization — faster Spark, Synapse Serverless, Databricks
- DFS API — dedicated endpoint (
dfs.core.windows.net)
ADLS Gen2 vs Blob Storage vs Gen1
| Feature | Blob Storage | ADLS Gen2 | ADLS Gen1 (deprecated) |
|---|---|---|---|
| Namespace | Flat (virtual folders) | Hierarchical (real dirs) | Hierarchical |
| Rename directory | Copy+delete each blob | Atomic operation | Atomic |
| POSIX ACLs | No | Yes | Yes |
| Cost | Standard | Same as Blob | Higher |
| Status | Active | Recommended | Deprecated |
The Hierarchical Namespace Explained
Flat Namespace (Regular Blob Storage)
Renaming folder customer to customers with 10,000 files = 30,000 API calls (list + copy + delete each).
Hierarchical Namespace (ADLS Gen2)
Same rename = 1 atomic operation. Instant.
Why this matters: – Spark writes temp files then renames to final names. Flat = slow. Hierarchical = instant. – Delta Lake/Iceberg rely on atomic directory operations for ACID transactions. – Pipeline folder management — organizing date-partitioned folders is fast.
Real-life analogy: Flat namespace is like a warehouse where every box has a long label: “Aisle-3/Shelf-B/Row-2/Box-15.” To “move Shelf B to Aisle 5,” a worker must pick up every box from Shelf B, cross out “Aisle-3” on each label, write “Aisle-5,” and put each box back — one at a time. Hierarchical namespace is like a real filing cabinet — grab the entire drawer labeled “Shelf B” and slide it into Aisle 5. One move, instant, done.
Impact on real operations:
Operation Flat (Blob) Hierarchical (ADLS Gen2)
───────────────────────── ────────────────── ──────────────────────
Rename folder (10K files) ~30,000 API calls 1 atomic operation
Delete folder (10K files) ~20,000 API calls 1 atomic operation
Spark job commit Slow (rename temp dirs) Instant (atomic rename)
Delta Lake ACID commit Relies on file renames Fast atomic directory ops
List files in directory Scans ALL blob names Direct directory listing
Pipeline folder management Slow at scale Fast at any scale
Setting Up ADLS Gen2
- Azure Portal > Create resource > Storage account
- Name: globally unique (e.g.,
naveendatalake) - Region: same as your ADF/Synapse workspace
- Performance: Standard
- Advanced tab > Check “Enable hierarchical namespace” — this makes it ADLS Gen2
- Review + Create
Create Containers
raw (or bronze) — raw ingested data
curated (or silver) — cleaned data
analytics (or gold) — business-ready data
Grant Access for Pipelines
- Storage Account > Access Control (IAM) > + Add role assignment
- Role: Storage Blob Data Contributor
- Members: your ADF or Synapse workspace managed identity
- Review + assign
Security: RBAC vs ACLs
RBAC (Broad Access)
Storage account or container level:
– Storage Blob Data Reader — read all
– Storage Blob Data Contributor — read + write + delete all
– Storage Blob Data Owner — full control
Use for: Service identities (ADF, Synapse, Databricks).
POSIX ACLs (Fine-Grained)
Directory and file level permissions (Read, Write, Execute — like Linux):
/datalake/raw/ -- Marketing: Read+Execute
/datalake/raw/finance/ -- Finance: Read+Write+Execute
/datalake/curated/ -- Analytics: Read+Execute
Use for: Multi-team data lakes with different access requirements.
Best practice: RBAC for services, ACLs for human users.
POSIX ACLs Deep Dive
POSIX ACLs work like Linux file permissions — three permission types applied to directories and files:
| Permission | On Files | On Directories | Symbol |
|---|---|---|---|
| Read (r) | Read file contents | List directory contents | r |
| Write (w) | Write to file | Create/delete child items | w |
| Execute (x) | N/A for data files | Traverse (access children) | x |
Critical concept — the “Execute on directories” trap: To read a file at /raw/customers/2026/data.parquet, a user needs Read on the file AND Execute on every parent directory (/raw/, /raw/customers/, /raw/customers/2026/). Without Execute on the parent directories, the user cannot traverse the path to reach the file — even if they have Read on the file itself.
ACL example: Marketing team reads curated data, Finance reads everything
/datalake/
├── raw/ Marketing: --x Finance: r-x
│ ├── customers/ Marketing: --x Finance: r-x
│ └── finance/ Marketing: --- Finance: rwx
├── curated/ Marketing: r-x Finance: r-x
│ ├── customer_clean/ Marketing: r-x Finance: r-x
│ └── finance_clean/ Marketing: --- Finance: r-x
└── analytics/ Marketing: r-x Finance: r-x
Marketing can: Read curated/customer_clean/ ✓ Read raw/finance/ ✗
Finance can: Read everything ✓ Write to raw/finance/ ✓
Two types of ACLs:
| ACL Type | What It Does | When It Applies |
|---|---|---|
| Access ACL | Controls access to the specific directory or file | Every directory and file |
| Default ACL | Template inherited by NEW child items created inside a directory | Directories only |
Key rule: Setting a Default ACL on a directory does NOT change permissions on existing children — it only applies to items created after the default is set. To fix existing items, you must recursively update ACLs using Azure CLI or Storage Explorer.
Real-life analogy: An Access ACL is a lock on a specific door — only people with the key can enter. A Default ACL is a policy saying “every new room built in this wing gets this same lock type.” Rooms built before the policy keep their old locks.
Bronze/Silver/Gold Folder Structure
datalake/
-- bronze/ (Raw - schema-on-read)
-- sqldb/
-- Customer/
-- 2026/04/05/
-- part-00000.snappy.parquet
-- Address/
-- Product/
-- api/
-- weather/
-- 2026-04-05.json
-- silver/ (Cleaned - light schema)
-- customer_cleaned/
-- address_standardized/
-- gold/ (Business-ready - schema-on-write)
-- dim_customer/
-- fact_sales/
Bronze: Raw data as-is. No transforms. Append-only. ADF/Synapse Copy writes here.
Silver: Cleaned, standardized. Nulls handled, dupes removed. Spark notebooks or Data Flows.
Gold: Aggregated, modeled. Star schema. Dashboards and reports query this layer.
Using ADLS Gen2 with Pipelines
As a Sink (Most Common)
Create Linked Service > Dataset (Parquet) with parameterized container/folder > Use as Copy Sink.
Covered in detail in metadata-driven pipelines and Synapse with audit logging.
Event-Triggered Pipelines
File uploaded to /raw/uploads/ > Event Grid > ADF Trigger > Pipeline runs automatically.
As a Source (Reading Data Back)
ADLS Gen2 is also used as a pipeline source when downstream processes need to read landed files:
Common source patterns:
1. Databricks reads Bronze Parquet from ADLS → transforms → writes Silver Delta
df = spark.read.parquet("abfss://raw@naveendatalake.dfs.core.windows.net/customers/")
2. Synapse Serverless SQL queries files directly in ADLS (no copy needed)
SELECT * FROM OPENROWSET(
BULK 'https://naveendatalake.dfs.core.windows.net/raw/customers/*.parquet',
FORMAT = 'PARQUET'
) AS customers
3. ADF Lookup activity reads a config file from ADLS
Source: JSON file at /config/pipeline_config.json
4. Power BI Direct Lake reads Gold Delta tables from ADLS via Fabric shortcuts
Folder Naming Conventions
Consistent naming makes pipelines parameterizable and debugging easier:
Convention 1: Date-partitioned (most common for Bronze)
/raw/customers/2026/06/12/customers.parquet
/raw/orders/2026/06/12/orders.parquet
ADF expression: @concat('raw/', item().TableName, '/', formatDateTime(utcNow(),'yyyy/MM/dd'), '/')
Convention 2: Hive-style partitioning (best for Spark/Databricks)
/raw/orders/year=2026/month=06/day=12/part-00000.parquet
Spark reads with: df = spark.read.parquet("/raw/orders/").filter("year = 2026 AND month = 6")
Convention 3: Full load snapshots
/raw/customers/full/2026-06-12/customers.parquet
/raw/customers/incremental/2026-06-12/delta.parquet
Rules:
• Lowercase everything (no mixed case)
• Underscores for multi-word names (customer_address, not customerAddress)
• No spaces, no special characters
• Table name matches source table name
Connecting from Different Tools
| Tool | Connection | URL Format |
|---|---|---|
| ADF/Synapse | Linked Service (Managed Identity) | https://{account}.dfs.core.windows.net |
| Databricks | Mount or direct | abfss://{container}@{account}.dfs.core.windows.net/ |
| Synapse Serverless | OPENROWSET | https://{account}.dfs.core.windows.net/{container}/ |
| Spark | ABFSS driver | abfss://{container}@{account}.dfs.core.windows.net/ |
| Python | azure-storage-file-datalake SDK | Connection string |
Note the DFS endpoint: ADLS Gen2 uses dfs.core.windows.net (not blob.core.windows.net) for hierarchical operations.
Performance Tips
- Use Parquet — columnar + compression = less data to read
- Partition by date —
year=2026/month=04/enables partition pruning - Right-size files — aim for 256MB to 1GB per Parquet file
- Use DFS endpoint — better performance than Blob endpoint
- Colocate compute and storage — same Azure region
Cost
Same as Blob Storage. Enabling hierarchical namespace does NOT increase cost.
| Component | Hot | Cool |
|---|---|---|
| Storage (per GB/month) | $0.0208 | $0.0115 |
| Write ops (per 10K) | $0.065 | $0.13 |
| Read ops (per 10K) | $0.0052 | $0.013 |
Example: 1 TB Parquet in Hot = ~$21/month.
Common Mistakes
-
Forgetting to enable hierarchical namespace at creation — you cannot enable it after the storage account is created. The only fix is creating a new account and migrating data. Always check the “Enable hierarchical namespace” box in the Advanced tab.
-
Using the Blob endpoint instead of the DFS endpoint —
blob.core.windows.networks but bypasses hierarchical namespace optimizations. Always usedfs.core.windows.netfor ADLS Gen2 operations, especially from Spark and Databricks. -
Setting ACLs on a directory but forgetting Execute on parents — granting Read on
/raw/customers/does nothing if the user lacks Execute on/raw/. Every parent directory in the path needs Execute permission. -
Confusing Default ACLs with Access ACLs — setting a Default ACL on a directory does NOT change existing children. New items inherit the default, but existing items keep their old permissions. Use recursive ACL updates for existing data.
-
Storing too many small files — thousands of small files (under 1 MB) kill Spark performance. Use
coalesce()orrepartition()to create fewer, larger files (256 MB to 1 GB target). This is the “small file problem” that plagues every data lake. -
Not colocating storage and compute — if your ADLS Gen2 is in East US but your Databricks workspace is in Canada Central, every read crosses Azure regions. This adds latency and egress charges. Always put storage and compute in the same region.
-
Still using ADLS Gen1 — Gen1 is deprecated. Microsoft has published migration guides to move to Gen2. If you are on Gen1, migrate now — Gen2 costs the same or less with better performance.
Interview Questions
Q: What is ADLS Gen2? A: Blob Storage with hierarchical namespace enabled. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost as regular Blob Storage.
Q: Why enable hierarchical namespace? A: Atomic directory operations make Spark, Delta Lake, and pipelines faster. Without it, renaming 10K files = 30K API calls. With it = 1 call.
Q: RBAC vs ACLs? A: RBAC = broad access at account/container level. ACLs = fine-grained at directory/file level. Use both together.
Q: Recommended folder structure? A: Bronze (raw), Silver (cleaned), Gold (business-ready). Called Medallion architecture.
Q: Which endpoint for analytics?
A: DFS endpoint (dfs.core.windows.net) for better hierarchical namespace performance.
Q: What is the difference between Access ACLs and Default ACLs? A: Access ACLs control permissions on a specific directory or file. Default ACLs are templates on directories that are automatically inherited by new child items created inside. Default ACLs do not retroactively change existing children — you must recursively update for existing data.
Q: Why do you need Execute permission on parent directories?
A: Execute on a directory means “traverse” — the ability to pass through the directory to reach children. To read a file at /raw/customers/data.parquet, you need Execute on /raw/ and /raw/customers/, plus Read on the file itself. Without Execute on parents, the path is blocked even with Read on the file.
Q: How do you handle the small file problem in ADLS Gen2?
A: Small files (under 1 MB each) cause Spark to create too many tasks and slow down reads. Fix with: coalesce() or repartition() when writing to create fewer larger files (256 MB–1 GB target), Delta Lake OPTIMIZE command to compact small files, or ADF Copy activity with “merge files” option.
Q: Can you convert an existing Blob Storage account to ADLS Gen2? A: No. Hierarchical namespace must be enabled at storage account creation time. To migrate, create a new ADLS Gen2 account, copy data using AzCopy or ADF, update all connection references, and decommission the old account.
Wrapping Up
ADLS Gen2 is the standard for modern data lakes on Azure. Same cost as Blob Storage, better performance for analytics. Use Parquet, partition by date, follow Bronze/Silver/Gold — you’re following industry best practice.
Related posts: – Azure Blob Storage Guide – Parquet vs CSV vs JSON – Schema-on-Write vs Schema-on-Read – Metadata-Driven Pipeline in ADF
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.