Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

Azure Data Lake Storage Gen2: The Complete Guide for Data Engineers

If Azure Blob Storage is the foundation, ADLS Gen2 is the evolution — purpose-built for big data analytics. It adds a hierarchical file system, fine-grained access control, and performance optimizations for Spark, Synapse, and Databricks.

Every ADF and Synapse pipeline on this blog writes to ADLS Gen2. It is the standard destination for data lake ingestion — our metadata-driven pipeline writes Parquet to ADLS, our Databricks connectivity post reads from it, and our Medallion Architecture organizes layers inside it.

Think of ADLS Gen2 like upgrading from a basic storage locker (Blob Storage) to a fully organized filing system (ADLS Gen2). Both hold the same amount of stuff at the same price. But the filing system has labeled drawers (real directories), access controls per drawer (POSIX ACLs), and a librarian who can instantly move an entire drawer to a new cabinet (atomic directory operations). The storage locker makes you move each item one by one.

Table of Contents

  • What is ADLS Gen2?
  • ADLS Gen2 vs Blob Storage vs Gen1
  • The Hierarchical Namespace Explained
  • Setting Up ADLS Gen2
  • Security: RBAC vs ACLs
  • POSIX ACLs Deep Dive
  • Bronze/Silver/Gold Folder Structure
  • Using ADLS Gen2 with Pipelines
  • Connecting from Different Tools
  • Performance Tips
  • Cost
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What is ADLS Gen2?

ADLS Gen2 is Azure Blob Storage with hierarchical namespace enabled. Not a separate service — a capability you activate on a storage account. You get all Blob Storage features plus:

  • Real directories — rename, move, delete directories atomically
  • POSIX ACLs — fine-grained access at directory/file level
  • Analytics optimization — faster Spark, Synapse Serverless, Databricks
  • DFS API — dedicated endpoint (dfs.core.windows.net)

ADLS Gen2 vs Blob Storage vs Gen1

Feature Blob Storage ADLS Gen2 ADLS Gen1 (deprecated)
Namespace Flat (virtual folders) Hierarchical (real dirs) Hierarchical
Rename directory Copy+delete each blob Atomic operation Atomic
POSIX ACLs No Yes Yes
Cost Standard Same as Blob Higher
Status Active Recommended Deprecated

The Hierarchical Namespace Explained

Flat Namespace (Regular Blob Storage)

Renaming folder customer to customers with 10,000 files = 30,000 API calls (list + copy + delete each).

Hierarchical Namespace (ADLS Gen2)

Same rename = 1 atomic operation. Instant.

Why this matters:Spark writes temp files then renames to final names. Flat = slow. Hierarchical = instant. – Delta Lake/Iceberg rely on atomic directory operations for ACID transactions. – Pipeline folder management — organizing date-partitioned folders is fast.

Real-life analogy: Flat namespace is like a warehouse where every box has a long label: “Aisle-3/Shelf-B/Row-2/Box-15.” To “move Shelf B to Aisle 5,” a worker must pick up every box from Shelf B, cross out “Aisle-3” on each label, write “Aisle-5,” and put each box back — one at a time. Hierarchical namespace is like a real filing cabinet — grab the entire drawer labeled “Shelf B” and slide it into Aisle 5. One move, instant, done.

Impact on real operations:

Operation                    Flat (Blob)              Hierarchical (ADLS Gen2)
─────────────────────────    ──────────────────       ──────────────────────
Rename folder (10K files)    ~30,000 API calls        1 atomic operation
Delete folder (10K files)    ~20,000 API calls        1 atomic operation
Spark job commit             Slow (rename temp dirs)  Instant (atomic rename)
Delta Lake ACID commit       Relies on file renames   Fast atomic directory ops
List files in directory      Scans ALL blob names     Direct directory listing
Pipeline folder management   Slow at scale            Fast at any scale

Setting Up ADLS Gen2

  1. Azure Portal > Create resource > Storage account
  2. Name: globally unique (e.g., naveendatalake)
  3. Region: same as your ADF/Synapse workspace
  4. Performance: Standard
  5. Advanced tab > Check “Enable hierarchical namespace” — this makes it ADLS Gen2
  6. Review + Create

Create Containers

raw (or bronze) — raw ingested data
curated (or silver) — cleaned data
analytics (or gold) — business-ready data

Grant Access for Pipelines

  1. Storage Account > Access Control (IAM) > + Add role assignment
  2. Role: Storage Blob Data Contributor
  3. Members: your ADF or Synapse workspace managed identity
  4. Review + assign

Security: RBAC vs ACLs

RBAC (Broad Access)

Storage account or container level: – Storage Blob Data Reader — read all – Storage Blob Data Contributor — read + write + delete all – Storage Blob Data Owner — full control

Use for: Service identities (ADF, Synapse, Databricks).

POSIX ACLs (Fine-Grained)

Directory and file level permissions (Read, Write, Execute — like Linux):

/datalake/raw/                  -- Marketing: Read+Execute
/datalake/raw/finance/          -- Finance: Read+Write+Execute
/datalake/curated/              -- Analytics: Read+Execute

Use for: Multi-team data lakes with different access requirements.

Best practice: RBAC for services, ACLs for human users.

POSIX ACLs Deep Dive

POSIX ACLs work like Linux file permissions — three permission types applied to directories and files:

Permission On Files On Directories Symbol
Read (r) Read file contents List directory contents r
Write (w) Write to file Create/delete child items w
Execute (x) N/A for data files Traverse (access children) x

Critical concept — the “Execute on directories” trap: To read a file at /raw/customers/2026/data.parquet, a user needs Read on the file AND Execute on every parent directory (/raw/, /raw/customers/, /raw/customers/2026/). Without Execute on the parent directories, the user cannot traverse the path to reach the file — even if they have Read on the file itself.

ACL example: Marketing team reads curated data, Finance reads everything

/datalake/
  ├── raw/                        Marketing: --x    Finance: r-x
  │     ├── customers/            Marketing: --x    Finance: r-x
  │     └── finance/              Marketing: ---    Finance: rwx
  ├── curated/                    Marketing: r-x    Finance: r-x
  │     ├── customer_clean/       Marketing: r-x    Finance: r-x
  │     └── finance_clean/        Marketing: ---    Finance: r-x
  └── analytics/                  Marketing: r-x    Finance: r-x

Marketing can:    Read curated/customer_clean/ ✓    Read raw/finance/ ✗
Finance can:      Read everything ✓                 Write to raw/finance/ ✓

Two types of ACLs:

ACL Type What It Does When It Applies
Access ACL Controls access to the specific directory or file Every directory and file
Default ACL Template inherited by NEW child items created inside a directory Directories only

Key rule: Setting a Default ACL on a directory does NOT change permissions on existing children — it only applies to items created after the default is set. To fix existing items, you must recursively update ACLs using Azure CLI or Storage Explorer.

Real-life analogy: An Access ACL is a lock on a specific door — only people with the key can enter. A Default ACL is a policy saying “every new room built in this wing gets this same lock type.” Rooms built before the policy keep their old locks.

Bronze/Silver/Gold Folder Structure

datalake/
-- bronze/                        (Raw - schema-on-read)
   -- sqldb/
      -- Customer/
         -- 2026/04/05/
            -- part-00000.snappy.parquet
      -- Address/
      -- Product/
   -- api/
      -- weather/
         -- 2026-04-05.json
-- silver/                        (Cleaned - light schema)
   -- customer_cleaned/
   -- address_standardized/
-- gold/                          (Business-ready - schema-on-write)
   -- dim_customer/
   -- fact_sales/

Bronze: Raw data as-is. No transforms. Append-only. ADF/Synapse Copy writes here.

Silver: Cleaned, standardized. Nulls handled, dupes removed. Spark notebooks or Data Flows.

Gold: Aggregated, modeled. Star schema. Dashboards and reports query this layer.

Using ADLS Gen2 with Pipelines

As a Sink (Most Common)

Create Linked Service > Dataset (Parquet) with parameterized container/folder > Use as Copy Sink.

Covered in detail in metadata-driven pipelines and Synapse with audit logging.

Event-Triggered Pipelines

File uploaded to /raw/uploads/ > Event Grid > ADF Trigger > Pipeline runs automatically.

As a Source (Reading Data Back)

ADLS Gen2 is also used as a pipeline source when downstream processes need to read landed files:

Common source patterns:

1. Databricks reads Bronze Parquet from ADLS → transforms → writes Silver Delta
   df = spark.read.parquet("abfss://raw@naveendatalake.dfs.core.windows.net/customers/")

2. Synapse Serverless SQL queries files directly in ADLS (no copy needed)
   SELECT * FROM OPENROWSET(
     BULK 'https://naveendatalake.dfs.core.windows.net/raw/customers/*.parquet',
     FORMAT = 'PARQUET'
   ) AS customers

3. ADF Lookup activity reads a config file from ADLS
   Source: JSON file at /config/pipeline_config.json

4. Power BI Direct Lake reads Gold Delta tables from ADLS via Fabric shortcuts

Folder Naming Conventions

Consistent naming makes pipelines parameterizable and debugging easier:

Convention 1: Date-partitioned (most common for Bronze)
  /raw/customers/2026/06/12/customers.parquet
  /raw/orders/2026/06/12/orders.parquet
  ADF expression: @concat('raw/', item().TableName, '/', formatDateTime(utcNow(),'yyyy/MM/dd'), '/')

Convention 2: Hive-style partitioning (best for Spark/Databricks)
  /raw/orders/year=2026/month=06/day=12/part-00000.parquet
  Spark reads with: df = spark.read.parquet("/raw/orders/").filter("year = 2026 AND month = 6")

Convention 3: Full load snapshots
  /raw/customers/full/2026-06-12/customers.parquet
  /raw/customers/incremental/2026-06-12/delta.parquet

Rules:
  • Lowercase everything (no mixed case)
  • Underscores for multi-word names (customer_address, not customerAddress)
  • No spaces, no special characters
  • Table name matches source table name

Connecting from Different Tools

Tool Connection URL Format
ADF/Synapse Linked Service (Managed Identity) https://{account}.dfs.core.windows.net
Databricks Mount or direct abfss://{container}@{account}.dfs.core.windows.net/
Synapse Serverless OPENROWSET https://{account}.dfs.core.windows.net/{container}/
Spark ABFSS driver abfss://{container}@{account}.dfs.core.windows.net/
Python azure-storage-file-datalake SDK Connection string

Note the DFS endpoint: ADLS Gen2 uses dfs.core.windows.net (not blob.core.windows.net) for hierarchical operations.

Performance Tips

  1. Use Parquet — columnar + compression = less data to read
  2. Partition by dateyear=2026/month=04/ enables partition pruning
  3. Right-size files — aim for 256MB to 1GB per Parquet file
  4. Use DFS endpoint — better performance than Blob endpoint
  5. Colocate compute and storage — same Azure region

Cost

Same as Blob Storage. Enabling hierarchical namespace does NOT increase cost.

Component Hot Cool
Storage (per GB/month) $0.0208 $0.0115
Write ops (per 10K) $0.065 $0.13
Read ops (per 10K) $0.0052 $0.013

Example: 1 TB Parquet in Hot = ~$21/month.

Common Mistakes

  1. Forgetting to enable hierarchical namespace at creation — you cannot enable it after the storage account is created. The only fix is creating a new account and migrating data. Always check the “Enable hierarchical namespace” box in the Advanced tab.

  2. Using the Blob endpoint instead of the DFS endpointblob.core.windows.net works but bypasses hierarchical namespace optimizations. Always use dfs.core.windows.net for ADLS Gen2 operations, especially from Spark and Databricks.

  3. Setting ACLs on a directory but forgetting Execute on parents — granting Read on /raw/customers/ does nothing if the user lacks Execute on /raw/. Every parent directory in the path needs Execute permission.

  4. Confusing Default ACLs with Access ACLs — setting a Default ACL on a directory does NOT change existing children. New items inherit the default, but existing items keep their old permissions. Use recursive ACL updates for existing data.

  5. Storing too many small files — thousands of small files (under 1 MB) kill Spark performance. Use coalesce() or repartition() to create fewer, larger files (256 MB to 1 GB target). This is the “small file problem” that plagues every data lake.

  6. Not colocating storage and compute — if your ADLS Gen2 is in East US but your Databricks workspace is in Canada Central, every read crosses Azure regions. This adds latency and egress charges. Always put storage and compute in the same region.

  7. Still using ADLS Gen1 — Gen1 is deprecated. Microsoft has published migration guides to move to Gen2. If you are on Gen1, migrate now — Gen2 costs the same or less with better performance.

Interview Questions

Q: What is ADLS Gen2? A: Blob Storage with hierarchical namespace enabled. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost as regular Blob Storage.

Q: Why enable hierarchical namespace? A: Atomic directory operations make Spark, Delta Lake, and pipelines faster. Without it, renaming 10K files = 30K API calls. With it = 1 call.

Q: RBAC vs ACLs? A: RBAC = broad access at account/container level. ACLs = fine-grained at directory/file level. Use both together.

Q: Recommended folder structure? A: Bronze (raw), Silver (cleaned), Gold (business-ready). Called Medallion architecture.

Q: Which endpoint for analytics? A: DFS endpoint (dfs.core.windows.net) for better hierarchical namespace performance.

Q: What is the difference between Access ACLs and Default ACLs? A: Access ACLs control permissions on a specific directory or file. Default ACLs are templates on directories that are automatically inherited by new child items created inside. Default ACLs do not retroactively change existing children — you must recursively update for existing data.

Q: Why do you need Execute permission on parent directories? A: Execute on a directory means “traverse” — the ability to pass through the directory to reach children. To read a file at /raw/customers/data.parquet, you need Execute on /raw/ and /raw/customers/, plus Read on the file itself. Without Execute on parents, the path is blocked even with Read on the file.

Q: How do you handle the small file problem in ADLS Gen2? A: Small files (under 1 MB each) cause Spark to create too many tasks and slow down reads. Fix with: coalesce() or repartition() when writing to create fewer larger files (256 MB–1 GB target), Delta Lake OPTIMIZE command to compact small files, or ADF Copy activity with “merge files” option.

Q: Can you convert an existing Blob Storage account to ADLS Gen2? A: No. Hierarchical namespace must be enabled at storage account creation time. To migrate, create a new ADLS Gen2 account, copy data using AzCopy or ADF, update all connection references, and decommission the old account.

Wrapping Up

ADLS Gen2 is the standard for modern data lakes on Azure. Same cost as Blob Storage, better performance for analytics. Use Parquet, partition by date, follow Bronze/Silver/Gold — you’re following industry best practice.

Related posts:Azure Blob Storage GuideParquet vs CSV vs JSONSchema-on-Write vs Schema-on-ReadMetadata-Driven Pipeline in ADF


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link