Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage is the foundation of cloud storage on Azure. Whether you are building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives. Every pipeline we have built on this blog — metadata-driven loads, incremental loading, SCD pipelines — reads from or writes to storage. Understanding Blob Storage is not optional for data engineers. It is the first building block.

Think of Azure Blob Storage like a massive, pay-per-use warehouse facility. You rent a building (Storage Account), divide it into sections (Containers), and store boxes (Blobs) in each section. You only pay for the space you actually use. Boxes you access daily stay on the ground floor (Hot tier). Boxes you rarely touch go to the basement (Cool/Cold). Boxes you need for legal compliance but never open go to off-site deep storage (Archive tier).

Table of Contents

  • What is Azure Blob Storage?
  • The Storage Hierarchy
  • Types of Blobs
  • Creating a Storage Account (Step by Step)
  • Access Tiers
  • Authentication and Access Control
  • Key RBAC Roles
  • Accessing Blob Storage
  • Blob Storage vs ADLS Gen2
  • Blob Storage in Data Engineering Pipelines
  • Redundancy Options
  • Lifecycle Management
  • Cost Optimization
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What is Azure Blob Storage?

Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.

“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.

The Storage Hierarchy

Storage Account (top-level resource)
  -- Container (like a root folder - must exist before uploading)
    -- Blob (the actual file)

Real-life analogy: A Storage Account is like a building in a storage facility. A Container is a room inside that building. A Blob is a box on the shelf inside that room. You cannot put a box (blob) directly in the facility — it must go in a room (container). And you cannot create a room without first having a building (storage account).

Example: A data engineering storage account

Storage Account: naveendatalake
  ├── Container: raw          ← Bronze layer (raw source data)
  │     ├── customers/2026/06/11/customers.csv
  │     ├── orders/2026/06/11/orders.json
  │     └── products/2026/06/11/products.parquet
  ├── Container: curated      ← Silver layer (cleaned data)
  │     ├── customers/part-00000.snappy.parquet
  │     └── orders/part-00000.snappy.parquet
  ├── Container: analytics    ← Gold layer (business-ready)
  │     ├── dim_customer/
  │     └── fact_orders/
  └── Container: quarantine   ← Bad records from validation
        └── customers/2026/06/11/rejected.parquet

Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.

Creating a Storage Account (Step by Step)

  1. Go to Azure Portal → search “Storage accounts” → click + Create
  2. Basics tab: — Subscription: your Azure subscription — Resource group: select or create (e.g., rg-datalake) — Storage account name: naveendatalake (globally unique, lowercase, no hyphens) — Region: Canada Central (same region as your other resources) — Performance: Standard (use Premium only for high-IOPS workloads) — Redundancy: LRS for dev, ZRS or GRS for production
  3. Advanced tab:Enable hierarchical namespace: YES (this makes it ADLS Gen2 — always enable for data engineering) — Enable blob public access: No — Default access tier: Hot
  4. Review + CreateCreate

The single most important checkbox: “Enable hierarchical namespace.” This one checkbox is the difference between basic Blob Storage and ADLS Gen2. Enabling it costs nothing extra but gives you real directories, POSIX ACLs, and analytics-optimized performance. You cannot enable it after creation — you must choose at creation time.

Types of Blobs

Blob Type How It Works Max Size Use Case Data Engineering?
Block Blob Uploaded in blocks, reassembled on read. Optimized for sequential reads. 190.7 TB CSV, JSON, Parquet, images, videos, backups Yes — 99% of your work
Append Blob Optimized for append-only writes. New data added to the end only. 195 GB Log files, audit trails, pipeline run logs Sometimes — logging pipelines
Page Blob Optimized for random read/write. 512-byte pages. 8 TB Azure VM disks (VHD files) Rarely — infrastructure only

Real-life analogy: Block Blobs are like a novel — written in chapters (blocks) and read front to back. Append Blobs are like a diary — you only add new entries at the end, never edit old ones. Page Blobs are like a whiteboard — you erase and rewrite any section at random.

Access Tiers

Tier Storage Cost Access Cost Use Case
Hot Highest Lowest Current month’s pipeline output
Cool ~40% less Higher 30+ days old, occasional access
Cold ~55% less Even higher 90+ days old, rare access
Archive ~75% less Highest (hours to rehydrate) Compliance, long-term retention

Real-life analogy: Hot tier is your desk drawer — instant access, but limited expensive space. Cool tier is the filing cabinet across the room — a short walk, much cheaper per drawer. Cold tier is the storage closet down the hall — takes a minute, but very cheap. Archive tier is the off-site warehouse — you call them, they deliver it tomorrow, but storage costs almost nothing.

Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.

Authentication and Access Control

Method When to Use
Storage Account Key Quick setup. Don’t use in production.
SAS Token Temporary, scoped access. Good for sharing.
Azure AD (Entra ID) Production standard. RBAC.
Managed Identity For Azure services (ADF, Synapse). Recommended.

Real-life analogy: A Storage Account Key is the master key to the building — anyone who has it can open every door. A SAS Token is a visitor pass with an expiration date and restricted access to specific rooms. Azure AD is the building’s security system — each person badges in with their own ID, and access is controlled per room. Managed Identity is a special badge that Azure services carry automatically — no human has to manage it.

SAS Token Deep Dive

Shared Access Signatures (SAS) are the most nuanced authentication method. Three types exist:

SAS Type Scope Created From Use Case
Account SAS Entire storage account (all containers) Storage account key Broad access for trusted services
Service SAS Single service (Blob, Queue, Table, or File) Storage account key Scoped access to blob service only
User Delegation SAS Single container or blob Azure AD credentials (no key needed) Most secure — recommended over Account/Service SAS
SAS Token URL example:
https://naveendatalake.blob.core.windows.net/raw/customers.csv
  ?sv=2023-11-03                    ← API version
  &st=2026-06-11T00:00:00Z         ← Start time
  &se=2026-06-12T00:00:00Z         ← Expiry time (24 hours)
  &sr=b                             ← Resource: b=blob, c=container
  &sp=r                             ← Permissions: r=read, w=write, d=delete
  &sig=abc123...                    ← Signature (proves authenticity)

The token controls: WHAT (resource), WHEN (start/expiry), and HOW (permissions).
If the token leaks, damage is limited to its scope and duration.

Best practice: Use User Delegation SAS (backed by Azure AD) instead of Account/Service SAS (backed by storage keys). If a key-based SAS leaks, you must rotate the storage key to invalidate it — which breaks every other SAS token created from that key.

Key RBAC Roles

Role Allows
Storage Blob Data Reader Read blobs
Storage Blob Data Contributor Read + write + delete blobs
Storage Blob Data Owner Full control
Storage Account Contributor Manage account, NOT blob data

Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.

Accessing Blob Storage

Python SDK

from azure.storage.blob import BlobServiceClient

blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")

# Upload
with open("data.csv", "rb") as data:
    container.upload_blob("path/to/data.csv", data)

# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()

# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
    print(blob.name, blob.size)

AzCopy

azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive

Blob Storage vs ADLS Gen2

ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.

Feature Blob Storage ADLS Gen2
Namespace Flat (virtual folders) Hierarchical (real directories)
Rename folder Copy+delete each blob Atomic single operation
POSIX permissions Not supported Supported (ACLs)
Analytics optimized No Yes
Cost Same Same

For data engineering, always use ADLS Gen2. Same cost, better performance.

Blob Storage in Data Engineering Pipelines

Every pipeline on this blog touches Blob Storage or ADLS Gen2. Here is how storage fits into the Medallion Architecture:

Source Systems                    ADLS Gen2 / Blob Storage                    Consumers
┌──────────────┐    ┌─────────────────────────────────────────────┐    ┌──────────────┐
│ SQL Database │──>│  raw/          curated/        analytics/    │──>│ Power BI     │
│ APIs         │──>│  (Bronze)  ──>  (Silver)  ──>  (Gold)        │──>│ Analysts     │
│ CSV files    │──>│                                              │──>│ Databricks   │
└──────────────┘    └─────────────────────────────────────────────┘    └──────────────┘
                     ADF Copy       Notebook/        SCD MERGE
                     Activity       Data Flow        Notebook

Key pipeline patterns that use Blob Storage:

Pipeline Pattern How It Uses Storage Blog Post
Metadata-Driven Load Copy from SQL → Parquet files in ADLS raw/ container Metadata-Driven Pipeline
Incremental Loading Copy only new rows → append to date-partitioned folders Incremental Loading
Data Quality + SCD Read from raw/, clean, write to curated/ Data Quality + SCD
Databricks Bronze → Silver Notebooks read Parquet from ADLS, write Delta to Silver PySpark Transformations

Common storage URL formats:

Blob Storage endpoint:
  https://naveendatalake.blob.core.windows.net/raw/customers/file.csv

ADLS Gen2 (DFS) endpoint:
  https://naveendatalake.dfs.core.windows.net/raw/customers/file.csv
  abfss://raw@naveendatalake.dfs.core.windows.net/customers/file.csv

In ADF Sink path:     raw/customers/@{formatDateTime(utcNow(),'yyyy/MM/dd')}/
In Databricks:        abfss://raw@naveendatalake.dfs.core.windows.net/customers/

Redundancy Options

Azure replicates your data for durability. The redundancy option you choose affects cost, availability, and disaster recovery:

Redundancy Copies Where Survives Cost Use Case
LRS (Locally Redundant) 3 Same data center Disk/rack failure Cheapest Dev/test, non-critical data
ZRS (Zone Redundant) 3 3 availability zones in same region Data center failure ~25% more than LRS Production data within one region
GRS (Geo Redundant) 6 3 local + 3 in a paired region Regional disaster ~2x LRS Business-critical data
RA-GRS (Read-Access Geo) 6 Same as GRS + read access to secondary Regional disaster + read failover Most expensive High availability, read during outage

Real-life analogy: LRS keeps three photocopies of your documents in the same filing cabinet — protects against a page getting damaged, but if the cabinet burns, all copies are gone. ZRS keeps copies in three different offices in the same city. GRS keeps copies in offices in two different cities. RA-GRS lets you read from the second city even while the first is offline.

For data engineering: Use LRS for dev/test and Bronze layer scratch data. Use ZRS or GRS for production Silver/Gold layers. The cost difference is significant at scale — 100 TB on LRS vs GRS saves roughly $1,000/month.

Lifecycle Management

Automate tier transitions:

{
  "rules": [{
    "name": "archive-old-data",
    "definition": {
      "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
      "actions": {
        "baseBlob": {
          "tierToCool": {"daysAfterModificationGreaterThan": 30},
          "tierToArchive": {"daysAfterModificationGreaterThan": 365},
          "delete": {"daysAfterModificationGreaterThan": 730}
        }
      }
    }
  }]
}

Set up in Storage Account > Data management > Lifecycle management.

Cost Optimization

  1. Right access tier — 1 TB from Hot to Cool saves ~$8/month
  2. Use Parquet — 5-10x smaller = 5-10x cheaper storage
  3. Lifecycle policies — automate tier transitions
  4. LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
  5. Delete old outputs — daily full loads create redundant data

Common Mistakes

  1. Using Blob Storage instead of ADLS Gen2 for data engineering — ADLS Gen2 costs the same but adds hierarchical namespace, real directories, and ACLs. Always enable hierarchical namespace when creating storage accounts for data lakes.

  2. Leaving everything in Hot tier — Bronze layer data older than 30 days is rarely re-read. Move it to Cool or Cold with lifecycle policies and save 40-55% on storage.

  3. Using Storage Account Keys in production — keys give full access to everything. Use Managed Identity for Azure services (ADF, Databricks, Synapse) and Azure AD RBAC for users.

  4. Confusing Storage Account Contributor with Blob Data Contributor — the first manages the account (settings, keys). The second reads and writes blob data. Data engineers need Storage Blob Data Contributor.

  5. Not creating containers before running pipelines — ADF and Synapse can create folders (blob prefixes), but containers must exist beforehand. A pipeline targeting a non-existent container fails.

  6. Archiving data you might need soon — Archive tier takes up to 15 hours to rehydrate. If there is any chance you need the data within a day, use Cool or Cold instead.

Interview Questions

Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.

Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.

Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).

Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.

Q: What authentication method should you use for ADF connecting to Blob Storage? A: Managed Identity is the recommended approach for production. ADF’s managed identity is granted the Storage Blob Data Contributor role on the storage account. No passwords to manage, no keys to rotate, no secrets to store.

Q: What is the difference between Block Blobs, Append Blobs, and Page Blobs? A: Block Blobs are the default for data files (CSV, Parquet, JSON) — uploaded in blocks, optimized for large reads. Append Blobs are optimized for append-only writes — ideal for log files. Page Blobs are for random read/write operations — used internally for Azure VM disks. Data engineers work almost exclusively with Block Blobs.

Q: How do lifecycle management policies work? A: You define rules based on last modified date. For example: move blobs to Cool after 30 days, to Archive after 365 days, delete after 730 days. Rules run automatically once per day. Set up in Storage Account → Data management → Lifecycle management.

Wrapping Up

Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.

Related posts:ADLS Gen2 Complete GuideParquet vs CSV vs JSONWhat is Azure Data Factory?


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link