Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage is the foundation of cloud storage on Azure. Whether you are building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives. Every pipeline we have built on this blog — metadata-driven loads, incremental loading, SCD pipelines — reads from or writes to storage. Understanding Blob Storage is not optional for data engineers. It is the first building block.

Think of Azure Blob Storage like a massive, pay-per-use warehouse facility. You rent a building (Storage Account), divide it into sections (Containers), and store boxes (Blobs) in each section. You only pay for the space you actually use. Boxes you access daily stay on the ground floor (Hot tier). Boxes you rarely touch go to the basement (Cool/Cold). Boxes you need for legal compliance but never open go to off-site deep storage (Archive tier).

What is Azure Blob Storage?
The Storage Hierarchy
Types of Blobs
Creating a Storage Account (Step by Step)
Access Tiers
Authentication and Access Control
Key RBAC Roles
Accessing Blob Storage
Blob Storage vs ADLS Gen2
Blob Storage in Data Engineering Pipelines
Redundancy Options
Lifecycle Management
Cost Optimization
Common Mistakes
Interview Questions
Wrapping Up

What is Azure Blob Storage?

Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.

“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.

The Storage Hierarchy

Storage Account (top-level resource)
  -- Container (like a root folder - must exist before uploading)
    -- Blob (the actual file)

Real-life analogy: A Storage Account is like a building in a storage facility. A Container is a room inside that building. A Blob is a box on the shelf inside that room. You cannot put a box (blob) directly in the facility — it must go in a room (container). And you cannot create a room without first having a building (storage account).

Example: A data engineering storage account

Storage Account: naveendatalake
  ├── Container: raw          ← Bronze layer (raw source data)
  │     ├── customers/2026/06/11/customers.csv
  │     ├── orders/2026/06/11/orders.json
  │     └── products/2026/06/11/products.parquet
  ├── Container: curated      ← Silver layer (cleaned data)
  │     ├── customers/part-00000.snappy.parquet
  │     └── orders/part-00000.snappy.parquet
  ├── Container: analytics    ← Gold layer (business-ready)
  │     ├── dim_customer/
  │     └── fact_orders/
  └── Container: quarantine   ← Bad records from validation
        └── customers/2026/06/11/rejected.parquet

Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.

Creating a Storage Account (Step by Step)

Go to Azure Portal → search “Storage accounts” → click + Create
Basics tab: — Subscription: your Azure subscription — Resource group: select or create (e.g., rg-datalake) — Storage account name: naveendatalake (globally unique, lowercase, no hyphens) — Region: Canada Central (same region as your other resources) — Performance: Standard (use Premium only for high-IOPS workloads) — Redundancy: LRS for dev, ZRS or GRS for production
Advanced tab: — Enable hierarchical namespace: YES (this makes it ADLS Gen2 — always enable for data engineering) — Enable blob public access: No — Default access tier: Hot
Review + Create → Create

The single most important checkbox: “Enable hierarchical namespace.” This one checkbox is the difference between basic Blob Storage and ADLS Gen2. Enabling it costs nothing extra but gives you real directories, POSIX ACLs, and analytics-optimized performance. You cannot enable it after creation — you must choose at creation time.

Types of Blobs

Blob Type	How It Works	Max Size	Use Case	Data Engineering?
Block Blob	Uploaded in blocks, reassembled on read. Optimized for sequential reads.	190.7 TB	CSV, JSON, Parquet, images, videos, backups	Yes — 99% of your work
Append Blob	Optimized for append-only writes. New data added to the end only.	195 GB	Log files, audit trails, pipeline run logs	Sometimes — logging pipelines
Page Blob	Optimized for random read/write. 512-byte pages.	8 TB	Azure VM disks (VHD files)	Rarely — infrastructure only

Real-life analogy: Block Blobs are like a novel — written in chapters (blocks) and read front to back. Append Blobs are like a diary — you only add new entries at the end, never edit old ones. Page Blobs are like a whiteboard — you erase and rewrite any section at random.

Access Tiers

Tier	Storage Cost	Access Cost	Use Case
Hot	Highest	Lowest	Current month’s pipeline output
Cool	~40% less	Higher	30+ days old, occasional access
Cold	~55% less	Even higher	90+ days old, rare access
Archive	~75% less	Highest (hours to rehydrate)	Compliance, long-term retention

Real-life analogy: Hot tier is your desk drawer — instant access, but limited expensive space. Cool tier is the filing cabinet across the room — a short walk, much cheaper per drawer. Cold tier is the storage closet down the hall — takes a minute, but very cheap. Archive tier is the off-site warehouse — you call them, they deliver it tomorrow, but storage costs almost nothing.

Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.

Authentication and Access Control

Method	When to Use
Storage Account Key	Quick setup. Don’t use in production.
SAS Token	Temporary, scoped access. Good for sharing.
Azure AD (Entra ID)	Production standard. RBAC.
Managed Identity	For Azure services (ADF, Synapse). Recommended.

Real-life analogy: A Storage Account Key is the master key to the building — anyone who has it can open every door. A SAS Token is a visitor pass with an expiration date and restricted access to specific rooms. Azure AD is the building’s security system — each person badges in with their own ID, and access is controlled per room. Managed Identity is a special badge that Azure services carry automatically — no human has to manage it.

SAS Token Deep Dive

Shared Access Signatures (SAS) are the most nuanced authentication method. Three types exist:

SAS Type	Scope	Created From	Use Case
Account SAS	Entire storage account (all containers)	Storage account key	Broad access for trusted services
Service SAS	Single service (Blob, Queue, Table, or File)	Storage account key	Scoped access to blob service only
User Delegation SAS	Single container or blob	Azure AD credentials (no key needed)	Most secure — recommended over Account/Service SAS

SAS Token URL example:
https://naveendatalake.blob.core.windows.net/raw/customers.csv
  ?sv=2023-11-03                    ← API version
  &st=2026-06-11T00:00:00Z         ← Start time
  &se=2026-06-12T00:00:00Z         ← Expiry time (24 hours)
  &sr=b                             ← Resource: b=blob, c=container
  &sp=r                             ← Permissions: r=read, w=write, d=delete
  &sig=abc123...                    ← Signature (proves authenticity)

The token controls: WHAT (resource), WHEN (start/expiry), and HOW (permissions).
If the token leaks, damage is limited to its scope and duration.

Best practice: Use User Delegation SAS (backed by Azure AD) instead of Account/Service SAS (backed by storage keys). If a key-based SAS leaks, you must rotate the storage key to invalidate it — which breaks every other SAS token created from that key.

Key RBAC Roles

Role	Allows
Storage Blob Data Reader	Read blobs
Storage Blob Data Contributor	Read + write + delete blobs
Storage Blob Data Owner	Full control
Storage Account Contributor	Manage account, NOT blob data

Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.

Accessing Blob Storage

Python SDK

from azure.storage.blob import BlobServiceClient

blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")

# Upload
with open("data.csv", "rb") as data:
    container.upload_blob("path/to/data.csv", data)

# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()

# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
    print(blob.name, blob.size)

AzCopy

azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive

Blob Storage vs ADLS Gen2

ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.

Feature	Blob Storage	ADLS Gen2
Namespace	Flat (virtual folders)	Hierarchical (real directories)
Rename folder	Copy+delete each blob	Atomic single operation
POSIX permissions	Not supported	Supported (ACLs)
Analytics optimized	No	Yes
Cost	Same	Same

For data engineering, always use ADLS Gen2. Same cost, better performance.

Blob Storage in Data Engineering Pipelines

Every pipeline on this blog touches Blob Storage or ADLS Gen2. Here is how storage fits into the Medallion Architecture:

Source Systems                    ADLS Gen2 / Blob Storage                    Consumers
┌──────────────┐    ┌─────────────────────────────────────────────┐    ┌──────────────┐
│ SQL Database │──>│  raw/          curated/        analytics/    │──>│ Power BI     │
│ APIs         │──>│  (Bronze)  ──>  (Silver)  ──>  (Gold)        │──>│ Analysts     │
│ CSV files    │──>│                                              │──>│ Databricks   │
└──────────────┘    └─────────────────────────────────────────────┘    └──────────────┘
                     ADF Copy       Notebook/        SCD MERGE
                     Activity       Data Flow        Notebook

Key pipeline patterns that use Blob Storage:

Pipeline Pattern	How It Uses Storage	Blog Post
Metadata-Driven Load	Copy from SQL → Parquet files in ADLS `raw/` container	Metadata-Driven Pipeline
Incremental Loading	Copy only new rows → append to date-partitioned folders	Incremental Loading
Data Quality + SCD	Read from `raw/`, clean, write to `curated/`	Data Quality + SCD
Databricks Bronze → Silver	Notebooks read Parquet from ADLS, write Delta to Silver	PySpark Transformations

Common storage URL formats:

Blob Storage endpoint:
  https://naveendatalake.blob.core.windows.net/raw/customers/file.csv

ADLS Gen2 (DFS) endpoint:
  https://naveendatalake.dfs.core.windows.net/raw/customers/file.csv
  abfss://raw@naveendatalake.dfs.core.windows.net/customers/file.csv

In ADF Sink path:     raw/customers/@{formatDateTime(utcNow(),'yyyy/MM/dd')}/
In Databricks:        abfss://raw@naveendatalake.dfs.core.windows.net/customers/

Redundancy Options

Azure replicates your data for durability. The redundancy option you choose affects cost, availability, and disaster recovery:

Redundancy	Copies	Where	Survives	Cost	Use Case
LRS (Locally Redundant)	3	Same data center	Disk/rack failure	Cheapest	Dev/test, non-critical data
ZRS (Zone Redundant)	3	3 availability zones in same region	Data center failure	~25% more than LRS	Production data within one region
GRS (Geo Redundant)	6	3 local + 3 in a paired region	Regional disaster	~2x LRS	Business-critical data
RA-GRS (Read-Access Geo)	6	Same as GRS + read access to secondary	Regional disaster + read failover	Most expensive	High availability, read during outage

Real-life analogy: LRS keeps three photocopies of your documents in the same filing cabinet — protects against a page getting damaged, but if the cabinet burns, all copies are gone. ZRS keeps copies in three different offices in the same city. GRS keeps copies in offices in two different cities. RA-GRS lets you read from the second city even while the first is offline.

For data engineering: Use LRS for dev/test and Bronze layer scratch data. Use ZRS or GRS for production Silver/Gold layers. The cost difference is significant at scale — 100 TB on LRS vs GRS saves roughly $1,000/month.

Lifecycle Management

Automate tier transitions:

{
  "rules": [{
    "name": "archive-old-data",
    "definition": {
      "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
      "actions": {
        "baseBlob": {
          "tierToCool": {"daysAfterModificationGreaterThan": 30},
          "tierToArchive": {"daysAfterModificationGreaterThan": 365},
          "delete": {"daysAfterModificationGreaterThan": 730}
        }
      }
    }
  }]
}

Set up in Storage Account > Data management > Lifecycle management.

Cost Optimization

Right access tier — 1 TB from Hot to Cool saves ~$8/month
Use Parquet — 5-10x smaller = 5-10x cheaper storage
Lifecycle policies — automate tier transitions
LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
Delete old outputs — daily full loads create redundant data

Common Mistakes

Using Blob Storage instead of ADLS Gen2 for data engineering — ADLS Gen2 costs the same but adds hierarchical namespace, real directories, and ACLs. Always enable hierarchical namespace when creating storage accounts for data lakes.
Leaving everything in Hot tier — Bronze layer data older than 30 days is rarely re-read. Move it to Cool or Cold with lifecycle policies and save 40-55% on storage.
Using Storage Account Keys in production — keys give full access to everything. Use Managed Identity for Azure services (ADF, Databricks, Synapse) and Azure AD RBAC for users.
Confusing Storage Account Contributor with Blob Data Contributor — the first manages the account (settings, keys). The second reads and writes blob data. Data engineers need Storage Blob Data Contributor.
Not creating containers before running pipelines — ADF and Synapse can create folders (blob prefixes), but containers must exist beforehand. A pipeline targeting a non-existent container fails.
Archiving data you might need soon — Archive tier takes up to 15 hours to rehydrate. If there is any chance you need the data within a day, use Cool or Cold instead.

Interview Questions

Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.

Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.

Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).

Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.

Q: What authentication method should you use for ADF connecting to Blob Storage? A: Managed Identity is the recommended approach for production. ADF’s managed identity is granted the Storage Blob Data Contributor role on the storage account. No passwords to manage, no keys to rotate, no secrets to store.

Q: What is the difference between Block Blobs, Append Blobs, and Page Blobs? A: Block Blobs are the default for data files (CSV, Parquet, JSON) — uploaded in blocks, optimized for large reads. Append Blobs are optimized for append-only writes — ideal for log files. Page Blobs are for random read/write operations — used internally for Azure VM disks. Data engineers work almost exclusively with Block Blobs.

Q: How do lifecycle management policies work? A: You define rules based on last modified date. For example: move blobs to Cool after 30 days, to Archive after 365 days, delete after 730 days. Rules run automatically once per day. Set up in Storage Account → Data management → Lifecycle management.

Wrapping Up

Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.

Related posts: – ADLS Gen2 Complete Guide – Parquet vs CSV vs JSON – What is Azure Data Factory?

← Previous: DB vs DW + SQL Pools

Azure (7/37)

Next: ADLS Gen2 →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.