Azure Blob Storage Explained: A Complete Guide for Data Engineers
Azure Blob Storage is the foundation of cloud storage on Azure. Whether you are building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives. Every pipeline we have built on this blog — metadata-driven loads, incremental loading, SCD pipelines — reads from or writes to storage. Understanding Blob Storage is not optional for data engineers. It is the first building block.
Think of Azure Blob Storage like a massive, pay-per-use warehouse facility. You rent a building (Storage Account), divide it into sections (Containers), and store boxes (Blobs) in each section. You only pay for the space you actually use. Boxes you access daily stay on the ground floor (Hot tier). Boxes you rarely touch go to the basement (Cool/Cold). Boxes you need for legal compliance but never open go to off-site deep storage (Archive tier).
Table of Contents
- What is Azure Blob Storage?
- The Storage Hierarchy
- Types of Blobs
- Creating a Storage Account (Step by Step)
- Access Tiers
- Authentication and Access Control
- Key RBAC Roles
- Accessing Blob Storage
- Blob Storage vs ADLS Gen2
- Blob Storage in Data Engineering Pipelines
- Redundancy Options
- Lifecycle Management
- Cost Optimization
- Common Mistakes
- Interview Questions
- Wrapping Up
What is Azure Blob Storage?
Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.
“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.
The Storage Hierarchy
Storage Account (top-level resource)
-- Container (like a root folder - must exist before uploading)
-- Blob (the actual file)
Real-life analogy: A Storage Account is like a building in a storage facility. A Container is a room inside that building. A Blob is a box on the shelf inside that room. You cannot put a box (blob) directly in the facility — it must go in a room (container). And you cannot create a room without first having a building (storage account).
Example: A data engineering storage account
Storage Account: naveendatalake
├── Container: raw ← Bronze layer (raw source data)
│ ├── customers/2026/06/11/customers.csv
│ ├── orders/2026/06/11/orders.json
│ └── products/2026/06/11/products.parquet
├── Container: curated ← Silver layer (cleaned data)
│ ├── customers/part-00000.snappy.parquet
│ └── orders/part-00000.snappy.parquet
├── Container: analytics ← Gold layer (business-ready)
│ ├── dim_customer/
│ └── fact_orders/
└── Container: quarantine ← Bad records from validation
└── customers/2026/06/11/rejected.parquet
Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.
Creating a Storage Account (Step by Step)
- Go to Azure Portal → search “Storage accounts” → click + Create
- Basics tab:
— Subscription: your Azure subscription
— Resource group: select or create (e.g.,
rg-datalake) — Storage account name:naveendatalake(globally unique, lowercase, no hyphens) — Region:Canada Central(same region as your other resources) — Performance: Standard (use Premium only for high-IOPS workloads) — Redundancy: LRS for dev, ZRS or GRS for production - Advanced tab: — Enable hierarchical namespace: YES (this makes it ADLS Gen2 — always enable for data engineering) — Enable blob public access: No — Default access tier: Hot
- Review + Create → Create
The single most important checkbox: “Enable hierarchical namespace.” This one checkbox is the difference between basic Blob Storage and ADLS Gen2. Enabling it costs nothing extra but gives you real directories, POSIX ACLs, and analytics-optimized performance. You cannot enable it after creation — you must choose at creation time.
Types of Blobs
| Blob Type | How It Works | Max Size | Use Case | Data Engineering? |
|---|---|---|---|---|
| Block Blob | Uploaded in blocks, reassembled on read. Optimized for sequential reads. | 190.7 TB | CSV, JSON, Parquet, images, videos, backups | Yes — 99% of your work |
| Append Blob | Optimized for append-only writes. New data added to the end only. | 195 GB | Log files, audit trails, pipeline run logs | Sometimes — logging pipelines |
| Page Blob | Optimized for random read/write. 512-byte pages. | 8 TB | Azure VM disks (VHD files) | Rarely — infrastructure only |
Real-life analogy: Block Blobs are like a novel — written in chapters (blocks) and read front to back. Append Blobs are like a diary — you only add new entries at the end, never edit old ones. Page Blobs are like a whiteboard — you erase and rewrite any section at random.
Access Tiers
| Tier | Storage Cost | Access Cost | Use Case |
|---|---|---|---|
| Hot | Highest | Lowest | Current month’s pipeline output |
| Cool | ~40% less | Higher | 30+ days old, occasional access |
| Cold | ~55% less | Even higher | 90+ days old, rare access |
| Archive | ~75% less | Highest (hours to rehydrate) | Compliance, long-term retention |
Real-life analogy: Hot tier is your desk drawer — instant access, but limited expensive space. Cool tier is the filing cabinet across the room — a short walk, much cheaper per drawer. Cold tier is the storage closet down the hall — takes a minute, but very cheap. Archive tier is the off-site warehouse — you call them, they deliver it tomorrow, but storage costs almost nothing.
Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.
Authentication and Access Control
| Method | When to Use |
|---|---|
| Storage Account Key | Quick setup. Don’t use in production. |
| SAS Token | Temporary, scoped access. Good for sharing. |
| Azure AD (Entra ID) | Production standard. RBAC. |
| Managed Identity | For Azure services (ADF, Synapse). Recommended. |
Real-life analogy: A Storage Account Key is the master key to the building — anyone who has it can open every door. A SAS Token is a visitor pass with an expiration date and restricted access to specific rooms. Azure AD is the building’s security system — each person badges in with their own ID, and access is controlled per room. Managed Identity is a special badge that Azure services carry automatically — no human has to manage it.
SAS Token Deep Dive
Shared Access Signatures (SAS) are the most nuanced authentication method. Three types exist:
| SAS Type | Scope | Created From | Use Case |
|---|---|---|---|
| Account SAS | Entire storage account (all containers) | Storage account key | Broad access for trusted services |
| Service SAS | Single service (Blob, Queue, Table, or File) | Storage account key | Scoped access to blob service only |
| User Delegation SAS | Single container or blob | Azure AD credentials (no key needed) | Most secure — recommended over Account/Service SAS |
SAS Token URL example:
https://naveendatalake.blob.core.windows.net/raw/customers.csv
?sv=2023-11-03 ← API version
&st=2026-06-11T00:00:00Z ← Start time
&se=2026-06-12T00:00:00Z ← Expiry time (24 hours)
&sr=b ← Resource: b=blob, c=container
&sp=r ← Permissions: r=read, w=write, d=delete
&sig=abc123... ← Signature (proves authenticity)
The token controls: WHAT (resource), WHEN (start/expiry), and HOW (permissions).
If the token leaks, damage is limited to its scope and duration.
Best practice: Use User Delegation SAS (backed by Azure AD) instead of Account/Service SAS (backed by storage keys). If a key-based SAS leaks, you must rotate the storage key to invalidate it — which breaks every other SAS token created from that key.
Key RBAC Roles
| Role | Allows |
|---|---|
| Storage Blob Data Reader | Read blobs |
| Storage Blob Data Contributor | Read + write + delete blobs |
| Storage Blob Data Owner | Full control |
| Storage Account Contributor | Manage account, NOT blob data |
Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.
Accessing Blob Storage
Python SDK
from azure.storage.blob import BlobServiceClient
blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")
# Upload
with open("data.csv", "rb") as data:
container.upload_blob("path/to/data.csv", data)
# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()
# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
print(blob.name, blob.size)
AzCopy
azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive
Blob Storage vs ADLS Gen2
ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.
| Feature | Blob Storage | ADLS Gen2 |
|---|---|---|
| Namespace | Flat (virtual folders) | Hierarchical (real directories) |
| Rename folder | Copy+delete each blob | Atomic single operation |
| POSIX permissions | Not supported | Supported (ACLs) |
| Analytics optimized | No | Yes |
| Cost | Same | Same |
For data engineering, always use ADLS Gen2. Same cost, better performance.
Blob Storage in Data Engineering Pipelines
Every pipeline on this blog touches Blob Storage or ADLS Gen2. Here is how storage fits into the Medallion Architecture:
Source Systems ADLS Gen2 / Blob Storage Consumers
┌──────────────┐ ┌─────────────────────────────────────────────┐ ┌──────────────┐
│ SQL Database │──>│ raw/ curated/ analytics/ │──>│ Power BI │
│ APIs │──>│ (Bronze) ──> (Silver) ──> (Gold) │──>│ Analysts │
│ CSV files │──>│ │──>│ Databricks │
└──────────────┘ └─────────────────────────────────────────────┘ └──────────────┘
ADF Copy Notebook/ SCD MERGE
Activity Data Flow Notebook
Key pipeline patterns that use Blob Storage:
| Pipeline Pattern | How It Uses Storage | Blog Post |
|---|---|---|
| Metadata-Driven Load | Copy from SQL → Parquet files in ADLS raw/ container |
Metadata-Driven Pipeline |
| Incremental Loading | Copy only new rows → append to date-partitioned folders | Incremental Loading |
| Data Quality + SCD | Read from raw/, clean, write to curated/ |
Data Quality + SCD |
| Databricks Bronze → Silver | Notebooks read Parquet from ADLS, write Delta to Silver | PySpark Transformations |
Common storage URL formats:
Blob Storage endpoint:
https://naveendatalake.blob.core.windows.net/raw/customers/file.csv
ADLS Gen2 (DFS) endpoint:
https://naveendatalake.dfs.core.windows.net/raw/customers/file.csv
abfss://raw@naveendatalake.dfs.core.windows.net/customers/file.csv
In ADF Sink path: raw/customers/@{formatDateTime(utcNow(),'yyyy/MM/dd')}/
In Databricks: abfss://raw@naveendatalake.dfs.core.windows.net/customers/
Redundancy Options
Azure replicates your data for durability. The redundancy option you choose affects cost, availability, and disaster recovery:
| Redundancy | Copies | Where | Survives | Cost | Use Case |
|---|---|---|---|---|---|
| LRS (Locally Redundant) | 3 | Same data center | Disk/rack failure | Cheapest | Dev/test, non-critical data |
| ZRS (Zone Redundant) | 3 | 3 availability zones in same region | Data center failure | ~25% more than LRS | Production data within one region |
| GRS (Geo Redundant) | 6 | 3 local + 3 in a paired region | Regional disaster | ~2x LRS | Business-critical data |
| RA-GRS (Read-Access Geo) | 6 | Same as GRS + read access to secondary | Regional disaster + read failover | Most expensive | High availability, read during outage |
Real-life analogy: LRS keeps three photocopies of your documents in the same filing cabinet — protects against a page getting damaged, but if the cabinet burns, all copies are gone. ZRS keeps copies in three different offices in the same city. GRS keeps copies in offices in two different cities. RA-GRS lets you read from the second city even while the first is offline.
For data engineering: Use LRS for dev/test and Bronze layer scratch data. Use ZRS or GRS for production Silver/Gold layers. The cost difference is significant at scale — 100 TB on LRS vs GRS saves roughly $1,000/month.
Lifecycle Management
Automate tier transitions:
{
"rules": [{
"name": "archive-old-data",
"definition": {
"filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
"actions": {
"baseBlob": {
"tierToCool": {"daysAfterModificationGreaterThan": 30},
"tierToArchive": {"daysAfterModificationGreaterThan": 365},
"delete": {"daysAfterModificationGreaterThan": 730}
}
}
}
}]
}
Set up in Storage Account > Data management > Lifecycle management.
Cost Optimization
- Right access tier — 1 TB from Hot to Cool saves ~$8/month
- Use Parquet — 5-10x smaller = 5-10x cheaper storage
- Lifecycle policies — automate tier transitions
- LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
- Delete old outputs — daily full loads create redundant data
Common Mistakes
-
Using Blob Storage instead of ADLS Gen2 for data engineering — ADLS Gen2 costs the same but adds hierarchical namespace, real directories, and ACLs. Always enable hierarchical namespace when creating storage accounts for data lakes.
-
Leaving everything in Hot tier — Bronze layer data older than 30 days is rarely re-read. Move it to Cool or Cold with lifecycle policies and save 40-55% on storage.
-
Using Storage Account Keys in production — keys give full access to everything. Use Managed Identity for Azure services (ADF, Databricks, Synapse) and Azure AD RBAC for users.
-
Confusing Storage Account Contributor with Blob Data Contributor — the first manages the account (settings, keys). The second reads and writes blob data. Data engineers need
Storage Blob Data Contributor. -
Not creating containers before running pipelines — ADF and Synapse can create folders (blob prefixes), but containers must exist beforehand. A pipeline targeting a non-existent container fails.
-
Archiving data you might need soon — Archive tier takes up to 15 hours to rehydrate. If there is any chance you need the data within a day, use Cool or Cold instead.
Interview Questions
Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.
Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.
Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).
Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.
Q: What authentication method should you use for ADF connecting to Blob Storage?
A: Managed Identity is the recommended approach for production. ADF’s managed identity is granted the Storage Blob Data Contributor role on the storage account. No passwords to manage, no keys to rotate, no secrets to store.
Q: What is the difference between Block Blobs, Append Blobs, and Page Blobs? A: Block Blobs are the default for data files (CSV, Parquet, JSON) — uploaded in blocks, optimized for large reads. Append Blobs are optimized for append-only writes — ideal for log files. Page Blobs are for random read/write operations — used internally for Azure VM disks. Data engineers work almost exclusively with Block Blobs.
Q: How do lifecycle management policies work? A: You define rules based on last modified date. For example: move blobs to Cool after 30 days, to Archive after 365 days, delete after 730 days. Rules run automatically once per day. Set up in Storage Account → Data management → Lifecycle management.
Wrapping Up
Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.
Related posts: – ADLS Gen2 Complete Guide – Parquet vs CSV vs JSON – What is Azure Data Factory?
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.