Azure Blob Storage Explained: A Complete Guide for Data Engineers
Azure Blob Storage is the foundation of cloud storage on Azure. Whether you’re building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives.
What is Azure Blob Storage?
Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.
“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.
The Storage Hierarchy
Storage Account (top-level resource)
-- Container (like a root folder - must exist before uploading)
-- Blob (the actual file)
Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.
Types of Blobs
- Block Blobs — most common, used for data files, images, documents. Up to 190.7 TB.
- Append Blobs — optimized for append operations. Perfect for log files.
- Page Blobs — for random read/write. Used for VM disks. Rarely used in data engineering.
Access Tiers
| Tier | Storage Cost | Access Cost | Use Case |
|---|---|---|---|
| Hot | Highest | Lowest | Current month’s pipeline output |
| Cool | ~40% less | Higher | 30+ days old, occasional access |
| Cold | ~55% less | Even higher | 90+ days old, rare access |
| Archive | ~75% less | Highest (hours to rehydrate) | Compliance, long-term retention |
Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.
Authentication and Access Control
| Method | When to Use |
|---|---|
| Storage Account Key | Quick setup. Don’t use in production. |
| SAS Token | Temporary, scoped access. Good for sharing. |
| Azure AD (Entra ID) | Production standard. RBAC. |
| Managed Identity | For Azure services (ADF, Synapse). Recommended. |
Key RBAC Roles
| Role | Allows |
|---|---|
| Storage Blob Data Reader | Read blobs |
| Storage Blob Data Contributor | Read + write + delete blobs |
| Storage Blob Data Owner | Full control |
| Storage Account Contributor | Manage account, NOT blob data |
Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.
Accessing Blob Storage
Python SDK
from azure.storage.blob import BlobServiceClient
blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")
# Upload
with open("data.csv", "rb") as data:
container.upload_blob("path/to/data.csv", data)
# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()
# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
print(blob.name, blob.size)
AzCopy
azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive
Blob Storage vs ADLS Gen2
ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.
| Feature | Blob Storage | ADLS Gen2 |
|---|---|---|
| Namespace | Flat (virtual folders) | Hierarchical (real directories) |
| Rename folder | Copy+delete each blob | Atomic single operation |
| POSIX permissions | Not supported | Supported (ACLs) |
| Analytics optimized | No | Yes |
| Cost | Same | Same |
For data engineering, always use ADLS Gen2. Same cost, better performance.
Lifecycle Management
Automate tier transitions:
{
"rules": [{
"name": "archive-old-data",
"definition": {
"filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
"actions": {
"baseBlob": {
"tierToCool": {"daysAfterModificationGreaterThan": 30},
"tierToArchive": {"daysAfterModificationGreaterThan": 365},
"delete": {"daysAfterModificationGreaterThan": 730}
}
}
}
}]
}
Set up in Storage Account > Data management > Lifecycle management.
Cost Optimization
- Right access tier — 1 TB from Hot to Cool saves ~$8/month
- Use Parquet — 5-10x smaller = 5-10x cheaper storage
- Lifecycle policies — automate tier transitions
- LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
- Delete old outputs — daily full loads create redundant data
Interview Questions
Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.
Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.
Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).
Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.
Wrapping Up
Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.
Related posts: – ADLS Gen2 Complete Guide – Parquet vs CSV vs JSON – What is Azure Data Factory?
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.