Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage is the foundation of cloud storage on Azure. Whether you’re building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives.

What is Azure Blob Storage?

Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.

“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.

The Storage Hierarchy

Storage Account (top-level resource)
  -- Container (like a root folder - must exist before uploading)
    -- Blob (the actual file)

Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.

Types of Blobs

Block Blobs — most common, used for data files, images, documents. Up to 190.7 TB.
Append Blobs — optimized for append operations. Perfect for log files.
Page Blobs — for random read/write. Used for VM disks. Rarely used in data engineering.

Access Tiers

Tier	Storage Cost	Access Cost	Use Case
Hot	Highest	Lowest	Current month’s pipeline output
Cool	~40% less	Higher	30+ days old, occasional access
Cold	~55% less	Even higher	90+ days old, rare access
Archive	~75% less	Highest (hours to rehydrate)	Compliance, long-term retention

Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.

Authentication and Access Control

Method	When to Use
Storage Account Key	Quick setup. Don’t use in production.
SAS Token	Temporary, scoped access. Good for sharing.
Azure AD (Entra ID)	Production standard. RBAC.
Managed Identity	For Azure services (ADF, Synapse). Recommended.

Key RBAC Roles

Role	Allows
Storage Blob Data Reader	Read blobs
Storage Blob Data Contributor	Read + write + delete blobs
Storage Blob Data Owner	Full control
Storage Account Contributor	Manage account, NOT blob data

Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.

Accessing Blob Storage

Python SDK

from azure.storage.blob import BlobServiceClient

blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")

# Upload
with open("data.csv", "rb") as data:
    container.upload_blob("path/to/data.csv", data)

# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()

# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
    print(blob.name, blob.size)

AzCopy

azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive

Blob Storage vs ADLS Gen2

ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.

Feature	Blob Storage	ADLS Gen2
Namespace	Flat (virtual folders)	Hierarchical (real directories)
Rename folder	Copy+delete each blob	Atomic single operation
POSIX permissions	Not supported	Supported (ACLs)
Analytics optimized	No	Yes
Cost	Same	Same

For data engineering, always use ADLS Gen2. Same cost, better performance.

Lifecycle Management

Automate tier transitions:

{
  "rules": [{
    "name": "archive-old-data",
    "definition": {
      "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
      "actions": {
        "baseBlob": {
          "tierToCool": {"daysAfterModificationGreaterThan": 30},
          "tierToArchive": {"daysAfterModificationGreaterThan": 365},
          "delete": {"daysAfterModificationGreaterThan": 730}
        }
      }
    }
  }]
}

Set up in Storage Account > Data management > Lifecycle management.

Cost Optimization

Right access tier — 1 TB from Hot to Cool saves ~$8/month
Use Parquet — 5-10x smaller = 5-10x cheaper storage
Lifecycle policies — automate tier transitions
LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
Delete old outputs — daily full loads create redundant data

Interview Questions

Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.

Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.

Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).

Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.

Wrapping Up

Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.

Related posts: – ADLS Gen2 Complete Guide – Parquet vs CSV vs JSON – What is Azure Data Factory?

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.