Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage Explained: A Complete Guide for Data Engineers

Azure Blob Storage is the foundation of cloud storage on Azure. Whether you’re building data pipelines, hosting static websites, or archiving data — Blob Storage is where your data lives.

What is Azure Blob Storage?

Azure Blob Storage is Microsoft’s object storage service for unstructured data — files, images, videos, logs, data files (CSV, JSON, Parquet), and anything representable as bytes.

“Blob” = Binary Large Object. Key characteristics: massively scalable (petabytes), durable (99.999999999%), accessible via HTTP/HTTPS, pay-per-use, integrated with all Azure services.

The Storage Hierarchy

Storage Account (top-level resource)
  -- Container (like a root folder - must exist before uploading)
    -- Blob (the actual file)

Important: “Folders” in Blob Storage are virtual. A blob named data/customer/file1.csv appears as nested folders, but there are no real directories. The / in the name creates a visual hierarchy. Exception: ADLS Gen2 (hierarchical namespace) has real directories.

Types of Blobs

  • Block Blobs — most common, used for data files, images, documents. Up to 190.7 TB.
  • Append Blobs — optimized for append operations. Perfect for log files.
  • Page Blobs — for random read/write. Used for VM disks. Rarely used in data engineering.

Access Tiers

Tier Storage Cost Access Cost Use Case
Hot Highest Lowest Current month’s pipeline output
Cool ~40% less Higher 30+ days old, occasional access
Cold ~55% less Even higher 90+ days old, rare access
Archive ~75% less Highest (hours to rehydrate) Compliance, long-term retention

Archive blobs are offline. Rehydration takes up to 15 hours (standard) or under 1 hour (high priority). Never archive data you need quickly.

Authentication and Access Control

Method When to Use
Storage Account Key Quick setup. Don’t use in production.
SAS Token Temporary, scoped access. Good for sharing.
Azure AD (Entra ID) Production standard. RBAC.
Managed Identity For Azure services (ADF, Synapse). Recommended.

Key RBAC Roles

Role Allows
Storage Blob Data Reader Read blobs
Storage Blob Data Contributor Read + write + delete blobs
Storage Blob Data Owner Full control
Storage Account Contributor Manage account, NOT blob data

Common mistake: Assigning Storage Account Contributor for data access. It doesn’t grant blob data access — you need Storage Blob Data Contributor.

Accessing Blob Storage

Python SDK

from azure.storage.blob import BlobServiceClient

blob_service = BlobServiceClient.from_connection_string(conn_str)
container = blob_service.get_container_client("my-container")

# Upload
with open("data.csv", "rb") as data:
    container.upload_blob("path/to/data.csv", data)

# Download
blob = container.download_blob("path/to/data.csv")
content = blob.readall()

# List blobs
for blob in container.list_blobs(name_starts_with="path/"):
    print(blob.name, blob.size)

AzCopy

azcopy copy "local/file.csv" "https://account.blob.core.windows.net/container/file.csv"
azcopy sync "local/dir/" "https://account.blob.core.windows.net/container/dir/" --recursive

Blob Storage vs ADLS Gen2

ADLS Gen2 IS Blob Storage with hierarchical namespace enabled.

Feature Blob Storage ADLS Gen2
Namespace Flat (virtual folders) Hierarchical (real directories)
Rename folder Copy+delete each blob Atomic single operation
POSIX permissions Not supported Supported (ACLs)
Analytics optimized No Yes
Cost Same Same

For data engineering, always use ADLS Gen2. Same cost, better performance.

Lifecycle Management

Automate tier transitions:

{
  "rules": [{
    "name": "archive-old-data",
    "definition": {
      "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["data/"]},
      "actions": {
        "baseBlob": {
          "tierToCool": {"daysAfterModificationGreaterThan": 30},
          "tierToArchive": {"daysAfterModificationGreaterThan": 365},
          "delete": {"daysAfterModificationGreaterThan": 730}
        }
      }
    }
  }]
}

Set up in Storage Account > Data management > Lifecycle management.

Cost Optimization

  1. Right access tier — 1 TB from Hot to Cool saves ~$8/month
  2. Use Parquet — 5-10x smaller = 5-10x cheaper storage
  3. Lifecycle policies — automate tier transitions
  4. LRS vs GRS — LRS is 2x cheaper. Use for non-critical data.
  5. Delete old outputs — daily full loads create redundant data

Interview Questions

Q: What is Azure Blob Storage? A: Object storage for unstructured data. Stores blobs in containers within storage accounts. Massively scalable, durable, accessible via HTTP.

Q: Blob Storage vs ADLS Gen2? A: ADLS Gen2 is Blob Storage with hierarchical namespace. Adds real directories, POSIX ACLs, and analytics optimizations. Same cost.

Q: What are access tiers? A: Hot (frequent, expensive storage), Cool (infrequent), Cold (rare), Archive (offline, cheapest, hours to rehydrate).

Q: Why can’t ADF create containers? A: Containers are storage account resources. ADF creates folders (blob name prefixes) but containers must exist before pipeline runs.

Wrapping Up

Blob Storage is the bedrock of Azure data engineering. Master the hierarchy (Account > Container > Blob), access tiers, RBAC roles, and the distinction from ADLS Gen2.

Related posts:ADLS Gen2 Complete GuideParquet vs CSV vs JSONWhat is Azure Data Factory?


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link