The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

Every data platform tutorial mentions “Bronze, Silver, Gold” like it is obvious. But when you sit down to build one, the questions start: What exactly goes in each layer? What format should Bronze use? When does Silver become Gold? Who reads from which layer? How do the layers connect to the pipelines we have built?

The Medallion Architecture is NOT a product you install. It is a design pattern — a way of organizing your data lake into three layers with clear rules about what data looks like at each stage. Think of it like a water purification plant.

Raw water from the river (Bronze) is collected as-is — muddy, unfiltered, full of particles. The filtration stage (Silver) removes contaminants, standardizes pH levels, and tests for quality. The final treated water (Gold) is clean, safe, and ready for consumption. You never serve river water directly to homes. You always purify it first.

That is exactly what the Medallion Architecture does with data.

Table of Contents

  • What Is the Medallion Architecture?
  • The Water Purification Analogy
  • Bronze Layer — Raw, Untouched Data
  • Silver Layer — Cleaned, Validated, Standardized
  • Gold Layer — Business-Ready, Aggregated, Modeled
  • Format Selection by Layer
  • How Our Blog Pipelines Map to Medallion
  • Building the Medallion Architecture in Azure
  • Bronze to Silver: Transformation Patterns
  • Silver to Gold: Aggregation and Modeling
  • Ownership and Access by Layer
  • Medallion in Databricks vs Synapse
  • Governance and Data Quality Per Layer
  • Cost Optimization Per Layer
  • Common Anti-Patterns
  • Interview Questions
  • Wrapping Up

What Is the Medallion Architecture?

SOURCE SYSTEMS                 DATA LAKE (ADLS Gen2)                    CONSUMERS
┌──────────────┐    ┌─────────────────────────────────────────────┐    ┌──────────────┐
│              │    │                                             │    │              │
│ SQL Database │──>│  BRONZE         SILVER         GOLD         │──>│ Power BI     │
│ APIs         │──>│  (Raw)    ──>   (Clean)   ──>  (Business)   │──>│ Analysts     │
│ CSV files    │──>│                                             │──>│ ML Models    │
│ Kafka        │──>│  "Land it"      "Clean it"     "Serve it"   │──>│ Applications │
│              │    │                                             │    │              │
└──────────────┘    └─────────────────────────────────────────────┘    └──────────────┘

Three layers, three purposes, three quality levels.

The Water Purification Analogy

Layer Water Plant Data Platform
Bronze Raw river water collected in a holding tank. Muddy, untested. Keep as-is for reference. Raw data landed as-is from sources. No transformations. CSV, JSON, Avro — whatever the source sends.
Silver Filtered water. Sediment removed, chemicals balanced, bacteria killed. Tested for quality. Cleaned, deduplicated, typed, validated data. NULLs handled, schemas enforced, bad records quarantined.
Gold Bottled water. Labeled, packaged, ready for specific consumers. Different products for different needs. Business-ready datasets. Star schema dimensions, aggregated facts, KPIs, ML feature tables. Tailored for specific consumers.

The key rule: Each layer only reads from the layer before it. Gold reads from Silver, never from Bronze. Silver reads from Bronze, never from source systems directly. This creates a clear lineage and makes each layer independently testable.

Bronze Layer — Raw, Untouched Data

Purpose

Capture raw data EXACTLY as the source provides it. No transformations, no cleaning, no type conversions. Bronze is your safety net — if anything goes wrong downstream, you can always reprocess from Bronze.

What Goes In

bronze/
  sales/
    2026/05/18/
      sales_raw.csv                 ← Exact file from source
  customers/
    2026/05/18/
      customers_raw.json            ← API response as-is
  events/
    2026/05/18/14/30/
      event_stream.avro             ← Kafka capture as-is
  erp_export/
    2026/05/18/
      GL_ACCOUNTS.csv               ← SAP export as-is

Rules

  • Never transform — land data exactly as received
  • Never delete — keep historical loads (date-partitioned folders)
  • Any format — CSV, JSON, Avro, Parquet — whatever the source sends
  • Append-only — each load adds a new folder, never overwrites
  • Schema-on-read — do not enforce schema at this layer

In Our Pipeline

This is what our metadata-driven pipeline does — Copy activity reads from SQL and writes Parquet to ADLS. The destination folders (sqldb/Customer/2026/05/18/) are the Bronze layer.

Who Reads Bronze

  • Silver layer transformation notebooks/pipelines
  • Data engineers debugging source data issues
  • Nobody else. Analysts and BI tools should NEVER read from Bronze.

Real-life analogy: Bronze is the warehouse receiving dock. Packages (data) arrive from suppliers (sources). They are placed on shelves exactly as they came — unopened, unchecked. The dock worker’s job is to receive and store, NOT to open or inspect.

Silver Layer — Cleaned, Validated, Standardized

Purpose

Clean, standardize, validate, and deduplicate Bronze data. Silver is your trusted, reusable data — multiple Gold layers can read from the same Silver table.

What Happens Here

Transformation Example
Null handling Replace NULL city with ‘Unknown’
Data type enforcement Cast “price” from STRING to DOUBLE
Deduplication Window function to keep latest record per key
Standardization trim(), initCap(), fix invalid emails
Schema enforcement Reject rows that do not match expected schema
Data quality checks Validate email format, check referential integrity
Joins Enrich customer with address (denormalize)
SCD Type 1 Overwrite corrections (email fixes)
Filtering Remove test data, internal accounts

What It Looks Like

silver/
  customers/
    _delta_log/
    part-00000.snappy.parquet         ← Cleaned, deduplicated, Delta format
  products/
    _delta_log/
    part-00000.snappy.parquet
  orders/
    _delta_log/
    year=2026/
      month=05/
        part-00000.snappy.parquet     ← Partitioned by date

Rules

  • Always Delta format — ACID transactions, schema enforcement, MERGE
  • Deduplicated — one row per entity (or one active row for SCD)
  • Typed — all columns have correct data types
  • Validated — bad data quarantined or defaulted
  • Reusable — multiple Gold tables read from the same Silver table

In Our Pipelines

This is what our Project 4 data quality pipeline does — HandleNulls, StandardizeData, Deduplicate, then write to the dimension table. And our PySpark transformations cookbook shows exactly these operations in code.

Who Reads Silver

  • Gold layer transformation notebooks
  • Data scientists for feature engineering
  • Advanced analysts who understand the data model

Real-life analogy: Silver is the kitchen prep station. Raw ingredients from the warehouse (Bronze) are washed, peeled, chopped, and portioned. The prep cook ensures everything is clean and measured correctly. Multiple chefs (Gold layers) use the same prepped ingredients for different dishes.

Gold Layer — Business-Ready, Aggregated, Modeled

Purpose

Serve business-specific, consumption-ready datasets. Gold tables are designed for specific consumers — a Power BI dashboard, a machine learning model, a regulatory report.

What Happens Here

Transformation Example
Star schema modeling Create fact_orders + dim_customer + dim_product
SCD Type 2 Historical dimension tables with versioning
Aggregations Daily/weekly/monthly revenue rollups
KPIs Customer lifetime value, churn rate, retention
Feature tables ML features: avg_order_value, days_since_last_order
Denormalization Pre-join dimensions into fact for faster BI queries
Business logic Calculate margins, apply tax rules, categorize customers

What It Looks Like

gold/
  dim_customer/           ← SCD Type 2 dimension (our SCD2 pipelines)
    _delta_log/
    part-00000.snappy.parquet
  dim_product/            ← Product dimension
  fact_orders/            ← Fact table (grain: one row per order line)
    _delta_log/
    year=2026/
      month=05/
  agg_daily_revenue/      ← Pre-aggregated daily revenue
  ml_features/            ← Feature table for ML models

Rules

  • Always Delta format — ACID, time travel, MERGE for SCD
  • Consumer-specific — each Gold table is designed for a specific use case
  • Business logic applied — calculations, categorizations, classifications
  • Optimized for reading — partitioned, Z-ORDERED, OPTIMIZED
  • Documented — clear definitions of every column and business rule

In Our Pipelines

This is where our SCD Type 2 tables live — dim_customer with start_date, end_date, is_active. And our DB vs DW post explains why Gold uses star schema.

Who Reads Gold

  • Power BI dashboards (DirectQuery or Import)
  • Business analysts (SQL queries)
  • ML models (feature tables)
  • APIs and applications
  • Regulatory reports

Real-life analogy: Gold is the served dish at a restaurant. The ingredients were received (Bronze), prepped (Silver), and now cooked and plated specifically for the customer’s order (Gold). A steak dinner and a Caesar salad use the same prepped lettuce (Silver) but are completely different dishes (Gold) for different customers.

Format Selection by Layer

Layer Format Why
Bronze Whatever source sends (CSV, JSON, Avro, Parquet) No transformation, preserve original
Silver Delta Lake ACID, schema enforcement, MERGE for dedup/SCD
Gold Delta Lake ACID, time travel, OPTIMIZE, Z-ORDER for fast queries

Why not Parquet for Silver/Gold? Because Parquet does not support UPDATE, DELETE, MERGE, schema enforcement, or ACID transactions. Delta adds all of these while using Parquet underneath.

How Our Blog Pipelines Map to Medallion

Our Pipeline Layer What It Does
Metadata-Driven Pipeline Source → Bronze Copy tables from SQL to ADLS as Parquet
Incremental Loading Source → Bronze Copy only new rows using watermark
Unified Full+Incremental Source → Bronze Both load types in one pipeline
Project 4: Data Quality Bronze → Silver Null handling, standardization, dedup
PySpark Transformations Bronze → Silver Same cleaning in code
SCD Type 1 Hash Silver → Gold Overwrite dimension changes
SCD Type 2 Pipeline Silver → Gold Historical dimension tracking
SCD PySpark MERGE Silver → Gold Same in Databricks

Every pipeline we built fits neatly into the Medallion pattern.

Building the Medallion Architecture in Azure

Folder Structure in ADLS Gen2

Container: datalake/
  bronze/
    sales/
      YYYY/MM/DD/        ← Date-partitioned raw data
    customers/
    products/
    events/

  silver/
    customers/           ← Delta table (cleaned, deduplicated)
    products/
    orders/

  gold/
    dim_customer/        ← SCD Type 2 dimension
    dim_product/
    fact_orders/
    agg_daily_revenue/
    ml_customer_features/

  quarantine/            ← Bad records from Silver validation
    customers/
    orders/

The Pipeline Chain

# Bronze → Silver notebook
df_bronze = spark.read.parquet("abfss://.../bronze/customers/2026/05/18/")
df_silver = df_bronze     .withColumn("name", initcap(trim(col("name"))))     .withColumn("email", lower(trim(col("email"))))     .fillna({"city": "Unknown"})     .dropDuplicates(["customer_id"])
df_silver.write.format("delta").mode("overwrite").save("abfss://.../silver/customers/")

# Silver → Gold notebook
df_silver = spark.read.format("delta").load("abfss://.../silver/customers/")
# Apply SCD Type 2 MERGE to Gold dim_customer
target = DeltaTable.forName(spark, "gold.dim_customer")
target.alias("t").merge(df_silver.alias("s"), ...).execute()

Ownership and Access by Layer

Layer Owner Write Access Read Access
Bronze Data Engineers ADF/Synapse pipelines Data Engineers only
Silver Data Engineers Databricks notebooks, Data Flows Data Engineers, Data Scientists
Gold Data Engineers + Analytics Databricks notebooks Everyone — Analysts, BI, ML, APIs
Quarantine Data Engineers Silver pipeline (rejects) Data Engineers (to investigate)

Analysts should never read from Bronze. If they do, they are working with messy, unvalidated data. Gold is their layer.

Medallion in Databricks vs Synapse

Aspect Databricks Synapse
Bronze → Silver PySpark notebooks Data Flows or Spark notebooks
Silver → Gold Delta MERGE in notebooks Data Flows with SCD transformations
Orchestration Databricks Workflows or ADF Synapse Pipelines
Delta support Native (default format) Supported in Spark pools
Governance Unity Catalog Purview + Synapse roles
Best for Code-first, ML-heavy Visual, SQL-heavy

Many enterprises use BOTH: Synapse for ingestion (Bronze) and SQL workloads, Databricks for transformation (Silver/Gold) and ML.

Governance and Data Quality Per Layer

Layer Schema Enforcement Data Quality Lineage
Bronze None (schema-on-read) None — accept everything Source system → Bronze timestamp
Silver Strict (Delta schema enforcement) Null checks, type validation, referential integrity Bronze path → Silver table
Gold Strict (Delta + business rules) Business rule validation, KPI accuracy Silver table → Gold table

Quarantine Pattern

# During Silver processing — separate good and bad records
df_good = df_bronze.filter(
    col("email").rlike("^.+@.+\..+$") & col("customer_id").isNotNull()
)
df_bad = df_bronze.filter(
    ~col("email").rlike("^.+@.+\..+$") | col("customer_id").isNull()
)

df_good.write.format("delta").mode("append").save(".../silver/customers/")
df_bad.write.format("delta").mode("append").save(".../quarantine/customers/")
print(f"Good: {df_good.count()}, Quarantined: {df_bad.count()}")

Cost Optimization Per Layer

Layer Storage Strategy Cost Tip
Bronze Keep raw for 90 days, archive to Cool/Archive tier Move to Cool after 30 days, Archive after 90
Silver Delta with OPTIMIZE + VACUUM VACUUM after 7 days, OPTIMIZE weekly
Gold Delta with Z-ORDER on query columns OPTIMIZE + Z-ORDER for BI query speed

Common Anti-Patterns

  1. Reading Bronze directly in Power BI — messy data, no types, no validation. Always read from Gold.

  2. Skipping Silver entirely — going straight from Bronze to Gold means Gold has no clean, reusable intermediate layer. When a second Gold table needs the same data, you duplicate the cleaning logic.

  3. Putting business logic in Silver — Silver should be source-aligned (cleaned version of the source). Business rules (margins, KPIs, categorizations) belong in Gold.

  4. Overwriting Bronze — Bronze should be append-only. If you overwrite, you lose the ability to reprocess historical data.

  5. No quarantine layer — bad records silently dropped or corrupting Silver. Always route bad data to a quarantine folder for investigation.

  6. Using CSV in Silver/Gold — no ACID, no MERGE, no schema enforcement. Always use Delta.

Interview Questions

Q: What is the Medallion Architecture? A: A data lake design pattern with three layers: Bronze (raw, untouched data from sources), Silver (cleaned, validated, standardized data), and Gold (business-ready, aggregated, modeled data). Each layer reads from the previous one, creating clear data lineage and quality progression.

Q: What format should each layer use? A: Bronze uses whatever the source provides (CSV, JSON, Avro). Silver and Gold use Delta Lake for ACID transactions, schema enforcement, MERGE, and time travel. Delta is Parquet underneath with a transaction log on top.

Q: What is the difference between Silver and Gold? A: Silver is source-aligned — a cleaned version of the raw data. Gold is consumer-aligned — designed for specific business use cases. Silver has one customers table. Gold might have dim_customer (SCD2), agg_customer_revenue (aggregation), and ml_customer_features (ML) — all derived from the same Silver customers table.

Q: Who should read from which layer? A: Data engineers read/write all layers. Data scientists read Silver and Gold. Business analysts and Power BI read Gold only. Nobody except data engineers should read Bronze.

Q: How does the quarantine pattern work? A: During Bronze-to-Silver transformation, records that fail validation (null IDs, invalid emails, type mismatches) are routed to a quarantine folder instead of being silently dropped. Data engineers investigate quarantined records and fix source issues.

Wrapping Up

The Medallion Architecture is not complicated — it is disciplined. Bronze captures everything. Silver cleans everything. Gold serves everything. Three layers, three purposes, three quality levels.

Every pipeline on this blog fits into this pattern. The metadata-driven pipeline fills Bronze. The data quality transformations produce Silver. The SCD dimension tables are Gold. You have been building a Medallion Architecture all along — now you know the name for it.

Related posts:Data File Formats (CSV, Parquet, Delta)Project 4: Data Quality + SCDSCD Type 1 and 2 in PySparkDelta Lake OptimizationMetadata-Driven Pipeline


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link