The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

Every data platform tutorial mentions “Bronze, Silver, Gold” like it is obvious. But when you sit down to build one, the questions start: What exactly goes in each layer? What format should Bronze use? When does Silver become Gold? Who reads from which layer? How do the layers connect to the pipelines we have built?

The Medallion Architecture is NOT a product you install. It is a design pattern — a way of organizing your data lake into three layers with clear rules about what data looks like at each stage. Think of it like a water purification plant.

Raw water from the river (Bronze) is collected as-is — muddy, unfiltered, full of particles. The filtration stage (Silver) removes contaminants, standardizes pH levels, and tests for quality. The final treated water (Gold) is clean, safe, and ready for consumption. You never serve river water directly to homes. You always purify it first.

That is exactly what the Medallion Architecture does with data.

What Is the Medallion Architecture?
The Water Purification Analogy
Bronze Layer — Raw, Untouched Data
Silver Layer — Cleaned, Validated, Standardized
Gold Layer — Business-Ready, Aggregated, Modeled
Format Selection by Layer
How Our Blog Pipelines Map to Medallion
Building the Medallion Architecture in Azure
Bronze to Silver: Transformation Patterns
Silver to Gold: Aggregation and Modeling
Ownership and Access by Layer
Medallion in Databricks vs Synapse
Governance and Data Quality Per Layer
Cost Optimization Per Layer
Common Anti-Patterns
Interview Questions
Wrapping Up

What Is the Medallion Architecture?

SOURCE SYSTEMS                 DATA LAKE (ADLS Gen2)                    CONSUMERS
┌──────────────┐    ┌─────────────────────────────────────────────┐    ┌──────────────┐
│              │    │                                             │    │              │
│ SQL Database │──>│  BRONZE         SILVER         GOLD         │──>│ Power BI     │
│ APIs         │──>│  (Raw)    ──>   (Clean)   ──>  (Business)   │──>│ Analysts     │
│ CSV files    │──>│                                             │──>│ ML Models    │
│ Kafka        │──>│  "Land it"      "Clean it"     "Serve it"   │──>│ Applications │
│              │    │                                             │    │              │
└──────────────┘    └─────────────────────────────────────────────┘    └──────────────┘

Three layers, three purposes, three quality levels.

The Water Purification Analogy

Layer	Water Plant	Data Platform
Bronze	Raw river water collected in a holding tank. Muddy, untested. Keep as-is for reference.	Raw data landed as-is from sources. No transformations. CSV, JSON, Avro — whatever the source sends.
Silver	Filtered water. Sediment removed, chemicals balanced, bacteria killed. Tested for quality.	Cleaned, deduplicated, typed, validated data. NULLs handled, schemas enforced, bad records quarantined.
Gold	Bottled water. Labeled, packaged, ready for specific consumers. Different products for different needs.	Business-ready datasets. Star schema dimensions, aggregated facts, KPIs, ML feature tables. Tailored for specific consumers.

The key rule: Each layer only reads from the layer before it. Gold reads from Silver, never from Bronze. Silver reads from Bronze, never from source systems directly. This creates a clear lineage and makes each layer independently testable.

Bronze Layer — Raw, Untouched Data

Purpose

Capture raw data EXACTLY as the source provides it. No transformations, no cleaning, no type conversions. Bronze is your safety net — if anything goes wrong downstream, you can always reprocess from Bronze.

What Goes In

bronze/
  sales/
    2026/05/18/
      sales_raw.csv                 ← Exact file from source
  customers/
    2026/05/18/
      customers_raw.json            ← API response as-is
  events/
    2026/05/18/14/30/
      event_stream.avro             ← Kafka capture as-is
  erp_export/
    2026/05/18/
      GL_ACCOUNTS.csv               ← SAP export as-is

Rules

Never transform — land data exactly as received
Never delete — keep historical loads (date-partitioned folders)
Any format — CSV, JSON, Avro, Parquet — whatever the source sends
Append-only — each load adds a new folder, never overwrites
Schema-on-read — do not enforce schema at this layer

In Our Pipeline

This is what our metadata-driven pipeline does — Copy activity reads from SQL and writes Parquet to ADLS. The destination folders (sqldb/Customer/2026/05/18/) are the Bronze layer.

Who Reads Bronze

Silver layer transformation notebooks/pipelines
Data engineers debugging source data issues
Nobody else. Analysts and BI tools should NEVER read from Bronze.

Real-life analogy: Bronze is the warehouse receiving dock. Packages (data) arrive from suppliers (sources). They are placed on shelves exactly as they came — unopened, unchecked. The dock worker’s job is to receive and store, NOT to open or inspect.

Silver Layer — Cleaned, Validated, Standardized

Purpose

Clean, standardize, validate, and deduplicate Bronze data. Silver is your trusted, reusable data — multiple Gold layers can read from the same Silver table.

What Happens Here

Transformation	Example
Null handling	Replace NULL city with ‘Unknown’
Data type enforcement	Cast “price” from STRING to DOUBLE
Deduplication	Window function to keep latest record per key
Standardization	`trim()`, `initCap()`, fix invalid emails
Schema enforcement	Reject rows that do not match expected schema
Data quality checks	Validate email format, check referential integrity
Joins	Enrich customer with address (denormalize)
SCD Type 1	Overwrite corrections (email fixes)
Filtering	Remove test data, internal accounts

What It Looks Like

silver/
  customers/
    _delta_log/
    part-00000.snappy.parquet         ← Cleaned, deduplicated, Delta format
  products/
    _delta_log/
    part-00000.snappy.parquet
  orders/
    _delta_log/
    year=2026/
      month=05/
        part-00000.snappy.parquet     ← Partitioned by date

Rules

Always Delta format — ACID transactions, schema enforcement, MERGE
Deduplicated — one row per entity (or one active row for SCD)
Typed — all columns have correct data types
Validated — bad data quarantined or defaulted
Reusable — multiple Gold tables read from the same Silver table

In Our Pipelines

This is what our Project 4 data quality pipeline does — HandleNulls, StandardizeData, Deduplicate, then write to the dimension table. And our PySpark transformations cookbook shows exactly these operations in code.

Who Reads Silver

Gold layer transformation notebooks
Data scientists for feature engineering
Advanced analysts who understand the data model

Real-life analogy: Silver is the kitchen prep station. Raw ingredients from the warehouse (Bronze) are washed, peeled, chopped, and portioned. The prep cook ensures everything is clean and measured correctly. Multiple chefs (Gold layers) use the same prepped ingredients for different dishes.

Gold Layer — Business-Ready, Aggregated, Modeled

Purpose

Serve business-specific, consumption-ready datasets. Gold tables are designed for specific consumers — a Power BI dashboard, a machine learning model, a regulatory report.

What Happens Here

Transformation	Example
Star schema modeling	Create fact_orders + dim_customer + dim_product
SCD Type 2	Historical dimension tables with versioning
Aggregations	Daily/weekly/monthly revenue rollups
KPIs	Customer lifetime value, churn rate, retention
Feature tables	ML features: avg_order_value, days_since_last_order
Denormalization	Pre-join dimensions into fact for faster BI queries
Business logic	Calculate margins, apply tax rules, categorize customers

What It Looks Like

gold/
  dim_customer/           ← SCD Type 2 dimension (our SCD2 pipelines)
    _delta_log/
    part-00000.snappy.parquet
  dim_product/            ← Product dimension
  fact_orders/            ← Fact table (grain: one row per order line)
    _delta_log/
    year=2026/
      month=05/
  agg_daily_revenue/      ← Pre-aggregated daily revenue
  ml_features/            ← Feature table for ML models

Rules

Always Delta format — ACID, time travel, MERGE for SCD
Consumer-specific — each Gold table is designed for a specific use case
Business logic applied — calculations, categorizations, classifications
Optimized for reading — partitioned, Z-ORDERED, OPTIMIZED
Documented — clear definitions of every column and business rule

In Our Pipelines

This is where our SCD Type 2 tables live — dim_customer with start_date, end_date, is_active. And our DB vs DW post explains why Gold uses star schema.

Who Reads Gold

Power BI dashboards (DirectQuery or Import)
Business analysts (SQL queries)
ML models (feature tables)
APIs and applications
Regulatory reports

Real-life analogy: Gold is the served dish at a restaurant. The ingredients were received (Bronze), prepped (Silver), and now cooked and plated specifically for the customer’s order (Gold). A steak dinner and a Caesar salad use the same prepped lettuce (Silver) but are completely different dishes (Gold) for different customers.

Format Selection by Layer

Layer	Format	Why
Bronze	Whatever source sends (CSV, JSON, Avro, Parquet)	No transformation, preserve original
Silver	Delta Lake	ACID, schema enforcement, MERGE for dedup/SCD
Gold	Delta Lake	ACID, time travel, OPTIMIZE, Z-ORDER for fast queries

Why not Parquet for Silver/Gold? Because Parquet does not support UPDATE, DELETE, MERGE, schema enforcement, or ACID transactions. Delta adds all of these while using Parquet underneath.

How Our Blog Pipelines Map to Medallion

Our Pipeline	Layer	What It Does
Metadata-Driven Pipeline	Source → Bronze	Copy tables from SQL to ADLS as Parquet
Incremental Loading	Source → Bronze	Copy only new rows using watermark
Unified Full+Incremental	Source → Bronze	Both load types in one pipeline
Project 4: Data Quality	Bronze → Silver	Null handling, standardization, dedup
PySpark Transformations	Bronze → Silver	Same cleaning in code
SCD Type 1 Hash	Silver → Gold	Overwrite dimension changes
SCD Type 2 Pipeline	Silver → Gold	Historical dimension tracking
SCD PySpark MERGE	Silver → Gold	Same in Databricks

Every pipeline we built fits neatly into the Medallion pattern.

Building the Medallion Architecture in Azure

Folder Structure in ADLS Gen2

Container: datalake/
  bronze/
    sales/
      YYYY/MM/DD/        ← Date-partitioned raw data
    customers/
    products/
    events/

  silver/
    customers/           ← Delta table (cleaned, deduplicated)
    products/
    orders/

  gold/
    dim_customer/        ← SCD Type 2 dimension
    dim_product/
    fact_orders/
    agg_daily_revenue/
    ml_customer_features/

  quarantine/            ← Bad records from Silver validation
    customers/
    orders/

Bronze to Silver: Transformation Patterns

The Bronze→Silver transformation focuses on cleaning and standardizing — making raw data trustworthy:

# Bronze → Silver notebook: Clean customers
df_bronze = spark.read.parquet("abfss://.../bronze/customers/2026/05/18/")

# Standard Silver cleaning pattern
df_silver = df_bronze \
    .withColumn("name", initcap(trim(col("name"))))          # Standardize casing
    .withColumn("email", lower(trim(col("email"))))           # Lowercase emails
    .withColumn("city", trim(col("city")))                    # Remove whitespace
    .withColumn("salary", col("salary").cast("double"))       # Enforce types
    .fillna({"city": "Unknown", "country": "Unknown"})        # Handle NULLs
    .filter(col("customer_id").isNotNull())                   # Remove invalid rows
    .dropDuplicates(["customer_id"])                          # Deduplicate

# Write as Delta (Silver is always Delta)
df_silver.write.format("delta").mode("overwrite").save("abfss://.../silver/customers/")

Common Bronze→Silver patterns: null handling (fillna, coalesce), type casting, trimming and standardizing strings, deduplication with dropDuplicates or window functions, filtering out test/invalid records, and quarantining rows that fail validation.

Silver to Gold: Aggregation and Modeling

The Silver→Gold transformation focuses on business logic — turning clean data into consumer-ready datasets:

# Silver → Gold notebook: Build star schema and aggregations
df_silver = spark.read.format("delta").load("abfss://.../silver/customers/")

# Gold Pattern 1: SCD Type 2 dimension via Delta MERGE
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "gold.dim_customer")
target.alias("t").merge(
    df_silver.alias("s"),
    "t.customer_id = s.customer_id AND t.is_active = true"
).whenMatchedUpdate(...).whenNotMatchedInsert(...).execute()

# Gold Pattern 2: Aggregated summary table
df_orders = spark.read.format("delta").load("abfss://.../silver/orders/")
df_daily_rev = df_orders \
    .groupBy("order_date", "region") \
    .agg(
        sum("amount").alias("total_revenue"),
        countDistinct("customer_id").alias("unique_customers"),
        count("order_id").alias("total_orders")
    )
df_daily_rev.write.format("delta").mode("overwrite").saveAsTable("gold.agg_daily_revenue")

# Gold Pattern 3: ML feature table
df_features = df_orders.join(df_silver, "customer_id") \
    .groupBy("customer_id") \
    .agg(
        count("order_id").alias("lifetime_orders"),
        sum("amount").alias("lifetime_value"),
        avg("amount").alias("avg_order_value"),
        datediff(current_date(), max("order_date")).alias("days_since_last_order")
    )
df_features.write.format("delta").mode("overwrite").saveAsTable("gold.ml_customer_features")

Common Silver→Gold patterns: star schema dimensions (SCD Type 1 and 2 via Delta MERGE), fact tables with surrogate keys, pre-aggregated summary tables for dashboards, denormalized wide tables for BI, and ML feature tables. Each Gold table is designed for a specific consumer.

Ownership and Access by Layer

Layer	Owner	Write Access	Read Access
Bronze	Data Engineers	ADF/Synapse pipelines	Data Engineers only
Silver	Data Engineers	Databricks notebooks, Data Flows	Data Engineers, Data Scientists
Gold	Data Engineers + Analytics	Databricks notebooks	Everyone — Analysts, BI, ML, APIs
Quarantine	Data Engineers	Silver pipeline (rejects)	Data Engineers (to investigate)

Analysts should never read from Bronze. If they do, they are working with messy, unvalidated data. Gold is their layer.

Medallion in Databricks vs Synapse

Aspect	Databricks	Synapse
Bronze → Silver	PySpark notebooks	Data Flows or Spark notebooks
Silver → Gold	Delta MERGE in notebooks	Data Flows with SCD transformations
Orchestration	Databricks Workflows or ADF	Synapse Pipelines
Delta support	Native (default format)	Supported in Spark pools
Governance	Unity Catalog	Purview + Synapse roles
Best for	Code-first, ML-heavy	Visual, SQL-heavy

Many enterprises use BOTH: Synapse for ingestion (Bronze) and SQL workloads, Databricks for transformation (Silver/Gold) and ML.

Governance and Data Quality Per Layer

Layer	Schema Enforcement	Data Quality	Lineage
Bronze	None (schema-on-read)	None — accept everything	Source system → Bronze timestamp
Silver	Strict (Delta schema enforcement)	Null checks, type validation, referential integrity	Bronze path → Silver table
Gold	Strict (Delta + business rules)	Business rule validation, KPI accuracy	Silver table → Gold table

Quarantine Pattern

# During Silver processing — separate good and bad records
df_good = df_bronze.filter(
    col("email").rlike("^.+@.+\..+$") & col("customer_id").isNotNull()
)
df_bad = df_bronze.filter(
    ~col("email").rlike("^.+@.+\..+$") | col("customer_id").isNull()
)

df_good.write.format("delta").mode("append").save(".../silver/customers/")
df_bad.write.format("delta").mode("append").save(".../quarantine/customers/")
print(f"Good: {df_good.count()}, Quarantined: {df_bad.count()}")

Cost Optimization Per Layer

Layer	Storage Strategy	Cost Tip
Bronze	Keep raw for 90 days, archive to Cool/Archive tier	Move to Cool after 30 days, Archive after 90
Silver	Delta with OPTIMIZE + VACUUM	VACUUM after 7 days, OPTIMIZE weekly
Gold	Delta with Z-ORDER on query columns	OPTIMIZE + Z-ORDER for BI query speed

Common Anti-Patterns

Reading Bronze directly in Power BI — messy data, no types, no validation. Always read from Gold.
Skipping Silver entirely — going straight from Bronze to Gold means Gold has no clean, reusable intermediate layer. When a second Gold table needs the same data, you duplicate the cleaning logic.
Putting business logic in Silver — Silver should be source-aligned (cleaned version of the source). Business rules (margins, KPIs, categorizations) belong in Gold.
Overwriting Bronze — Bronze should be append-only. If you overwrite, you lose the ability to reprocess historical data.
No quarantine layer — bad records silently dropped or corrupting Silver. Always route bad data to a quarantine folder for investigation.
Using CSV in Silver/Gold — no ACID, no MERGE, no schema enforcement. Always use Delta.

Interview Questions

Q: What is the Medallion Architecture? A: A data lake design pattern with three layers: Bronze (raw, untouched data from sources), Silver (cleaned, validated, standardized data), and Gold (business-ready, aggregated, modeled data). Each layer reads from the previous one, creating clear data lineage and quality progression.

Q: What format should each layer use? A: Bronze uses whatever the source provides (CSV, JSON, Avro). Silver and Gold use Delta Lake for ACID transactions, schema enforcement, MERGE, and time travel. Delta is Parquet underneath with a transaction log on top.

Q: What is the difference between Silver and Gold? A: Silver is source-aligned — a cleaned version of the raw data. Gold is consumer-aligned — designed for specific business use cases. Silver has one customers table. Gold might have dim_customer (SCD2), agg_customer_revenue (aggregation), and ml_customer_features (ML) — all derived from the same Silver customers table.

Q: Who should read from which layer? A: Data engineers read/write all layers. Data scientists read Silver and Gold. Business analysts and Power BI read Gold only. Nobody except data engineers should read Bronze.

Q: How does the quarantine pattern work? A: During Bronze-to-Silver transformation, records that fail validation (null IDs, invalid emails, type mismatches) are routed to a quarantine folder instead of being silently dropped. Data engineers investigate quarantined records and fix source issues.

Wrapping Up

The Medallion Architecture is not complicated — it is disciplined. Bronze captures everything. Silver cleans everything. Gold serves everything. Three layers, three purposes, three quality levels.

Every pipeline on this blog fits into this pattern. The metadata-driven pipeline fills Bronze. The data quality transformations produce Silver. The SCD dimension tables are Gold. You have been building a Medallion Architecture all along — now you know the name for it.

← Previous: Databricks Workflows and Jobs

Databricks (12/16)

Next: Databricks Git Integration and CI/CD →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

Table of Contents

What Is the Medallion Architecture?

The Water Purification Analogy

Bronze Layer — Raw, Untouched Data

Purpose

What Goes In

Rules

In Our Pipeline

Who Reads Bronze

Silver Layer — Cleaned, Validated, Standardized

Purpose

What Happens Here

What It Looks Like

Rules

In Our Pipelines

Who Reads Silver

Gold Layer — Business-Ready, Aggregated, Modeled

Purpose

What Happens Here

What It Looks Like

Rules

In Our Pipelines

Who Reads Gold

Format Selection by Layer

How Our Blog Pipelines Map to Medallion

Building the Medallion Architecture in Azure

Folder Structure in ADLS Gen2

Bronze to Silver: Transformation Patterns

Silver to Gold: Aggregation and Modeling

Ownership and Access by Layer

Medallion in Databricks vs Synapse

Governance and Data Quality Per Layer

Quarantine Pattern

Cost Optimization Per Layer

Common Anti-Patterns

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

Table of Contents

What Is the Medallion Architecture?

The Water Purification Analogy

Bronze Layer — Raw, Untouched Data

Purpose

What Goes In

Rules

In Our Pipeline

Who Reads Bronze

Silver Layer — Cleaned, Validated, Standardized

Purpose

What Happens Here

What It Looks Like

Rules

In Our Pipelines

Who Reads Silver

Gold Layer — Business-Ready, Aggregated, Modeled

Purpose

What Happens Here

What It Looks Like

Rules

In Our Pipelines

Who Reads Gold

Format Selection by Layer

How Our Blog Pipelines Map to Medallion

Building the Medallion Architecture in Azure

Folder Structure in ADLS Gen2

Bronze to Silver: Transformation Patterns

Silver to Gold: Aggregation and Modeling

Ownership and Access by Layer

Medallion in Databricks vs Synapse

Governance and Data Quality Per Layer

Quarantine Pattern

Cost Optimization Per Layer

Common Anti-Patterns

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply