The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines
Every data platform tutorial mentions “Bronze, Silver, Gold” like it is obvious. But when you sit down to build one, the questions start: What exactly goes in each layer? What format should Bronze use? When does Silver become Gold? Who reads from which layer? How do the layers connect to the pipelines we have built?
The Medallion Architecture is NOT a product you install. It is a design pattern — a way of organizing your data lake into three layers with clear rules about what data looks like at each stage. Think of it like a water purification plant.
Raw water from the river (Bronze) is collected as-is — muddy, unfiltered, full of particles. The filtration stage (Silver) removes contaminants, standardizes pH levels, and tests for quality. The final treated water (Gold) is clean, safe, and ready for consumption. You never serve river water directly to homes. You always purify it first.
That is exactly what the Medallion Architecture does with data.
Table of Contents
- What Is the Medallion Architecture?
- The Water Purification Analogy
- Bronze Layer — Raw, Untouched Data
- Silver Layer — Cleaned, Validated, Standardized
- Gold Layer — Business-Ready, Aggregated, Modeled
- Format Selection by Layer
- How Our Blog Pipelines Map to Medallion
- Building the Medallion Architecture in Azure
- Bronze to Silver: Transformation Patterns
- Silver to Gold: Aggregation and Modeling
- Ownership and Access by Layer
- Medallion in Databricks vs Synapse
- Governance and Data Quality Per Layer
- Cost Optimization Per Layer
- Common Anti-Patterns
- Interview Questions
- Wrapping Up
What Is the Medallion Architecture?
SOURCE SYSTEMS DATA LAKE (ADLS Gen2) CONSUMERS
┌──────────────┐ ┌─────────────────────────────────────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ SQL Database │──>│ BRONZE SILVER GOLD │──>│ Power BI │
│ APIs │──>│ (Raw) ──> (Clean) ──> (Business) │──>│ Analysts │
│ CSV files │──>│ │──>│ ML Models │
│ Kafka │──>│ "Land it" "Clean it" "Serve it" │──>│ Applications │
│ │ │ │ │ │
└──────────────┘ └─────────────────────────────────────────────┘ └──────────────┘
Three layers, three purposes, three quality levels.
The Water Purification Analogy
| Layer | Water Plant | Data Platform |
|---|---|---|
| Bronze | Raw river water collected in a holding tank. Muddy, untested. Keep as-is for reference. | Raw data landed as-is from sources. No transformations. CSV, JSON, Avro — whatever the source sends. |
| Silver | Filtered water. Sediment removed, chemicals balanced, bacteria killed. Tested for quality. | Cleaned, deduplicated, typed, validated data. NULLs handled, schemas enforced, bad records quarantined. |
| Gold | Bottled water. Labeled, packaged, ready for specific consumers. Different products for different needs. | Business-ready datasets. Star schema dimensions, aggregated facts, KPIs, ML feature tables. Tailored for specific consumers. |
The key rule: Each layer only reads from the layer before it. Gold reads from Silver, never from Bronze. Silver reads from Bronze, never from source systems directly. This creates a clear lineage and makes each layer independently testable.
Bronze Layer — Raw, Untouched Data
Purpose
Capture raw data EXACTLY as the source provides it. No transformations, no cleaning, no type conversions. Bronze is your safety net — if anything goes wrong downstream, you can always reprocess from Bronze.
What Goes In
bronze/
sales/
2026/05/18/
sales_raw.csv ← Exact file from source
customers/
2026/05/18/
customers_raw.json ← API response as-is
events/
2026/05/18/14/30/
event_stream.avro ← Kafka capture as-is
erp_export/
2026/05/18/
GL_ACCOUNTS.csv ← SAP export as-is
Rules
- Never transform — land data exactly as received
- Never delete — keep historical loads (date-partitioned folders)
- Any format — CSV, JSON, Avro, Parquet — whatever the source sends
- Append-only — each load adds a new folder, never overwrites
- Schema-on-read — do not enforce schema at this layer
In Our Pipeline
This is what our metadata-driven pipeline does — Copy activity reads from SQL and writes Parquet to ADLS. The destination folders (sqldb/Customer/2026/05/18/) are the Bronze layer.
Who Reads Bronze
- Silver layer transformation notebooks/pipelines
- Data engineers debugging source data issues
- Nobody else. Analysts and BI tools should NEVER read from Bronze.
Real-life analogy: Bronze is the warehouse receiving dock. Packages (data) arrive from suppliers (sources). They are placed on shelves exactly as they came — unopened, unchecked. The dock worker’s job is to receive and store, NOT to open or inspect.
Silver Layer — Cleaned, Validated, Standardized
Purpose
Clean, standardize, validate, and deduplicate Bronze data. Silver is your trusted, reusable data — multiple Gold layers can read from the same Silver table.
What Happens Here
| Transformation | Example |
|---|---|
| Null handling | Replace NULL city with ‘Unknown’ |
| Data type enforcement | Cast “price” from STRING to DOUBLE |
| Deduplication | Window function to keep latest record per key |
| Standardization | trim(), initCap(), fix invalid emails |
| Schema enforcement | Reject rows that do not match expected schema |
| Data quality checks | Validate email format, check referential integrity |
| Joins | Enrich customer with address (denormalize) |
| SCD Type 1 | Overwrite corrections (email fixes) |
| Filtering | Remove test data, internal accounts |
What It Looks Like
silver/
customers/
_delta_log/
part-00000.snappy.parquet ← Cleaned, deduplicated, Delta format
products/
_delta_log/
part-00000.snappy.parquet
orders/
_delta_log/
year=2026/
month=05/
part-00000.snappy.parquet ← Partitioned by date
Rules
- Always Delta format — ACID transactions, schema enforcement, MERGE
- Deduplicated — one row per entity (or one active row for SCD)
- Typed — all columns have correct data types
- Validated — bad data quarantined or defaulted
- Reusable — multiple Gold tables read from the same Silver table
In Our Pipelines
This is what our Project 4 data quality pipeline does — HandleNulls, StandardizeData, Deduplicate, then write to the dimension table. And our PySpark transformations cookbook shows exactly these operations in code.
Who Reads Silver
- Gold layer transformation notebooks
- Data scientists for feature engineering
- Advanced analysts who understand the data model
Real-life analogy: Silver is the kitchen prep station. Raw ingredients from the warehouse (Bronze) are washed, peeled, chopped, and portioned. The prep cook ensures everything is clean and measured correctly. Multiple chefs (Gold layers) use the same prepped ingredients for different dishes.
Gold Layer — Business-Ready, Aggregated, Modeled
Purpose
Serve business-specific, consumption-ready datasets. Gold tables are designed for specific consumers — a Power BI dashboard, a machine learning model, a regulatory report.
What Happens Here
| Transformation | Example |
|---|---|
| Star schema modeling | Create fact_orders + dim_customer + dim_product |
| SCD Type 2 | Historical dimension tables with versioning |
| Aggregations | Daily/weekly/monthly revenue rollups |
| KPIs | Customer lifetime value, churn rate, retention |
| Feature tables | ML features: avg_order_value, days_since_last_order |
| Denormalization | Pre-join dimensions into fact for faster BI queries |
| Business logic | Calculate margins, apply tax rules, categorize customers |
What It Looks Like
gold/
dim_customer/ ← SCD Type 2 dimension (our SCD2 pipelines)
_delta_log/
part-00000.snappy.parquet
dim_product/ ← Product dimension
fact_orders/ ← Fact table (grain: one row per order line)
_delta_log/
year=2026/
month=05/
agg_daily_revenue/ ← Pre-aggregated daily revenue
ml_features/ ← Feature table for ML models
Rules
- Always Delta format — ACID, time travel, MERGE for SCD
- Consumer-specific — each Gold table is designed for a specific use case
- Business logic applied — calculations, categorizations, classifications
- Optimized for reading — partitioned, Z-ORDERED, OPTIMIZED
- Documented — clear definitions of every column and business rule
In Our Pipelines
This is where our SCD Type 2 tables live — dim_customer with start_date, end_date, is_active. And our DB vs DW post explains why Gold uses star schema.
Who Reads Gold
- Power BI dashboards (DirectQuery or Import)
- Business analysts (SQL queries)
- ML models (feature tables)
- APIs and applications
- Regulatory reports
Real-life analogy: Gold is the served dish at a restaurant. The ingredients were received (Bronze), prepped (Silver), and now cooked and plated specifically for the customer’s order (Gold). A steak dinner and a Caesar salad use the same prepped lettuce (Silver) but are completely different dishes (Gold) for different customers.
Format Selection by Layer
| Layer | Format | Why |
|---|---|---|
| Bronze | Whatever source sends (CSV, JSON, Avro, Parquet) | No transformation, preserve original |
| Silver | Delta Lake | ACID, schema enforcement, MERGE for dedup/SCD |
| Gold | Delta Lake | ACID, time travel, OPTIMIZE, Z-ORDER for fast queries |
Why not Parquet for Silver/Gold? Because Parquet does not support UPDATE, DELETE, MERGE, schema enforcement, or ACID transactions. Delta adds all of these while using Parquet underneath.
How Our Blog Pipelines Map to Medallion
| Our Pipeline | Layer | What It Does |
|---|---|---|
| Metadata-Driven Pipeline | Source → Bronze | Copy tables from SQL to ADLS as Parquet |
| Incremental Loading | Source → Bronze | Copy only new rows using watermark |
| Unified Full+Incremental | Source → Bronze | Both load types in one pipeline |
| Project 4: Data Quality | Bronze → Silver | Null handling, standardization, dedup |
| PySpark Transformations | Bronze → Silver | Same cleaning in code |
| SCD Type 1 Hash | Silver → Gold | Overwrite dimension changes |
| SCD Type 2 Pipeline | Silver → Gold | Historical dimension tracking |
| SCD PySpark MERGE | Silver → Gold | Same in Databricks |
Every pipeline we built fits neatly into the Medallion pattern.
Building the Medallion Architecture in Azure
Folder Structure in ADLS Gen2
Container: datalake/
bronze/
sales/
YYYY/MM/DD/ ← Date-partitioned raw data
customers/
products/
events/
silver/
customers/ ← Delta table (cleaned, deduplicated)
products/
orders/
gold/
dim_customer/ ← SCD Type 2 dimension
dim_product/
fact_orders/
agg_daily_revenue/
ml_customer_features/
quarantine/ ← Bad records from Silver validation
customers/
orders/
The Pipeline Chain
# Bronze → Silver notebook
df_bronze = spark.read.parquet("abfss://.../bronze/customers/2026/05/18/")
df_silver = df_bronze .withColumn("name", initcap(trim(col("name")))) .withColumn("email", lower(trim(col("email")))) .fillna({"city": "Unknown"}) .dropDuplicates(["customer_id"])
df_silver.write.format("delta").mode("overwrite").save("abfss://.../silver/customers/")
# Silver → Gold notebook
df_silver = spark.read.format("delta").load("abfss://.../silver/customers/")
# Apply SCD Type 2 MERGE to Gold dim_customer
target = DeltaTable.forName(spark, "gold.dim_customer")
target.alias("t").merge(df_silver.alias("s"), ...).execute()
Ownership and Access by Layer
| Layer | Owner | Write Access | Read Access |
|---|---|---|---|
| Bronze | Data Engineers | ADF/Synapse pipelines | Data Engineers only |
| Silver | Data Engineers | Databricks notebooks, Data Flows | Data Engineers, Data Scientists |
| Gold | Data Engineers + Analytics | Databricks notebooks | Everyone — Analysts, BI, ML, APIs |
| Quarantine | Data Engineers | Silver pipeline (rejects) | Data Engineers (to investigate) |
Analysts should never read from Bronze. If they do, they are working with messy, unvalidated data. Gold is their layer.
Medallion in Databricks vs Synapse
| Aspect | Databricks | Synapse |
|---|---|---|
| Bronze → Silver | PySpark notebooks | Data Flows or Spark notebooks |
| Silver → Gold | Delta MERGE in notebooks | Data Flows with SCD transformations |
| Orchestration | Databricks Workflows or ADF | Synapse Pipelines |
| Delta support | Native (default format) | Supported in Spark pools |
| Governance | Unity Catalog | Purview + Synapse roles |
| Best for | Code-first, ML-heavy | Visual, SQL-heavy |
Many enterprises use BOTH: Synapse for ingestion (Bronze) and SQL workloads, Databricks for transformation (Silver/Gold) and ML.
Governance and Data Quality Per Layer
| Layer | Schema Enforcement | Data Quality | Lineage |
|---|---|---|---|
| Bronze | None (schema-on-read) | None — accept everything | Source system → Bronze timestamp |
| Silver | Strict (Delta schema enforcement) | Null checks, type validation, referential integrity | Bronze path → Silver table |
| Gold | Strict (Delta + business rules) | Business rule validation, KPI accuracy | Silver table → Gold table |
Quarantine Pattern
# During Silver processing — separate good and bad records
df_good = df_bronze.filter(
col("email").rlike("^.+@.+\..+$") & col("customer_id").isNotNull()
)
df_bad = df_bronze.filter(
~col("email").rlike("^.+@.+\..+$") | col("customer_id").isNull()
)
df_good.write.format("delta").mode("append").save(".../silver/customers/")
df_bad.write.format("delta").mode("append").save(".../quarantine/customers/")
print(f"Good: {df_good.count()}, Quarantined: {df_bad.count()}")
Cost Optimization Per Layer
| Layer | Storage Strategy | Cost Tip |
|---|---|---|
| Bronze | Keep raw for 90 days, archive to Cool/Archive tier | Move to Cool after 30 days, Archive after 90 |
| Silver | Delta with OPTIMIZE + VACUUM | VACUUM after 7 days, OPTIMIZE weekly |
| Gold | Delta with Z-ORDER on query columns | OPTIMIZE + Z-ORDER for BI query speed |
Common Anti-Patterns
-
Reading Bronze directly in Power BI — messy data, no types, no validation. Always read from Gold.
-
Skipping Silver entirely — going straight from Bronze to Gold means Gold has no clean, reusable intermediate layer. When a second Gold table needs the same data, you duplicate the cleaning logic.
-
Putting business logic in Silver — Silver should be source-aligned (cleaned version of the source). Business rules (margins, KPIs, categorizations) belong in Gold.
-
Overwriting Bronze — Bronze should be append-only. If you overwrite, you lose the ability to reprocess historical data.
-
No quarantine layer — bad records silently dropped or corrupting Silver. Always route bad data to a quarantine folder for investigation.
-
Using CSV in Silver/Gold — no ACID, no MERGE, no schema enforcement. Always use Delta.
Interview Questions
Q: What is the Medallion Architecture? A: A data lake design pattern with three layers: Bronze (raw, untouched data from sources), Silver (cleaned, validated, standardized data), and Gold (business-ready, aggregated, modeled data). Each layer reads from the previous one, creating clear data lineage and quality progression.
Q: What format should each layer use? A: Bronze uses whatever the source provides (CSV, JSON, Avro). Silver and Gold use Delta Lake for ACID transactions, schema enforcement, MERGE, and time travel. Delta is Parquet underneath with a transaction log on top.
Q: What is the difference between Silver and Gold? A: Silver is source-aligned — a cleaned version of the raw data. Gold is consumer-aligned — designed for specific business use cases. Silver has one customers table. Gold might have dim_customer (SCD2), agg_customer_revenue (aggregation), and ml_customer_features (ML) — all derived from the same Silver customers table.
Q: Who should read from which layer? A: Data engineers read/write all layers. Data scientists read Silver and Gold. Business analysts and Power BI read Gold only. Nobody except data engineers should read Bronze.
Q: How does the quarantine pattern work? A: During Bronze-to-Silver transformation, records that fail validation (null IDs, invalid emails, type mismatches) are routed to a quarantine folder instead of being silently dropped. Data engineers investigate quarantined records and fix source issues.
Wrapping Up
The Medallion Architecture is not complicated — it is disciplined. Bronze captures everything. Silver cleans everything. Gold serves everything. Three layers, three purposes, three quality levels.
Every pipeline on this blog fits into this pattern. The metadata-driven pipeline fills Bronze. The data quality transformations produce Silver. The SCD dimension tables are Gold. You have been building a Medallion Architecture all along — now you know the name for it.
Related posts: – Data File Formats (CSV, Parquet, Delta) – Project 4: Data Quality + SCD – SCD Type 1 and 2 in PySpark – Delta Lake Optimization – Metadata-Driven Pipeline
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.