Data File Formats in Azure Explained: CSV, Parquet, Delta, Avro, ORC, and JSON — When to Use Each

Data File Formats in Azure Explained: CSV, Parquet, Delta, Avro, ORC, and JSON — When to Use Each

Imagine you are moving to a new house. You can pack your belongings in garbage bags (CSV), organized labeled boxes (Parquet), or a professional moving system with tracking, insurance, and inventory (Delta Lake). All three get your stuff to the new house, but the experience and reliability are vastly different.

Choosing the right file format in your data lake is one of the most impactful decisions you make as a data engineer. The wrong format means slow queries, bloated storage costs, failed pipelines, and frustrated analysts. The right format means fast queries, efficient storage, reliable data, and happy stakeholders.

This post covers every file format you will encounter in Azure data engineering, with honest comparisons, real-world scenarios, and clear recommendations for each layer of your data lake.

Table of Contents

  • Why File Formats Matter
  • CSV (Comma-Separated Values)
  • JSON (JavaScript Object Notation)
  • Parquet
  • Delta Lake (Delta Format)
  • Avro
  • ORC (Optimized Row Columnar)
  • The Complete Comparison Table
  • Row-Oriented vs Column-Oriented: The Fundamental Split
  • Choosing Formats for the Medallion Architecture
  • File Formats in Azure Services
  • Compression: Snappy, GZIP, ZSTD, and LZ4
  • Small Files Problem and How Each Format Handles It
  • Schema Evolution: How Each Format Handles Changes
  • Real-World Scenarios
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

Why File Formats Matter

Consider this query:

SELECT city, SUM(amount) FROM orders WHERE year = 2026 GROUP BY city

On CSV (10 GB file): Reads the ENTIRE 10 GB file. Every row, every column. Even though you only need city, amount, and year. Takes 45 seconds.

On Parquet (10 GB → 2 GB compressed): Reads ONLY the city, amount, and year columns. Skips all other columns entirely. Reads only 2026 partitions. Takes 3 seconds.

On Delta (same Parquet underneath + metadata): Same speed as Parquet, PLUS ACID transactions, time travel, and automatic file optimization. Takes 3 seconds with reliability guarantees.

The format determines how much data is read, how fast queries run, how much storage costs, and how reliable your pipeline is.

Real-life analogy: File formats are like book organization systems: – CSV = a handwritten notebook. Easy to start writing, but finding anything requires reading every page. – Parquet = an organized filing cabinet with labeled tabs. Jump directly to the section you need. – Delta = the filing cabinet PLUS a librarian who tracks every change, prevents conflicts, and can show you what the cabinet looked like last Tuesday.

CSV (Comma-Separated Values)

What It Is

The simplest data format. Plain text with commas separating values. Every spreadsheet, database, and programming language can read and write CSV.

customer_id,name,city,amount
1,Alice,Toronto,500
2,Bob,Mumbai,300
3,Carol,London,700

Strengths

  • Universal compatibility — everything reads CSV. Excel, Python, R, ADF, Spark, Athena, even Notepad.
  • Human-readable — open it in any text editor and see the data immediately.
  • Easy to produce — source systems, APIs, and manual exports often output CSV.
  • No special tools neededcat, head, grep work on CSV files.

Weaknesses

  • No schema — column names are in the header (maybe), but data types are not stored. Is “2026” a string or an integer? Is “01/02/2026” January 2nd or February 1st? CSV does not know.
  • No compression — raw text takes the most storage space.
  • Full scan required — to read one column, you must read ALL columns (row-oriented).
  • No column pruning — Athena or Spark cannot skip columns. Reads everything.
  • Delimiter conflicts — what if your data contains commas? "Alice, Bob" breaks naive CSV parsers.
  • No data types — everything is text. “500” and “five hundred” look the same to CSV.

When to Use CSV

  • Landing zone — source systems drop CSV files. Accept them, then convert.
  • Data exchange — sending data to non-technical stakeholders or legacy systems.
  • Small datasets — under 100 MB where performance does not matter.
  • Debugging — quick inspection with a text editor.

When NOT to Use CSV

  • Analytics — use Parquet instead. CSV scans are 10-50x slower.
  • Data lake storage — never store production data as CSV in Silver/Gold layers.
  • Large datasets — anything over 1 GB should be Parquet or Delta.

Real-life analogy: CSV is like a handwritten letter. Anyone can read it, but it takes forever to search through a stack of 10,000 letters to find ones about a specific topic. No index, no organization, no shortcuts.

CSV in Azure

# ADF: Copy activity with CSV dataset
# Python:
import pandas as pd
df = pd.read_csv('data.csv')

# Spark:
df = spark.read.csv('abfss://container@storage.dfs.core.windows.net/data.csv',
                     header=True, inferSchema=True)

JSON (JavaScript Object Notation)

What It Is

Semi-structured data format that supports nested objects and arrays. The language of APIs, web applications, and NoSQL databases.

{
    "customer_id": 1,
    "name": "Alice",
    "address": {
        "city": "Toronto",
        "country": "Canada"
    },
    "orders": [
        {"id": 101, "amount": 500},
        {"id": 102, "amount": 300}
    ]
}

Strengths

  • Supports nesting — objects within objects, arrays within objects. Perfect for complex, hierarchical data.
  • Self-describing — field names are included with every record.
  • Human-readable — easy to inspect and debug.
  • API standard — every REST API returns JSON.
  • Schema-flexible — different records can have different fields.

Weaknesses

  • Verbose — field names repeated in every record. “customer_id” appears 1 million times in 1 million records.
  • No compression — raw text format (larger than CSV because of field names and syntax characters).
  • Slow to query — parsing nested JSON is expensive for analytical engines.
  • No column pruning — must read entire records to extract one field.
  • Schema inference is messy — Spark guesses types, sometimes wrong.

JSON Lines (JSONL) — The Better Format

Standard JSON wraps everything in an array. JSONL puts one record per line:

{"customer_id": 1, "name": "Alice", "city": "Toronto"}
{"customer_id": 2, "name": "Bob", "city": "Mumbai"}
{"customer_id": 3, "name": "Carol", "city": "London"}

Why JSONL is better: Spark and ADF can read JSONL line by line without loading the entire file into memory. Standard JSON requires parsing the whole array first.

When to Use JSON

  • API responses — store raw API data in Bronze layer as JSON.
  • Event data — IoT sensors, clickstream, application logs often produce JSON.
  • Nested/complex data — when flat CSV cannot represent the structure.
  • Configuration files — pipeline configs, metadata definitions.

When NOT to Use JSON

  • Analytical queries — flatten to Parquet for analytics.
  • Large-scale storage — JSON is verbose. Convert to Parquet in Silver layer.
  • Tabular data — if your data is flat (no nesting), CSV or Parquet is simpler.

Real-life analogy: JSON is like a detailed conversation transcript. Every sentence includes context (“Alice said:”, “Bob replied:”). Rich in detail but takes a long time to search through for specific facts. Great for capturing complex interactions, poor for quick lookups.

JSON in Azure

# ADF: Copy activity with JSON dataset (supports nested)
# Python:
import json
with open('data.json') as f:
    data = json.load(f)

# Spark (JSONL):
df = spark.read.json('abfss://container@storage.dfs.core.windows.net/data.jsonl')

# Flatten nested JSON in Spark:
from pyspark.sql.functions import explode, col
df_flat = df.select("customer_id", "name", explode("orders").alias("order"))
df_final = df_flat.select("customer_id", "name", "order.id", "order.amount")

Parquet

What It Is

A columnar binary format designed for analytical workloads. Created by the Apache team (Hadoop ecosystem), now the de facto standard for data lakes.

Row format (CSV):       Column format (Parquet):
id, name, city, amt     id column:   [1, 2, 3, 4, 5]
1, Alice, Toronto, 500  name column: [Alice, Bob, Carol, Dave, Eve]
2, Bob, Mumbai, 300     city column: [Toronto, Mumbai, London, Toronto, Paris]
3, Carol, London, 700   amt column:  [500, 300, 700, 200, 400]
4, Dave, Toronto, 200
5, Eve, Paris, 400

Why Columnar Matters

When you query SELECT city, SUM(amount) FROM orders: – CSV (row format): Reads ALL columns for ALL rows. 100% of data scanned. – Parquet (column format): Reads ONLY city and amount columns. Maybe 20% of data scanned. 5x faster, 5x cheaper.

Strengths

  • Column pruning — reads only the columns your query needs. Massive performance boost.
  • Compression — similar values in a column compress extremely well. 5-10x smaller than CSV.
  • Embedded schema — data types are stored in the file. No guessing.
  • Predicate pushdown — Spark/Athena skip row groups where min/max statistics eliminate them from the query.
  • Universal support — Spark, ADF, Synapse, Athena, Redshift, BigQuery, Databricks all read Parquet natively.
  • Partition pruning — combined with Hive-style partitioning, only relevant partitions are scanned.

Weaknesses

  • Not human-readable — binary format. Cannot open in Notepad or Excel directly.
  • Append-only — you cannot update or delete individual rows. Must rewrite the entire file.
  • No ACID transactions — concurrent writes can corrupt data. No rollback capability.
  • No time travel — once overwritten, the old data is gone.
  • Small files problem — many small Parquet files kill query performance.

When to Use Parquet

  • Silver layer — cleaned, transformed data ready for analytics.
  • Data lake storage — the default format for storing analytical data.
  • Any Spark/Athena/Synapse workload — these engines are optimized for Parquet.
  • Long-term archival — compressed, efficient, widely supported.

When to Consider Delta Instead

  • You need updates/deletes on individual rows.
  • You need ACID transactions (concurrent writes safety).
  • You need time travel (query data as it was yesterday).
  • You need schema enforcement (reject bad data automatically).

Real-life analogy: Parquet is like a well-organized filing cabinet with color-coded tabs. Need all the financial documents? Pull the green tab. Need all the HR documents? Pull the blue tab. You never have to search through the entire cabinet. But if you want to change one document, you have to pull the entire folder out, make the change, and put it all back.

Parquet in Azure

# ADF: Copy activity with Parquet dataset (Snappy compression)
# Python:
import pandas as pd
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', compression='snappy')

# Spark:
df = spark.read.parquet('abfss://container@storage.dfs.core.windows.net/silver/customers/')
df.write.parquet('abfss://container@storage.dfs.core.windows.net/gold/summary/')

Delta Lake (Delta Format)

What It Is

Delta Lake is NOT a new file format — it is Parquet files + a transaction log. The data is still stored as Parquet. Delta adds a _delta_log/ folder that tracks every change: inserts, updates, deletes, schema changes.

silver/customers/
  _delta_log/
    00000000000000000000.json    (transaction 0: initial load)
    00000000000000000001.json    (transaction 1: updated 50 rows)
    00000000000000000002.json    (transaction 2: deleted 3 rows)
  part-00000.snappy.parquet     (data file)
  part-00001.snappy.parquet     (data file)
  part-00002.snappy.parquet     (new data from update)

Think of it this way: Parquet is the filing cabinet. Delta is the filing cabinet PLUS a security camera, a change log, and a librarian who prevents two people from filing at the same time.

What Delta Adds Over Parquet

Feature Parquet Delta Lake
ACID transactions No Yes — concurrent writes are safe
UPDATE/DELETE No — must rewrite entire file Yes — row-level updates and deletes
Time travel No Yes — query any historical version
Schema enforcement No — any schema is accepted Yes — rejects data that does not match
Schema evolution Manual Automatic (mergeSchema option)
MERGE (upsert) Not possible Native support
File compaction Manual OPTIMIZE command auto-compacts
Data skipping Basic (min/max stats) Enhanced with Z-ORDER clustering

ACID Transactions (The Bank Analogy)

Imagine two pipelines writing to the same table simultaneously. Without ACID:

Pipeline A: Writing 1 million rows to customers/
Pipeline B: Also writing to customers/ at the same time
Result: Corrupted files. Some rows from A, some from B, some half-written.

With Delta Lake ACID:

Pipeline A: Writing 1 million rows → Transaction 5 committed ✓
Pipeline B: Also writing → Transaction 6 committed ✓
Result: Both complete, no corruption. Each sees a consistent view.

Real-life analogy: ACID transactions are like a bank transfer. When you send $100 from checking to savings, either BOTH accounts update (debit checking AND credit savings) or NEITHER updates. You never lose $100 in the middle. Delta Lake gives your data the same guarantee.

Time Travel (The Security Camera)

-- What does the customer table look like RIGHT NOW?
SELECT * FROM delta.`/silver/customers/`

-- What did it look like YESTERDAY? (before the bad update)
SELECT * FROM delta.`/silver/customers/` VERSION AS OF 5

-- What did it look like at a specific timestamp?
SELECT * FROM delta.`/silver/customers/` TIMESTAMP AS OF '2026-04-10 14:00:00'

When time travel saves you: – Someone ran a bad UPDATE that corrupted 10,000 rows. Restore the previous version. – An auditor asks “what was Alice’s address on March 1?” Query that point in time. – A pipeline failed halfway. Roll back to the last good version.

Real-life analogy: Time travel is like having security camera footage of your filing cabinet. If someone misfiled a document, you can rewind the footage, see what it looked like before, and restore the correct state. Without Delta, once the document is misfiled, you have no record of the original state.

MERGE (The Killer Feature for Data Engineers)

MERGE combines INSERT, UPDATE, and DELETE in one atomic operation. This is how you implement SCD Type 2, incremental loads, and deduplication:

MERGE INTO silver.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id

WHEN MATCHED AND source.city != target.city THEN
    UPDATE SET city = source.city, updated_date = current_date()

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, city, updated_date)
    VALUES (source.customer_id, source.name, source.city, current_date())

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

This single statement handles: new customers (INSERT), changed customers (UPDATE), and removed customers (DELETE). All in one transaction. All or nothing.

Schema Enforcement and Evolution

# Schema enforcement: reject data with wrong schema
df_bad = spark.createDataFrame([{"customer_id": "ABC", "name": 123}])  # wrong types
df_bad.write.format("delta").mode("append").save("/silver/customers/")
# ERROR: schema mismatch. Data rejected. Table stays clean.

# Schema evolution: add new columns safely
df_new = spark.createDataFrame([{"customer_id": 1, "name": "Alice", "email": "alice@test.com"}])
df_new.write.format("delta").mode("append")     .option("mergeSchema", "true")     .save("/silver/customers/")
# Success: "email" column added to the table automatically

Real-life analogy: Schema enforcement is like a club bouncer with a dress code. Wrong outfit (wrong schema)? You are not getting in. Schema evolution is like the club updating the dress code to allow new styles — existing guests stay, new guests can wear the new style.

OPTIMIZE and Z-ORDER

-- Compact small files into larger ones (fixes small files problem)
OPTIMIZE silver.customers

-- Cluster data by a column for faster queries
OPTIMIZE silver.customers ZORDER BY (city)

After Z-ORDER by city, all “Toronto” rows are stored near each other. A query filtering WHERE city = 'Toronto' reads far fewer files.

When to Use Delta Lake

  • Silver and Gold layers — the standard for modern data lakes
  • Any table that needs updates/deletes — SCD, deduplication, corrections
  • Concurrent pipelines — multiple writers to the same table
  • Compliance/audit — time travel for regulatory requirements
  • Production data platforms — reliability guarantees that Parquet cannot provide

Delta Lake in Azure

# Databricks (native Delta support):
df.write.format("delta").mode("overwrite").save("/silver/customers/")
df = spark.read.format("delta").load("/silver/customers/")

# Synapse Spark pool:
df.write.format("delta").save("abfss://container@storage.dfs.core.windows.net/silver/customers/")

# Synapse Serverless SQL:
SELECT * FROM OPENROWSET(
    BULK 'abfss://container@storage.dfs.core.windows.net/silver/customers/',
    FORMAT = 'DELTA'
) AS result

Avro

What It Is

A row-oriented binary format with an embedded schema. Created for the Hadoop ecosystem, now widely used with Apache Kafka and Azure Event Hubs.

Schema (JSON):
{
    "type": "record",
    "name": "Customer",
    "fields": [
        {"name": "customer_id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "city", "type": "string"}
    ]
}

Data: binary-encoded rows following this schema

Strengths

  • Schema embedded — the schema travels WITH the data. No guessing types.
  • Schema evolution — add/remove fields without breaking consumers.
  • Fast serialization/deserialization — optimized for write-heavy workloads.
  • Compact — binary encoding is smaller than JSON.
  • Kafka standard — the default format for message queues and streaming.

Weaknesses

  • Row-oriented — not optimized for analytical queries (cannot skip columns).
  • Not human-readable — binary format.
  • Less efficient for analytics — Parquet beats Avro for SELECT queries by 5-10x.

When to Use Avro

  • Streaming/messaging — Kafka, Event Hubs, data streaming pipelines
  • Write-heavy workloads — fast serialization for high-throughput producers
  • Schema registry — Avro schemas work with Confluent Schema Registry
  • Bronze layer ingestion — capture streaming data before converting to Parquet

When NOT to Use Avro

  • Analytics — use Parquet. Avro scans are much slower for column-based queries.
  • Data lake storage — convert Avro to Parquet in the Silver layer.

Real-life analogy: Avro is like a package with a packing slip attached. The packing slip (schema) describes exactly what is inside. Great for shipping (streaming), but when you need to search through 1,000 packages for all items of a certain color (analytical query), you have to open every package. Parquet is a warehouse where items are sorted by color on different shelves.

Avro in Azure

# ADF: Copy activity supports Avro source/sink
# Spark:
df = spark.read.format("avro").load("abfss://container@storage.dfs.core.windows.net/events/")

# Event Hubs capture stores data as Avro automatically

ORC (Optimized Row Columnar)

What It Is

A columnar format created by Hortonworks for the Hive ecosystem. Very similar to Parquet in concept but with different internal structure.

Strengths

  • Excellent compression — often slightly better than Parquet.
  • Built-in indexes — bloom filters and min/max statistics per stripe.
  • Hive optimized — the default format for Hive tables.
  • ACID support — Hive 3+ supports ACID transactions with ORC.

ORC vs Parquet

Feature ORC Parquet
Ecosystem Hive/Hortonworks Spark/Databricks/everything
Compression Slightly better Very good
Azure support Supported First-class citizen
Spark performance Good Excellent (optimized)
Industry adoption Declining Growing (dominant)
Recommendation Legacy Hive migrations Everything else

When to Use ORC

  • Migrating from on-premises Hive — your existing Hive tables are ORC, keep them.
  • Hive-specific workloads — Hive queries on ORC can be slightly faster than on Parquet.

When NOT to Use ORC

  • New projects — use Parquet or Delta. Better ecosystem support.
  • Spark workloads — Spark is more optimized for Parquet.
  • Azure-native services — Synapse, ADF, Databricks all prefer Parquet/Delta.

Real-life analogy: ORC is like Betamax. Technically excellent, but VHS (Parquet) won the format war through wider adoption. If you already have a Betamax collection (Hive tables), keep using it. But for new purchases (new projects), go with the standard.

The Complete Comparison Table

Feature CSV JSON Parquet Delta Avro ORC
Type Row, text Row, text Column, binary Column + log Row, binary Column, binary
Human-readable Yes Yes No No No No
Schema embedded No Partial Yes Yes Yes Yes
Compression None (manual gzip) None (manual gzip) Snappy/GZIP/ZSTD Snappy/GZIP/ZSTD Deflate/Snappy ZLIB/Snappy/LZO
Column pruning No No Yes Yes No Yes
UPDATE/DELETE No No No Yes No Hive 3+ only
ACID transactions No No No Yes No Hive 3+ only
Time travel No No No Yes No No
MERGE No No No Yes No Hive 3+ only
Schema evolution No Flexible Manual Auto (mergeSchema) Yes Yes
Best for Landing zone APIs, events Analytics Production DL Streaming Legacy Hive
10 GB query speed 45 sec 60 sec 3 sec 3 sec 30 sec 4 sec
10 GB storage size 10 GB 12 GB 2 GB 2 GB 4 GB 1.8 GB

Row-Oriented vs Column-Oriented: The Fundamental Split

This is the single most important concept in file formats:

Row-Oriented (CSV, JSON, Avro)

Data is stored row by row:

Row 1: [1, Alice, Toronto, 500]
Row 2: [2, Bob, Mumbai, 300]
Row 3: [3, Carol, London, 700]

Great for: Writing individual records (streaming, transactions). Each row is self-contained.

Bad for: Analytical queries that touch few columns but many rows. Must read ALL columns to get one.

Column-Oriented (Parquet, Delta, ORC)

Data is stored column by column:

Column 'id':     [1, 2, 3]
Column 'name':   [Alice, Bob, Carol]
Column 'city':   [Toronto, Mumbai, London]
Column 'amount': [500, 300, 700]

Great for: Analytical queries. SELECT city, SUM(amount) reads only 2 columns out of 4. Plus, similar values in a column compress better.

Bad for: Writing individual records (must update multiple column files).

Real-life analogy:Row-oriented = a phone book sorted by person. Great for finding Alice’s number. Bad for finding all people in Toronto. – Column-oriented = a spreadsheet with one column per attribute. Great for finding all people in Toronto (scan one column). Bad for finding everything about Alice (scan every column).

Choosing Formats for the Medallion Architecture

BRONZE (Raw Ingestion)
  Format: Whatever the source provides (CSV, JSON, Avro)
  Why: Accept data as-is, do not transform yet
  Example: API response → JSON, database export → CSV, Kafka → Avro

SILVER (Cleaned + Transformed)
  Format: Delta Lake (or Parquet if Delta is not available)
  Why: Need schema enforcement, updates for corrections, data quality
  Example: Cleaned customers (deduped, nulls handled, types enforced)

GOLD (Business-Ready)
  Format: Delta Lake
  Why: Need MERGE for SCD, time travel for audit, ACID for reliability
  Example: dim_customer (SCD Type 2), fact_sales (daily aggregation)

The Simple Rule

Source drops CSV/JSON  →  Bronze (keep as-is)
                       →  Silver (convert to Delta)
                       →  Gold (Delta with MERGE and OPTIMIZE)

File Formats in Azure Services

Service Preferred Formats Notes
ADF Copy Activity CSV, Parquet, JSON, Avro, ORC Format conversion during copy
ADF Data Flows All formats Spark engine reads everything
Synapse Spark Pool Delta, Parquet Delta is first-class citizen
Synapse Serverless SQL Parquet, Delta, CSV OPENROWSET function
Synapse Dedicated SQL Parquet (via COPY INTO) Fastest bulk load format
Databricks Delta (native) Delta is the default
Azure HDInsight ORC, Parquet, Avro Hive prefers ORC
Azure Stream Analytics JSON, Avro, CSV Input from Event Hubs
Power BI Any (via Synapse/Databricks) Reads through compute layer

Compression: Snappy, GZIP, ZSTD, and LZ4

Algorithm Speed Compression Ratio Use Case
Snappy Fastest Good (3-4x) Default for Parquet/Delta. Best balance of speed and size.
GZIP Slow Best (5-8x) Archival. Maximum compression when storage cost matters more than speed.
ZSTD Fast Very Good (4-6x) Modern alternative to GZIP. Better compression than Snappy, almost as fast.
LZ4 Fastest Moderate (2-3x) When speed is critical and storage is cheap.
None Instant 1x (no compression) Temporary files, debugging.

Recommendation: Use Snappy for everything in production. Switch to ZSTD if you need better compression without sacrificing too much speed. Use GZIP only for long-term archival.

Small Files Problem and How Each Format Handles It

The Problem

Spark jobs and streaming pipelines often create thousands of tiny files (under 1 MB each). Each file requires a separate API call to read. 10,000 small files is 10,000x slower than 10 large files.

How Each Format Handles It

Format Solution How
CSV Manual merge Write a script to concatenate files
Parquet Coalesce in Spark df.coalesce(10).write.parquet(path)
Delta OPTIMIZE command OPTIMIZE table_name — auto-compacts small files
Avro Manual merge Similar to CSV — no built-in solution

Delta Lake’s OPTIMIZE is a game-changer because it compacts files without rewriting the table or causing downtime.

Schema Evolution: How Each Format Handles Changes

Source system adds a new column email. What happens?

Format Behavior Risk
CSV New column appears. Old files do not have it. Queries may break. High — no type safety
JSON New field appears naturally. Old records just do not have it. Low — flexible by nature
Parquet Old files have old schema, new files have new schema. Query engine merges them (schema merge). Medium — requires explicit merge
Delta With mergeSchema=true, new column added automatically. Old rows show NULL. Schema enforcement prevents bad data. Low — controlled and safe
Avro Schema registry manages evolution. New fields need defaults. Low — schema registry enforces rules

Real-World Scenarios

Scenario 1: E-Commerce Data Platform

Source:  Order API (JSON), Product CSV export, Customer database (SQL)
Bronze:  JSON files, CSV files, Parquet from ADF Copy
Silver:  Delta Lake — cleaned, deduped, joined
Gold:    Delta Lake — fact_orders, dim_customer (SCD2), dim_product
Query:   Synapse Serverless SQL or Databricks SQL

Scenario 2: IoT Sensor Data

Source:  IoT Hub/Event Hubs (Avro from capture)
Bronze:  Avro files in ADLS Gen2 (auto-captured)
Silver:  Convert to Delta, aggregate by minute/hour
Gold:    Delta — sensor_readings_hourly, alert_thresholds
Query:   Synapse Spark for ML, Power BI for dashboards

Scenario 3: Legacy Hive Migration to Azure

On-prem:  Hive tables in ORC format
Migration: ADF Copy from HDFS to ADLS Gen2 (keep ORC temporarily)
Silver:    Convert ORC to Delta Lake using Spark
Gold:      Delta — new dimensional model

Common Mistakes

  1. Storing analytics data as CSV — converts a 3-second query into a 45-second query. Always convert to Parquet or Delta for analytics.

  2. Using Parquet when you need updates — Parquet does not support UPDATE/DELETE. If you need row-level changes, use Delta Lake.

  3. Ignoring the small files problem — thousands of tiny Parquet files after a Spark job. Use Delta’s OPTIMIZE or coalesce in Spark.

  4. Not enabling schema enforcement — Delta Lake can reject bad data. Turn it on. Catching a schema error at ingestion is 100x cheaper than finding corrupted data in a dashboard.

  5. Using GZIP for real-time queries — GZIP compression is slow to decompress. Use Snappy for anything that needs fast query performance.

  6. Storing production data as JSON in the analytics layer — JSON is great for landing, terrible for querying. Always convert to Parquet or Delta in Silver.

Interview Questions

Q: What is the difference between row-oriented and column-oriented formats? A: Row-oriented formats (CSV, JSON, Avro) store data row by row — good for writing individual records. Column-oriented formats (Parquet, Delta, ORC) store data column by column — good for analytical queries that read few columns across many rows. Column formats enable column pruning and better compression.

Q: What does Delta Lake add over Parquet? A: Delta Lake adds ACID transactions (safe concurrent writes), row-level UPDATE/DELETE, time travel (query historical versions), schema enforcement (reject bad data), MERGE (upsert), and OPTIMIZE (auto file compaction). The data files are still Parquet; Delta adds a transaction log.

Q: Which file format would you use for each layer of the Medallion architecture? A: Bronze — whatever the source provides (CSV, JSON, Avro). Silver — Delta Lake for cleaned, deduplicated data with schema enforcement. Gold — Delta Lake for aggregated, business-ready data with MERGE for SCD and time travel for audit.

Q: What is the small files problem and how do you solve it? A: Too many small files (under 128 MB) cause slow query performance because each file requires a separate read operation. Solutions: coalesce in Spark before writing, Delta Lake OPTIMIZE command, or schedule compaction jobs. Delta Lake handles this most elegantly.

Q: When would you use Avro instead of Parquet? A: For streaming and messaging workloads — Kafka, Event Hubs, high-throughput data ingestion. Avro is row-oriented and optimized for fast serialization. For analytics, convert Avro to Parquet or Delta in the Silver layer.

Q: What compression algorithm do you recommend for Parquet files? A: Snappy for most production workloads (best balance of speed and compression). ZSTD for better compression with minimal speed loss. GZIP only for long-term archival where storage cost matters more than query speed.

Wrapping Up

The file format decision comes down to three questions:

  1. Is this for landing (Bronze)? — Accept whatever the source gives you. CSV, JSON, Avro — all fine.
  2. Is this for analytics (Silver/Gold)? — Use Delta Lake. If Delta is not available, use Parquet.
  3. Is this for streaming? — Use Avro for ingestion, convert to Delta for analysis.

Delta Lake is the modern standard because it adds everything Parquet is missing: ACID transactions, updates, time travel, and schema enforcement. If your platform supports Delta (Databricks, Synapse Spark), there is almost no reason to use raw Parquet in Silver and Gold layers.

Choose your format wisely. It is the foundation that everything else sits on.

Related posts:Parquet vs CSV vs JSONSchema-on-Write vs Schema-on-ReadADLS Gen2 Complete GuideSlowly Changing DimensionsAWS S3 for Data Engineers


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link