Data File Formats in Azure Explained: CSV, Parquet, Delta, Avro, ORC, and JSON — When to Use Each

Imagine you are moving to a new house. You can pack your belongings in garbage bags (CSV), organized labeled boxes (Parquet), or a professional moving system with tracking, insurance, and inventory (Delta Lake). All three get your stuff to the new house, but the experience and reliability are vastly different.

Choosing the right file format in your data lake is one of the most impactful decisions you make as a data engineer. The wrong format means slow queries, bloated storage costs, failed pipelines, and frustrated analysts. The right format means fast queries, efficient storage, reliable data, and happy stakeholders.

This post covers every file format you will encounter in Azure data engineering, with honest comparisons, real-world scenarios, and clear recommendations for each layer of your data lake.

Why File Formats Matter
CSV (Comma-Separated Values)
JSON (JavaScript Object Notation)
Parquet
Delta Lake (Delta Format)
Avro
ORC (Optimized Row Columnar)
The Complete Comparison Table
Row-Oriented vs Column-Oriented: The Fundamental Split
Choosing Formats for the Medallion Architecture
File Formats in Azure Services
Compression: Snappy, GZIP, ZSTD, and LZ4
Small Files Problem and How Each Format Handles It
Schema Evolution: How Each Format Handles Changes
Real-World Scenarios
Common Mistakes
Interview Questions
Wrapping Up

Why File Formats Matter

Consider this query:

SELECT city, SUM(amount) FROM orders WHERE year = 2026 GROUP BY city

On CSV (10 GB file): Reads the ENTIRE 10 GB file. Every row, every column. Even though you only need city, amount, and year. Takes 45 seconds.

On Parquet (10 GB → 2 GB compressed): Reads ONLY the city, amount, and year columns. Skips all other columns entirely. Reads only 2026 partitions. Takes 3 seconds.

On Delta (same Parquet underneath + metadata): Same speed as Parquet, PLUS ACID transactions, time travel, and automatic file optimization. Takes 3 seconds with reliability guarantees.

The format determines how much data is read, how fast queries run, how much storage costs, and how reliable your pipeline is.

Real-life analogy: File formats are like book organization systems: – CSV = a handwritten notebook. Easy to start writing, but finding anything requires reading every page. – Parquet = an organized filing cabinet with labeled tabs. Jump directly to the section you need. – Delta = the filing cabinet PLUS a librarian who tracks every change, prevents conflicts, and can show you what the cabinet looked like last Tuesday.

CSV (Comma-Separated Values)

What It Is

The simplest data format. Plain text with commas separating values. Every spreadsheet, database, and programming language can read and write CSV.

customer_id,name,city,amount
1,Alice,Toronto,500
2,Bob,Mumbai,300
3,Carol,London,700

Strengths

Universal compatibility — everything reads CSV. Excel, Python, R, ADF, Spark, Athena, even Notepad.
Human-readable — open it in any text editor and see the data immediately.
Easy to produce — source systems, APIs, and manual exports often output CSV.
No special tools needed — cat, head, grep work on CSV files.

Weaknesses

No schema — column names are in the header (maybe), but data types are not stored. Is “2026” a string or an integer? Is “01/02/2026” January 2nd or February 1st? CSV does not know.
No compression — raw text takes the most storage space.
Full scan required — to read one column, you must read ALL columns (row-oriented).
No column pruning — Athena or Spark cannot skip columns. Reads everything.
Delimiter conflicts — what if your data contains commas? "Alice, Bob" breaks naive CSV parsers.
No data types — everything is text. “500” and “five hundred” look the same to CSV.

When to Use CSV

Landing zone — source systems drop CSV files. Accept them, then convert.
Data exchange — sending data to non-technical stakeholders or legacy systems.
Small datasets — under 100 MB where performance does not matter.
Debugging — quick inspection with a text editor.

When NOT to Use CSV

Analytics — use Parquet instead. CSV scans are 10-50x slower.
Data lake storage — never store production data as CSV in Silver/Gold layers.
Large datasets — anything over 1 GB should be Parquet or Delta.

Real-life analogy: CSV is like a handwritten letter. Anyone can read it, but it takes forever to search through a stack of 10,000 letters to find ones about a specific topic. No index, no organization, no shortcuts.

CSV in Azure

# ADF: Copy activity with CSV dataset
# Python:
import pandas as pd
df = pd.read_csv('data.csv')

# Spark:
df = spark.read.csv('abfss://container@storage.dfs.core.windows.net/data.csv',
                     header=True, inferSchema=True)

JSON (JavaScript Object Notation)

What It Is

Semi-structured data format that supports nested objects and arrays. The language of APIs, web applications, and NoSQL databases.

{
    "customer_id": 1,
    "name": "Alice",
    "address": {
        "city": "Toronto",
        "country": "Canada"
    },
    "orders": [
        {"id": 101, "amount": 500},
        {"id": 102, "amount": 300}
    ]
}

Strengths

Supports nesting — objects within objects, arrays within objects. Perfect for complex, hierarchical data.
Self-describing — field names are included with every record.
Human-readable — easy to inspect and debug.
API standard — every REST API returns JSON.
Schema-flexible — different records can have different fields.

Weaknesses

Verbose — field names repeated in every record. “customer_id” appears 1 million times in 1 million records.
No compression — raw text format (larger than CSV because of field names and syntax characters).
Slow to query — parsing nested JSON is expensive for analytical engines.
No column pruning — must read entire records to extract one field.
Schema inference is messy — Spark guesses types, sometimes wrong.

JSON Lines (JSONL) — The Better Format

Standard JSON wraps everything in an array. JSONL puts one record per line:

{"customer_id": 1, "name": "Alice", "city": "Toronto"}
{"customer_id": 2, "name": "Bob", "city": "Mumbai"}
{"customer_id": 3, "name": "Carol", "city": "London"}

Why JSONL is better: Spark and ADF can read JSONL line by line without loading the entire file into memory. Standard JSON requires parsing the whole array first.

When to Use JSON

API responses — store raw API data in Bronze layer as JSON.
Event data — IoT sensors, clickstream, application logs often produce JSON.
Nested/complex data — when flat CSV cannot represent the structure.
Configuration files — pipeline configs, metadata definitions.

When NOT to Use JSON

Analytical queries — flatten to Parquet for analytics.
Large-scale storage — JSON is verbose. Convert to Parquet in Silver layer.
Tabular data — if your data is flat (no nesting), CSV or Parquet is simpler.

Real-life analogy: JSON is like a detailed conversation transcript. Every sentence includes context (“Alice said:”, “Bob replied:”). Rich in detail but takes a long time to search through for specific facts. Great for capturing complex interactions, poor for quick lookups.

JSON in Azure

# ADF: Copy activity with JSON dataset (supports nested)
# Python:
import json
with open('data.json') as f:
    data = json.load(f)

# Spark (JSONL):
df = spark.read.json('abfss://container@storage.dfs.core.windows.net/data.jsonl')

# Flatten nested JSON in Spark:
from pyspark.sql.functions import explode, col
df_flat = df.select("customer_id", "name", explode("orders").alias("order"))
df_final = df_flat.select("customer_id", "name", "order.id", "order.amount")

Parquet

What It Is

A columnar binary format designed for analytical workloads. Created by the Apache team (Hadoop ecosystem), now the de facto standard for data lakes.

Row format (CSV):       Column format (Parquet):
id, name, city, amt     id column:   [1, 2, 3, 4, 5]
1, Alice, Toronto, 500  name column: [Alice, Bob, Carol, Dave, Eve]
2, Bob, Mumbai, 300     city column: [Toronto, Mumbai, London, Toronto, Paris]
3, Carol, London, 700   amt column:  [500, 300, 700, 200, 400]
4, Dave, Toronto, 200
5, Eve, Paris, 400

Why Columnar Matters

When you query SELECT city, SUM(amount) FROM orders: – CSV (row format): Reads ALL columns for ALL rows. 100% of data scanned. – Parquet (column format): Reads ONLY city and amount columns. Maybe 20% of data scanned. 5x faster, 5x cheaper.

Strengths

Column pruning — reads only the columns your query needs. Massive performance boost.
Compression — similar values in a column compress extremely well. 5-10x smaller than CSV.
Embedded schema — data types are stored in the file. No guessing.
Predicate pushdown — Spark/Athena skip row groups where min/max statistics eliminate them from the query.
Universal support — Spark, ADF, Synapse, Athena, Redshift, BigQuery, Databricks all read Parquet natively.
Partition pruning — combined with Hive-style partitioning, only relevant partitions are scanned.

Weaknesses

Not human-readable — binary format. Cannot open in Notepad or Excel directly.
Append-only — you cannot update or delete individual rows. Must rewrite the entire file.
No ACID transactions — concurrent writes can corrupt data. No rollback capability.
No time travel — once overwritten, the old data is gone.
Small files problem — many small Parquet files kill query performance.

When to Use Parquet

Silver layer — cleaned, transformed data ready for analytics.
Data lake storage — the default format for storing analytical data.
Any Spark/Athena/Synapse workload — these engines are optimized for Parquet.
Long-term archival — compressed, efficient, widely supported.

When to Consider Delta Instead

You need updates/deletes on individual rows.
You need ACID transactions (concurrent writes safety).
You need time travel (query data as it was yesterday).
You need schema enforcement (reject bad data automatically).

Real-life analogy: Parquet is like a well-organized filing cabinet with color-coded tabs. Need all the financial documents? Pull the green tab. Need all the HR documents? Pull the blue tab. You never have to search through the entire cabinet. But if you want to change one document, you have to pull the entire folder out, make the change, and put it all back.

Parquet in Azure

# ADF: Copy activity with Parquet dataset (Snappy compression)
# Python:
import pandas as pd
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', compression='snappy')

# Spark:
df = spark.read.parquet('abfss://container@storage.dfs.core.windows.net/silver/customers/')
df.write.parquet('abfss://container@storage.dfs.core.windows.net/gold/summary/')

Delta Lake (Delta Format)

What It Is

Delta Lake is NOT a new file format — it is Parquet files + a transaction log. The data is still stored as Parquet. Delta adds a _delta_log/ folder that tracks every change: inserts, updates, deletes, schema changes.

silver/customers/
  _delta_log/
    00000000000000000000.json    (transaction 0: initial load)
    00000000000000000001.json    (transaction 1: updated 50 rows)
    00000000000000000002.json    (transaction 2: deleted 3 rows)
  part-00000.snappy.parquet     (data file)
  part-00001.snappy.parquet     (data file)
  part-00002.snappy.parquet     (new data from update)

Think of it this way: Parquet is the filing cabinet. Delta is the filing cabinet PLUS a security camera, a change log, and a librarian who prevents two people from filing at the same time.

What Delta Adds Over Parquet

Feature	Parquet	Delta Lake
ACID transactions	No	Yes — concurrent writes are safe
UPDATE/DELETE	No — must rewrite entire file	Yes — row-level updates and deletes
Time travel	No	Yes — query any historical version
Schema enforcement	No — any schema is accepted	Yes — rejects data that does not match
Schema evolution	Manual	Automatic (mergeSchema option)
MERGE (upsert)	Not possible	Native support
File compaction	Manual	OPTIMIZE command auto-compacts
Data skipping	Basic (min/max stats)	Enhanced with Z-ORDER clustering

ACID Transactions (The Bank Analogy)

Imagine two pipelines writing to the same table simultaneously. Without ACID:

Pipeline A: Writing 1 million rows to customers/
Pipeline B: Also writing to customers/ at the same time
Result: Corrupted files. Some rows from A, some from B, some half-written.

With Delta Lake ACID:

Pipeline A: Writing 1 million rows → Transaction 5 committed ✓
Pipeline B: Also writing → Transaction 6 committed ✓
Result: Both complete, no corruption. Each sees a consistent view.

Real-life analogy: ACID transactions are like a bank transfer. When you send $100 from checking to savings, either BOTH accounts update (debit checking AND credit savings) or NEITHER updates. You never lose $100 in the middle. Delta Lake gives your data the same guarantee.

Time Travel (The Security Camera)

-- What does the customer table look like RIGHT NOW?
SELECT * FROM delta.`/silver/customers/`

-- What did it look like YESTERDAY? (before the bad update)
SELECT * FROM delta.`/silver/customers/` VERSION AS OF 5

-- What did it look like at a specific timestamp?
SELECT * FROM delta.`/silver/customers/` TIMESTAMP AS OF '2026-04-10 14:00:00'

When time travel saves you: – Someone ran a bad UPDATE that corrupted 10,000 rows. Restore the previous version. – An auditor asks “what was Alice’s address on March 1?” Query that point in time. – A pipeline failed halfway. Roll back to the last good version.

Real-life analogy: Time travel is like having security camera footage of your filing cabinet. If someone misfiled a document, you can rewind the footage, see what it looked like before, and restore the correct state. Without Delta, once the document is misfiled, you have no record of the original state.

MERGE (The Killer Feature for Data Engineers)

MERGE combines INSERT, UPDATE, and DELETE in one atomic operation. This is how you implement SCD Type 2, incremental loads, and deduplication:

MERGE INTO silver.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id

WHEN MATCHED AND source.city != target.city THEN
    UPDATE SET city = source.city, updated_date = current_date()

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, city, updated_date)
    VALUES (source.customer_id, source.name, source.city, current_date())

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

This single statement handles: new customers (INSERT), changed customers (UPDATE), and removed customers (DELETE). All in one transaction. All or nothing.

Schema Enforcement and Evolution

# Schema enforcement: reject data with wrong schema
df_bad = spark.createDataFrame([{"customer_id": "ABC", "name": 123}])  # wrong types
df_bad.write.format("delta").mode("append").save("/silver/customers/")
# ERROR: schema mismatch. Data rejected. Table stays clean.

# Schema evolution: add new columns safely
df_new = spark.createDataFrame([{"customer_id": 1, "name": "Alice", "email": "alice@test.com"}])
df_new.write.format("delta").mode("append")     .option("mergeSchema", "true")     .save("/silver/customers/")
# Success: "email" column added to the table automatically

Real-life analogy: Schema enforcement is like a club bouncer with a dress code. Wrong outfit (wrong schema)? You are not getting in. Schema evolution is like the club updating the dress code to allow new styles — existing guests stay, new guests can wear the new style.

OPTIMIZE and Z-ORDER

-- Compact small files into larger ones (fixes small files problem)
OPTIMIZE silver.customers

-- Cluster data by a column for faster queries
OPTIMIZE silver.customers ZORDER BY (city)

After Z-ORDER by city, all “Toronto” rows are stored near each other. A query filtering WHERE city = 'Toronto' reads far fewer files.

When to Use Delta Lake

Silver and Gold layers — the standard for modern data lakes
Any table that needs updates/deletes — SCD, deduplication, corrections
Concurrent pipelines — multiple writers to the same table
Compliance/audit — time travel for regulatory requirements
Production data platforms — reliability guarantees that Parquet cannot provide

Delta Lake in Azure

# Databricks (native Delta support):
df.write.format("delta").mode("overwrite").save("/silver/customers/")
df = spark.read.format("delta").load("/silver/customers/")

# Synapse Spark pool:
df.write.format("delta").save("abfss://container@storage.dfs.core.windows.net/silver/customers/")

# Synapse Serverless SQL:
SELECT * FROM OPENROWSET(
    BULK 'abfss://container@storage.dfs.core.windows.net/silver/customers/',
    FORMAT = 'DELTA'
) AS result

Avro

What It Is

A row-oriented binary format with an embedded schema. Created for the Hadoop ecosystem, now widely used with Apache Kafka and Azure Event Hubs.

Schema (JSON):
{
    "type": "record",
    "name": "Customer",
    "fields": [
        {"name": "customer_id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "city", "type": "string"}
    ]
}

Data: binary-encoded rows following this schema

Strengths

Schema embedded — the schema travels WITH the data. No guessing types.
Schema evolution — add/remove fields without breaking consumers.
Fast serialization/deserialization — optimized for write-heavy workloads.
Compact — binary encoding is smaller than JSON.
Kafka standard — the default format for message queues and streaming.

Weaknesses

Row-oriented — not optimized for analytical queries (cannot skip columns).
Not human-readable — binary format.
Less efficient for analytics — Parquet beats Avro for SELECT queries by 5-10x.

When to Use Avro

Streaming/messaging — Kafka, Event Hubs, data streaming pipelines
Write-heavy workloads — fast serialization for high-throughput producers
Schema registry — Avro schemas work with Confluent Schema Registry
Bronze layer ingestion — capture streaming data before converting to Parquet

When NOT to Use Avro

Analytics — use Parquet. Avro scans are much slower for column-based queries.
Data lake storage — convert Avro to Parquet in the Silver layer.

Real-life analogy: Avro is like a package with a packing slip attached. The packing slip (schema) describes exactly what is inside. Great for shipping (streaming), but when you need to search through 1,000 packages for all items of a certain color (analytical query), you have to open every package. Parquet is a warehouse where items are sorted by color on different shelves.

Avro in Azure

# ADF: Copy activity supports Avro source/sink
# Spark:
df = spark.read.format("avro").load("abfss://container@storage.dfs.core.windows.net/events/")

# Event Hubs capture stores data as Avro automatically

ORC (Optimized Row Columnar)

What It Is

A columnar format created by Hortonworks for the Hive ecosystem. Very similar to Parquet in concept but with different internal structure.

Strengths

Excellent compression — often slightly better than Parquet.
Built-in indexes — bloom filters and min/max statistics per stripe.
Hive optimized — the default format for Hive tables.
ACID support — Hive 3+ supports ACID transactions with ORC.

ORC vs Parquet

Feature	ORC	Parquet
Ecosystem	Hive/Hortonworks	Spark/Databricks/everything
Compression	Slightly better	Very good
Azure support	Supported	First-class citizen
Spark performance	Good	Excellent (optimized)
Industry adoption	Declining	Growing (dominant)
Recommendation	Legacy Hive migrations	Everything else

When to Use ORC

Migrating from on-premises Hive — your existing Hive tables are ORC, keep them.
Hive-specific workloads — Hive queries on ORC can be slightly faster than on Parquet.

When NOT to Use ORC

New projects — use Parquet or Delta. Better ecosystem support.
Spark workloads — Spark is more optimized for Parquet.
Azure-native services — Synapse, ADF, Databricks all prefer Parquet/Delta.

Real-life analogy: ORC is like Betamax. Technically excellent, but VHS (Parquet) won the format war through wider adoption. If you already have a Betamax collection (Hive tables), keep using it. But for new purchases (new projects), go with the standard.

The Complete Comparison Table

Feature	CSV	JSON	Parquet	Delta	Avro	ORC
Type	Row, text	Row, text	Column, binary	Column + log	Row, binary	Column, binary
Human-readable	Yes	Yes	No	No	No	No
Schema embedded	No	Partial	Yes	Yes	Yes	Yes
Compression	None (manual gzip)	None (manual gzip)	Snappy/GZIP/ZSTD	Snappy/GZIP/ZSTD	Deflate/Snappy	ZLIB/Snappy/LZO
Column pruning	No	No	Yes	Yes	No	Yes
UPDATE/DELETE	No	No	No	Yes	No	Hive 3+ only
ACID transactions	No	No	No	Yes	No	Hive 3+ only
Time travel	No	No	No	Yes	No	No
MERGE	No	No	No	Yes	No	Hive 3+ only
Schema evolution	No	Flexible	Manual	Auto (mergeSchema)	Yes	Yes
Best for	Landing zone	APIs, events	Analytics	Production DL	Streaming	Legacy Hive
10 GB query speed	45 sec	60 sec	3 sec	3 sec	30 sec	4 sec
10 GB storage size	10 GB	12 GB	2 GB	2 GB	4 GB	1.8 GB

Row-Oriented vs Column-Oriented: The Fundamental Split

This is the single most important concept in file formats:

Row-Oriented (CSV, JSON, Avro)

Data is stored row by row:

Row 1: [1, Alice, Toronto, 500]
Row 2: [2, Bob, Mumbai, 300]
Row 3: [3, Carol, London, 700]

Great for: Writing individual records (streaming, transactions). Each row is self-contained.

Bad for: Analytical queries that touch few columns but many rows. Must read ALL columns to get one.

Column-Oriented (Parquet, Delta, ORC)

Data is stored column by column:

Column 'id':     [1, 2, 3]
Column 'name':   [Alice, Bob, Carol]
Column 'city':   [Toronto, Mumbai, London]
Column 'amount': [500, 300, 700]

Great for: Analytical queries. SELECT city, SUM(amount) reads only 2 columns out of 4. Plus, similar values in a column compress better.

Bad for: Writing individual records (must update multiple column files).

Real-life analogy: – Row-oriented = a phone book sorted by person. Great for finding Alice’s number. Bad for finding all people in Toronto. – Column-oriented = a spreadsheet with one column per attribute. Great for finding all people in Toronto (scan one column). Bad for finding everything about Alice (scan every column).

Choosing Formats for the Medallion Architecture

BRONZE (Raw Ingestion)
  Format: Whatever the source provides (CSV, JSON, Avro)
  Why: Accept data as-is, do not transform yet
  Example: API response → JSON, database export → CSV, Kafka → Avro

SILVER (Cleaned + Transformed)
  Format: Delta Lake (or Parquet if Delta is not available)
  Why: Need schema enforcement, updates for corrections, data quality
  Example: Cleaned customers (deduped, nulls handled, types enforced)

GOLD (Business-Ready)
  Format: Delta Lake
  Why: Need MERGE for SCD, time travel for audit, ACID for reliability
  Example: dim_customer (SCD Type 2), fact_sales (daily aggregation)

The Simple Rule

Source drops CSV/JSON  →  Bronze (keep as-is)
                       →  Silver (convert to Delta)
                       →  Gold (Delta with MERGE and OPTIMIZE)

File Formats in Azure Services

Service	Preferred Formats	Notes
ADF Copy Activity	CSV, Parquet, JSON, Avro, ORC	Format conversion during copy
ADF Data Flows	All formats	Spark engine reads everything
Synapse Spark Pool	Delta, Parquet	Delta is first-class citizen
Synapse Serverless SQL	Parquet, Delta, CSV	OPENROWSET function
Synapse Dedicated SQL	Parquet (via COPY INTO)	Fastest bulk load format
Databricks	Delta (native)	Delta is the default
Azure HDInsight	ORC, Parquet, Avro	Hive prefers ORC
Azure Stream Analytics	JSON, Avro, CSV	Input from Event Hubs
Power BI	Any (via Synapse/Databricks)	Reads through compute layer

Compression: Snappy, GZIP, ZSTD, and LZ4

Algorithm	Speed	Compression Ratio	Use Case
Snappy	Fastest	Good (3-4x)	Default for Parquet/Delta. Best balance of speed and size.
GZIP	Slow	Best (5-8x)	Archival. Maximum compression when storage cost matters more than speed.
ZSTD	Fast	Very Good (4-6x)	Modern alternative to GZIP. Better compression than Snappy, almost as fast.
LZ4	Fastest	Moderate (2-3x)	When speed is critical and storage is cheap.
None	Instant	1x (no compression)	Temporary files, debugging.

Recommendation: Use Snappy for everything in production. Switch to ZSTD if you need better compression without sacrificing too much speed. Use GZIP only for long-term archival.

Small Files Problem and How Each Format Handles It

The Problem

Spark jobs and streaming pipelines often create thousands of tiny files (under 1 MB each). Each file requires a separate API call to read. 10,000 small files is 10,000x slower than 10 large files.

How Each Format Handles It

Format	Solution	How
CSV	Manual merge	Write a script to concatenate files
Parquet	Coalesce in Spark	`df.coalesce(10).write.parquet(path)`
Delta	OPTIMIZE command	`OPTIMIZE table_name` — auto-compacts small files
Avro	Manual merge	Similar to CSV — no built-in solution

Delta Lake’s OPTIMIZE is a game-changer because it compacts files without rewriting the table or causing downtime.

Schema Evolution: How Each Format Handles Changes

Source system adds a new column email. What happens?

Format	Behavior	Risk
CSV	New column appears. Old files do not have it. Queries may break.	High — no type safety
JSON	New field appears naturally. Old records just do not have it.	Low — flexible by nature
Parquet	Old files have old schema, new files have new schema. Query engine merges them (schema merge).	Medium — requires explicit merge
Delta	With `mergeSchema=true`, new column added automatically. Old rows show NULL. Schema enforcement prevents bad data.	Low — controlled and safe
Avro	Schema registry manages evolution. New fields need defaults.	Low — schema registry enforces rules

Real-World Scenarios

Scenario 1: E-Commerce Data Platform

Source:  Order API (JSON), Product CSV export, Customer database (SQL)
Bronze:  JSON files, CSV files, Parquet from ADF Copy
Silver:  Delta Lake — cleaned, deduped, joined
Gold:    Delta Lake — fact_orders, dim_customer (SCD2), dim_product
Query:   Synapse Serverless SQL or Databricks SQL

Scenario 2: IoT Sensor Data

Source:  IoT Hub/Event Hubs (Avro from capture)
Bronze:  Avro files in ADLS Gen2 (auto-captured)
Silver:  Convert to Delta, aggregate by minute/hour
Gold:    Delta — sensor_readings_hourly, alert_thresholds
Query:   Synapse Spark for ML, Power BI for dashboards

Scenario 3: Legacy Hive Migration to Azure

On-prem:  Hive tables in ORC format
Migration: ADF Copy from HDFS to ADLS Gen2 (keep ORC temporarily)
Silver:    Convert ORC to Delta Lake using Spark
Gold:      Delta — new dimensional model

Common Mistakes

Storing analytics data as CSV — converts a 3-second query into a 45-second query. Always convert to Parquet or Delta for analytics.
Using Parquet when you need updates — Parquet does not support UPDATE/DELETE. If you need row-level changes, use Delta Lake.
Ignoring the small files problem — thousands of tiny Parquet files after a Spark job. Use Delta’s OPTIMIZE or coalesce in Spark.
Not enabling schema enforcement — Delta Lake can reject bad data. Turn it on. Catching a schema error at ingestion is 100x cheaper than finding corrupted data in a dashboard.
Using GZIP for real-time queries — GZIP compression is slow to decompress. Use Snappy for anything that needs fast query performance.
Storing production data as JSON in the analytics layer — JSON is great for landing, terrible for querying. Always convert to Parquet or Delta in Silver.

Interview Questions

Q: What is the difference between row-oriented and column-oriented formats? A: Row-oriented formats (CSV, JSON, Avro) store data row by row — good for writing individual records. Column-oriented formats (Parquet, Delta, ORC) store data column by column — good for analytical queries that read few columns across many rows. Column formats enable column pruning and better compression.

Q: What does Delta Lake add over Parquet? A: Delta Lake adds ACID transactions (safe concurrent writes), row-level UPDATE/DELETE, time travel (query historical versions), schema enforcement (reject bad data), MERGE (upsert), and OPTIMIZE (auto file compaction). The data files are still Parquet; Delta adds a transaction log.

Q: Which file format would you use for each layer of the Medallion architecture? A: Bronze — whatever the source provides (CSV, JSON, Avro). Silver — Delta Lake for cleaned, deduplicated data with schema enforcement. Gold — Delta Lake for aggregated, business-ready data with MERGE for SCD and time travel for audit.

Q: What is the small files problem and how do you solve it? A: Too many small files (under 128 MB) cause slow query performance because each file requires a separate read operation. Solutions: coalesce in Spark before writing, Delta Lake OPTIMIZE command, or schedule compaction jobs. Delta Lake handles this most elegantly.

Q: When would you use Avro instead of Parquet? A: For streaming and messaging workloads — Kafka, Event Hubs, high-throughput data ingestion. Avro is row-oriented and optimized for fast serialization. For analytics, convert Avro to Parquet or Delta in the Silver layer.

Q: What compression algorithm do you recommend for Parquet files? A: Snappy for most production workloads (best balance of speed and compression). ZSTD for better compression with minimal speed loss. GZIP only for long-term archival where storage cost matters more than query speed.

Wrapping Up

The file format decision comes down to three questions:

Is this for landing (Bronze)? — Accept whatever the source gives you. CSV, JSON, Avro — all fine.
Is this for analytics (Silver/Gold)? — Use Delta Lake. If Delta is not available, use Parquet.
Is this for streaming? — Use Avro for ingestion, convert to Delta for analysis.

Delta Lake is the modern standard because it adds everything Parquet is missing: ACID transactions, updates, time travel, and schema enforcement. If your platform supports Delta (Databricks, Synapse Spark), there is almost no reason to use raw Parquet in Silver and Gold layers.

Choose your format wisely. It is the foundation that everything else sits on.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Data File Formats in Azure Explained: CSV, Parquet, Delta, Avro, ORC, and JSON — When to Use Each

Table of Contents

Why File Formats Matter

CSV (Comma-Separated Values)

What It Is

Strengths

Weaknesses

When to Use CSV

When NOT to Use CSV

CSV in Azure

JSON (JavaScript Object Notation)

What It Is

Strengths

Weaknesses

JSON Lines (JSONL) — The Better Format

When to Use JSON

When NOT to Use JSON

JSON in Azure

Parquet

What It Is

Why Columnar Matters

Strengths

Weaknesses

When to Use Parquet

When to Consider Delta Instead

Parquet in Azure

Delta Lake (Delta Format)

What It Is

What Delta Adds Over Parquet

ACID Transactions (The Bank Analogy)

Time Travel (The Security Camera)

MERGE (The Killer Feature for Data Engineers)

Schema Enforcement and Evolution

OPTIMIZE and Z-ORDER

When to Use Delta Lake

Delta Lake in Azure

Avro

What It Is

Strengths

Weaknesses

When to Use Avro

When NOT to Use Avro

Avro in Azure

ORC (Optimized Row Columnar)

What It Is

Strengths

ORC vs Parquet

When to Use ORC

When NOT to Use ORC

The Complete Comparison Table

Row-Oriented vs Column-Oriented: The Fundamental Split

Row-Oriented (CSV, JSON, Avro)

Column-Oriented (Parquet, Delta, ORC)

Choosing Formats for the Medallion Architecture

The Simple Rule

File Formats in Azure Services

Compression: Snappy, GZIP, ZSTD, and LZ4

Small Files Problem and How Each Format Handles It

The Problem

How Each Format Handles It

Schema Evolution: How Each Format Handles Changes

Real-World Scenarios

Scenario 1: E-Commerce Data Platform

Scenario 2: IoT Sensor Data

Scenario 3: Legacy Hive Migration to Azure

Common Mistakes

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply