Data File Formats in Azure Explained: CSV, Parquet, Delta, Avro, ORC, and JSON — When to Use Each
Imagine you are moving to a new house. You can pack your belongings in garbage bags (CSV), organized labeled boxes (Parquet), or a professional moving system with tracking, insurance, and inventory (Delta Lake). All three get your stuff to the new house, but the experience and reliability are vastly different.
Choosing the right file format in your data lake is one of the most impactful decisions you make as a data engineer. The wrong format means slow queries, bloated storage costs, failed pipelines, and frustrated analysts. The right format means fast queries, efficient storage, reliable data, and happy stakeholders.
This post covers every file format you will encounter in Azure data engineering, with honest comparisons, real-world scenarios, and clear recommendations for each layer of your data lake.
Table of Contents
- Why File Formats Matter
- CSV (Comma-Separated Values)
- JSON (JavaScript Object Notation)
- Parquet
- Delta Lake (Delta Format)
- Avro
- ORC (Optimized Row Columnar)
- The Complete Comparison Table
- Row-Oriented vs Column-Oriented: The Fundamental Split
- Choosing Formats for the Medallion Architecture
- File Formats in Azure Services
- Compression: Snappy, GZIP, ZSTD, and LZ4
- Small Files Problem and How Each Format Handles It
- Schema Evolution: How Each Format Handles Changes
- Real-World Scenarios
- Common Mistakes
- Interview Questions
- Wrapping Up
Why File Formats Matter
Consider this query:
SELECT city, SUM(amount) FROM orders WHERE year = 2026 GROUP BY city
On CSV (10 GB file): Reads the ENTIRE 10 GB file. Every row, every column. Even though you only need city, amount, and year. Takes 45 seconds.
On Parquet (10 GB → 2 GB compressed): Reads ONLY the city, amount, and year columns. Skips all other columns entirely. Reads only 2026 partitions. Takes 3 seconds.
On Delta (same Parquet underneath + metadata): Same speed as Parquet, PLUS ACID transactions, time travel, and automatic file optimization. Takes 3 seconds with reliability guarantees.
The format determines how much data is read, how fast queries run, how much storage costs, and how reliable your pipeline is.
Real-life analogy: File formats are like book organization systems: – CSV = a handwritten notebook. Easy to start writing, but finding anything requires reading every page. – Parquet = an organized filing cabinet with labeled tabs. Jump directly to the section you need. – Delta = the filing cabinet PLUS a librarian who tracks every change, prevents conflicts, and can show you what the cabinet looked like last Tuesday.
CSV (Comma-Separated Values)
What It Is
The simplest data format. Plain text with commas separating values. Every spreadsheet, database, and programming language can read and write CSV.
customer_id,name,city,amount
1,Alice,Toronto,500
2,Bob,Mumbai,300
3,Carol,London,700
Strengths
- Universal compatibility — everything reads CSV. Excel, Python, R, ADF, Spark, Athena, even Notepad.
- Human-readable — open it in any text editor and see the data immediately.
- Easy to produce — source systems, APIs, and manual exports often output CSV.
- No special tools needed —
cat,head,grepwork on CSV files.
Weaknesses
- No schema — column names are in the header (maybe), but data types are not stored. Is “2026” a string or an integer? Is “01/02/2026” January 2nd or February 1st? CSV does not know.
- No compression — raw text takes the most storage space.
- Full scan required — to read one column, you must read ALL columns (row-oriented).
- No column pruning — Athena or Spark cannot skip columns. Reads everything.
- Delimiter conflicts — what if your data contains commas?
"Alice, Bob"breaks naive CSV parsers. - No data types — everything is text. “500” and “five hundred” look the same to CSV.
When to Use CSV
- Landing zone — source systems drop CSV files. Accept them, then convert.
- Data exchange — sending data to non-technical stakeholders or legacy systems.
- Small datasets — under 100 MB where performance does not matter.
- Debugging — quick inspection with a text editor.
When NOT to Use CSV
- Analytics — use Parquet instead. CSV scans are 10-50x slower.
- Data lake storage — never store production data as CSV in Silver/Gold layers.
- Large datasets — anything over 1 GB should be Parquet or Delta.
Real-life analogy: CSV is like a handwritten letter. Anyone can read it, but it takes forever to search through a stack of 10,000 letters to find ones about a specific topic. No index, no organization, no shortcuts.
CSV in Azure
# ADF: Copy activity with CSV dataset
# Python:
import pandas as pd
df = pd.read_csv('data.csv')
# Spark:
df = spark.read.csv('abfss://container@storage.dfs.core.windows.net/data.csv',
header=True, inferSchema=True)
JSON (JavaScript Object Notation)
What It Is
Semi-structured data format that supports nested objects and arrays. The language of APIs, web applications, and NoSQL databases.
{
"customer_id": 1,
"name": "Alice",
"address": {
"city": "Toronto",
"country": "Canada"
},
"orders": [
{"id": 101, "amount": 500},
{"id": 102, "amount": 300}
]
}
Strengths
- Supports nesting — objects within objects, arrays within objects. Perfect for complex, hierarchical data.
- Self-describing — field names are included with every record.
- Human-readable — easy to inspect and debug.
- API standard — every REST API returns JSON.
- Schema-flexible — different records can have different fields.
Weaknesses
- Verbose — field names repeated in every record. “customer_id” appears 1 million times in 1 million records.
- No compression — raw text format (larger than CSV because of field names and syntax characters).
- Slow to query — parsing nested JSON is expensive for analytical engines.
- No column pruning — must read entire records to extract one field.
- Schema inference is messy — Spark guesses types, sometimes wrong.
JSON Lines (JSONL) — The Better Format
Standard JSON wraps everything in an array. JSONL puts one record per line:
{"customer_id": 1, "name": "Alice", "city": "Toronto"}
{"customer_id": 2, "name": "Bob", "city": "Mumbai"}
{"customer_id": 3, "name": "Carol", "city": "London"}
Why JSONL is better: Spark and ADF can read JSONL line by line without loading the entire file into memory. Standard JSON requires parsing the whole array first.
When to Use JSON
- API responses — store raw API data in Bronze layer as JSON.
- Event data — IoT sensors, clickstream, application logs often produce JSON.
- Nested/complex data — when flat CSV cannot represent the structure.
- Configuration files — pipeline configs, metadata definitions.
When NOT to Use JSON
- Analytical queries — flatten to Parquet for analytics.
- Large-scale storage — JSON is verbose. Convert to Parquet in Silver layer.
- Tabular data — if your data is flat (no nesting), CSV or Parquet is simpler.
Real-life analogy: JSON is like a detailed conversation transcript. Every sentence includes context (“Alice said:”, “Bob replied:”). Rich in detail but takes a long time to search through for specific facts. Great for capturing complex interactions, poor for quick lookups.
JSON in Azure
# ADF: Copy activity with JSON dataset (supports nested)
# Python:
import json
with open('data.json') as f:
data = json.load(f)
# Spark (JSONL):
df = spark.read.json('abfss://container@storage.dfs.core.windows.net/data.jsonl')
# Flatten nested JSON in Spark:
from pyspark.sql.functions import explode, col
df_flat = df.select("customer_id", "name", explode("orders").alias("order"))
df_final = df_flat.select("customer_id", "name", "order.id", "order.amount")
Parquet
What It Is
A columnar binary format designed for analytical workloads. Created by the Apache team (Hadoop ecosystem), now the de facto standard for data lakes.
Row format (CSV): Column format (Parquet):
id, name, city, amt id column: [1, 2, 3, 4, 5]
1, Alice, Toronto, 500 name column: [Alice, Bob, Carol, Dave, Eve]
2, Bob, Mumbai, 300 city column: [Toronto, Mumbai, London, Toronto, Paris]
3, Carol, London, 700 amt column: [500, 300, 700, 200, 400]
4, Dave, Toronto, 200
5, Eve, Paris, 400
Why Columnar Matters
When you query SELECT city, SUM(amount) FROM orders:
– CSV (row format): Reads ALL columns for ALL rows. 100% of data scanned.
– Parquet (column format): Reads ONLY city and amount columns. Maybe 20% of data scanned. 5x faster, 5x cheaper.
Strengths
- Column pruning — reads only the columns your query needs. Massive performance boost.
- Compression — similar values in a column compress extremely well. 5-10x smaller than CSV.
- Embedded schema — data types are stored in the file. No guessing.
- Predicate pushdown — Spark/Athena skip row groups where min/max statistics eliminate them from the query.
- Universal support — Spark, ADF, Synapse, Athena, Redshift, BigQuery, Databricks all read Parquet natively.
- Partition pruning — combined with Hive-style partitioning, only relevant partitions are scanned.
Weaknesses
- Not human-readable — binary format. Cannot open in Notepad or Excel directly.
- Append-only — you cannot update or delete individual rows. Must rewrite the entire file.
- No ACID transactions — concurrent writes can corrupt data. No rollback capability.
- No time travel — once overwritten, the old data is gone.
- Small files problem — many small Parquet files kill query performance.
When to Use Parquet
- Silver layer — cleaned, transformed data ready for analytics.
- Data lake storage — the default format for storing analytical data.
- Any Spark/Athena/Synapse workload — these engines are optimized for Parquet.
- Long-term archival — compressed, efficient, widely supported.
When to Consider Delta Instead
- You need updates/deletes on individual rows.
- You need ACID transactions (concurrent writes safety).
- You need time travel (query data as it was yesterday).
- You need schema enforcement (reject bad data automatically).
Real-life analogy: Parquet is like a well-organized filing cabinet with color-coded tabs. Need all the financial documents? Pull the green tab. Need all the HR documents? Pull the blue tab. You never have to search through the entire cabinet. But if you want to change one document, you have to pull the entire folder out, make the change, and put it all back.
Parquet in Azure
# ADF: Copy activity with Parquet dataset (Snappy compression)
# Python:
import pandas as pd
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', compression='snappy')
# Spark:
df = spark.read.parquet('abfss://container@storage.dfs.core.windows.net/silver/customers/')
df.write.parquet('abfss://container@storage.dfs.core.windows.net/gold/summary/')
Delta Lake (Delta Format)
What It Is
Delta Lake is NOT a new file format — it is Parquet files + a transaction log. The data is still stored as Parquet. Delta adds a _delta_log/ folder that tracks every change: inserts, updates, deletes, schema changes.
silver/customers/
_delta_log/
00000000000000000000.json (transaction 0: initial load)
00000000000000000001.json (transaction 1: updated 50 rows)
00000000000000000002.json (transaction 2: deleted 3 rows)
part-00000.snappy.parquet (data file)
part-00001.snappy.parquet (data file)
part-00002.snappy.parquet (new data from update)
Think of it this way: Parquet is the filing cabinet. Delta is the filing cabinet PLUS a security camera, a change log, and a librarian who prevents two people from filing at the same time.
What Delta Adds Over Parquet
| Feature | Parquet | Delta Lake |
|---|---|---|
| ACID transactions | No | Yes — concurrent writes are safe |
| UPDATE/DELETE | No — must rewrite entire file | Yes — row-level updates and deletes |
| Time travel | No | Yes — query any historical version |
| Schema enforcement | No — any schema is accepted | Yes — rejects data that does not match |
| Schema evolution | Manual | Automatic (mergeSchema option) |
| MERGE (upsert) | Not possible | Native support |
| File compaction | Manual | OPTIMIZE command auto-compacts |
| Data skipping | Basic (min/max stats) | Enhanced with Z-ORDER clustering |
ACID Transactions (The Bank Analogy)
Imagine two pipelines writing to the same table simultaneously. Without ACID:
Pipeline A: Writing 1 million rows to customers/
Pipeline B: Also writing to customers/ at the same time
Result: Corrupted files. Some rows from A, some from B, some half-written.
With Delta Lake ACID:
Pipeline A: Writing 1 million rows → Transaction 5 committed ✓
Pipeline B: Also writing → Transaction 6 committed ✓
Result: Both complete, no corruption. Each sees a consistent view.
Real-life analogy: ACID transactions are like a bank transfer. When you send $100 from checking to savings, either BOTH accounts update (debit checking AND credit savings) or NEITHER updates. You never lose $100 in the middle. Delta Lake gives your data the same guarantee.
Time Travel (The Security Camera)
-- What does the customer table look like RIGHT NOW?
SELECT * FROM delta.`/silver/customers/`
-- What did it look like YESTERDAY? (before the bad update)
SELECT * FROM delta.`/silver/customers/` VERSION AS OF 5
-- What did it look like at a specific timestamp?
SELECT * FROM delta.`/silver/customers/` TIMESTAMP AS OF '2026-04-10 14:00:00'
When time travel saves you: – Someone ran a bad UPDATE that corrupted 10,000 rows. Restore the previous version. – An auditor asks “what was Alice’s address on March 1?” Query that point in time. – A pipeline failed halfway. Roll back to the last good version.
Real-life analogy: Time travel is like having security camera footage of your filing cabinet. If someone misfiled a document, you can rewind the footage, see what it looked like before, and restore the correct state. Without Delta, once the document is misfiled, you have no record of the original state.
MERGE (The Killer Feature for Data Engineers)
MERGE combines INSERT, UPDATE, and DELETE in one atomic operation. This is how you implement SCD Type 2, incremental loads, and deduplication:
MERGE INTO silver.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.city != target.city THEN
UPDATE SET city = source.city, updated_date = current_date()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, city, updated_date)
VALUES (source.customer_id, source.name, source.city, current_date())
WHEN NOT MATCHED BY SOURCE THEN
DELETE
This single statement handles: new customers (INSERT), changed customers (UPDATE), and removed customers (DELETE). All in one transaction. All or nothing.
Schema Enforcement and Evolution
# Schema enforcement: reject data with wrong schema
df_bad = spark.createDataFrame([{"customer_id": "ABC", "name": 123}]) # wrong types
df_bad.write.format("delta").mode("append").save("/silver/customers/")
# ERROR: schema mismatch. Data rejected. Table stays clean.
# Schema evolution: add new columns safely
df_new = spark.createDataFrame([{"customer_id": 1, "name": "Alice", "email": "alice@test.com"}])
df_new.write.format("delta").mode("append") .option("mergeSchema", "true") .save("/silver/customers/")
# Success: "email" column added to the table automatically
Real-life analogy: Schema enforcement is like a club bouncer with a dress code. Wrong outfit (wrong schema)? You are not getting in. Schema evolution is like the club updating the dress code to allow new styles — existing guests stay, new guests can wear the new style.
OPTIMIZE and Z-ORDER
-- Compact small files into larger ones (fixes small files problem)
OPTIMIZE silver.customers
-- Cluster data by a column for faster queries
OPTIMIZE silver.customers ZORDER BY (city)
After Z-ORDER by city, all “Toronto” rows are stored near each other. A query filtering WHERE city = 'Toronto' reads far fewer files.
When to Use Delta Lake
- Silver and Gold layers — the standard for modern data lakes
- Any table that needs updates/deletes — SCD, deduplication, corrections
- Concurrent pipelines — multiple writers to the same table
- Compliance/audit — time travel for regulatory requirements
- Production data platforms — reliability guarantees that Parquet cannot provide
Delta Lake in Azure
# Databricks (native Delta support):
df.write.format("delta").mode("overwrite").save("/silver/customers/")
df = spark.read.format("delta").load("/silver/customers/")
# Synapse Spark pool:
df.write.format("delta").save("abfss://container@storage.dfs.core.windows.net/silver/customers/")
# Synapse Serverless SQL:
SELECT * FROM OPENROWSET(
BULK 'abfss://container@storage.dfs.core.windows.net/silver/customers/',
FORMAT = 'DELTA'
) AS result
Avro
What It Is
A row-oriented binary format with an embedded schema. Created for the Hadoop ecosystem, now widely used with Apache Kafka and Azure Event Hubs.
Schema (JSON):
{
"type": "record",
"name": "Customer",
"fields": [
{"name": "customer_id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "city", "type": "string"}
]
}
Data: binary-encoded rows following this schema
Strengths
- Schema embedded — the schema travels WITH the data. No guessing types.
- Schema evolution — add/remove fields without breaking consumers.
- Fast serialization/deserialization — optimized for write-heavy workloads.
- Compact — binary encoding is smaller than JSON.
- Kafka standard — the default format for message queues and streaming.
Weaknesses
- Row-oriented — not optimized for analytical queries (cannot skip columns).
- Not human-readable — binary format.
- Less efficient for analytics — Parquet beats Avro for SELECT queries by 5-10x.
When to Use Avro
- Streaming/messaging — Kafka, Event Hubs, data streaming pipelines
- Write-heavy workloads — fast serialization for high-throughput producers
- Schema registry — Avro schemas work with Confluent Schema Registry
- Bronze layer ingestion — capture streaming data before converting to Parquet
When NOT to Use Avro
- Analytics — use Parquet. Avro scans are much slower for column-based queries.
- Data lake storage — convert Avro to Parquet in the Silver layer.
Real-life analogy: Avro is like a package with a packing slip attached. The packing slip (schema) describes exactly what is inside. Great for shipping (streaming), but when you need to search through 1,000 packages for all items of a certain color (analytical query), you have to open every package. Parquet is a warehouse where items are sorted by color on different shelves.
Avro in Azure
# ADF: Copy activity supports Avro source/sink
# Spark:
df = spark.read.format("avro").load("abfss://container@storage.dfs.core.windows.net/events/")
# Event Hubs capture stores data as Avro automatically
ORC (Optimized Row Columnar)
What It Is
A columnar format created by Hortonworks for the Hive ecosystem. Very similar to Parquet in concept but with different internal structure.
Strengths
- Excellent compression — often slightly better than Parquet.
- Built-in indexes — bloom filters and min/max statistics per stripe.
- Hive optimized — the default format for Hive tables.
- ACID support — Hive 3+ supports ACID transactions with ORC.
ORC vs Parquet
| Feature | ORC | Parquet |
|---|---|---|
| Ecosystem | Hive/Hortonworks | Spark/Databricks/everything |
| Compression | Slightly better | Very good |
| Azure support | Supported | First-class citizen |
| Spark performance | Good | Excellent (optimized) |
| Industry adoption | Declining | Growing (dominant) |
| Recommendation | Legacy Hive migrations | Everything else |
When to Use ORC
- Migrating from on-premises Hive — your existing Hive tables are ORC, keep them.
- Hive-specific workloads — Hive queries on ORC can be slightly faster than on Parquet.
When NOT to Use ORC
- New projects — use Parquet or Delta. Better ecosystem support.
- Spark workloads — Spark is more optimized for Parquet.
- Azure-native services — Synapse, ADF, Databricks all prefer Parquet/Delta.
Real-life analogy: ORC is like Betamax. Technically excellent, but VHS (Parquet) won the format war through wider adoption. If you already have a Betamax collection (Hive tables), keep using it. But for new purchases (new projects), go with the standard.
The Complete Comparison Table
| Feature | CSV | JSON | Parquet | Delta | Avro | ORC |
|---|---|---|---|---|---|---|
| Type | Row, text | Row, text | Column, binary | Column + log | Row, binary | Column, binary |
| Human-readable | Yes | Yes | No | No | No | No |
| Schema embedded | No | Partial | Yes | Yes | Yes | Yes |
| Compression | None (manual gzip) | None (manual gzip) | Snappy/GZIP/ZSTD | Snappy/GZIP/ZSTD | Deflate/Snappy | ZLIB/Snappy/LZO |
| Column pruning | No | No | Yes | Yes | No | Yes |
| UPDATE/DELETE | No | No | No | Yes | No | Hive 3+ only |
| ACID transactions | No | No | No | Yes | No | Hive 3+ only |
| Time travel | No | No | No | Yes | No | No |
| MERGE | No | No | No | Yes | No | Hive 3+ only |
| Schema evolution | No | Flexible | Manual | Auto (mergeSchema) | Yes | Yes |
| Best for | Landing zone | APIs, events | Analytics | Production DL | Streaming | Legacy Hive |
| 10 GB query speed | 45 sec | 60 sec | 3 sec | 3 sec | 30 sec | 4 sec |
| 10 GB storage size | 10 GB | 12 GB | 2 GB | 2 GB | 4 GB | 1.8 GB |
Row-Oriented vs Column-Oriented: The Fundamental Split
This is the single most important concept in file formats:
Row-Oriented (CSV, JSON, Avro)
Data is stored row by row:
Row 1: [1, Alice, Toronto, 500]
Row 2: [2, Bob, Mumbai, 300]
Row 3: [3, Carol, London, 700]
Great for: Writing individual records (streaming, transactions). Each row is self-contained.
Bad for: Analytical queries that touch few columns but many rows. Must read ALL columns to get one.
Column-Oriented (Parquet, Delta, ORC)
Data is stored column by column:
Column 'id': [1, 2, 3]
Column 'name': [Alice, Bob, Carol]
Column 'city': [Toronto, Mumbai, London]
Column 'amount': [500, 300, 700]
Great for: Analytical queries. SELECT city, SUM(amount) reads only 2 columns out of 4. Plus, similar values in a column compress better.
Bad for: Writing individual records (must update multiple column files).
Real-life analogy: – Row-oriented = a phone book sorted by person. Great for finding Alice’s number. Bad for finding all people in Toronto. – Column-oriented = a spreadsheet with one column per attribute. Great for finding all people in Toronto (scan one column). Bad for finding everything about Alice (scan every column).
Choosing Formats for the Medallion Architecture
BRONZE (Raw Ingestion)
Format: Whatever the source provides (CSV, JSON, Avro)
Why: Accept data as-is, do not transform yet
Example: API response → JSON, database export → CSV, Kafka → Avro
SILVER (Cleaned + Transformed)
Format: Delta Lake (or Parquet if Delta is not available)
Why: Need schema enforcement, updates for corrections, data quality
Example: Cleaned customers (deduped, nulls handled, types enforced)
GOLD (Business-Ready)
Format: Delta Lake
Why: Need MERGE for SCD, time travel for audit, ACID for reliability
Example: dim_customer (SCD Type 2), fact_sales (daily aggregation)
The Simple Rule
Source drops CSV/JSON → Bronze (keep as-is)
→ Silver (convert to Delta)
→ Gold (Delta with MERGE and OPTIMIZE)
File Formats in Azure Services
| Service | Preferred Formats | Notes |
|---|---|---|
| ADF Copy Activity | CSV, Parquet, JSON, Avro, ORC | Format conversion during copy |
| ADF Data Flows | All formats | Spark engine reads everything |
| Synapse Spark Pool | Delta, Parquet | Delta is first-class citizen |
| Synapse Serverless SQL | Parquet, Delta, CSV | OPENROWSET function |
| Synapse Dedicated SQL | Parquet (via COPY INTO) | Fastest bulk load format |
| Databricks | Delta (native) | Delta is the default |
| Azure HDInsight | ORC, Parquet, Avro | Hive prefers ORC |
| Azure Stream Analytics | JSON, Avro, CSV | Input from Event Hubs |
| Power BI | Any (via Synapse/Databricks) | Reads through compute layer |
Compression: Snappy, GZIP, ZSTD, and LZ4
| Algorithm | Speed | Compression Ratio | Use Case |
|---|---|---|---|
| Snappy | Fastest | Good (3-4x) | Default for Parquet/Delta. Best balance of speed and size. |
| GZIP | Slow | Best (5-8x) | Archival. Maximum compression when storage cost matters more than speed. |
| ZSTD | Fast | Very Good (4-6x) | Modern alternative to GZIP. Better compression than Snappy, almost as fast. |
| LZ4 | Fastest | Moderate (2-3x) | When speed is critical and storage is cheap. |
| None | Instant | 1x (no compression) | Temporary files, debugging. |
Recommendation: Use Snappy for everything in production. Switch to ZSTD if you need better compression without sacrificing too much speed. Use GZIP only for long-term archival.
Small Files Problem and How Each Format Handles It
The Problem
Spark jobs and streaming pipelines often create thousands of tiny files (under 1 MB each). Each file requires a separate API call to read. 10,000 small files is 10,000x slower than 10 large files.
How Each Format Handles It
| Format | Solution | How |
|---|---|---|
| CSV | Manual merge | Write a script to concatenate files |
| Parquet | Coalesce in Spark | df.coalesce(10).write.parquet(path) |
| Delta | OPTIMIZE command | OPTIMIZE table_name — auto-compacts small files |
| Avro | Manual merge | Similar to CSV — no built-in solution |
Delta Lake’s OPTIMIZE is a game-changer because it compacts files without rewriting the table or causing downtime.
Schema Evolution: How Each Format Handles Changes
Source system adds a new column email. What happens?
| Format | Behavior | Risk |
|---|---|---|
| CSV | New column appears. Old files do not have it. Queries may break. | High — no type safety |
| JSON | New field appears naturally. Old records just do not have it. | Low — flexible by nature |
| Parquet | Old files have old schema, new files have new schema. Query engine merges them (schema merge). | Medium — requires explicit merge |
| Delta | With mergeSchema=true, new column added automatically. Old rows show NULL. Schema enforcement prevents bad data. |
Low — controlled and safe |
| Avro | Schema registry manages evolution. New fields need defaults. | Low — schema registry enforces rules |
Real-World Scenarios
Scenario 1: E-Commerce Data Platform
Source: Order API (JSON), Product CSV export, Customer database (SQL)
Bronze: JSON files, CSV files, Parquet from ADF Copy
Silver: Delta Lake — cleaned, deduped, joined
Gold: Delta Lake — fact_orders, dim_customer (SCD2), dim_product
Query: Synapse Serverless SQL or Databricks SQL
Scenario 2: IoT Sensor Data
Source: IoT Hub/Event Hubs (Avro from capture)
Bronze: Avro files in ADLS Gen2 (auto-captured)
Silver: Convert to Delta, aggregate by minute/hour
Gold: Delta — sensor_readings_hourly, alert_thresholds
Query: Synapse Spark for ML, Power BI for dashboards
Scenario 3: Legacy Hive Migration to Azure
On-prem: Hive tables in ORC format
Migration: ADF Copy from HDFS to ADLS Gen2 (keep ORC temporarily)
Silver: Convert ORC to Delta Lake using Spark
Gold: Delta — new dimensional model
Common Mistakes
-
Storing analytics data as CSV — converts a 3-second query into a 45-second query. Always convert to Parquet or Delta for analytics.
-
Using Parquet when you need updates — Parquet does not support UPDATE/DELETE. If you need row-level changes, use Delta Lake.
-
Ignoring the small files problem — thousands of tiny Parquet files after a Spark job. Use Delta’s OPTIMIZE or coalesce in Spark.
-
Not enabling schema enforcement — Delta Lake can reject bad data. Turn it on. Catching a schema error at ingestion is 100x cheaper than finding corrupted data in a dashboard.
-
Using GZIP for real-time queries — GZIP compression is slow to decompress. Use Snappy for anything that needs fast query performance.
-
Storing production data as JSON in the analytics layer — JSON is great for landing, terrible for querying. Always convert to Parquet or Delta in Silver.
Interview Questions
Q: What is the difference between row-oriented and column-oriented formats? A: Row-oriented formats (CSV, JSON, Avro) store data row by row — good for writing individual records. Column-oriented formats (Parquet, Delta, ORC) store data column by column — good for analytical queries that read few columns across many rows. Column formats enable column pruning and better compression.
Q: What does Delta Lake add over Parquet? A: Delta Lake adds ACID transactions (safe concurrent writes), row-level UPDATE/DELETE, time travel (query historical versions), schema enforcement (reject bad data), MERGE (upsert), and OPTIMIZE (auto file compaction). The data files are still Parquet; Delta adds a transaction log.
Q: Which file format would you use for each layer of the Medallion architecture? A: Bronze — whatever the source provides (CSV, JSON, Avro). Silver — Delta Lake for cleaned, deduplicated data with schema enforcement. Gold — Delta Lake for aggregated, business-ready data with MERGE for SCD and time travel for audit.
Q: What is the small files problem and how do you solve it? A: Too many small files (under 128 MB) cause slow query performance because each file requires a separate read operation. Solutions: coalesce in Spark before writing, Delta Lake OPTIMIZE command, or schedule compaction jobs. Delta Lake handles this most elegantly.
Q: When would you use Avro instead of Parquet? A: For streaming and messaging workloads — Kafka, Event Hubs, high-throughput data ingestion. Avro is row-oriented and optimized for fast serialization. For analytics, convert Avro to Parquet or Delta in the Silver layer.
Q: What compression algorithm do you recommend for Parquet files? A: Snappy for most production workloads (best balance of speed and compression). ZSTD for better compression with minimal speed loss. GZIP only for long-term archival where storage cost matters more than query speed.
Wrapping Up
The file format decision comes down to three questions:
- Is this for landing (Bronze)? — Accept whatever the source gives you. CSV, JSON, Avro — all fine.
- Is this for analytics (Silver/Gold)? — Use Delta Lake. If Delta is not available, use Parquet.
- Is this for streaming? — Use Avro for ingestion, convert to Delta for analysis.
Delta Lake is the modern standard because it adds everything Parquet is missing: ACID transactions, updates, time travel, and schema enforcement. If your platform supports Delta (Databricks, Synapse Spark), there is almost no reason to use raw Parquet in Silver and Gold layers.
Choose your format wisely. It is the foundation that everything else sits on.
Related posts: – Parquet vs CSV vs JSON – Schema-on-Write vs Schema-on-Read – ADLS Gen2 Complete Guide – Slowly Changing Dimensions – AWS S3 for Data Engineers
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.