Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline
Every data pipeline ends with writing data somewhere — usually a file in a data lake. But what format should that file be? CSV? JSON? Parquet? ORC?
The answer isn’t always obvious, and choosing wrong can cost you real money and performance. I’ve seen pipelines where switching from CSV to Parquet reduced storage costs by 80% and query times by 10x. I’ve also seen cases where using Parquet was overkill and CSV was the right choice.
In this guide, I’ll compare the three most common data lake formats — CSV, JSON, and Parquet — with real benchmarks, practical advice, and clear recommendations for when to use each.
Table of Contents
- The Quick Answer
- Understanding File Format Types: Row vs Column
- CSV — The Universal Format
- JSON — The Flexible Format
- Parquet — The Analytics Format
- Head-to-Head Comparison
- File Size Comparison with Real Data
- Query Performance Comparison
- Which Format for Which Use Case?
- How to Switch from CSV to Parquet in ADF/Synapse
- Other Formats Worth Knowing
- Interview Questions
- Wrapping Up
The Quick Answer
If you don’t want to read the whole post:
- CSV — Use for debugging, small datasets, human-readable exports, and sharing data with non-technical users
- JSON — Use for nested/hierarchical data, API responses, semi-structured data, and configuration files
- Parquet — Use for everything else, especially data lakes, analytics, and production pipelines
For data engineering pipelines in 2026, Parquet is the default choice unless you have a specific reason to use something else.
Understanding File Format Types: Row vs Column
Before comparing formats, you need to understand the fundamental difference between row-oriented and column-oriented storage.
Row-Oriented (CSV, JSON)
Data is stored row by row:
Row 1: {id: 1, name: "Alice", age: 30, city: "Toronto"}
Row 2: {id: 2, name: "Bob", age: 25, city: "Mumbai"}
Row 3: {id: 3, name: "Carol", age: 35, city: "London"}
When you query SELECT name FROM users, the engine reads ALL columns for every row, then discards the ones you don’t need. If the table has 50 columns, you’re reading 50 columns to get 1.
Column-Oriented (Parquet, ORC)
Data is stored column by column:
Column "id": [1, 2, 3]
Column "name": ["Alice", "Bob", "Carol"]
Column "age": [30, 25, 35]
Column "city": ["Toronto", "Mumbai", "London"]
When you query SELECT name FROM users, the engine reads ONLY the “name” column. The other 49 columns are never touched. This is called column pruning and it’s why Parquet queries are so fast.
CSV — The Universal Format
CSV (Comma-Separated Values) is the oldest and most universally supported data format. Every tool, every language, every system can read CSV.
How CSV Stores Data
id,name,age,city
1,Alice,30,Toronto
2,Bob,25,Mumbai
3,Carol,35,London
Advantages
- Universal compatibility — literally every tool can read CSV (Excel, Python, R, databases, text editors)
- Human-readable — you can open it in any text editor and understand the data
- Simple to produce — no special libraries needed, just string concatenation
- Easy to debug — when something goes wrong, you can visually inspect the file
Disadvantages
- No schema — everything is a string. Is “30” an integer, a float, or a string? The reader decides.
- No compression — CSV files are large. A 1GB CSV might be 200MB in Parquet.
- No nested data — can’t represent hierarchical structures
- No column pruning — reading one column requires scanning the entire file
- Delimiter conflicts — what if your data contains commas? You need quoting, escaping, and suddenly it’s not so simple.
- Encoding issues — UTF-8 vs Latin-1 vs Windows-1252 causes constant headaches
When to Use CSV
- Quick exports for business users who’ll open it in Excel
- Small datasets (under 100MB) where performance doesn’t matter
- Debugging pipeline output — “let me see what the data actually looks like”
- Exchanging data with systems that only support CSV
- One-time data transfers where you’ll never query the data again
JSON — The Flexible Format
JSON (JavaScript Object Notation) is the standard format for APIs and semi-structured data. It handles nested and hierarchical data naturally.
How JSON Stores Data
[
{"id": 1, "name": "Alice", "age": 30, "address": {"city": "Toronto", "country": "Canada"}},
{"id": 2, "name": "Bob", "age": 25, "address": {"city": "Mumbai", "country": "India"}},
{"id": 3, "name": "Carol", "age": 35, "address": {"city": "London", "country": "UK"}}
]
Advantages
- Supports nested data — objects inside objects, arrays inside objects
- Self-describing — field names are in every record
- Schema-flexible — different records can have different fields (schema-on-read)
- API standard — every REST API speaks JSON
- Human-readable — easier to read than XML, especially for nested structures
- Data types — distinguishes between strings, numbers, booleans, and null (unlike CSV)
Disadvantages
- Verbose — field names are repeated in every single record. In a million-row file, “customer_name” appears a million times.
- Large file size — typically 2-5x larger than CSV for the same data (due to field name repetition and syntax characters)
- No column pruning — must read entire records
- Slow to parse — especially for large files. Parsing JSON is CPU-intensive.
- No built-in compression — you can gzip it, but that’s an extra step
When to Use JSON
- Storing API responses from REST services
- Semi-structured data where different records have different fields
- Nested or hierarchical data (addresses, order items, user preferences)
- Configuration files and settings
- Event data from applications (logs, clickstream)
- Small to medium datasets where flexibility matters more than performance
Parquet — The Analytics Format
Parquet is an open-source, columnar storage format designed for analytics workloads. It was created by Twitter and Cloudera and is now the de facto standard for data lakes.
How Parquet Stores Data
Parquet organizes data by columns and divides files into row groups:
Row Group 1:
Column "id": [1, 2, 3, ..., 1000] (compressed)
Column "name": ["Alice", "Bob", ...] (compressed)
Column "age": [30, 25, 35, ...] (compressed)
Column "city": ["Toronto", "Mumbai", ...] (compressed)
Row Group 2:
Column "id": [1001, 1002, ...] (compressed)
...
Footer: Schema definition + column statistics (min/max/count)
Advantages
- Columnar storage — only reads the columns your query needs (column pruning)
- Excellent compression — 70-90% smaller than CSV because similar values in a column compress extremely well
- Embedded schema — data types are stored in the file. No guessing if “30” is an int or string.
- Predicate pushdown — the footer contains min/max statistics per column per row group. If you query
WHERE age > 50and a row group’s max age is 45, the entire row group is skipped. - Encoding optimizations — dictionary encoding for low-cardinality columns (e.g., country names), run-length encoding for repeated values
- Partitioning — works naturally with Hive-style partitioning (
/year=2026/month=04/) - Industry standard — supported by Spark, Synapse, Databricks, Athena, BigQuery, Snowflake, Presto, Trino
Disadvantages
- Not human-readable — binary format, can’t open in a text editor
- Not streamable — must read the footer (at the end of the file) to understand the schema, so you can’t stream it like CSV
- Append-only — you can’t update or delete individual rows. For updates, you rewrite the entire file (or use Delta Lake/Iceberg on top).
- Overhead for tiny files — the metadata overhead makes Parquet inefficient for very small files (under 1MB). Better to batch small records into larger files.
- Write speed — writing Parquet is slower than CSV because of compression and encoding overhead. Read speed is much faster though.
When to Use Parquet
- Data lakes — this is the standard format for Azure Data Lake, S3, and GCS
- Analytics and BI — any data that will be queried by Synapse, Spark, Databricks, or Athena
- Production pipelines — the default output format for ETL/ELT pipelines
- Large datasets — anything over 100MB benefits from Parquet’s compression
- Long-term storage — smaller files = lower storage costs over time
Head-to-Head Comparison
| Feature | CSV | JSON | Parquet |
|---|---|---|---|
| Storage type | Row-oriented | Row-oriented | Column-oriented |
| Human readable | Yes | Yes | No (binary) |
| Schema embedded | No (headers only) | Partial (field names) | Yes (full schema with types) |
| Data types | Everything is text | String, number, boolean, null | Int, long, float, double, string, date, timestamp, etc. |
| Nested data | Not supported | Native support | Supported |
| Compression ratio | Poor (1x) | Worse (1.5-2x larger than CSV) | Excellent (5-10x smaller than CSV) |
| Column pruning | Not possible | Not possible | Built-in |
| Predicate pushdown | Not possible | Not possible | Built-in |
| Write speed | Fast | Medium | Slower (compression overhead) |
| Read speed (full scan) | Medium | Slow | Fast |
| Read speed (select columns) | Same as full scan | Same as full scan | Much faster (only reads needed columns) |
| Tool support | Universal | Universal | Analytics tools (Spark, Synapse, Athena, etc.) |
| Best for | Debugging, exports, small data | APIs, semi-structured, nested data | Analytics, data lakes, production pipelines |
File Size Comparison with Real Data
Here’s a real comparison using the AdventureWorks SalesLT.Customer table (847 rows, 15 columns):
| Format | File Size | Relative Size |
|---|---|---|
| CSV | 285 KB | 1x (baseline) |
| JSON | 520 KB | 1.8x larger |
| Parquet (Snappy) | 48 KB | 6x smaller |
| Parquet (Gzip) | 38 KB | 7.5x smaller |
At scale (100 million rows), those ratios become dramatic:
| Format | Estimated Size | Storage Cost/Month (ADLS Gen2) |
|---|---|---|
| CSV | 30 GB | $0.62 |
| JSON | 54 GB | $1.12 |
| Parquet (Snappy) | 5 GB | $0.10 |
Over a year with daily full loads, you’d store 365 copies. CSV = 10.9 TB ($226/yr). Parquet = 1.8 TB ($37/yr). That’s a $189/year savings on storage alone — for ONE table.
Query Performance Comparison
Querying SELECT name, email FROM customers WHERE age > 30 on a 100-million-row dataset:
| Format | Engine | Time | Data Scanned |
|---|---|---|---|
| CSV | Synapse Serverless | ~45 seconds | 30 GB (full file) |
| JSON | Synapse Serverless | ~60 seconds | 54 GB (full file) |
| Parquet | Synapse Serverless | ~3 seconds | 800 MB (2 columns + predicate pushdown) |
Parquet is 15x faster because: 1. It only reads 2 columns out of 15 (column pruning) 2. It skips row groups where max(age) < 30 (predicate pushdown) 3. The data is compressed, so less I/O
Which Format for Which Use Case?
| Use Case | Recommended Format | Why |
|---|---|---|
| Data lake ingestion | Parquet | Compression, schema, column pruning |
| Daily ETL pipeline output | Parquet | Standard format for analytics |
| API response storage | JSON | Preserves nested structure |
| Debug/inspect pipeline output | CSV | Open in Excel or text editor |
| Share with business users | CSV | Everyone can open CSV |
| Event/log data | JSON or Parquet | JSON for flexibility, Parquet for volume |
| Configuration/settings | JSON | Human-readable, supports nesting |
| Data warehouse staging | Parquet | Fastest load into SQL pools |
| ML training data | Parquet | Standard for Spark/pandas |
| One-time data export | CSV | Simple, universal |
How to Switch from CSV to Parquet in ADF/Synapse
If you’re currently outputting CSV and want to switch to Parquet, the change is simple:
Step 1: Create a new dataset – Instead of selecting DelimitedText, select Parquet – Same linked service, same parameters
Step 2: Update the Copy activity’s sink
– Change the sink dataset from DS_ADLS_Sink_CSV to DS_ADLS_Sink_Parquet
Step 3: That’s it.
The Copy activity handles the format conversion automatically. It reads from the source (SQL, CSV, whatever) and writes in Parquet format with Snappy compression by default.
No code changes. No schema mapping changes. No pipeline restructuring. Just swap the dataset.
Other Formats Worth Knowing
ORC (Optimized Row Columnar): – Similar to Parquet but optimized for Hive – Used primarily in the Hadoop ecosystem – Parquet has become more popular due to broader tool support
Avro: – Row-based but with schema evolution support – Popular for Kafka message serialization – Good for write-heavy workloads where schema changes frequently
Delta Lake (not a format, a layer on top of Parquet): – Adds ACID transactions, time travel, and schema enforcement to Parquet – Supports UPDATE, DELETE, and MERGE operations on Parquet files – Used in Databricks and now supported in Synapse
Iceberg and Hudi: – Similar to Delta Lake — table formats that add transaction support to Parquet – Growing in popularity, especially in the AWS and open-source ecosystems
Interview Questions
Q: Why would you choose Parquet over CSV for a data lake? A: Parquet offers 5-10x better compression (lower storage costs), column pruning (reads only the columns needed), predicate pushdown (skips irrelevant row groups), and embedded schema (no type guessing). CSV is only better for human readability and universal tool support.
Q: What is column pruning? A: Column pruning is the ability to read only the specific columns needed by a query, skipping all others. In a 50-column Parquet file, a query that needs 3 columns reads only 6% of the data. CSV must read 100% regardless.
Q: What is predicate pushdown in Parquet?
A: Parquet files store min/max statistics for each column in each row group. When a query has a WHERE clause (e.g., WHERE age > 50), the engine checks these statistics and skips entire row groups where the max value doesn’t meet the condition — without reading any actual data rows.
Q: When would you NOT use Parquet? A: For small files (under 1MB where metadata overhead matters), for data that needs to be human-readable (debugging), for sharing with non-technical users who need Excel compatibility, or for real-time streaming where you need to append individual records frequently.
Wrapping Up
Choosing the right file format isn’t glamorous, but it directly impacts your pipeline’s performance, storage costs, and query speed. In 2026, the decision is straightforward:
- Parquet for production data pipelines and analytics (the default)
- JSON for API data and semi-structured content
- CSV for debugging, exports, and human readability
If you’re building pipelines in ADF or Synapse, start with Parquet. Your future self (and your cloud bill) will thank you.
Related posts: – Building a Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging (Parquet output) – Common ADF/Synapse Pipeline Errors – Top 15 ADF Interview Questions
If this comparison helped you make a decision, share it with your team. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.