Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Every data pipeline ends with writing data somewhere — usually a file in a data lake. But what format should that file be? CSV? JSON? Parquet? ORC?

The answer isn’t always obvious, and choosing wrong can cost you real money and performance. I’ve seen pipelines where switching from CSV to Parquet reduced storage costs by 80% and query times by 10x. I’ve also seen cases where using Parquet was overkill and CSV was the right choice.

In this guide, I’ll compare the three most common data lake formats — CSV, JSON, and Parquet — with real benchmarks, practical advice, and clear recommendations for when to use each.

The Quick Answer
Understanding File Format Types: Row vs Column
CSV — The Universal Format
JSON — The Flexible Format
Parquet — The Analytics Format
Head-to-Head Comparison
File Size Comparison with Real Data
Query Performance Comparison
Which Format for Which Use Case?
How to Switch from CSV to Parquet in ADF/Synapse
Other Formats Worth Knowing
Interview Questions
Wrapping Up

The Quick Answer

If you don’t want to read the whole post:

CSV — Use for debugging, small datasets, human-readable exports, and sharing data with non-technical users
JSON — Use for nested/hierarchical data, API responses, semi-structured data, and configuration files
Parquet — Use for everything else, especially data lakes, analytics, and production pipelines

For data engineering pipelines in 2026, Parquet is the default choice unless you have a specific reason to use something else.

Understanding File Format Types: Row vs Column

Before comparing formats, you need to understand the fundamental difference between row-oriented and column-oriented storage.

Row-Oriented (CSV, JSON)

Data is stored row by row:

Row 1: {id: 1, name: "Alice", age: 30, city: "Toronto"}
Row 2: {id: 2, name: "Bob",   age: 25, city: "Mumbai"}
Row 3: {id: 3, name: "Carol", age: 35, city: "London"}

When you query SELECT name FROM users, the engine reads ALL columns for every row, then discards the ones you don’t need. If the table has 50 columns, you’re reading 50 columns to get 1.

Column-Oriented (Parquet, ORC)

Data is stored column by column:

Column "id":   [1, 2, 3]
Column "name": ["Alice", "Bob", "Carol"]
Column "age":  [30, 25, 35]
Column "city": ["Toronto", "Mumbai", "London"]

When you query SELECT name FROM users, the engine reads ONLY the “name” column. The other 49 columns are never touched. This is called column pruning and it’s why Parquet queries are so fast.

CSV — The Universal Format

CSV (Comma-Separated Values) is the oldest and most universally supported data format. Every tool, every language, every system can read CSV.

How CSV Stores Data

id,name,age,city
1,Alice,30,Toronto
2,Bob,25,Mumbai
3,Carol,35,London

Advantages

Universal compatibility — literally every tool can read CSV (Excel, Python, R, databases, text editors)
Human-readable — you can open it in any text editor and understand the data
Simple to produce — no special libraries needed, just string concatenation
Easy to debug — when something goes wrong, you can visually inspect the file

Disadvantages

No schema — everything is a string. Is “30” an integer, a float, or a string? The reader decides.
No compression — CSV files are large. A 1GB CSV might be 200MB in Parquet.
No nested data — can’t represent hierarchical structures
No column pruning — reading one column requires scanning the entire file
Delimiter conflicts — what if your data contains commas? You need quoting, escaping, and suddenly it’s not so simple.
Encoding issues — UTF-8 vs Latin-1 vs Windows-1252 causes constant headaches

When to Use CSV

Quick exports for business users who’ll open it in Excel
Small datasets (under 100MB) where performance doesn’t matter
Debugging pipeline output — “let me see what the data actually looks like”
Exchanging data with systems that only support CSV
One-time data transfers where you’ll never query the data again

JSON — The Flexible Format

JSON (JavaScript Object Notation) is the standard format for APIs and semi-structured data. It handles nested and hierarchical data naturally.

How JSON Stores Data

[
    {"id": 1, "name": "Alice", "age": 30, "address": {"city": "Toronto", "country": "Canada"}},
    {"id": 2, "name": "Bob", "age": 25, "address": {"city": "Mumbai", "country": "India"}},
    {"id": 3, "name": "Carol", "age": 35, "address": {"city": "London", "country": "UK"}}
]

Advantages

Supports nested data — objects inside objects, arrays inside objects
Self-describing — field names are in every record
Schema-flexible — different records can have different fields (schema-on-read)
API standard — every REST API speaks JSON
Human-readable — easier to read than XML, especially for nested structures
Data types — distinguishes between strings, numbers, booleans, and null (unlike CSV)

Disadvantages

Verbose — field names are repeated in every single record. In a million-row file, “customer_name” appears a million times.
Large file size — typically 2-5x larger than CSV for the same data (due to field name repetition and syntax characters)
No column pruning — must read entire records
Slow to parse — especially for large files. Parsing JSON is CPU-intensive.
No built-in compression — you can gzip it, but that’s an extra step

When to Use JSON

Storing API responses from REST services
Semi-structured data where different records have different fields
Nested or hierarchical data (addresses, order items, user preferences)
Configuration files and settings
Event data from applications (logs, clickstream)
Small to medium datasets where flexibility matters more than performance

Parquet — The Analytics Format

Parquet is an open-source, columnar storage format designed for analytics workloads. It was created by Twitter and Cloudera and is now the de facto standard for data lakes.

How Parquet Stores Data

Parquet organizes data by columns and divides files into row groups:

Row Group 1:
  Column "id":   [1, 2, 3, ..., 1000]     (compressed)
  Column "name": ["Alice", "Bob", ...]      (compressed)
  Column "age":  [30, 25, 35, ...]          (compressed)
  Column "city": ["Toronto", "Mumbai", ...] (compressed)

Row Group 2:
  Column "id":   [1001, 1002, ...]          (compressed)
  ...

Footer: Schema definition + column statistics (min/max/count)

Advantages

Columnar storage — only reads the columns your query needs (column pruning)
Excellent compression — 70-90% smaller than CSV because similar values in a column compress extremely well
Embedded schema — data types are stored in the file. No guessing if “30” is an int or string.
Predicate pushdown — the footer contains min/max statistics per column per row group. If you query WHERE age > 50 and a row group’s max age is 45, the entire row group is skipped.
Encoding optimizations — dictionary encoding for low-cardinality columns (e.g., country names), run-length encoding for repeated values
Partitioning — works naturally with Hive-style partitioning (/year=2026/month=04/)
Industry standard — supported by Spark, Synapse, Databricks, Athena, BigQuery, Snowflake, Presto, Trino

Disadvantages

Not human-readable — binary format, can’t open in a text editor
Not streamable — must read the footer (at the end of the file) to understand the schema, so you can’t stream it like CSV
Append-only — you can’t update or delete individual rows. For updates, you rewrite the entire file (or use Delta Lake/Iceberg on top).
Overhead for tiny files — the metadata overhead makes Parquet inefficient for very small files (under 1MB). Better to batch small records into larger files.
Write speed — writing Parquet is slower than CSV because of compression and encoding overhead. Read speed is much faster though.

When to Use Parquet

Data lakes — this is the standard format for Azure Data Lake, S3, and GCS
Analytics and BI — any data that will be queried by Synapse, Spark, Databricks, or Athena
Production pipelines — the default output format for ETL/ELT pipelines
Large datasets — anything over 100MB benefits from Parquet’s compression
Long-term storage — smaller files = lower storage costs over time

Head-to-Head Comparison

Feature	CSV	JSON	Parquet
Storage type	Row-oriented	Row-oriented	Column-oriented
Human readable	Yes	Yes	No (binary)
Schema embedded	No (headers only)	Partial (field names)	Yes (full schema with types)
Data types	Everything is text	String, number, boolean, null	Int, long, float, double, string, date, timestamp, etc.
Nested data	Not supported	Native support	Supported
Compression ratio	Poor (1x)	Worse (1.5-2x larger than CSV)	Excellent (5-10x smaller than CSV)
Column pruning	Not possible	Not possible	Built-in
Predicate pushdown	Not possible	Not possible	Built-in
Write speed	Fast	Medium	Slower (compression overhead)
Read speed (full scan)	Medium	Slow	Fast
Read speed (select columns)	Same as full scan	Same as full scan	Much faster (only reads needed columns)
Tool support	Universal	Universal	Analytics tools (Spark, Synapse, Athena, etc.)
Best for	Debugging, exports, small data	APIs, semi-structured, nested data	Analytics, data lakes, production pipelines

File Size Comparison with Real Data

Here’s a real comparison using the AdventureWorks SalesLT.Customer table (847 rows, 15 columns):

Format	File Size	Relative Size
CSV	285 KB	1x (baseline)
JSON	520 KB	1.8x larger
Parquet (Snappy)	48 KB	6x smaller
Parquet (Gzip)	38 KB	7.5x smaller

At scale (100 million rows), those ratios become dramatic:

Format	Estimated Size	Storage Cost/Month (ADLS Gen2)
CSV	30 GB	$0.62
JSON	54 GB	$1.12
Parquet (Snappy)	5 GB	$0.10

Over a year with daily full loads, you’d store 365 copies. CSV = 10.9 TB ($226/yr). Parquet = 1.8 TB ($37/yr). That’s a $189/year savings on storage alone — for ONE table.

Query Performance Comparison

Querying SELECT name, email FROM customers WHERE age > 30 on a 100-million-row dataset:

Format	Engine	Time	Data Scanned
CSV	Synapse Serverless	~45 seconds	30 GB (full file)
JSON	Synapse Serverless	~60 seconds	54 GB (full file)
Parquet	Synapse Serverless	~3 seconds	800 MB (2 columns + predicate pushdown)

Parquet is 15x faster because: 1. It only reads 2 columns out of 15 (column pruning) 2. It skips row groups where max(age) < 30 (predicate pushdown) 3. The data is compressed, so less I/O

Which Format for Which Use Case?

Use Case	Recommended Format	Why
Data lake ingestion	Parquet	Compression, schema, column pruning
Daily ETL pipeline output	Parquet	Standard format for analytics
API response storage	JSON	Preserves nested structure
Debug/inspect pipeline output	CSV	Open in Excel or text editor
Share with business users	CSV	Everyone can open CSV
Event/log data	JSON or Parquet	JSON for flexibility, Parquet for volume
Configuration/settings	JSON	Human-readable, supports nesting
Data warehouse staging	Parquet	Fastest load into SQL pools
ML training data	Parquet	Standard for Spark/pandas
One-time data export	CSV	Simple, universal

How to Switch from CSV to Parquet in ADF/Synapse

If you’re currently outputting CSV and want to switch to Parquet, the change is simple:

Step 1: Create a new dataset – Instead of selecting DelimitedText, select Parquet – Same linked service, same parameters

Step 2: Update the Copy activity’s sink – Change the sink dataset from DS_ADLS_Sink_CSV to DS_ADLS_Sink_Parquet

Step 3: That’s it.

The Copy activity handles the format conversion automatically. It reads from the source (SQL, CSV, whatever) and writes in Parquet format with Snappy compression by default.

No code changes. No schema mapping changes. No pipeline restructuring. Just swap the dataset.

Other Formats Worth Knowing

ORC (Optimized Row Columnar): – Similar to Parquet but optimized for Hive – Used primarily in the Hadoop ecosystem – Parquet has become more popular due to broader tool support

Avro: – Row-based but with schema evolution support – Popular for Kafka message serialization – Good for write-heavy workloads where schema changes frequently

Delta Lake (not a format, a layer on top of Parquet): – Adds ACID transactions, time travel, and schema enforcement to Parquet – Supports UPDATE, DELETE, and MERGE operations on Parquet files – Used in Databricks and now supported in Synapse

Iceberg and Hudi: – Similar to Delta Lake — table formats that add transaction support to Parquet – Growing in popularity, especially in the AWS and open-source ecosystems

Interview Questions

Q: Why would you choose Parquet over CSV for a data lake? A: Parquet offers 5-10x better compression (lower storage costs), column pruning (reads only the columns needed), predicate pushdown (skips irrelevant row groups), and embedded schema (no type guessing). CSV is only better for human readability and universal tool support.

Q: What is column pruning? A: Column pruning is the ability to read only the specific columns needed by a query, skipping all others. In a 50-column Parquet file, a query that needs 3 columns reads only 6% of the data. CSV must read 100% regardless.

Q: What is predicate pushdown in Parquet? A: Parquet files store min/max statistics for each column in each row group. When a query has a WHERE clause (e.g., WHERE age > 50), the engine checks these statistics and skips entire row groups where the max value doesn’t meet the condition — without reading any actual data rows.

Q: When would you NOT use Parquet? A: For small files (under 1MB where metadata overhead matters), for data that needs to be human-readable (debugging), for sharing with non-technical users who need Excel compatibility, or for real-time streaming where you need to append individual records frequently.

Wrapping Up

Choosing the right file format isn’t glamorous, but it directly impacts your pipeline’s performance, storage costs, and query speed. In 2026, the decision is straightforward:

Parquet for production data pipelines and analytics (the default)
JSON for API data and semi-structured content
CSV for debugging, exports, and human readability

If you’re building pipelines in ADF or Synapse, start with Parquet. Your future self (and your cloud bill) will thank you.

If this comparison helped you make a decision, share it with your team. Questions? Drop a comment below.

← Previous: Top 15 ADF Interview Questions

More (3/5)

Next: Schema-on-Write vs Read →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Table of Contents

The Quick Answer

Understanding File Format Types: Row vs Column

Row-Oriented (CSV, JSON)

Column-Oriented (Parquet, ORC)

CSV — The Universal Format

How CSV Stores Data

Advantages

Disadvantages

When to Use CSV

JSON — The Flexible Format

How JSON Stores Data

Advantages

Disadvantages

When to Use JSON

Parquet — The Analytics Format

How Parquet Stores Data

Advantages

Disadvantages

When to Use Parquet

Head-to-Head Comparison

File Size Comparison with Real Data

Query Performance Comparison

Which Format for Which Use Case?

How to Switch from CSV to Parquet in ADF/Synapse

Other Formats Worth Knowing

Interview Questions

Wrapping Up

Leave a Comment Cancel Reply

Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Table of Contents

The Quick Answer

Understanding File Format Types: Row vs Column

Row-Oriented (CSV, JSON)

Column-Oriented (Parquet, ORC)

CSV — The Universal Format

How CSV Stores Data

Advantages

Disadvantages

When to Use CSV

JSON — The Flexible Format

How JSON Stores Data

Advantages

Disadvantages

When to Use JSON

Parquet — The Analytics Format

How Parquet Stores Data

Advantages

Disadvantages

When to Use Parquet

Head-to-Head Comparison

File Size Comparison with Real Data

Query Performance Comparison

Which Format for Which Use Case?

How to Switch from CSV to Parquet in ADF/Synapse

Other Formats Worth Knowing

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply