Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Parquet vs CSV vs JSON: Choosing the Right File Format for Your Data Pipeline

Every data pipeline ends with writing data somewhere — usually a file in a data lake. But what format should that file be? CSV? JSON? Parquet? ORC?

The answer isn’t always obvious, and choosing wrong can cost you real money and performance. I’ve seen pipelines where switching from CSV to Parquet reduced storage costs by 80% and query times by 10x. I’ve also seen cases where using Parquet was overkill and CSV was the right choice.

In this guide, I’ll compare the three most common data lake formats — CSV, JSON, and Parquet — with real benchmarks, practical advice, and clear recommendations for when to use each.

Table of Contents

  • The Quick Answer
  • Understanding File Format Types: Row vs Column
  • CSV — The Universal Format
  • JSON — The Flexible Format
  • Parquet — The Analytics Format
  • Head-to-Head Comparison
  • File Size Comparison with Real Data
  • Query Performance Comparison
  • Which Format for Which Use Case?
  • How to Switch from CSV to Parquet in ADF/Synapse
  • Other Formats Worth Knowing
  • Interview Questions
  • Wrapping Up

The Quick Answer

If you don’t want to read the whole post:

  • CSV — Use for debugging, small datasets, human-readable exports, and sharing data with non-technical users
  • JSON — Use for nested/hierarchical data, API responses, semi-structured data, and configuration files
  • Parquet — Use for everything else, especially data lakes, analytics, and production pipelines

For data engineering pipelines in 2026, Parquet is the default choice unless you have a specific reason to use something else.

Understanding File Format Types: Row vs Column

Before comparing formats, you need to understand the fundamental difference between row-oriented and column-oriented storage.

Row-Oriented (CSV, JSON)

Data is stored row by row:

Row 1: {id: 1, name: "Alice", age: 30, city: "Toronto"}
Row 2: {id: 2, name: "Bob",   age: 25, city: "Mumbai"}
Row 3: {id: 3, name: "Carol", age: 35, city: "London"}

When you query SELECT name FROM users, the engine reads ALL columns for every row, then discards the ones you don’t need. If the table has 50 columns, you’re reading 50 columns to get 1.

Column-Oriented (Parquet, ORC)

Data is stored column by column:

Column "id":   [1, 2, 3]
Column "name": ["Alice", "Bob", "Carol"]
Column "age":  [30, 25, 35]
Column "city": ["Toronto", "Mumbai", "London"]

When you query SELECT name FROM users, the engine reads ONLY the “name” column. The other 49 columns are never touched. This is called column pruning and it’s why Parquet queries are so fast.

CSV — The Universal Format

CSV (Comma-Separated Values) is the oldest and most universally supported data format. Every tool, every language, every system can read CSV.

How CSV Stores Data

id,name,age,city
1,Alice,30,Toronto
2,Bob,25,Mumbai
3,Carol,35,London

Advantages

  • Universal compatibility — literally every tool can read CSV (Excel, Python, R, databases, text editors)
  • Human-readable — you can open it in any text editor and understand the data
  • Simple to produce — no special libraries needed, just string concatenation
  • Easy to debug — when something goes wrong, you can visually inspect the file

Disadvantages

  • No schema — everything is a string. Is “30” an integer, a float, or a string? The reader decides.
  • No compression — CSV files are large. A 1GB CSV might be 200MB in Parquet.
  • No nested data — can’t represent hierarchical structures
  • No column pruning — reading one column requires scanning the entire file
  • Delimiter conflicts — what if your data contains commas? You need quoting, escaping, and suddenly it’s not so simple.
  • Encoding issues — UTF-8 vs Latin-1 vs Windows-1252 causes constant headaches

When to Use CSV

  • Quick exports for business users who’ll open it in Excel
  • Small datasets (under 100MB) where performance doesn’t matter
  • Debugging pipeline output — “let me see what the data actually looks like”
  • Exchanging data with systems that only support CSV
  • One-time data transfers where you’ll never query the data again

JSON — The Flexible Format

JSON (JavaScript Object Notation) is the standard format for APIs and semi-structured data. It handles nested and hierarchical data naturally.

How JSON Stores Data

[
    {"id": 1, "name": "Alice", "age": 30, "address": {"city": "Toronto", "country": "Canada"}},
    {"id": 2, "name": "Bob", "age": 25, "address": {"city": "Mumbai", "country": "India"}},
    {"id": 3, "name": "Carol", "age": 35, "address": {"city": "London", "country": "UK"}}
]

Advantages

  • Supports nested data — objects inside objects, arrays inside objects
  • Self-describing — field names are in every record
  • Schema-flexible — different records can have different fields (schema-on-read)
  • API standard — every REST API speaks JSON
  • Human-readable — easier to read than XML, especially for nested structures
  • Data types — distinguishes between strings, numbers, booleans, and null (unlike CSV)

Disadvantages

  • Verbose — field names are repeated in every single record. In a million-row file, “customer_name” appears a million times.
  • Large file size — typically 2-5x larger than CSV for the same data (due to field name repetition and syntax characters)
  • No column pruning — must read entire records
  • Slow to parse — especially for large files. Parsing JSON is CPU-intensive.
  • No built-in compression — you can gzip it, but that’s an extra step

When to Use JSON

  • Storing API responses from REST services
  • Semi-structured data where different records have different fields
  • Nested or hierarchical data (addresses, order items, user preferences)
  • Configuration files and settings
  • Event data from applications (logs, clickstream)
  • Small to medium datasets where flexibility matters more than performance

Parquet — The Analytics Format

Parquet is an open-source, columnar storage format designed for analytics workloads. It was created by Twitter and Cloudera and is now the de facto standard for data lakes.

How Parquet Stores Data

Parquet organizes data by columns and divides files into row groups:

Row Group 1:
  Column "id":   [1, 2, 3, ..., 1000]     (compressed)
  Column "name": ["Alice", "Bob", ...]      (compressed)
  Column "age":  [30, 25, 35, ...]          (compressed)
  Column "city": ["Toronto", "Mumbai", ...] (compressed)

Row Group 2:
  Column "id":   [1001, 1002, ...]          (compressed)
  ...

Footer: Schema definition + column statistics (min/max/count)

Advantages

  • Columnar storage — only reads the columns your query needs (column pruning)
  • Excellent compression — 70-90% smaller than CSV because similar values in a column compress extremely well
  • Embedded schema — data types are stored in the file. No guessing if “30” is an int or string.
  • Predicate pushdown — the footer contains min/max statistics per column per row group. If you query WHERE age > 50 and a row group’s max age is 45, the entire row group is skipped.
  • Encoding optimizations — dictionary encoding for low-cardinality columns (e.g., country names), run-length encoding for repeated values
  • Partitioning — works naturally with Hive-style partitioning (/year=2026/month=04/)
  • Industry standard — supported by Spark, Synapse, Databricks, Athena, BigQuery, Snowflake, Presto, Trino

Disadvantages

  • Not human-readable — binary format, can’t open in a text editor
  • Not streamable — must read the footer (at the end of the file) to understand the schema, so you can’t stream it like CSV
  • Append-only — you can’t update or delete individual rows. For updates, you rewrite the entire file (or use Delta Lake/Iceberg on top).
  • Overhead for tiny files — the metadata overhead makes Parquet inefficient for very small files (under 1MB). Better to batch small records into larger files.
  • Write speed — writing Parquet is slower than CSV because of compression and encoding overhead. Read speed is much faster though.

When to Use Parquet

  • Data lakes — this is the standard format for Azure Data Lake, S3, and GCS
  • Analytics and BI — any data that will be queried by Synapse, Spark, Databricks, or Athena
  • Production pipelines — the default output format for ETL/ELT pipelines
  • Large datasets — anything over 100MB benefits from Parquet’s compression
  • Long-term storage — smaller files = lower storage costs over time

Head-to-Head Comparison

Feature CSV JSON Parquet
Storage type Row-oriented Row-oriented Column-oriented
Human readable Yes Yes No (binary)
Schema embedded No (headers only) Partial (field names) Yes (full schema with types)
Data types Everything is text String, number, boolean, null Int, long, float, double, string, date, timestamp, etc.
Nested data Not supported Native support Supported
Compression ratio Poor (1x) Worse (1.5-2x larger than CSV) Excellent (5-10x smaller than CSV)
Column pruning Not possible Not possible Built-in
Predicate pushdown Not possible Not possible Built-in
Write speed Fast Medium Slower (compression overhead)
Read speed (full scan) Medium Slow Fast
Read speed (select columns) Same as full scan Same as full scan Much faster (only reads needed columns)
Tool support Universal Universal Analytics tools (Spark, Synapse, Athena, etc.)
Best for Debugging, exports, small data APIs, semi-structured, nested data Analytics, data lakes, production pipelines

File Size Comparison with Real Data

Here’s a real comparison using the AdventureWorks SalesLT.Customer table (847 rows, 15 columns):

Format File Size Relative Size
CSV 285 KB 1x (baseline)
JSON 520 KB 1.8x larger
Parquet (Snappy) 48 KB 6x smaller
Parquet (Gzip) 38 KB 7.5x smaller

At scale (100 million rows), those ratios become dramatic:

Format Estimated Size Storage Cost/Month (ADLS Gen2)
CSV 30 GB $0.62
JSON 54 GB $1.12
Parquet (Snappy) 5 GB $0.10

Over a year with daily full loads, you’d store 365 copies. CSV = 10.9 TB ($226/yr). Parquet = 1.8 TB ($37/yr). That’s a $189/year savings on storage alone — for ONE table.

Query Performance Comparison

Querying SELECT name, email FROM customers WHERE age > 30 on a 100-million-row dataset:

Format Engine Time Data Scanned
CSV Synapse Serverless ~45 seconds 30 GB (full file)
JSON Synapse Serverless ~60 seconds 54 GB (full file)
Parquet Synapse Serverless ~3 seconds 800 MB (2 columns + predicate pushdown)

Parquet is 15x faster because: 1. It only reads 2 columns out of 15 (column pruning) 2. It skips row groups where max(age) < 30 (predicate pushdown) 3. The data is compressed, so less I/O

Which Format for Which Use Case?

Use Case Recommended Format Why
Data lake ingestion Parquet Compression, schema, column pruning
Daily ETL pipeline output Parquet Standard format for analytics
API response storage JSON Preserves nested structure
Debug/inspect pipeline output CSV Open in Excel or text editor
Share with business users CSV Everyone can open CSV
Event/log data JSON or Parquet JSON for flexibility, Parquet for volume
Configuration/settings JSON Human-readable, supports nesting
Data warehouse staging Parquet Fastest load into SQL pools
ML training data Parquet Standard for Spark/pandas
One-time data export CSV Simple, universal

How to Switch from CSV to Parquet in ADF/Synapse

If you’re currently outputting CSV and want to switch to Parquet, the change is simple:

Step 1: Create a new dataset – Instead of selecting DelimitedText, select Parquet – Same linked service, same parameters

Step 2: Update the Copy activity’s sink – Change the sink dataset from DS_ADLS_Sink_CSV to DS_ADLS_Sink_Parquet

Step 3: That’s it.

The Copy activity handles the format conversion automatically. It reads from the source (SQL, CSV, whatever) and writes in Parquet format with Snappy compression by default.

No code changes. No schema mapping changes. No pipeline restructuring. Just swap the dataset.

Other Formats Worth Knowing

ORC (Optimized Row Columnar): – Similar to Parquet but optimized for Hive – Used primarily in the Hadoop ecosystem – Parquet has become more popular due to broader tool support

Avro: – Row-based but with schema evolution support – Popular for Kafka message serialization – Good for write-heavy workloads where schema changes frequently

Delta Lake (not a format, a layer on top of Parquet): – Adds ACID transactions, time travel, and schema enforcement to Parquet – Supports UPDATE, DELETE, and MERGE operations on Parquet files – Used in Databricks and now supported in Synapse

Iceberg and Hudi: – Similar to Delta Lake — table formats that add transaction support to Parquet – Growing in popularity, especially in the AWS and open-source ecosystems

Interview Questions

Q: Why would you choose Parquet over CSV for a data lake? A: Parquet offers 5-10x better compression (lower storage costs), column pruning (reads only the columns needed), predicate pushdown (skips irrelevant row groups), and embedded schema (no type guessing). CSV is only better for human readability and universal tool support.

Q: What is column pruning? A: Column pruning is the ability to read only the specific columns needed by a query, skipping all others. In a 50-column Parquet file, a query that needs 3 columns reads only 6% of the data. CSV must read 100% regardless.

Q: What is predicate pushdown in Parquet? A: Parquet files store min/max statistics for each column in each row group. When a query has a WHERE clause (e.g., WHERE age > 50), the engine checks these statistics and skips entire row groups where the max value doesn’t meet the condition — without reading any actual data rows.

Q: When would you NOT use Parquet? A: For small files (under 1MB where metadata overhead matters), for data that needs to be human-readable (debugging), for sharing with non-technical users who need Excel compatibility, or for real-time streaming where you need to append individual records frequently.

Wrapping Up

Choosing the right file format isn’t glamorous, but it directly impacts your pipeline’s performance, storage costs, and query speed. In 2026, the decision is straightforward:

  • Parquet for production data pipelines and analytics (the default)
  • JSON for API data and semi-structured content
  • CSV for debugging, exports, and human readability

If you’re building pipelines in ADF or Synapse, start with Parquet. Your future self (and your cloud bill) will thank you.

Related posts:Building a Metadata-Driven Pipeline in ADFSynapse Pipeline with Audit Logging (Parquet output)Common ADF/Synapse Pipeline ErrorsTop 15 ADF Interview Questions

If this comparison helped you make a decision, share it with your team. Questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link