Reading and Writing Every File Format in Azure Databricks: CSV, Parquet, JSON, Delta, and Tricky CSV Variations

Reading and Writing Every File Format in Azure Databricks: CSV, Parquet, JSON, Delta, and Tricky CSV Variations

You connected Databricks to your storage account. You can list files. Now you try spark.read.csv("path") and get a DataFrame where every column is named _c0, _c1, _c2 with all string types. Or worse — a CSV with pipe delimiters, single-quote text qualifiers, escape characters, and multiline values that Spark completely butchers.

Reading files in Databricks is not just spark.read.format("csv").load(). Real-world files come with quirks — no headers, mixed delimiters, embedded commas, escaped quotes, values that span multiple lines, NULL markers, corrupted rows. Each quirk needs a specific option to handle correctly.

Think of Spark’s file reader like a universal remote control. It can control any TV (file format), but you need to press the right buttons (options) for each TV brand. Use the wrong button and you get static (garbled data).

This post covers every file format you will read and write in Databricks, with the exact options for every real-world scenario — including the tricky CSV variations that trip up even experienced engineers.

Table of Contents

  • The spark.read Pattern
  • Reading Standard CSV Files
  • CSV Without Headers (Manual Schema)
  • CSV with Pipe Delimiter
  • CSV with Corrupted Data (NULL Handling)
  • CSV with Embedded Commas (Quote Character)
  • CSV with Pipe + Single Quote as Text Qualifier
  • CSV with Escape Characters
  • CSV with Multiline Values
  • CSV Options Reference Table
  • Reading JSON Files
  • Reading Parquet Files
  • Reading Delta Lake Tables
  • Writing Data — Modes and Formats
  • Writing with Partitioning (partitionBy)
  • Writing Parquet
  • Writing Delta
  • Creating Managed Tables in Unity Catalog
  • Managed vs External Tables
  • Delta Table Operations (History, Time Travel, Vacuum)
  • Schema Definition (StructType)
  • Real-World Pipeline: Read, Transform, Write, Table
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

The spark.read Pattern

Every file read in Databricks follows the same structure:

df = spark.read     .format("csv")              # What format? csv, parquet, json, delta
    .option("header", "true")   # Format-specific options
    .option("inferSchema", "true")
    .load("abfss://container@account.dfs.core.windows.net/path/")

Or the shorthand:

df = spark.read.csv("path", header=True, inferSchema=True)
df = spark.read.parquet("path")
df = spark.read.json("path")

Real-life analogy: spark.read is like opening a package. The format tells Spark what KIND of package (envelope, box, crate). The options are the opening instructions (cut here, pull tab, twist cap). The load is the address where the package is.

Reading Standard CSV Files

The Simplest CSV

order_id,customer_name,product,quantity,amount
1,Alice Johnson,Laptop,2,85000
2,Bob Smith,Mouse,5,2500
3,Carol Williams,Keyboard,3,4500

Reading It

df = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load(f"{base_path}/sales/sales_standard.csv")

df.show()
df.printSchema()

What the Options Do

Option Without It With It
header=true Columns named _c0, _c1, _c2 — first row treated as data Column names from first row: order_id, customer_name, product
inferSchema=true ALL columns are StringType Spark samples data and detects types: order_id=IntegerType, amount=IntegerType

The inferSchema Gotcha

inferSchema reads the file TWICE — once to infer types, once to load data. For large files (10+ GB), this is slow. For production, define the schema manually instead.

Real-life analogy: inferSchema is like a customs officer opening every package to see what is inside before deciding how to classify it. Fast for a few packages, painfully slow for a shipping container of 10,000 packages. Better to have a manifest (defined schema) upfront.

CSV Without Headers (Manual Schema)

The File

1,Alice Johnson,Laptop,2,85000
2,Bob Smith,Mouse,5,2500
3,Carol Williams,Keyboard,3,4500

No header row. Spark does not know the column names.

Reading with Manual Schema

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define the schema manually
schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("product", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("amount", IntegerType(), True)
])

df = spark.read.format("csv")     .option("header", "false")     .schema(schema)     .load(f"{base_path}/sales/sales_no_header.csv")

df.show()
df.printSchema()

Why Manual Schema Is Better for Production

  1. Faster — no double-read for inference
  2. Reliable — you control the types (inferSchema might guess wrong)
  3. Self-documenting — the schema IS the documentation
  4. Catches errors — if the file has wrong types, you get a clear error instead of silent corruption

CSV with Pipe Delimiter

The File

order_id|customer_name|product|quantity|amount
1|Alice Johnson|Laptop Pro|2|85000
2|Bob Smith|Wireless Mouse|5|2500

Reading It

df = spark.read.format("csv")     .option("header", "true")     .option("delimiter", "|")     .option("inferSchema", "true")     .load(f"{base_path}/sales/sales_pipe.csv")

df.show(truncate=False)

The delimiter option (also called sep) tells Spark which character separates fields. Default is comma. Change it to |, (tab), or any single character.

Real-life analogy: The delimiter is like the divider between compartments in a toolbox. A comma-delimited file uses thin dividers. A pipe-delimited file uses thick dividers. You need to tell Spark which divider to look for, or it tries to use the wrong ones and tools end up in the wrong compartments.

CSV with Corrupted Data (NULL Handling)

The File

order_id,customer_name,product,quantity,amount,remarks
1,John Doe,"Laptop, 16GB RAM",2,85000,"Delivered on time"
2,NULL,Mouse,NULL,2500,"Customer satisfied"
3,Carol,Keyboard,3,NULL,NULL

Reading with NULL Handling

df = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .option("nullValue", "NULL")     .load(f"{base_path}/sales/sales_corrupted.csv")

df.show(truncate=False)

What nullValue Does

Without nullValue="NULL": – The string "NULL" is treated as a regular text value – customer_name for row 2 is the STRING “NULL”, not an actual null

With nullValue="NULL": – Every occurrence of the text NULL is converted to a real null value – df.filter(col("customer_name").isNull()) correctly finds row 2

Handling Embedded Commas

Notice row 1: "Laptop, 16GB RAM". The product name contains a comma, but it is wrapped in double quotes. Spark’s default quote character is ", so it correctly reads this as ONE field.

Without quote handling: Laptop and 16GB RAM would be split into two separate columns — breaking the entire row.

CSV with Pipe + Single Quote as Text Qualifier

The File

order_id|customer_name|product|quantity|amount|remarks
1|John Doe|'Laptop | 16GB RAM'|2|85000|'Delivered on time'
2|Jane Smith|Mouse|5|2500|'Customer satisfied'

The pipe | is the delimiter. Single quotes ' wrap text fields (not the standard double quotes). The product name contains a pipe INSIDE single quotes.

Reading It

df = spark.read.format("csv")     .option("header", "true")     .option("delimiter", "|")     .option("quote", "'")     .option("inferSchema", "true")     .load(f"{base_path}/sales/sales_pipe_single_quote.csv")

df.show(truncate=False)

Why quote=”‘” Is Critical

Without quote="'", Spark uses the default " as the text qualifier. When it sees 'Laptop | 16GB RAM', it does not recognize the single quotes as text boundaries. The | inside gets treated as a delimiter, splitting the product into two fields and breaking the entire row.

With quote="'", Spark knows: “Everything between single quotes is ONE field. Ignore any pipes inside.”

Real-life analogy: The quote character is like quotation marks in speech. When someone says “I want the ‘large, extra cheese’ pizza,” you know “large, extra cheese” is ONE item, not three separate requests. The quote character tells Spark where a field begins and ends.

CSV with Escape Characters

The File

order_id|customer_name|product|quantity|amount|remarks
1|John Doe|'Laptop Pro'|2|85000|'Good product'
2|Jane Smith|Mouse|5|2500|'Customer said #'excellent#' work'

Row 2 remarks contain single quotes INSIDE a single-quoted field: 'Customer said #'excellent#' work'. The # before each inner quote is the escape character — it tells Spark “this next single quote is part of the text, not the end of the field.”

Reading It

df = spark.read.format("csv")     .option("header", "true")     .option("delimiter", "|")     .option("quote", "'")     .option("escape", "#")     .option("inferSchema", "true")     .load(f"{base_path}/sales/sales_pipe_single_quote_escape.csv")

df.show(truncate=False)

Result

The remarks field for row 2 correctly shows: Customer said 'excellent' work

The # characters are consumed during parsing — they do not appear in the output. They only served as signals to Spark: “the quote after me is literal, not structural.”

Real-life analogy: The escape character is like saying “quote” in a conversation. If you say to a robot: “Print: She said ‘hello'”, the robot gets confused — is ‘hello’ a command or text? With escape: “Print: She said QUOTE hello QUOTE” — now the robot knows those quotes are part of the text.

CSV with Multiline Values

The File

order_id|customer_name|product|quantity|amount|remarks
1|John Doe|'Laptop Pro'|2|85000|'Good product'
2|Jane Smith|Mouse|5|2500|'Customer said #'excellent#' work
and he praised the laptop a lot'
3|Carol|Keyboard|3|4500|'Fast delivery'

Row 2’s remarks span TWO LINES. The text starts on one line and continues on the next. Without special handling, Spark treats each line as a separate row and the data is completely mangled.

Reading It

df = spark.read.format("csv")     .option("header", "true")     .option("delimiter", "|")     .option("quote", "'")     .option("escape", "#")     .option("multiLine", "true")     .option("inferSchema", "true")     .load(f"{base_path}/sales/sales_pipe_single_quote_escape_multiline.csv")

df.show(truncate=False)

What multiLine=true Does

Without multiLine=true: Spark reads line by line. It sees the first line of row 2, then treats the continuation as a separate (broken) row.

With multiLine=true: Spark reads the entire file and respects quote boundaries across line breaks. It sees the opening ' before “Customer said” and reads until the closing ' after “a lot” — even though they are on different lines.

Performance Warning

multiLine=true is slower because Spark cannot parallelize reading (it must read sequentially to find quote boundaries across lines). Only use it when your data actually has multiline values.

CSV Options Reference Table

Option What It Does Default Example
header First row is column names false "true"
delimiter / sep Field separator character , "\|" for pipe
quote Character wrapping text fields " "'" for single quote
escape Character escaping quotes within text \ "#"
multiLine Allow values to span multiple lines false "true"
inferSchema Auto-detect data types (reads file twice) false "true"
nullValue String to interpret as null (none) "NULL" or "N/A"
schema Manually define column types (none) StructType([...])
encoding Character encoding UTF-8 "ISO-8859-1"
dateFormat Date parsing format yyyy-MM-dd "dd/MM/yyyy"
timestampFormat Timestamp parsing format ISO format "yyyy-MM-dd HH:mm:ss"
mode Error handling: PERMISSIVE, DROPMALFORMED, FAILFAST PERMISSIVE "FAILFAST"
columnNameOfCorruptRecord Column to store bad rows _corrupt_record "bad_row"

Reading JSON Files

Standard JSON

df = spark.read.format("json")     .option("multiLine", "true")     .load(f"{base_path}/data.json")

df.show(truncate=False)
df.printSchema()  # Spark infers schema from JSON structure

JSON Lines (One Record Per Line)

# No multiLine needed — this is the default
df = spark.read.json(f"{base_path}/data.jsonl")

Nested JSON (Flattening)

from pyspark.sql.functions import col, explode

# Read JSON with nested structure
df = spark.read.json(f"{base_path}/nested_data.json")

# Flatten nested fields
df_flat = df.select(
    col("id"),
    col("name"),
    col("address.city").alias("city"),
    col("address.country").alias("country"),
    explode("orders").alias("order")
).select(
    "id", "name", "city", "country",
    col("order.order_id").alias("order_id"),
    col("order.amount").alias("amount")
)

df_flat.show()

Real-life analogy: Reading JSON is like unpacking a Russian nesting doll (matryoshka). Each doll (object) contains smaller dolls (nested objects). explode opens the innermost doll and lays everything flat on the table.

Reading Parquet Files

Parquet is the simplest to read — schema is embedded, no options needed:

# That is it. No options needed.
df = spark.read.parquet(f"{base_path}/customers/")

# Or with format syntax
df = spark.read.format("parquet").load(f"{base_path}/customers/")

# Read specific columns only (Parquet supports column pruning)
df = spark.read.parquet(f"{base_path}/customers/").select("name", "city")

Why Parquet is easiest: The schema (column names, types) is stored INSIDE the Parquet file. Spark reads the metadata and knows everything instantly — no inferSchema, no header option, no delimiter. Open and read.

Reading Delta Lake Tables

# Read Delta format
df = spark.read.format("delta").load(f"{base_path}/silver/customers/")

# Read a specific version (time travel)
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load(f"{base_path}/silver/customers/")

# Read at a specific timestamp
df_past = spark.read.format("delta")     .option("timestampAsOf", "2026-04-20 14:00:00")     .load(f"{base_path}/silver/customers/")

# Read registered table (if in Unity Catalog)
df = spark.table("catalog.schema.customers")

Writing Data — Modes and Formats

Write Modes

Mode What It Does When to Use
overwrite Replaces existing data entirely Full refresh, dev/testing
append Adds new data to existing Incremental loads, log data
ignore Does nothing if data already exists Prevent accidental overwrites
errorIfExists Throws error if data exists (default) Catch unexpected overwrites

df.write.mode("overwrite").parquet("path/")
df.write.mode("append").format("delta").save("path/")

Writing with Partitioning (partitionBy)

from pyspark.sql.functions import year, month, col

# Add partition columns
df_with_date = df     .withColumn("year", year(col("order_date")))     .withColumn("month", month(col("order_date")))

# Write with partitioning
df_with_date.write     .format("parquet")     .mode("overwrite")     .partitionBy("year", "month")     .save(f"{base_path}/output/sales_partitioned/")

What partitionBy Creates

output/sales_partitioned/
  year=2025/
    month=11/
      part-00000.parquet
    month=12/
      part-00000.parquet
  year=2026/
    month=1/
      part-00000.parquet
    month=2/
      part-00000.parquet

Hive-style partition folders. When you query WHERE year=2026 AND month=1, Spark reads ONLY that folder — skipping everything else. This is partition pruning and it makes queries on large datasets dramatically faster.

Real-life analogy: Partitioning is like organizing files in a filing cabinet by year and month. When someone asks for January 2026 invoices, you open one drawer instead of searching through every drawer in the cabinet.

Writing Delta

# Write Delta (same as Parquet but with transaction log)
df.write     .format("delta")     .mode("overwrite")     .partitionBy("year", "month")     .save(f"{base_path}/output/sales_delta/")

# The folder now contains:
#   part-00000.snappy.parquet (data — same as Parquet)
#   _delta_log/               (transaction log — the Delta magic)

Creating Managed Tables in Unity Catalog

Method 1: saveAsTable (From DataFrame)

# Write directly to a managed table
df.write     .format("delta")     .mode("overwrite")     .saveAsTable("my_catalog.my_schema.sales")

# Query it
spark.sql("SELECT * FROM my_catalog.my_schema.sales").show()

Method 2: SQL CREATE TABLE + insertInto

# Create the table structure
spark.sql(
    "CREATE TABLE IF NOT EXISTS my_catalog.my_schema.customers ("
    "customer_id INT, name STRING, city STRING, amount DOUBLE"
    ") USING DELTA"
)

# Insert data
df.write.insertInto("my_catalog.my_schema.customers")

Method 3: Partitioned Managed Table

df.write     .format("delta")     .mode("overwrite")     .partitionBy("year", "month")     .saveAsTable("my_catalog.my_schema.sales_partitioned")

Managed vs External Tables

Feature Managed Table External Table
Data location Databricks manages (internal storage) YOUR storage (ADLS Gen2 path)
DROP TABLE Deletes metadata AND data Deletes metadata only, data stays
Created with saveAsTable() or CREATE TABLE CREATE TABLE LOCATION 'path'
Best for Curated tables, Unity Catalog governed Raw data, shared storage
Lineage Full lineage in Unity Catalog Limited

# Managed table
df.write.saveAsTable("catalog.schema.customers")

# External table
df.write.format("delta").save("abfss://container@account/path/")
spark.sql(
    "CREATE TABLE catalog.schema.customers_external "
    "USING DELTA "
    "LOCATION 'abfss://container@account/path/'"
)

Delta Table Operations

View History

DESCRIBE HISTORY my_catalog.my_schema.sales

Shows every operation: INSERT, UPDATE, DELETE, MERGE — with timestamps, user, and row counts.

Time Travel

# Read version 3 of the table
df_v3 = spark.read.format("delta").option("versionAsOf", 3).load("path/")

# Read table as it was yesterday
df_yesterday = spark.sql("SELECT * FROM sales VERSION AS OF 3")

# Read by timestamp
df_past = spark.sql("SELECT * FROM sales TIMESTAMP AS OF '2026-04-20 14:00:00'")

Vacuum (Clean Old Files)

-- Delete files older than 7 days (168 hours)
VACUUM my_catalog.my_schema.sales RETAIN 168 HOURS

Optimize (Compact Small Files)

OPTIMIZE my_catalog.my_schema.sales
OPTIMIZE my_catalog.my_schema.sales ZORDER BY (city)

Schema Definition (StructType)

For production, always define schemas explicitly:

from pyspark.sql.types import (
    StructType, StructField, 
    IntegerType, StringType, DoubleType, 
    DateType, TimestampType, BooleanType
)

schema = StructType([
    StructField("order_id", IntegerType(), nullable=False),
    StructField("customer_name", StringType(), nullable=True),
    StructField("product", StringType(), nullable=True),
    StructField("quantity", IntegerType(), nullable=True),
    StructField("amount", DoubleType(), nullable=True),
    StructField("order_date", DateType(), nullable=True),
    StructField("is_returned", BooleanType(), nullable=True),
    StructField("created_at", TimestampType(), nullable=True)
])

df = spark.read.format("csv")     .option("header", "true")     .schema(schema)     .load("path/")

Real-World Pipeline: Read, Transform, Write, Table

# Step 1: Read messy CSV from bronze layer
df_raw = spark.read.format("csv")     .option("header", "true")     .option("delimiter", "|")     .option("quote", "'")     .option("nullValue", "NULL")     .option("inferSchema", "true")     .load(f"{base_path}/bronze/sales_raw.csv")

print(f"Raw: {df_raw.count()} rows")

# Step 2: Clean and transform
from pyspark.sql.functions import *

df_clean = df_raw     .withColumn("customer_name", initCap(trim(col("customer_name"))))     .withColumn("amount", col("amount").cast("double"))     .withColumn("order_date", to_date(col("order_date"), "dd/MM/yyyy"))     .withColumn("year", year(col("order_date")))     .withColumn("month", month(col("order_date")))     .filter(col("amount") > 0)     .dropDuplicates(["order_id"])

print(f"Clean: {df_clean.count()} rows")

# Step 3: Write as partitioned Delta to silver layer
df_clean.write     .format("delta")     .mode("overwrite")     .partitionBy("year", "month")     .save(f"{base_path}/silver/sales/")

# Step 4: Register as managed table in Unity Catalog
df_clean.write     .format("delta")     .mode("overwrite")     .saveAsTable("my_catalog.silver.sales")

print("Pipeline complete!")

# Step 5: Verify
spark.sql("SELECT year, month, count(*) as orders FROM my_catalog.silver.sales GROUP BY year, month").show()

Common Mistakes

  1. Forgetting header=true — columns named _c0, _c1, _c2 instead of real names. The most common beginner mistake.

  2. Not setting inferSchema=true — all columns are StringType. Aggregations fail because SUM("500") does not work on strings.

  3. Wrong delimiter for pipe files — using default comma delimiter on pipe-delimited files. Result: one giant column with pipes in the data.

  4. Not handling embedded delimiters"Laptop, 16GB" splits into two columns if quote handling is wrong. Always verify with .show(truncate=False).

  5. Using collect() on large data — brings millions of rows to the driver, crashing it. Use .show(), .take(10), or .write().

  6. Not using coalesce() before writing — Spark creates one file per partition, resulting in thousands of tiny files. Use df.coalesce(10).write.parquet().

  7. Forgetting multiLine=true for multiline CSV — rows break across lines, data is mangled. Only needed when text values contain newlines.

  8. Writing Parquet when you need updates later — Parquet does not support UPDATE/DELETE. Use Delta format for any table that will be modified.

Interview Questions

Q: How do you read a CSV file with pipe delimiter and single-quote text qualifier in PySpark? A: Use spark.read.format("csv") with .option("delimiter", "|") and .option("quote", "'"). The delimiter option sets the field separator. The quote option tells Spark which character wraps text fields, preventing delimiter characters inside quoted text from being treated as field boundaries.

Q: What is the difference between inferSchema and defining a schema manually? A: inferSchema=true reads the file twice — once to detect types, once to load. It is convenient but slow for large files and can guess wrong. Manual schema (StructType) is faster (single read), reliable, and self-documenting. Always use manual schema in production.

Q: What is the difference between writing Parquet and Delta? A: Both store data in columnar Parquet format. Delta adds a _delta_log transaction log that enables ACID transactions, UPDATE/DELETE, MERGE, time travel, schema enforcement, and OPTIMIZE/VACUUM. Delta is Parquet + reliability guarantees.

Q: What does partitionBy do when writing data? A: Creates Hive-style folder structure (year=2026/month=4/). When querying with a filter on partition columns, Spark reads only the relevant folders (partition pruning), dramatically improving performance on large datasets.

Q: What is the difference between managed and external tables in Databricks? A: Managed tables — Databricks manages both metadata and data files. DROP TABLE deletes everything. External tables — Databricks manages only metadata, data stays in your ADLS Gen2. DROP TABLE removes the metadata but data files remain. Use managed for curated tables, external for shared storage.

Wrapping Up

Reading files in Databricks is about knowing the right options for each format and each quirk. Standard CSV needs header and inferSchema. Pipe-delimited needs delimiter. Quoted fields need quote. Escaped quotes need escape. Multiline values need multiLine. Once you know the options, every file format is just a combination of these settings.

For writing, the progression is clear: write CSV for compatibility, Parquet for analytics, and Delta for everything production. Partition by date for large datasets. Register as managed tables for governance.

The formula is always: Read (with correct options) → Transform (clean, standardize) → Write (Delta with partitions) → Register (managed table in catalog).

Related posts:Azure Databricks Introduction and dbutilsConnecting Databricks to StorageData File Formats (CSV, Parquet, Delta, Avro, ORC)Apache Spark and PySparkPython for Data Engineers


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link