Delta Lake Deep Dive in Azure Databricks: Time Travel, Versioning, MERGE, Schema Evolution, and Every Operation Explained

Delta Lake Deep Dive in Azure Databricks: Time Travel, Versioning, MERGE, Schema Evolution, and Every Operation Explained

Delta Lake is what transforms a simple data lake into a reliable data platform. Without Delta, your Parquet files are just files — no transactions, no updates, no history, no rollbacks. With Delta, your data lake gets the reliability of a database.

We have mentioned Delta in several posts, but never done a proper hands-on deep dive. This post fixes that. We are going to create a Delta table, perform every operation (INSERT, UPDATE, DELETE, MERGE), watch the version history grow, time travel to any point in the past, restore the table to an earlier version, evolve the schema, and clean up old files — all step by step in a Databricks notebook.

Think of a Delta table like a document in Google Docs with version history. Every edit (operation) creates a new version. You can see who changed what and when (DESCRIBE HISTORY). You can view the document as it was last Tuesday (time travel). You can restore an earlier version if someone messed up (RESTORE). And you can clean up old versions to save storage (VACUUM). Parquet files are like a Word document saved locally — no history, no versions, no undo.

Table of Contents

  • What Makes Delta Different from Parquet
  • Setup: Create the Schema
  • Version 0: Create the Delta Table
  • Version 1: INSERT — Add New Rows
  • Version 2: UPDATE — Modify Existing Rows
  • Version 3: DELETE — Remove Rows
  • Version 4: MERGE — Upsert (Insert + Update in One Operation)
  • DESCRIBE HISTORY — The Complete Audit Trail
  • Time Travel — Query Any Previous Version
  • Time Travel Method 1: VERSION AS OF
  • Time Travel Method 2: TIMESTAMP AS OF
  • Time Travel Method 3: PySpark API
  • Compare Two Versions — What Changed?
  • Track One Employee Across All Versions
  • RESTORE — Roll Back the Entire Table
  • VACUUM — Clean Up Old Files
  • Schema Evolution — Add New Columns
  • DESCRIBE DETAIL — Table Metadata
  • DeltaTable API — Python Access
  • Delta vs Parquet vs CSV Comparison
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Makes Delta Different from Parquet

Parquet file:
  data/
    part-00000.parquet
    part-00001.parquet

Delta table:
  data/
    _delta_log/                    ← THIS is the magic
      00000000000000000000.json    ← Version 0: created table
      00000000000000000001.json    ← Version 1: inserted 2 rows
      00000000000000000002.json    ← Version 2: updated salary
      00000000000000000003.json    ← Version 3: deleted Ravi
      00000000000000000004.json    ← Version 4: MERGE upsert
    part-00000.snappy.parquet      ← Data files (same as Parquet)
    part-00001.snappy.parquet

The _delta_log/ folder is the transaction log — a JSON file for every operation. This log is what enables ACID transactions, time travel, and all Delta features. Without it, you just have Parquet files.

Real-life analogy: Parquet is a photo album with photos but no dates, no captions, and no way to recover deleted photos. Delta is Google Photos — every photo is timestamped, edits are tracked, deleted photos go to trash (recoverable for 30 days), and you can view your library as it was on any date.

Setup: Create the Schema

CREATE SCHEMA IF NOT EXISTS delta_demo;
USE delta_demo;

Version 0: Create the Delta Table

from pyspark.sql.functions import *

# Create sample employee data
data = [
    (1, "Naveen", "Data Engineering", "Mississauga", 95000, "2023-01-15"),
    (2, "Shrey", "Data Science", "Toronto", 88000, "2023-03-20"),
    (3, "Vrushab", "DevOps", "Vancouver", 92000, "2023-06-10"),
    (4, "Vishnu", "Analytics", "Ottawa", 91000, "2023-09-01"),
    (5, "Ravi", "Engineering", "Calgary", 85000, "2024-01-05"),
]

columns = ["emp_id", "name", "department", "city", "salary", "hire_date"]

df = spark.createDataFrame(data, columns)     .withColumn("hire_date", to_date(col("hire_date")))

# Save as managed Delta table — this is VERSION 0
df.write.format("delta").mode("overwrite").saveAsTable("delta_demo.employees")
print("Table created: delta_demo.employees (Version 0)")
-- Verify
SELECT * FROM delta_demo.employees ORDER BY emp_id;
emp_id name department city salary hire_date
1 Naveen Data Engineering Mississauga 95000 2023-01-15
2 Shrey Data Science Toronto 88000 2023-03-20
3 Vrushab DevOps Vancouver 92000 2023-06-10
4 Vishnu Analytics Ottawa 91000 2023-09-01
5 Ravi Engineering Calgary 85000 2024-01-05

5 employees. Version 0. This is our baseline.

Version 1: INSERT — Add New Rows

INSERT INTO delta_demo.employees VALUES
    (6, 'Anita', 'Marketing', 'Montreal', 78000, '2024-03-15'),
    (7, 'Deepak', 'Finance', 'Edmonton', 82000, '2024-05-01');
SELECT * FROM delta_demo.employees ORDER BY emp_id;
-- Now 7 employees. Version 1.

What happened internally: Delta created a new Parquet file with the 2 new rows and added a JSON entry to _delta_log/ recording this INSERT. The original 5-row Parquet file is untouched.

Version 2: UPDATE — Modify Existing Rows

Naveen got promoted and relocated:

UPDATE delta_demo.employees
SET salary = 105000, city = 'Toronto'
WHERE emp_id = 1;
SELECT * FROM delta_demo.employees WHERE emp_id = 1;
-- Naveen now shows salary=105000, city=Toronto. Version 2.

What happened internally: Delta does NOT modify the original Parquet file. It creates a NEW Parquet file with Naveen’s updated row and marks the old file as “removed” in the transaction log. The old file still physically exists on disk (for time travel) until VACUUM cleans it up.

Real-life analogy: An UPDATE in Delta is like editing a Wikipedia article. The old version is preserved in the revision history. The current page shows the latest version. You can view any previous revision.

Version 3: DELETE — Remove Rows

Ravi left the company:

DELETE FROM delta_demo.employees
WHERE emp_id = 5;
SELECT * FROM delta_demo.employees ORDER BY emp_id;
-- Ravi is gone. 6 employees remain. Version 3.

What happened internally: The transaction log records that the Parquet file containing Ravi’s row is “removed.” The file still exists physically (for time travel). A new Parquet file is created with the remaining rows from that file.

Version 4: MERGE — Upsert (Insert + Update in One Operation)

MERGE is the most powerful Delta operation. It combines INSERT, UPDATE, and DELETE in a single atomic statement:

from delta.tables import DeltaTable

# Source data — mix of new and existing employees
merge_data = [
    (2, "Shrey", "Machine Learning", "Toronto", 95000, "2023-03-20"),   # EXISTING — department changed, salary increased
    (8, "Manju", "QA", "Pune", 72000, "2024-07-01"),                    # NEW
    (9, "Teja", "Backend", "Hyderabad", 88000, "2024-08-15"),           # NEW
]

df_source = spark.createDataFrame(merge_data, columns)     .withColumn("hire_date", to_date(col("hire_date")))

# Load the target Delta table
target_table = DeltaTable.forName(spark, "delta_demo.employees")

# MERGE: update if exists, insert if new
target_table.alias("target").merge(
    df_source.alias("source"),
    "target.emp_id = source.emp_id"
).whenMatchedUpdateAll()  .whenNotMatchedInsertAll()  .execute()

print("MERGE completed — Version 4")
SELECT * FROM delta_demo.employees ORDER BY emp_id;
-- Shrey updated (ML, 95K), Manju and Teja inserted. 8 employees. Version 4.

Real-life analogy: MERGE is like checking guests into a hotel. Returning guests (matched) get their room preferences updated. New guests (not matched) get registered and assigned a room. One operation handles both — no need to check “are they new?” first.

DESCRIBE HISTORY — The Complete Audit Trail

DESCRIBE HISTORY delta_demo.employees;

Shows every operation performed on the table:

version timestamp operation operationParameters
4 2026-05-05 10:25:00 MERGE predicate: target.emp_id = source.emp_id
3 2026-05-05 10:20:00 DELETE predicate: emp_id = 5
2 2026-05-05 10:15:00 UPDATE predicate: emp_id = 1
1 2026-05-05 10:10:00 WRITE mode: Append
0 2026-05-05 10:05:00 CREATE TABLE

Every operation is logged with: version number, timestamp, operation type, user who performed it, and parameters. This is your complete audit trail.

Real-life analogy: DESCRIBE HISTORY is like a security camera log for your data. “At 10:15, Version 2 was created by user naveen — operation: UPDATE. At 10:20, Version 3 was created — operation: DELETE.” Every change is recorded. Nothing happens silently.

Time Travel — Query Any Previous Version

Method 1: VERSION AS OF (By Version Number)

-- Current state (Version 4) — 8 employees, Ravi gone, Shrey updated
SELECT * FROM delta_demo.employees ORDER BY emp_id;

-- Version 0 — original 5 employees
SELECT * FROM delta_demo.employees VERSION AS OF 0 ORDER BY emp_id;

-- Version 1 — 7 employees (after INSERT)
SELECT * FROM delta_demo.employees VERSION AS OF 1 ORDER BY emp_id;

-- Version 2 — Naveen's salary changed
SELECT * FROM delta_demo.employees VERSION AS OF 2 WHERE emp_id = 1;

-- Version 3 — Ravi still exists in Version 2 but gone in Version 3
SELECT * FROM delta_demo.employees VERSION AS OF 2 WHERE emp_id = 5;  -- Ravi EXISTS
SELECT * FROM delta_demo.employees VERSION AS OF 3 WHERE emp_id = 5;  -- Ravi GONE

Method 2: TIMESTAMP AS OF (By Point in Time)

-- What did the table look like at a specific timestamp?
SELECT * FROM delta_demo.employees TIMESTAMP AS OF '2026-05-05 10:12:00' ORDER BY emp_id;

Useful when you know WHEN something happened but not WHICH version.

Method 3: PySpark API (Programmatic)

# Read Version 0 programmatically
df_v0 = spark.read.format("delta")     .option("versionAsOf", 0)     .table("delta_demo.employees")

df_v0.show()
print(f"Version 0 had {df_v0.count()} employees")

# Read by timestamp
df_past = spark.read.format("delta")     .option("timestampAsOf", "2026-05-05 10:12:00")     .table("delta_demo.employees")

df_past.show()

When Time Travel Saves You

Scenario How Time Travel Helps
Bad UPDATE corrupted 10,000 rows Query the version BEFORE the update to see correct data
Auditor asks “what was Alice’s salary on March 1?” TIMESTAMP AS OF '2026-03-01'
Accidental DELETE removed critical records Query the previous version to find them
Need to compare this week vs last week Query two different versions
Debug a pipeline — “when did this row change?” Track the row across versions

Compare Two Versions — What Changed?

# Load two versions
df_v1 = spark.read.format("delta").option("versionAsOf", 1).table("delta_demo.employees")
df_v3 = spark.read.format("delta").option("versionAsOf", 3).table("delta_demo.employees")

# Find rows in Version 1 but NOT in Version 3 (deleted or changed)
df_deleted = df_v1.exceptAll(df_v3)
print("Rows removed or changed between V1 and V3:")
df_deleted.show()

# Find rows in Version 3 but NOT in Version 1 (new or changed)
df_added = df_v3.exceptAll(df_v1)
print("Rows added or changed between V1 and V3:")
df_added.show()

Track One Employee Across All Versions

# Track Naveen (emp_id=1) across all versions
print("Naveen's journey through time:")
print("=" * 60)

for v in range(5):
    try:
        df_v = spark.read.format("delta").option("versionAsOf", v).table("delta_demo.employees")
        naveen = df_v.filter(col("emp_id") == 1).collect()
        if naveen:
            row = naveen[0]
            print(f"Version {v}: city={row['city']}, salary={row['salary']}, dept={row['department']}")
        else:
            print(f"Version {v}: Not found (deleted)")
    except Exception as e:
        print(f"Version {v}: Error - {e}")

Output:

Version 0: city=Mississauga, salary=95000, dept=Data Engineering
Version 1: city=Mississauga, salary=95000, dept=Data Engineering
Version 2: city=Toronto, salary=105000, dept=Data Engineering     ← UPDATED
Version 3: city=Toronto, salary=105000, dept=Data Engineering
Version 4: city=Toronto, salary=105000, dept=Data Engineering

RESTORE — Roll Back the Entire Table

Someone accidentally ran a bad DELETE. Roll back the entire table:

-- Table currently at Version 4 (8 employees)

-- Restore to Version 1 (7 employees, before UPDATE/DELETE/MERGE)
RESTORE TABLE delta_demo.employees TO VERSION AS OF 1;
-- Verify — table is back to Version 1 state
SELECT * FROM delta_demo.employees ORDER BY emp_id;
-- 7 employees! Ravi is back. Naveen has original salary. This is now Version 5.

Important: RESTORE creates a NEW version (Version 5). It does not erase history — Versions 0-4 still exist and are still accessible via time travel. RESTORE is safe and reversible.

Real-life analogy: RESTORE is like pressing “Undo” in Google Docs. The document goes back to an earlier state, but the revision history still shows all the edits that happened in between. You can undo the undo.

After RESTORE, let us re-apply our operations to get back to a full dataset:

-- Re-apply the operations to continue the demo
RESTORE TABLE delta_demo.employees TO VERSION AS OF 4;
-- Back to Version 4 state (8 employees). This is now Version 6.

VACUUM — Clean Up Old Files

The Problem

Every UPDATE, DELETE, and MERGE creates new Parquet files while keeping old ones for time travel. Over time, old files accumulate and consume storage.

The Solution

-- See how much space old files consume
DESCRIBE DETAIL delta_demo.employees;
-- Check: numFiles and sizeInBytes

-- VACUUM removes old files beyond the retention period
-- Default retention: 7 days (168 hours)
VACUUM delta_demo.employees RETAIN 168 HOURS;

What VACUUM Does

Before VACUUM:
  _delta_log/  (transaction log — NOT touched by VACUUM)
  part-00000.parquet  (current — in use)
  part-00001.parquet  (current — in use)
  part-00002.parquet  (old — from Version 0, replaced in Version 2)
  part-00003.parquet  (old — from Version 1, replaced in Version 3)

After VACUUM (if old files exceed retention):
  _delta_log/  (still intact)
  part-00000.parquet  (current)
  part-00001.parquet  (current)
  (old files deleted — time travel to those versions stops working!)

The Retention Warning

-- ⚠️ DANGEROUS: Setting retention to 0 removes ALL old files immediately
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM delta_demo.employees RETAIN 0 HOURS;
-- Time travel to old versions will FAIL after this!

Never set retention to 0 in production. Keep the default 7 days or set it based on your audit requirements.

Real-life analogy: VACUUM is like emptying the trash on your computer. Files in the trash (old Parquet files) are still recoverable. Emptying the trash (VACUUM) permanently deletes them. Set a retention period like “keep trash for 7 days before auto-emptying.”

Schema Evolution — Add New Columns

Without mergeSchema (Fails)

# Create data with a new column 'level'
new_data = [
    (10, "Sanju", "Backend", "Pune", 89000, "2024-09-01", "Senior"),
    (11, "Ayushi", "Frontend", "Jaipur", 86000, "2024-10-15", "Junior"),
]

df_new = spark.createDataFrame(new_data, 
    ["emp_id", "name", "department", "city", "salary", "hire_date", "level"])     .withColumn("hire_date", to_date(col("hire_date")))

# This FAILS — new column 'level' does not exist in the table
try:
    df_new.write.format("delta").mode("append").saveAsTable("delta_demo.employees")
except Exception as e:
    print(f"ERROR: {e}")
    print("Need to enable schema evolution!")

With mergeSchema (Works)

# Enable schema evolution with mergeSchema option
df_new.write     .format("delta")     .mode("append")     .option("mergeSchema", "true")     .saveAsTable("delta_demo.employees")

print("Schema evolved! New column 'level' added automatically.")
-- Verify — old rows have NULL for 'level', new rows have values
SELECT emp_id, name, city, salary, level 
FROM delta_demo.employees 
ORDER BY emp_id;
emp_id name salary level
1 Naveen 105000 NULL
2 Shrey 95000 NULL
10 Sanju 89000 Senior
11 Ayushi 86000 Junior

Old rows get NULL for the new column. New rows have the value. No data loss. No table recreation.

Real-life analogy: Schema evolution is like adding a new column to a spreadsheet. The existing rows get blank cells for the new column. New rows fill in the new column. The spreadsheet expands without losing any existing data.

DESCRIBE DETAIL — Table Metadata

DESCRIBE DETAIL delta_demo.employees;

Shows: table name, format (delta), location (storage path), number of files, total size in bytes, number of partitions, and more.

# Programmatic access to details
detail = spark.sql("DESCRIBE DETAIL delta_demo.employees").collect()[0]
print(f"Format: {detail['format']}")
print(f"Files: {detail['numFiles']}")
print(f"Size: {detail['sizeInBytes'] / 1024:.1f} KB")
print(f"Location: {detail['location']}")

DeltaTable API — Python Access

from delta.tables import DeltaTable

# Load the Delta table as a DeltaTable object
dt = DeltaTable.forName(spark, "delta_demo.employees")

# Get the history
dt.history().show(truncate=False)

# Get the current version number
latest_version = dt.history(1).select("version").collect()[0][0]
print(f"Current version: {latest_version}")

# Convert to DataFrame for transformations
df = dt.toDF()
df.show()

# Delete using DeltaTable API
dt.delete("emp_id = 99")  # Delete rows matching condition

# Update using DeltaTable API
dt.update(
    condition="emp_id = 2",
    set={"salary": "100000"}
)

Delta vs Parquet vs CSV Comparison

Feature CSV Parquet Delta
Schema embedded No Yes Yes
Columnar storage No Yes Yes
Compression No Yes (Snappy) Yes (Snappy)
ACID transactions No No Yes
INSERT Append only Append only Yes
UPDATE No No Yes
DELETE No No Yes
MERGE (upsert) No No Yes
Time travel No No Yes
Schema evolution No No Yes (mergeSchema)
RESTORE No No Yes
VACUUM N/A N/A Yes
OPTIMIZE N/A N/A Yes
Concurrent writes Corrupt Corrupt Safe (ACID)

Common Mistakes

  1. VACUUM with 0 hours retention — permanently deletes ALL old files. Time travel stops working. Never do this in production.

  2. Forgetting mergeSchema on schema evolution — the write fails silently or throws an error when new columns are present. Always add .option("mergeSchema", "true") when the source might have new columns.

  3. Using DESCRIBE HISTORY for row-level auditing — DESCRIBE HISTORY shows operations, NOT which specific rows changed. For row-level auditing, compare versions using time travel.

  4. Expecting VERSION AS OF to work after VACUUM — if VACUUM deleted the files for that version, time travel fails with “file not found.” Only versions within the retention window are accessible.

  5. Running RESTORE and expecting history to be erased — RESTORE creates a new version with the old state. It does NOT delete intermediate versions. All history is preserved.

  6. Not running OPTIMIZE on frequently updated tables — many small UPDATE/DELETE operations create many small files. Run OPTIMIZE table_name regularly to compact them.

Interview Questions

Q: What is Delta Lake and how does it differ from Parquet? A: Delta Lake adds a transaction log (_delta_log/) on top of Parquet files. This log enables ACID transactions, UPDATE/DELETE/MERGE operations, time travel, schema evolution, and RESTORE — none of which are possible with raw Parquet. The data files are still Parquet format. Delta adds reliability and manageability.

Q: How does time travel work in Delta Lake? A: Every operation creates a new version in the transaction log. You can query any previous version using VERSION AS OF n or TIMESTAMP AS OF 'datetime'. The old Parquet files are preserved on disk until VACUUM removes them. Time travel works as long as the files for that version exist (within the retention period).

Q: What does MERGE do and when would you use it? A: MERGE combines INSERT and UPDATE in a single atomic operation. It matches source rows against target rows using a condition (typically the business key). Matched rows are updated. Unmatched rows are inserted. It is used for SCD implementations, incremental loads, and deduplication.

Q: What is the relationship between VACUUM and time travel? A: VACUUM removes old Parquet files that are no longer referenced by the current version. After VACUUM, time travel to versions whose files were deleted will fail. The default retention is 7 days. In production, set retention based on how far back you need time travel to work.

Q: How does schema evolution work in Delta? A: When writing data with new columns that do not exist in the target table, add .option("mergeSchema", "true"). Delta automatically adds the new columns. Existing rows get NULL for the new columns. New rows have the actual values. No ALTER TABLE needed.

Wrapping Up

Delta Lake transforms your data lake from a collection of files into a proper data platform with database-grade reliability. Every operation is versioned, every change is audited, and any mistake is recoverable.

The key operations to master: INSERT/UPDATE/DELETE/MERGE for data management. DESCRIBE HISTORY for auditing. VERSION AS OF for time travel. RESTORE for recovery. VACUUM for storage management. Schema evolution for adapting to changing sources.

This is not theoretical — this is exactly how production data platforms work in Databricks, Synapse Spark, and any environment that supports Delta Lake.

Related posts:Azure Databricks Introduction and dbutilsData File Formats (CSV, Parquet, Delta, Avro, ORC)PySpark Transformations CookbookReading/Writing File Formats in DatabricksSCD Type 2 Pipeline


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link