File Storage in Azure Databricks: Volumes, DBFS, /tmp/, External Locations, and Where Your Files Actually Live

File Storage in Azure Databricks: Volumes, DBFS, /tmp/, External Locations, and Where Your Files Actually Live

You created a file in your Databricks notebook using open("/tmp/test.csv", "w"). It worked. You restarted the cluster. The file is gone. You tried open("/Volumes/workspace/default/myvol/test.csv", "a") and got Illegal seek. You tried dbutils.fs.rm("file:/tmp/test.csv") and got LocalFilesystemAccessDeniedException.

Welcome to the confusing world of Databricks file storage — where /tmp/, /dbfs/, /Volumes/, dbfs:/, abfss://, and file:/ all look similar but behave completely differently.

This post clears up every file path, storage location, and access method in Databricks. By the end, you will know exactly where to put files, how to access them, and which path prefix to use in which context.

Think of Databricks file storage like a building with multiple rooms. /tmp/ is a temporary locker that gets emptied every night (cluster restart). DBFS is a shared storage room that persists. Volumes are labeled filing cabinets organized by catalog and schema. External Locations are doors that open to your own warehouse next door (ADLS Gen2). Each room has its own key (path prefix) — use the wrong key and the door does not open.

Table of Contents

  • The Five Storage Locations in Databricks
  • /tmp/ — Driver-Local Temporary Storage
  • DBFS — Databricks File System
  • Unity Catalog Volumes — The Modern Way
  • External Locations — Your Own ADLS Gen2
  • The Workspace FileStore
  • The Path Prefix Cheat Sheet
  • Creating and Reading Files in Each Location
  • Managed Volumes vs External Volumes
  • Creating a Volume (Step by Step)
  • The Append Mode Bug (Illegal Seek)
  • Python open() vs dbutils.fs vs spark.read
  • Uploading Files via UI
  • Which Storage for Which Use Case
  • Common Errors and Fixes
  • Interview Questions
  • Wrapping Up

The Five Storage Locations in Databricks

Location Path Persists? Visible in UI? Best For
/tmp/ /tmp/file.csv No (lost on restart) No Quick scratch files
DBFS dbfs:/FileStore/file.csv Yes Data tab (DBFS) Legacy file storage
Volumes /Volumes/catalog/schema/vol/file.csv Yes Catalog Explorer Modern file storage
External Location abfss://container@account.dfs.core.windows.net/ Yes (your storage) No (browse in Azure) Production data lake
Workspace FileStore /FileStore/file.csv Yes No (access via URL) Small files, images for notebooks

/tmp/ — Driver-Local Temporary Storage

What It Is

/tmp/ is the local filesystem of the driver VM — the actual machine running your notebook. It is NOT distributed, NOT shared between nodes, and NOT persistent.

# Write a file to /tmp/
with open("/tmp/quick_test.csv", "w") as f:
    f.write("id,name
1,Naveen
2,Shrey
")

# Read it back
with open("/tmp/quick_test.csv", "r") as f:
    print(f.read())

The Catch

Cluster starts → /tmp/ is empty
You create /tmp/test.csv → file exists
Cluster restarts → /tmp/test.csv is GONE

When to Use

  • Quick one-off tests
  • Temporary intermediate files during a notebook run
  • Never for anything you need to keep

When NOT to Use

  • Storing data between runs
  • Sharing files with other notebooks
  • Anything production

Real-life analogy: /tmp/ is a whiteboard in a meeting room. Write whatever you need during the meeting. The janitor erases it overnight. It is never meant for permanent notes.

DBFS — Databricks File System

What It Is

DBFS is Databricks’ built-in distributed file system. Files persist across cluster restarts. It is backed by cloud storage (Azure Blob behind the scenes).

The Confusing Part: Two Path Styles

DBFS files have TWO valid paths depending on which tool you use:

# Using dbutils (Databricks utility) — use dbfs:/ prefix
dbutils.fs.put("/FileStore/my_files/test.csv", "id,name
1,Naveen
", overwrite=True)
dbutils.fs.ls("/FileStore/my_files/")
dbutils.fs.head("/FileStore/my_files/test.csv")

# Using Python open() — use /dbfs/ prefix
with open("/dbfs/FileStore/my_files/test.csv", "r") as f:
    print(f.read())

# Using Spark — no prefix needed
df = spark.read.csv("/FileStore/my_files/test.csv", header=True)
df.show()

The same file, three different paths:

Tool Path
dbutils.fs /FileStore/my_files/test.csv
Python open() /dbfs/FileStore/my_files/test.csv
spark.read /FileStore/my_files/test.csv or dbfs:/FileStore/my_files/test.csv

When to Use

  • Legacy workspaces without Unity Catalog
  • Databricks Community Edition (free tier — no Volumes available)
  • Storing small reference files

When NOT to Use

  • Unity Catalog workspaces (use Volumes instead)
  • Production data (use External Locations pointing to ADLS)
  • Large datasets

Real-life analogy: DBFS is like a shared network drive in an office. Everyone can access it, files persist, but it belongs to Databricks — you do not control the underlying storage.

Unity Catalog Volumes — The Modern Way

What It Is

Volumes are Unity Catalog’s managed file storage. They sit inside the catalog hierarchy: Catalog → Schema → Volume → Files. Think of them as organized folders governed by Unity Catalog permissions.

Catalog: workspace
  Schema: default
    Volume: naveenvol
      File: employees.csv
      File: pipeline_log.txt
      File: config.json

The Path

/Volumes/catalog_name/schema_name/volume_name/filename
/Volumes/workspace/default/naveenvol/employees.csv

How to Use

# Write
with open("/Volumes/workspace/default/naveenvol/employees.csv", "w") as f:
    f.write("id,name,dept,salary
")
    f.write("1001,Naveen,Data Engineering,95000
")
    f.write("1002,Shrey,Data Science,88000
")

# Read with Python
with open("/Volumes/workspace/default/naveenvol/employees.csv", "r") as f:
    print(f.read())

# Read with Spark
df = spark.read.csv("/Volumes/workspace/default/naveenvol/employees.csv", header=True)
df.show()

# List files
import os
files = os.listdir("/Volumes/workspace/default/naveenvol/")
print(f"Files: {files}")

# Delete
os.remove("/Volumes/workspace/default/naveenvol/employees.csv")

Visible in UI

  1. Click Catalog in the sidebar
  2. Navigate: workspace → default → naveenvol
  3. Click Files tab — your files show up here

When to Use

  • All file operations on workspaces with Unity Catalog
  • Storing CSV/JSON files for notebook exercises
  • Landing files for Spark to read
  • Config files, lookup files, reference data

Real-life analogy: Volumes are like labeled filing cabinets in an office. Each cabinet (volume) belongs to a department (schema) in a building (catalog). Files are organized, labeled, findable, and access-controlled. Anyone with permission can browse the cabinet and see what is inside.

External Locations — Your Own ADLS Gen2

What It Is

External Locations let Databricks access YOUR storage account (ADLS Gen2). The files live in YOUR Azure subscription, not in Databricks-managed storage.

# Read from your ADLS Gen2 directly
df = spark.read.parquet("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/bronze/customers/")

# Write to your ADLS Gen2
df.write.format("delta").save("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/silver/customers/")

Setup Required

  1. Storage Credential — how to authenticate (Access Connector managed identity)
  2. External Location — which ADLS path to allow access to
  3. See our External Tables post for full setup

When to Use

  • Production data lake (Bronze/Silver/Gold layers)
  • Data shared across Databricks workspaces, Synapse, ADF, Power BI
  • Compliance requirements (data must stay in your storage)

Real-life analogy: External Locations are like having a key to the warehouse next door. The warehouse (ADLS) is yours — you own it, you control it. The key (External Location) lets Databricks open the door and access your inventory.

The Path Prefix Cheat Sheet

What You Are Doing Path to Use Example
Python open() to /tmp/ /tmp/ /tmp/test.csv
Python open() to DBFS /dbfs/ /dbfs/FileStore/test.csv
Python open() to Volume /Volumes/ /Volumes/workspace/default/naveenvol/test.csv
dbutils.fs to DBFS No prefix or dbfs:/ /FileStore/test.csv
dbutils.fs to Volume /Volumes/ /Volumes/workspace/default/naveenvol/
dbutils.fs to ADLS abfss:// abfss://container@account.dfs.core.windows.net/
spark.read from DBFS No prefix /FileStore/test.csv
spark.read from Volume /Volumes/ /Volumes/workspace/default/naveenvol/test.csv
spark.read from ADLS abfss:// abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/

The rule: Python open() needs /dbfs/ for DBFS. Everything else uses the path directly. Volumes always use /Volumes/. ADLS always uses abfss://.

Managed Volumes vs External Volumes

Feature Managed Volume External Volume
Data stored in Databricks-managed storage YOUR ADLS Gen2
DROP VOLUME Deletes data + metadata Deletes metadata, data stays in ADLS
Create SQL CREATE VOLUME myvol CREATE EXTERNAL VOLUME myvol LOCATION 'abfss://...'
Best for Notebooks, exercises, small files Production, shared files
Visible in Catalog Explorer Catalog Explorer + Azure Portal

Creating an External Volume

-- Requires an External Location already set up
CREATE EXTERNAL VOLUME workspace.default.adls_volume
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/volumes/';

Files uploaded to this volume physically land in your ADLS Gen2 container.

Creating a Volume (Step by Step)

Via UI

  1. Click Catalog in the sidebar
  2. Select your catalog (e.g., workspace)
  3. Select a schema (e.g., default)
  4. Click + CreateVolume
  5. Name: naveenvol
  6. Type: Managed (or External with ADLS path)
  7. Click Create

Via SQL

-- Managed volume
CREATE VOLUME IF NOT EXISTS workspace.default.naveenvol;

-- External volume (requires External Location)
CREATE EXTERNAL VOLUME IF NOT EXISTS workspace.default.adls_volume
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/volumes/';

Via Python

spark.sql("CREATE VOLUME IF NOT EXISTS workspace.default.naveenvol")

The Append Mode Bug (Illegal Seek)

This error caught us during our practice session:

# This FAILS on Volumes
with open("/Volumes/workspace/default/naveenvol/log.txt", "a") as f:
    f.write("New line
")
# OSError: [Errno 29] Illegal seek

Why It Fails

Databricks Volumes use cloud object storage underneath, which does NOT support file append natively. The "a" (append) mode tries to seek to the end of the file — cloud storage does not support seeking.

The Workaround

Read existing content, append new content, rewrite the entire file:

import os

vol_path = "/Volumes/workspace/default/naveenvol/log.txt"

# Read existing content (if file exists)
existing = ""
if os.path.exists(vol_path):
    with open(vol_path, "r") as f:
        existing = f.read()

# Append new content and write everything
with open(vol_path, "w") as f:
    f.write(existing)
    f.write("New line added!
")

print("Appended successfully!")

File Modes That Work on Volumes

Mode Works? What It Does
"r" ✅ Yes Read
"w" ✅ Yes Write (overwrite)
"a" ❌ No Append — Illegal seek error
"rb" ✅ Yes Read binary
"wb" ✅ Yes Write binary

Python open() vs dbutils.fs vs spark.read

Feature Python open() dbutils.fs spark.read
Read small files ✅ Best head() Overkill
Read large data ❌ Slow (single thread) ❌ Not for processing ✅ Best (distributed)
Write small files ✅ Best put() Overkill
Write large data ❌ Slow ❌ Not for this ✅ Best (distributed)
Delete files os.remove() dbutils.fs.rm() N/A
List files os.listdir() dbutils.fs.ls() N/A
Works with /tmp/
Works with DBFS ✅ (with /dbfs/ prefix)
Works with Volumes
Works with ADLS

Rule of thumb: – Small files (config, logs, CSVs under 1 MB) → Python open() – File management (list, delete, move, copy) → dbutils.fs – Data processing (Parquet, Delta, large CSV) → spark.read / spark.write

Uploading Files via UI

Upload to Volume

  1. Catalog → navigate to your volume
  2. Click Upload to volume button
  3. Drag and drop files
  4. Files appear under the Files tab

Upload to DBFS

  1. Click Data in the sidebar
  2. Click DBFS tab
  3. Navigate to /FileStore/
  4. Click Upload → drag and drop

Upload to Table (Create Table UI)

  1. Click Data in the sidebar
  2. Click Create Table
  3. Upload CSV → Databricks creates a managed table directly

Which Storage for Which Use Case

Use Case Storage Path
Quick scratch file during development /tmp/ /tmp/test.csv
Notebook exercise files Volume /Volumes/workspace/default/naveenvol/
Config files, lookup data Volume /Volumes/catalog/schema/config_vol/
Production data lake (Bronze/Silver/Gold) External Location (ADLS) abfss://container@account.dfs.core.windows.net/
Legacy workspace without Unity Catalog DBFS /FileStore/my_files/
Files shared across workspaces External Location (ADLS) abfss://
Small images for notebooks Workspace FileStore /FileStore/images/

Common Errors and Fixes

Error Cause Fix
FileNotFoundError: /tmp/test.csv Cluster restarted, /tmp/ was cleared Use Volumes instead for persistent files
OSError: [Errno 29] Illegal seek Append mode "a" on Volumes Read existing → append in memory → write all with "w"
LocalFilesystemAccessDeniedException dbutils.fs.rm("file:/tmp/...") blocked Use os.remove("/tmp/...") or ignore (cleared on restart)
No such file or directory: /dbfs/... Missing /dbfs/ prefix for Python open() on DBFS Add /dbfs/ prefix: /dbfs/FileStore/test.csv
Path does not exist: /Volumes/... Volume not created or wrong catalog/schema Verify in Catalog Explorer, check catalog and schema names
PERMISSION_DENIED creating volume Lack CREATE VOLUME privilege Ask workspace admin for permission on the schema
Cannot access non /Workspace local filesystem Databricks blocking local filesystem access entirely Use Volumes instead of /tmp/

Interview Questions

Q: What are the storage options in Databricks? A: Five options: /tmp/ (driver-local, non-persistent), DBFS (Databricks-managed, persistent), Unity Catalog Volumes (modern governed storage), External Locations (your ADLS Gen2), and Workspace FileStore (legacy small file storage). For production, use External Locations for the data lake and Volumes for config/reference files.

Q: What is the difference between a Managed Volume and an External Volume? A: A Managed Volume stores data in Databricks-managed storage — DROP VOLUME deletes the data. An External Volume points to your ADLS Gen2 — DROP VOLUME removes only the metadata, data files remain in your storage. External Volumes require an External Location to be set up first.

Q: Why does append mode fail on Volumes? A: Volumes use cloud object storage (Azure Blob) underneath, which does not support native file append or seek operations. The workaround is to read the existing file content, append the new content in memory, and write the entire file using write mode ("w").

Q: When should you use Python open() vs spark.read? A: Use Python open() for small files (configs, logs, CSVs under 1 MB). Use spark.read for data processing (Parquet, Delta, large CSVs). Python open() runs on the driver only (single thread). Spark distributes the read across the cluster (parallel).

Wrapping Up

File storage in Databricks is confusing because there are five different locations with different path prefixes, persistence rules, and access methods. But once you understand the map, it is simple:

  • Development scratch files/tmp/ (temporary) or Volumes (persistent)
  • Notebook data files → Volumes (/Volumes/catalog/schema/vol/)
  • Production data lake → External Locations (abfss://)
  • Legacy workspaces → DBFS (/dbfs/FileStore/)

Use Volumes for everyday file operations. Use External Locations for production data. Forget /tmp/ exists unless you need a 5-second scratch file. And always remember: append mode does not work on Volumes — read, modify, rewrite.

Related posts:Azure Databricks Introduction and dbutilsConnecting to Blob/ADLS Gen2External Tables and Unity CatalogReading/Writing File Formats


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link