File Storage in Azure Databricks: Volumes, DBFS, /tmp/, External Locations, and Where Your Files Actually Live
You created a file in your Databricks notebook using open("/tmp/test.csv", "w"). It worked. You restarted the cluster. The file is gone. You tried open("/Volumes/workspace/default/myvol/test.csv", "a") and got Illegal seek. You tried dbutils.fs.rm("file:/tmp/test.csv") and got LocalFilesystemAccessDeniedException.
Welcome to the confusing world of Databricks file storage — where /tmp/, /dbfs/, /Volumes/, dbfs:/, abfss://, and file:/ all look similar but behave completely differently.
This post clears up every file path, storage location, and access method in Databricks. By the end, you will know exactly where to put files, how to access them, and which path prefix to use in which context.
Think of Databricks file storage like a building with multiple rooms. /tmp/ is a temporary locker that gets emptied every night (cluster restart). DBFS is a shared storage room that persists. Volumes are labeled filing cabinets organized by catalog and schema. External Locations are doors that open to your own warehouse next door (ADLS Gen2). Each room has its own key (path prefix) — use the wrong key and the door does not open.
Table of Contents
- The Five Storage Locations in Databricks
- /tmp/ — Driver-Local Temporary Storage
- DBFS — Databricks File System
- Unity Catalog Volumes — The Modern Way
- External Locations — Your Own ADLS Gen2
- The Workspace FileStore
- The Path Prefix Cheat Sheet
- Creating and Reading Files in Each Location
- Managed Volumes vs External Volumes
- Creating a Volume (Step by Step)
- The Append Mode Bug (Illegal Seek)
- Python open() vs dbutils.fs vs spark.read
- Uploading Files via UI
- Which Storage for Which Use Case
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
The Five Storage Locations in Databricks
| Location | Path | Persists? | Visible in UI? | Best For |
|---|---|---|---|---|
| /tmp/ | /tmp/file.csv |
No (lost on restart) | No | Quick scratch files |
| DBFS | dbfs:/FileStore/file.csv |
Yes | Data tab (DBFS) | Legacy file storage |
| Volumes | /Volumes/catalog/schema/vol/file.csv |
Yes | Catalog Explorer | Modern file storage |
| External Location | abfss://container@account.dfs.core.windows.net/ |
Yes (your storage) | No (browse in Azure) | Production data lake |
| Workspace FileStore | /FileStore/file.csv |
Yes | No (access via URL) | Small files, images for notebooks |
/tmp/ — Driver-Local Temporary Storage
What It Is
/tmp/ is the local filesystem of the driver VM — the actual machine running your notebook. It is NOT distributed, NOT shared between nodes, and NOT persistent.
# Write a file to /tmp/
with open("/tmp/quick_test.csv", "w") as f:
f.write("id,name
1,Naveen
2,Shrey
")
# Read it back
with open("/tmp/quick_test.csv", "r") as f:
print(f.read())
The Catch
Cluster starts → /tmp/ is empty
You create /tmp/test.csv → file exists
Cluster restarts → /tmp/test.csv is GONE
When to Use
- Quick one-off tests
- Temporary intermediate files during a notebook run
- Never for anything you need to keep
When NOT to Use
- Storing data between runs
- Sharing files with other notebooks
- Anything production
Real-life analogy: /tmp/ is a whiteboard in a meeting room. Write whatever you need during the meeting. The janitor erases it overnight. It is never meant for permanent notes.
DBFS — Databricks File System
What It Is
DBFS is Databricks’ built-in distributed file system. Files persist across cluster restarts. It is backed by cloud storage (Azure Blob behind the scenes).
The Confusing Part: Two Path Styles
DBFS files have TWO valid paths depending on which tool you use:
# Using dbutils (Databricks utility) — use dbfs:/ prefix
dbutils.fs.put("/FileStore/my_files/test.csv", "id,name
1,Naveen
", overwrite=True)
dbutils.fs.ls("/FileStore/my_files/")
dbutils.fs.head("/FileStore/my_files/test.csv")
# Using Python open() — use /dbfs/ prefix
with open("/dbfs/FileStore/my_files/test.csv", "r") as f:
print(f.read())
# Using Spark — no prefix needed
df = spark.read.csv("/FileStore/my_files/test.csv", header=True)
df.show()
The same file, three different paths:
| Tool | Path |
|---|---|
dbutils.fs |
/FileStore/my_files/test.csv |
Python open() |
/dbfs/FileStore/my_files/test.csv |
spark.read |
/FileStore/my_files/test.csv or dbfs:/FileStore/my_files/test.csv |
When to Use
- Legacy workspaces without Unity Catalog
- Databricks Community Edition (free tier — no Volumes available)
- Storing small reference files
When NOT to Use
- Unity Catalog workspaces (use Volumes instead)
- Production data (use External Locations pointing to ADLS)
- Large datasets
Real-life analogy: DBFS is like a shared network drive in an office. Everyone can access it, files persist, but it belongs to Databricks — you do not control the underlying storage.
Unity Catalog Volumes — The Modern Way
What It Is
Volumes are Unity Catalog’s managed file storage. They sit inside the catalog hierarchy: Catalog → Schema → Volume → Files. Think of them as organized folders governed by Unity Catalog permissions.
Catalog: workspace
Schema: default
Volume: naveenvol
File: employees.csv
File: pipeline_log.txt
File: config.json
The Path
/Volumes/catalog_name/schema_name/volume_name/filename
/Volumes/workspace/default/naveenvol/employees.csv
How to Use
# Write
with open("/Volumes/workspace/default/naveenvol/employees.csv", "w") as f:
f.write("id,name,dept,salary
")
f.write("1001,Naveen,Data Engineering,95000
")
f.write("1002,Shrey,Data Science,88000
")
# Read with Python
with open("/Volumes/workspace/default/naveenvol/employees.csv", "r") as f:
print(f.read())
# Read with Spark
df = spark.read.csv("/Volumes/workspace/default/naveenvol/employees.csv", header=True)
df.show()
# List files
import os
files = os.listdir("/Volumes/workspace/default/naveenvol/")
print(f"Files: {files}")
# Delete
os.remove("/Volumes/workspace/default/naveenvol/employees.csv")
Visible in UI
- Click Catalog in the sidebar
- Navigate: workspace → default → naveenvol
- Click Files tab — your files show up here
When to Use
- All file operations on workspaces with Unity Catalog
- Storing CSV/JSON files for notebook exercises
- Landing files for Spark to read
- Config files, lookup files, reference data
Real-life analogy: Volumes are like labeled filing cabinets in an office. Each cabinet (volume) belongs to a department (schema) in a building (catalog). Files are organized, labeled, findable, and access-controlled. Anyone with permission can browse the cabinet and see what is inside.
External Locations — Your Own ADLS Gen2
What It Is
External Locations let Databricks access YOUR storage account (ADLS Gen2). The files live in YOUR Azure subscription, not in Databricks-managed storage.
# Read from your ADLS Gen2 directly
df = spark.read.parquet("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/bronze/customers/")
# Write to your ADLS Gen2
df.write.format("delta").save("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/silver/customers/")
Setup Required
- Storage Credential — how to authenticate (Access Connector managed identity)
- External Location — which ADLS path to allow access to
- See our External Tables post for full setup
When to Use
- Production data lake (Bronze/Silver/Gold layers)
- Data shared across Databricks workspaces, Synapse, ADF, Power BI
- Compliance requirements (data must stay in your storage)
Real-life analogy: External Locations are like having a key to the warehouse next door. The warehouse (ADLS) is yours — you own it, you control it. The key (External Location) lets Databricks open the door and access your inventory.
The Path Prefix Cheat Sheet
| What You Are Doing | Path to Use | Example |
|---|---|---|
Python open() to /tmp/ |
/tmp/ |
/tmp/test.csv |
Python open() to DBFS |
/dbfs/ |
/dbfs/FileStore/test.csv |
Python open() to Volume |
/Volumes/ |
/Volumes/workspace/default/naveenvol/test.csv |
dbutils.fs to DBFS |
No prefix or dbfs:/ |
/FileStore/test.csv |
dbutils.fs to Volume |
/Volumes/ |
/Volumes/workspace/default/naveenvol/ |
dbutils.fs to ADLS |
abfss:// |
abfss://container@account.dfs.core.windows.net/ |
spark.read from DBFS |
No prefix | /FileStore/test.csv |
spark.read from Volume |
/Volumes/ |
/Volumes/workspace/default/naveenvol/test.csv |
spark.read from ADLS |
abfss:// |
abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/ |
The rule: Python open() needs /dbfs/ for DBFS. Everything else uses the path directly. Volumes always use /Volumes/. ADLS always uses abfss://.
Managed Volumes vs External Volumes
| Feature | Managed Volume | External Volume |
|---|---|---|
| Data stored in | Databricks-managed storage | YOUR ADLS Gen2 |
| DROP VOLUME | Deletes data + metadata | Deletes metadata, data stays in ADLS |
| Create SQL | CREATE VOLUME myvol |
CREATE EXTERNAL VOLUME myvol LOCATION 'abfss://...' |
| Best for | Notebooks, exercises, small files | Production, shared files |
| Visible in | Catalog Explorer | Catalog Explorer + Azure Portal |
Creating an External Volume
-- Requires an External Location already set up
CREATE EXTERNAL VOLUME workspace.default.adls_volume
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/volumes/';
Files uploaded to this volume physically land in your ADLS Gen2 container.
Creating a Volume (Step by Step)
Via UI
- Click Catalog in the sidebar
- Select your catalog (e.g.,
workspace) - Select a schema (e.g.,
default) - Click + Create → Volume
- Name:
naveenvol - Type: Managed (or External with ADLS path)
- Click Create
Via SQL
-- Managed volume
CREATE VOLUME IF NOT EXISTS workspace.default.naveenvol;
-- External volume (requires External Location)
CREATE EXTERNAL VOLUME IF NOT EXISTS workspace.default.adls_volume
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/volumes/';
Via Python
spark.sql("CREATE VOLUME IF NOT EXISTS workspace.default.naveenvol")
The Append Mode Bug (Illegal Seek)
This error caught us during our practice session:
# This FAILS on Volumes
with open("/Volumes/workspace/default/naveenvol/log.txt", "a") as f:
f.write("New line
")
# OSError: [Errno 29] Illegal seek
Why It Fails
Databricks Volumes use cloud object storage underneath, which does NOT support file append natively. The "a" (append) mode tries to seek to the end of the file — cloud storage does not support seeking.
The Workaround
Read existing content, append new content, rewrite the entire file:
import os
vol_path = "/Volumes/workspace/default/naveenvol/log.txt"
# Read existing content (if file exists)
existing = ""
if os.path.exists(vol_path):
with open(vol_path, "r") as f:
existing = f.read()
# Append new content and write everything
with open(vol_path, "w") as f:
f.write(existing)
f.write("New line added!
")
print("Appended successfully!")
File Modes That Work on Volumes
| Mode | Works? | What It Does |
|---|---|---|
"r" |
✅ Yes | Read |
"w" |
✅ Yes | Write (overwrite) |
"a" |
❌ No | Append — Illegal seek error |
"rb" |
✅ Yes | Read binary |
"wb" |
✅ Yes | Write binary |
Python open() vs dbutils.fs vs spark.read
| Feature | Python open() |
dbutils.fs |
spark.read |
|---|---|---|---|
| Read small files | ✅ Best | ✅ head() |
Overkill |
| Read large data | ❌ Slow (single thread) | ❌ Not for processing | ✅ Best (distributed) |
| Write small files | ✅ Best | ✅ put() |
Overkill |
| Write large data | ❌ Slow | ❌ Not for this | ✅ Best (distributed) |
| Delete files | os.remove() |
dbutils.fs.rm() |
N/A |
| List files | os.listdir() |
dbutils.fs.ls() |
N/A |
| Works with /tmp/ | ✅ | ❌ | ❌ |
| Works with DBFS | ✅ (with /dbfs/ prefix) |
✅ | ✅ |
| Works with Volumes | ✅ | ✅ | ✅ |
| Works with ADLS | ❌ | ✅ | ✅ |
Rule of thumb:
– Small files (config, logs, CSVs under 1 MB) → Python open()
– File management (list, delete, move, copy) → dbutils.fs
– Data processing (Parquet, Delta, large CSV) → spark.read / spark.write
Uploading Files via UI
Upload to Volume
- Catalog → navigate to your volume
- Click Upload to volume button
- Drag and drop files
- Files appear under the Files tab
Upload to DBFS
- Click Data in the sidebar
- Click DBFS tab
- Navigate to
/FileStore/ - Click Upload → drag and drop
Upload to Table (Create Table UI)
- Click Data in the sidebar
- Click Create Table
- Upload CSV → Databricks creates a managed table directly
Which Storage for Which Use Case
| Use Case | Storage | Path |
|---|---|---|
| Quick scratch file during development | /tmp/ |
/tmp/test.csv |
| Notebook exercise files | Volume | /Volumes/workspace/default/naveenvol/ |
| Config files, lookup data | Volume | /Volumes/catalog/schema/config_vol/ |
| Production data lake (Bronze/Silver/Gold) | External Location (ADLS) | abfss://container@account.dfs.core.windows.net/ |
| Legacy workspace without Unity Catalog | DBFS | /FileStore/my_files/ |
| Files shared across workspaces | External Location (ADLS) | abfss:// |
| Small images for notebooks | Workspace FileStore | /FileStore/images/ |
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
FileNotFoundError: /tmp/test.csv |
Cluster restarted, /tmp/ was cleared |
Use Volumes instead for persistent files |
OSError: [Errno 29] Illegal seek |
Append mode "a" on Volumes |
Read existing → append in memory → write all with "w" |
LocalFilesystemAccessDeniedException |
dbutils.fs.rm("file:/tmp/...") blocked |
Use os.remove("/tmp/...") or ignore (cleared on restart) |
No such file or directory: /dbfs/... |
Missing /dbfs/ prefix for Python open() on DBFS |
Add /dbfs/ prefix: /dbfs/FileStore/test.csv |
Path does not exist: /Volumes/... |
Volume not created or wrong catalog/schema | Verify in Catalog Explorer, check catalog and schema names |
PERMISSION_DENIED creating volume |
Lack CREATE VOLUME privilege | Ask workspace admin for permission on the schema |
Cannot access non /Workspace local filesystem |
Databricks blocking local filesystem access entirely | Use Volumes instead of /tmp/ |
Interview Questions
Q: What are the storage options in Databricks?
A: Five options: /tmp/ (driver-local, non-persistent), DBFS (Databricks-managed, persistent), Unity Catalog Volumes (modern governed storage), External Locations (your ADLS Gen2), and Workspace FileStore (legacy small file storage). For production, use External Locations for the data lake and Volumes for config/reference files.
Q: What is the difference between a Managed Volume and an External Volume? A: A Managed Volume stores data in Databricks-managed storage — DROP VOLUME deletes the data. An External Volume points to your ADLS Gen2 — DROP VOLUME removes only the metadata, data files remain in your storage. External Volumes require an External Location to be set up first.
Q: Why does append mode fail on Volumes?
A: Volumes use cloud object storage (Azure Blob) underneath, which does not support native file append or seek operations. The workaround is to read the existing file content, append the new content in memory, and write the entire file using write mode ("w").
Q: When should you use Python open() vs spark.read?
A: Use Python open() for small files (configs, logs, CSVs under 1 MB). Use spark.read for data processing (Parquet, Delta, large CSVs). Python open() runs on the driver only (single thread). Spark distributes the read across the cluster (parallel).
Wrapping Up
File storage in Databricks is confusing because there are five different locations with different path prefixes, persistence rules, and access methods. But once you understand the map, it is simple:
- Development scratch files →
/tmp/(temporary) or Volumes (persistent) - Notebook data files → Volumes (
/Volumes/catalog/schema/vol/) - Production data lake → External Locations (
abfss://) - Legacy workspaces → DBFS (
/dbfs/FileStore/)
Use Volumes for everyday file operations. Use External Locations for production data. Forget /tmp/ exists unless you need a 5-second scratch file. And always remember: append mode does not work on Volumes — read, modify, rewrite.
Related posts: – Azure Databricks Introduction and dbutils – Connecting to Blob/ADLS Gen2 – External Tables and Unity Catalog – Reading/Writing File Formats
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.