Managed vs External Tables in Azure Databricks: Unity Catalog, External Locations, Data Persistence, and Every Operation Explained
You just created a Delta table in Databricks and everything works. But then you run DROP TABLE and your data disappears. Gone. Hundreds of thousands of rows, deleted because you dropped a table.
That is a managed table. Databricks owns the data. DROP TABLE = delete metadata AND data files.
Now imagine a different setup: you drop the table, but the data files are still sitting in your ADLS Gen2 storage account. You run CREATE TABLE pointing to the same path, and the data is back — instantly, with full Delta history.
That is an external table. YOU own the data. Databricks only owns the metadata. DROP TABLE = delete the catalog entry, but your files are untouched.
Think of it like renting vs owning a house. With a managed table (renting), when you end the lease (DROP TABLE), you leave and the house is demolished. With an external table (owning), when you remove your name from the directory (DROP TABLE), the house is still there — you just re-register and move back in.
This post walks through setting up external tables from scratch — Storage Credentials, External Locations, creating tables, proving data persistence after DROP, and understanding when to use managed vs external.
Table of Contents
- Managed vs External Tables: The Core Difference
- When to Use Managed vs External
- The Three Layers: Credential → Location → Table
- Prerequisites
- Step 1: Create a Storage Credential
- Step 2: Create an External Location
- Step 3: Create a Schema
- Step 4: Create an External Table
- Step 5: Insert Data
- Step 6: DROP Table — Prove Data Persists
- Step 7: Recreate Table — Data Comes Back
- Step 8: Delta Operations on External Tables
- How to Check If a Table Is Managed or External
- External Tables with Partitioning
- External Tables and VACUUM
- Granting Access to Other Users
- Common Errors and Fixes
- Interview Questions
- Wrapping Up
Managed vs External Tables: The Core Difference
| Feature | Managed Table | External Table |
|---|---|---|
| Data location | Databricks-managed storage (internal) | YOUR storage (ADLS Gen2, S3) |
| Created with | saveAsTable() or CREATE TABLE (no LOCATION) |
CREATE TABLE ... LOCATION 'abfss://...' |
| DROP TABLE | Deletes metadata AND data files | Deletes metadata only, data files stay |
| Who owns the data? | Databricks | You |
| Data accessible outside Databricks? | No (locked in managed storage) | Yes (other tools can read ADLS directly) |
| Recreate after DROP? | Data lost forever | Point to same path → data is back |
| Best for | Curated tables, sandbox, internal analytics | Shared data, production data lakes, cross-tool access |
| Cost | Databricks storage charges | Your ADLS storage charges |
The Visual Difference
Managed Table:
Databricks Catalog: "employees" → data at dbfs:/user/hive/warehouse/employees/
DROP TABLE → Catalog entry removed + data files DELETED
External Table:
Databricks Catalog: "employees" → data at abfss://raw-data@storageaccount/employees/
DROP TABLE → Catalog entry removed + data files STILL IN ADLS
Real-life analogy: – Managed table = renting furniture from a store. When you return it (DROP), it is gone. – External table = buying furniture and putting it in your own house. If the store closes (DROP from catalog), your furniture is still in your house.
When to Use Managed vs External
| Scenario | Use Managed | Use External |
|---|---|---|
| Sandbox / exploration / temp data | ✅ | |
| Production data lake (Bronze/Silver/Gold) | ✅ | |
| Data shared with Synapse/ADF/Power BI | ✅ (they read ADLS directly) | |
| Short-lived tables (dev/testing) | ✅ | |
| Compliance (data must stay in your storage) | ✅ | |
| Data used by multiple Databricks workspaces | ✅ | |
| Quick prototyping | ✅ | |
| Long-term production tables | ✅ |
Rule of thumb: If the data matters and other tools need to access it, use external. If it is temporary or internal to Databricks, use managed.
The Three Layers: Credential → Location → Table
Unity Catalog enforces a three-layer security model for external tables:
Layer 1: Storage Credential
"Here is HOW to authenticate to the storage account"
(Access Connector managed identity)
|
v
Layer 2: External Location
"Here is WHERE in the storage account you can access"
(abfss://container@account.dfs.core.windows.net/)
|
v
Layer 3: External Table
"Here is the specific path for this table's data"
(abfss://container@account.dfs.core.windows.net/external_tables/employees/)
Why three layers?
Without this structure, any user could create an external table pointing to ANY path in ANY storage account — a massive security risk. The three layers ensure: – Storage Credential: only authorized identities can access storage – External Location: only approved paths are accessible – External Table: users can only create tables under approved locations
Real-life analogy: Getting into a secure office building: – Storage Credential = your employee badge (proves WHO you are) – External Location = the floors your badge can access (WHERE you can go) – External Table = the specific office you enter (your workspace within the allowed floors)
Prerequisites
Before starting, you need:
- Azure Databricks workspace (Premium tier for Unity Catalog)
- ADLS Gen2 storage account with a container (e.g.,
raw-data) - Access Connector for Azure Databricks (created in Azure Portal)
- Storage Blob Data Contributor role assigned to the Access Connector on your storage account
Creating the Access Connector (If Not Done)
- Azure Portal → Create a resource → search Access Connector for Azure Databricks
- Name:
unity-catalog-access-connector - Resource group: same as your Databricks workspace
- Click Create
Granting Storage Access to the Connector
- Go to your Storage account → Access Control (IAM)
- + Add → Add role assignment
- Role: Storage Blob Data Contributor
- Members → Managed identity → select Access Connector for Azure Databricks → select your connector
- Review + assign
Step 1: Create a Storage Credential
The Storage Credential tells Unity Catalog HOW to authenticate to your storage:
CREATE STORAGE CREDENTIAL IF NOT EXISTS adls_gen2_credential
WITH (AZURE_MANAGED_IDENTITY = (
ACCESS_CONNECTOR_ID = '/subscriptions/<your-sub-id>/resourceGroups/<your-rg>/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector'
));
Where to find the ACCESS_CONNECTOR_ID:
1. Azure Portal → Access Connector resource → Properties (or Overview)
2. Copy the Resource ID — it looks like: /subscriptions/abc123.../providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector
Verify:
SHOW STORAGE CREDENTIALS;
Step 2: Create an External Location
The External Location tells Unity Catalog WHERE in the storage account you can access:
CREATE EXTERNAL LOCATION IF NOT EXISTS adls_raw_data
URL 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL adls_gen2_credential);
This registers the entire raw-data container. Any external table pointing to a path under this container will now work.
Verify:
SHOW EXTERNAL LOCATIONS;
Real-life analogy: The Storage Credential is like getting a master key card for a building. The External Location is like programming that key card to access specific floors. Without the External Location, the key card exists but opens nothing.
Step 3: Create a Schema
CREATE SCHEMA IF NOT EXISTS external_demo;
USE external_demo;
Step 4: Create an External Table
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Create sample data
data = [
(1, "Naveen", "Data Engineering", "Mississauga", 95000, "2023-01-15"),
(2, "Shrey", "Data Science", "Toronto", 88000, "2023-03-20"),
(3, "Vrushab", "DevOps", "Vancouver", 92000, "2023-06-10"),
(4, "Vishnu", "Analytics", "Ottawa", 91000, "2023-09-01"),
(5, "Ravi", "Engineering", "Calgary", 85000, "2024-01-05"),
]
columns = ["emp_id", "name", "department", "city", "salary", "hire_date"]
df = spark.createDataFrame(data, columns) .withColumn("hire_date", to_date(col("hire_date")))
# Write to ADLS Gen2 as Delta
external_path = "abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees"
df.write.format("delta").mode("overwrite").save(external_path)
print(f"Data written to: {external_path}")
Now register it as an external table in Unity Catalog:
CREATE TABLE IF NOT EXISTS external_demo.employees
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees';
-- Verify
SELECT * FROM external_demo.employees ORDER BY emp_id;
The key difference from managed tables: The LOCATION clause. Without it, Databricks stores data in its managed storage. With it, data lives in YOUR ADLS Gen2.
Step 5: Insert Data
INSERT INTO external_demo.employees VALUES
(6, 'Anita', 'Marketing', 'Montreal', 78000, '2024-03-15'),
(7, 'Deepak', 'Finance', 'Edmonton', 82000, '2024-05-01');
SELECT * FROM external_demo.employees ORDER BY emp_id;
-- 7 employees now
The new rows are written as new Parquet files in the SAME ADLS Gen2 path. The Delta transaction log tracks the addition.
Step 6: DROP Table — Prove Data Persists
This is the moment of truth:
-- Drop the table
DROP TABLE external_demo.employees;
-- Verify it is gone from the catalog
SHOW TABLES IN external_demo;
-- employees is NOT listed
But the data files are still in ADLS Gen2:
# Check ADLS directly — files are STILL THERE
files = dbutils.fs.ls("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees/")
for f in files:
print(f"{f.name} ({f.size} bytes)")
# You will see:
# _delta_log/
# part-00000-xxxx.snappy.parquet
# part-00001-xxxx.snappy.parquet
# ...
The data files and the Delta transaction log are completely untouched. DROP TABLE only removed the catalog entry — the metadata that says “there is a table called employees.” The actual Parquet files in your storage account are still there.
You can even read the data directly without a table:
# Read directly from the ADLS path — no table needed
df_direct = spark.read.format("delta").load(
"abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees/"
)
df_direct.show()
print(f"Rows: {df_direct.count()}")
# All 7 employees are here!
Real-life analogy: Dropping an external table is like removing a bookmark from your browser. The website (data) still exists — you just removed the shortcut to it. You can re-bookmark it (recreate the table) anytime.
Step 7: Recreate Table — Data Comes Back
-- Recreate the table pointing to the same ADLS path
CREATE TABLE IF NOT EXISTS external_demo.employees
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees';
-- All data is back instantly!
SELECT * FROM external_demo.employees ORDER BY emp_id;
-- 7 employees — exactly as before DROP
-- Even the Delta history is preserved!
DESCRIBE HISTORY external_demo.employees;
-- Shows all operations: CREATE, INSERT, etc.
The Delta transaction log (_delta_log/) was preserved in ADLS, so Databricks reads it and restores the complete table — data AND history.
Step 8: Delta Operations on External Tables
External tables support ALL the same Delta operations as managed tables:
UPDATE
UPDATE external_demo.employees
SET salary = 105000, city = 'Toronto'
WHERE emp_id = 1;
DELETE
DELETE FROM external_demo.employees
WHERE emp_id = 7;
MERGE
from delta.tables import DeltaTable
source_data = [(2, "Shrey", "ML Engineering", "Toronto", 95000, "2023-03-20"),
(8, "Priya", "QA", "Pune", 72000, "2024-06-01")]
df_source = spark.createDataFrame(source_data, columns).withColumn("hire_date", to_date(col("hire_date")))
target = DeltaTable.forName(spark, "external_demo.employees")
target.alias("t").merge(
df_source.alias("s"), "t.emp_id = s.emp_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Time Travel
-- View all versions
DESCRIBE HISTORY external_demo.employees;
-- Query a previous version
SELECT * FROM external_demo.employees VERSION AS OF 0 ORDER BY emp_id;
OPTIMIZE and VACUUM
-- Compact small files
OPTIMIZE external_demo.employees;
-- Clean old files (respects retention period)
VACUUM external_demo.employees RETAIN 168 HOURS;
How to Check If a Table Is Managed or External
Method 1: DESCRIBE EXTENDED
DESCRIBE EXTENDED external_demo.employees;
-- Look for "Type" in the output:
-- Type = MANAGED → managed table
-- Type = EXTERNAL → external table
Method 2: Check the Location
DESCRIBE DETAIL external_demo.employees;
-- Look at the "location" column:
-- abfss://... → external (YOUR storage)
-- dbfs:/user/hive/warehouse/... → managed (Databricks storage)
Method 3: Quick PySpark
detail = spark.sql("DESCRIBE DETAIL external_demo.employees").collect()[0]
print(f"Type: {'EXTERNAL' if 'abfss://' in detail['location'] else 'MANAGED'}")
print(f"Location: {detail['location']}")
External Tables with Partitioning
# Write partitioned data to ADLS
df.write.format("delta") .mode("overwrite") .partitionBy("department") .save("abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees_partitioned/")
# Register as external table
spark.sql('''
CREATE TABLE IF NOT EXISTS external_demo.employees_partitioned
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/external_tables/employees_partitioned/'
''')
ADLS folder structure:
external_tables/employees_partitioned/
_delta_log/
department=Analytics/
part-00000.parquet
department=Data Engineering/
part-00000.parquet
department=DevOps/
part-00000.parquet
Partition pruning works exactly the same as managed tables.
External Tables and VACUUM
VACUUM on external tables works identically to managed tables — it removes old Parquet files from YOUR ADLS storage beyond the retention period:
-- Default 7-day retention
VACUUM external_demo.employees RETAIN 168 HOURS;
Important difference: When you VACUUM an external table, the old files are deleted from YOUR storage account. Make sure your retention period covers your time travel needs before running VACUUM.
After VACUUM, time travel to versions whose files were deleted will fail — same behavior as managed tables.
Granting Access to Other Users
With Unity Catalog, you control who can access external tables:
-- Grant SELECT to a user
GRANT SELECT ON TABLE external_demo.employees TO `analyst@company.com`;
-- Grant SELECT on the entire schema
GRANT SELECT ON SCHEMA external_demo TO `analyst@company.com`;
-- Grant usage on the external location (needed to CREATE external tables)
GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION adls_raw_data TO `engineer@company.com`;
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “No matching external location found” | No External Location registered for the ADLS path | Create External Location pointing to the container |
| “PARSE_SYNTAX_ERROR at STORAGE” | Cluster does not support Unity Catalog commands | Use a cluster with Unity Catalog enabled (shared or single-user access mode) |
| “ACCESS_DENIED” on External Location | Access Connector lacks Storage Blob Data Contributor role | Add the role on the storage account in Azure Portal |
| “Permission denied” creating External Location | Your Databricks user lacks metastore admin or CREATE EXTERNAL LOCATION | Ask workspace admin for the permission |
| “Table already exists” | LOCATION path already registered to another table | Drop the existing table first or use a different path |
| “Cannot create external table — schema does not exist” | Schema not created | Run CREATE SCHEMA IF NOT EXISTS external_demo |
| “Storage credential not found” | Typo in credential name | Run SHOW STORAGE CREDENTIALS to verify the exact name |
| Files not showing after DROP | You dropped a MANAGED table (not external) | Managed tables delete data on DROP. Data is lost. |
Interview Questions
Q: What is the difference between managed and external tables in Databricks? A: Managed tables store data in Databricks-managed storage — DROP TABLE deletes both metadata and data. External tables store data in YOUR storage (ADLS Gen2) — DROP TABLE deletes only the catalog metadata, data files remain. Recreating an external table with the same LOCATION restores all data and history.
Q: What are the three layers needed for external tables in Unity Catalog? A: Storage Credential (HOW to authenticate — Access Connector managed identity), External Location (WHERE to access — registered ADLS path), and External Table (WHAT data — specific path under the location). Each layer provides security governance.
Q: What happens when you DROP an external table? A: Only the catalog entry (metadata) is deleted. The Delta data files and transaction log remain in ADLS Gen2. You can read the data directly from the path, or recreate the table with the same LOCATION to restore it with full history.
Q: When would you use an external table instead of a managed table? A: When data needs to be accessed by tools outside Databricks (Synapse, ADF, Power BI), when compliance requires data in your own storage, when multiple Databricks workspaces share the same data, or when you want data to survive table drops and workspace changes.
Q: Do external tables support all Delta operations? A: Yes. INSERT, UPDATE, DELETE, MERGE, time travel, OPTIMIZE, VACUUM, schema evolution — all work identically to managed tables. The only difference is where the data files physically reside.
Wrapping Up
External tables give you the best of both worlds: the convenience of SQL table access with the durability of data in YOUR storage. DROP a table by accident? No problem — the data is still in ADLS. Need to share data with Synapse or Power BI? They read ADLS directly. Need to migrate to a different Databricks workspace? Point the new workspace at the same ADLS path.
The setup is a one-time process: create an Access Connector, assign storage permissions, register a Storage Credential and External Location in Unity Catalog. After that, creating external tables is just adding LOCATION 'abfss://...' to your CREATE TABLE statement.
For production data lakes, external tables are the standard. Managed tables are for sandboxes and experiments. Know the difference. Choose wisely.
Related posts: – Azure Databricks Introduction and dbutils – Connecting to Blob/ADLS Gen2 – Delta Lake Deep Dive – Databricks Secret Scopes – ADLS Gen2 Complete Guide
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.