Databricks Unity Catalog Deep Dive: Metastore, Three-Level Namespace, Governance, Lineage, and Securing Your Lakehouse Like a Production Engineer
You have built Delta Lake tables, run PySpark notebooks, and created workflows in Databricks. But who controls access to those tables? Where is the metadata stored? How do you prevent a junior analyst from accidentally deleting a production table? How do you track which notebook reads which table?
Before Unity Catalog, Databricks used the Hive metastore — a workspace-level catalog with no cross-workspace sharing, no fine-grained access control, and no data lineage. Every workspace had its own isolated metadata. Sharing a table between two workspaces meant copying data or mounting the same storage account in both. Security was all-or-nothing: you either had access to the entire workspace or you did not.
Unity Catalog replaces all of that with a centralized, account-level governance layer. One metastore governs ALL workspaces. One permission model controls access down to the row and column level. One lineage graph tracks every table, every notebook, every query. One sharing mechanism lets you share data across workspaces, across accounts, and even across clouds — without copying.
Think of Unity Catalog like a city government. Before Unity Catalog, each neighborhood (workspace) had its own rules, its own property registry, and its own police force. If someone in Neighborhood A wanted to use a park in Neighborhood B, they had to build a duplicate park. With Unity Catalog, the city government maintains ONE property registry for ALL neighborhoods, ONE set of rules, and ONE police force. Every property (table) is registered centrally. Access is controlled by city permits (grants). And every transaction is logged in city records (audit logs).
Table of Contents
- The Problem Before Unity Catalog
- What Is Unity Catalog?
- The Three-Level Namespace: Catalog.Schema.Table
- Unity Catalog Metastore
- Creating and Assigning a Metastore
- Catalogs — The Top-Level Container
- Creating Catalogs
- Catalog Organization Patterns
- Schemas — The Middle Layer
- Creating Schemas
- Tables in Unity Catalog
- Managed Tables vs External Tables
- Views and Functions
- Volumes — Unstructured Data Governance
- Storage Credentials and External Locations
- Storage Credentials
- External Locations
- How They Work Together
- Access Control — GRANT and REVOKE
- Securable Objects Hierarchy
- Privileges Reference
- Granting Access (Complete Examples)
- Ownership
- Row-Level Security
- Column Masking (Dynamic Data Masking)
- Data Lineage
- Audit Logging
- Delta Sharing — Share Data Without Copying
- Identity Federation and Account-Level Identity
- Migrating from Hive Metastore to Unity Catalog
- Unity Catalog in Azure vs AWS vs GCP
- Unity Catalog vs Microsoft Fabric Governance
- Common Mistakes
- Interview Questions
- Wrapping Up
The Problem Before Unity Catalog
BEFORE Unity Catalog (Hive Metastore era):
Workspace A (Data Engineering) Workspace B (Data Science)
├── hive_metastore ├── hive_metastore
│ ├── default.customers │ ├── default.customers ← COPY of same data!
│ ├── default.orders │ ├── default.ml_features
│ └── default.products │ └── default.predictions
│ │
└── Mounted storage: └── Mounted storage:
/mnt/datalake/... /mnt/datalake/... ← Same mount, no governance
Problems:
❌ Two copies of customers table (storage waste, sync nightmare)
❌ No way to share tables across workspaces without mounting
❌ Permissions are workspace-level only (all or nothing)
❌ No lineage — who reads what? No idea.
❌ No audit trail — who deleted that table last Tuesday? Unknown.
❌ Cluster-scoped credentials — every user on the cluster has the same access
Real-life analogy: The Hive metastore era was like a company where every department kept its own filing cabinet with its own copies of documents. HR had a copy of the employee list. Finance had a different copy. When an employee changed their address, HR updated their copy but Finance did not. Nobody knew which copy was current. And there was no central log of who accessed which documents.
What Is Unity Catalog?
Unity Catalog is Databricks’ unified governance layer for all data and AI assets. It provides:
Unity Catalog provides:
1. CENTRALIZED METASTORE — one catalog for all workspaces in the account
2. THREE-LEVEL NAMESPACE — catalog.schema.table for organizing everything
3. FINE-GRAINED ACCESS CONTROL — GRANT/REVOKE at catalog, schema, table, row, column level
4. DATA LINEAGE — automatic tracking of table → notebook → table dependencies
5. AUDIT LOGGING — every access, every query, every permission change is logged
6. DELTA SHARING — share data externally without copying (open protocol)
7. VOLUMES — governance for unstructured data (files, images, models)
8. IDENTITY FEDERATION — account-level users and groups, not per-workspace
AFTER Unity Catalog:
Databricks Account
└── Unity Catalog Metastore (ONE for the entire account)
├── Catalog: prod
│ ├── Schema: bronze
│ │ ├── customers
│ │ ├── orders
│ │ └── products
│ ├── Schema: silver
│ │ ├── customers_clean
│ │ └── orders_validated
│ └── Schema: gold
│ ├── dim_customer
│ └── fact_orders
│
├── Catalog: dev
│ └── Schema: sandbox
│ └── test_customers
│
└── Catalog: shared
└── Schema: ml_features
└── customer_features
Workspace A (Data Engineering) ──► Same metastore
Workspace B (Data Science) ──────► Same metastore
Workspace C (Analytics) ─────────► Same metastore
✅ ONE copy of customers — all workspaces access the same table
✅ Fine-grained permissions — analysts can read gold, not bronze
✅ Full lineage — see exactly which notebooks read/write which tables
✅ Audit logs — every query logged with user, timestamp, table
✅ Identity-based access — permissions follow the USER, not the cluster
The Three-Level Namespace: Catalog.Schema.Table
Unity Catalog organizes all data assets into a three-level hierarchy:
Level 1: CATALOG — the top-level container (like a database server)
Level 2: SCHEMA — a logical grouping within a catalog (like a database)
Level 3: TABLE/VIEW — the actual data object
Full reference: catalog_name.schema_name.table_name
Examples:
prod.bronze.customers — production bronze customers table
prod.gold.dim_customer — production gold dimension
dev.sandbox.test_data — development sandbox
shared.ml_features.customer_features — shared ML feature table
-- Always use the three-level name for clarity
SELECT * FROM prod.gold.dim_customer;
-- Or set a default catalog and schema
USE CATALOG prod;
USE SCHEMA gold;
SELECT * FROM dim_customer; -- Resolves to prod.gold.dim_customer
-- In PySpark
df = spark.table("prod.silver.customers_clean")
df.write.format("delta").saveAsTable("prod.gold.dim_customer")
Real-life analogy: The three-level namespace is like a postal address. The catalog is the country (Canada). The schema is the city (Toronto). The table is the street address (123 Main St). Just like “123 Main St” exists in many cities, a table named “customers” can exist in multiple schemas — prod.bronze.customers and dev.sandbox.customers are completely separate tables.
| Level | What It Is | Analogy | Example |
|---|---|---|---|
| Catalog | Top-level container for all data assets | A building in a campus | prod, dev, staging |
| Schema | Logical grouping of tables, views, functions | A floor in the building | bronze, silver, gold |
| Table/View | The actual data object | A room on the floor | customers, dim_product |
Unity Catalog Metastore
The metastore is the top-level container for ALL Unity Catalog metadata in a Databricks account. It stores catalog definitions, schema definitions, table metadata, permissions, lineage, and audit logs.
Databricks Account
└── Metastore (one per region)
├── Catalogs
├── Schemas
├── Tables / Views / Functions
├── Volumes
├── Storage Credentials
├── External Locations
├── Permissions (GRANT / REVOKE)
├── Lineage Graph
└── Audit Logs
Key facts:
• ONE metastore per cloud region (e.g., one for Canada Central, one for East US)
• ALL workspaces in that region attach to the SAME metastore
• The metastore stores METADATA only — actual data stays in your storage account
• Metastore is managed by the account admin (not workspace admin)
Real-life analogy: The metastore is like a city’s land registry office. It does not own any houses (data). It maintains the RECORDS of every property — who owns it, where it is located, who has permission to enter, and a history of every transaction. The houses (data files) stay where they are. The registry (metastore) tracks everything about them.
Creating and Assigning a Metastore
Step 1: Create a metastore (Account Admin → account.cloud.databricks.com)
• Go to Data → Create Metastore
• Name: "main" (or your company name)
• Region: Canada Central (must match your workspace region)
• ADLS Gen2 container for managed storage:
abfss://unity-catalog@yourstorageaccount.dfs.core.windows.net/
Step 2: Create an Access Connector for Azure Databricks
• In Azure Portal → Create "Access Connector for Azure Databricks"
• This gives the metastore a Managed Identity to access ADLS Gen2
• Assign "Storage Blob Data Contributor" role to the Access Connector
on your ADLS Gen2 storage account
Step 3: Assign the metastore to workspaces
• Account Console → Workspaces → select workspace → assign metastore
• All assigned workspaces now share this metastore
Important: A workspace can only be assigned to ONE metastore at a time. But one metastore can serve MANY workspaces. This is what enables cross-workspace data sharing.
Catalogs — The Top-Level Container
A catalog is the highest organizational unit in Unity Catalog. It contains schemas, which contain tables, views, and functions.
Creating Catalogs
-- Create a catalog
CREATE CATALOG prod;
CREATE CATALOG dev;
CREATE CATALOG staging;
CREATE CATALOG shared;
-- Create a catalog with a comment
CREATE CATALOG prod COMMENT 'Production data — Bronze, Silver, Gold layers';
-- Create a catalog with a managed location (where managed table data is stored)
CREATE CATALOG prod
MANAGED LOCATION 'abfss://unity-catalog@naveenadlsgen2de.dfs.core.windows.net/prod/';
-- List all catalogs
SHOW CATALOGS;
-- Describe a catalog
DESCRIBE CATALOG prod;
-- Drop a catalog (must be empty — no schemas)
DROP CATALOG dev;
-- Drop a catalog and ALL its contents (dangerous!)
DROP CATALOG dev CASCADE;
Catalog Organization Patterns
Pattern 1: By Environment (Most Common)
├── prod — production data (restricted access)
├── staging — pre-production validation
├── dev — development and experimentation
└── shared — cross-team shared datasets
Pattern 2: By Domain / Business Unit
├── sales — sales team data
├── marketing — marketing data
├── finance — financial data
└── platform — shared platform data
Pattern 3: Hybrid (Recommended for Enterprise)
├── prod_sales — production sales data
├── prod_marketing — production marketing data
├── dev — all development work
└── shared — cross-domain shared assets
Pattern 4: Medallion Within Each Catalog
└── prod
├── bronze — raw ingested data (schema)
├── silver — cleaned, validated data (schema)
└── gold — business-ready aggregates (schema)
Real-life analogy: Catalogs are like separate buildings on a university campus. The Science Building (prod) has labs (schemas) with equipment (tables). The Library Building (shared) has books everyone can access. The Student Center (dev) is for experimentation. You need different key cards (permissions) for different buildings.
Schemas — The Middle Layer
A schema (also called a database) groups related tables, views, and functions within a catalog.
Creating Schemas
-- Create schemas within a catalog
CREATE SCHEMA prod.bronze;
CREATE SCHEMA prod.silver;
CREATE SCHEMA prod.gold;
-- Create with a managed location
CREATE SCHEMA prod.bronze
MANAGED LOCATION 'abfss://unity-catalog@naveenadlsgen2de.dfs.core.windows.net/prod/bronze/'
COMMENT 'Raw ingested data — no transformations applied';
-- Create with properties
CREATE SCHEMA prod.gold
COMMENT 'Business-ready tables — star schema, aggregates'
WITH DBPROPERTIES ('team' = 'data-engineering', 'layer' = 'gold');
-- List schemas in a catalog
SHOW SCHEMAS IN prod;
-- Describe schema
DESCRIBE SCHEMA prod.gold;
-- Set default catalog and schema for the session
USE CATALOG prod;
USE SCHEMA gold;
Tables in Unity Catalog
Managed Tables vs External Tables
| Feature | Managed Table | External Table |
|---|---|---|
| Data location | Managed by Unity Catalog (in the catalog/schema’s managed location) | Your storage account (you choose the path) |
| DROP TABLE behavior | Deletes metadata AND data files | Deletes metadata only — data files remain |
| Governance | Full UC governance (permissions, lineage, audit) | Full UC governance (permissions, lineage, audit) |
| Storage management | UC handles file organization | You manage the storage path |
| Best for | Most tables — simpler, fully managed | Tables shared with non-Databricks tools, existing data lakes |
| VACUUM/OPTIMIZE | Works normally | Works normally |
-- Managed table (UC controls the data location)
CREATE TABLE prod.bronze.customers (
customer_id INT,
name STRING,
email STRING,
city STRING,
signup_date DATE
) USING DELTA;
-- External table (you specify the data location)
CREATE TABLE prod.bronze.orders
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/orders/';
-- Check if a table is managed or external
DESCRIBE EXTENDED prod.bronze.customers;
-- Look for "Type" → MANAGED or EXTERNAL
Real-life analogy: A managed table is like renting a storage unit from a company — they handle the locks, maintenance, and security. You just put your stuff in. If you cancel the rental (DROP TABLE), your stuff is removed. An external table is like storing furniture in your own garage and giving the storage company a KEY to access it — if you stop using the company (DROP TABLE), your furniture is still in your garage.
Views and Functions
-- Views are registered in Unity Catalog with full governance
CREATE VIEW prod.gold.v_active_customers AS
SELECT customer_id, name, email, city
FROM prod.gold.dim_customer
WHERE is_active = true;
-- Functions are also governed
CREATE FUNCTION prod.gold.mask_email(email STRING)
RETURNS STRING
RETURN CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[1]);
-- Usage
SELECT name, prod.gold.mask_email(email) AS masked_email
FROM prod.gold.dim_customer;
Volumes — Unstructured Data Governance
Volumes extend Unity Catalog governance to unstructured data — files, images, CSVs, models, JARs. Before Volumes, files in ADLS/S3 had no UC governance. Volumes bring files into the three-level namespace with full permissions and audit logging.
-- Create a managed volume (UC manages the storage)
CREATE VOLUME prod.bronze.raw_files;
-- Create an external volume (your storage, UC governs access)
CREATE EXTERNAL VOLUME prod.bronze.landing_zone
LOCATION 'abfss://landing@naveenadlsgen2de.dfs.core.windows.net/incoming/';
-- Upload files to a volume
-- In notebook:
-- dbutils.fs.cp("file:/tmp/data.csv", "/Volumes/prod/bronze/raw_files/data.csv")
-- List files in a volume
LIST '/Volumes/prod/bronze/raw_files/';
-- Read a file from a volume in PySpark
df = spark.read.csv("/Volumes/prod/bronze/raw_files/customers.csv", header=True)
-- Read in SQL
SELECT * FROM csv.`/Volumes/prod/bronze/raw_files/customers.csv`;
Real-life analogy: If tables are like books in a library (structured, cataloged, searchable), Volumes are like the library’s special collections room — maps, photographs, manuscripts (unstructured data). Before Volumes, these were stored in a random closet with no catalog card. Now they are registered, governed, and tracked like any other library asset.
Storage Credentials and External Locations
These two objects control how Unity Catalog accesses your cloud storage for external tables and external volumes.
Storage Credentials
A Storage Credential tells Unity Catalog HOW to authenticate to your storage.
In Azure:
→ Azure Managed Identity (via Access Connector for Azure Databricks)
→ Service Principal (App Registration + Client Secret)
In AWS:
→ IAM Role (cross-account role for S3 access)
In GCP:
→ Service Account (GCS access)
-- Create a storage credential (account admin only)
CREATE STORAGE CREDENTIAL azure_adls_credential
WITH (AZURE_MANAGED_IDENTITY, ACCESS_CONNECTOR_ID =
'/subscriptions/xxxx/resourceGroups/rg-databricks/providers/Microsoft.Databricks/accessConnectors/unity-catalog-connector');
-- List storage credentials
SHOW STORAGE CREDENTIALS;
External Locations
An External Location maps a CLOUD STORAGE PATH to a STORAGE CREDENTIAL.
It says: "This path in ADLS/S3 can be accessed using this credential."
-- Create an external location
CREATE EXTERNAL LOCATION adls_raw_data
URL 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL azure_adls_credential)
COMMENT 'Raw data landing zone in ADLS Gen2';
-- Now you can create external tables pointing to paths under this location
CREATE TABLE prod.bronze.orders
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/orders/';
-- Unity Catalog checks: Does an external location cover this path? Does the user have permission?
-- List external locations
SHOW EXTERNAL LOCATIONS;
How They Work Together
User runs: CREATE TABLE ... LOCATION 'abfss://raw-data@storage.dfs.core.windows.net/orders/'
│
├── Step 1: Unity Catalog checks — does an External Location cover this path?
│ → YES: adls_raw_data covers abfss://raw-data@storage.dfs.core.windows.net/
│
├── Step 2: Does the user have CREATE EXTERNAL TABLE permission on the external location?
│ → YES: GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION adls_raw_data TO data_engineers
│
├── Step 3: Unity Catalog uses the Storage Credential linked to the external location
│ → azure_adls_credential (Managed Identity)
│
└── Step 4: Table is created. Data stays in ADLS. Metadata registered in UC.
Without an external location covering the path → CREATE TABLE fails.
Without permission on the external location → CREATE TABLE fails.
The storage credential is NEVER exposed to the user — only UC uses it internally.
Real-life analogy: The Storage Credential is the master key to a building. The External Location is a sign on the door that says “Authorized personnel may enter Rooms 100-200.” The user has a badge (GRANT) that says “Allowed to enter via this door.” The user never touches the master key — the building management system (Unity Catalog) uses it on their behalf.
Access Control — GRANT and REVOKE
Securable Objects Hierarchy
Metastore
└── Catalog ← GRANT at this level → applies to ALL schemas and tables below
└── Schema ← GRANT at this level → applies to ALL tables in this schema
├── Table ← GRANT at this level → applies to this table only
├── View
├── Function
└── Volume
Permissions INHERIT downward:
GRANT USE CATALOG ON CATALOG prod TO analysts;
→ analysts can now see the catalog and attempt to access schemas inside it
GRANT USE SCHEMA ON SCHEMA prod.gold TO analysts;
→ analysts can see schema and attempt to access tables inside it
GRANT SELECT ON TABLE prod.gold.dim_customer TO analysts;
→ analysts can query this specific table
All THREE are needed to read a table:
USE CATALOG + USE SCHEMA + SELECT = can query the table
Privileges Reference
| Privilege | What It Allows | Applies To |
|---|---|---|
| USE CATALOG | See and access the catalog | Catalog |
| USE SCHEMA | See and access the schema | Schema |
| SELECT | Read data from a table or view | Table, View |
| MODIFY | INSERT, UPDATE, DELETE, MERGE data | Table |
| CREATE TABLE | Create managed tables in a schema | Schema |
| CREATE EXTERNAL TABLE | Create external tables using an external location | External Location |
| CREATE SCHEMA | Create schemas in a catalog | Catalog |
| CREATE CATALOG | Create new catalogs | Metastore |
| CREATE VOLUME | Create volumes in a schema | Schema |
| READ VOLUME | Read files from a volume | Volume |
| WRITE VOLUME | Write files to a volume | Volume |
| EXECUTE | Run a function | Function |
| ALL PRIVILEGES | All available privileges on the object | Any |
Granting Access (Complete Examples)
-- SCENARIO 1: Give analysts read-only access to Gold tables
GRANT USE CATALOG ON CATALOG prod TO analysts;
GRANT USE SCHEMA ON SCHEMA prod.gold TO analysts;
GRANT SELECT ON SCHEMA prod.gold TO analysts;
-- analysts can now query ALL current and future tables in prod.gold
-- SCENARIO 2: Give data engineers full access to Bronze and Silver
GRANT USE CATALOG ON CATALOG prod TO data_engineers;
GRANT USE SCHEMA ON SCHEMA prod.bronze TO data_engineers;
GRANT USE SCHEMA ON SCHEMA prod.silver TO data_engineers;
GRANT ALL PRIVILEGES ON SCHEMA prod.bronze TO data_engineers;
GRANT ALL PRIVILEGES ON SCHEMA prod.silver TO data_engineers;
-- SCENARIO 3: Give a specific user access to ONE table only
GRANT USE CATALOG ON CATALOG prod TO `naveen@company.com`;
GRANT USE SCHEMA ON SCHEMA prod.gold TO `naveen@company.com`;
GRANT SELECT ON TABLE prod.gold.dim_customer TO `naveen@company.com`;
-- SCENARIO 4: Revoke access
REVOKE SELECT ON SCHEMA prod.gold FROM contractors;
-- SCENARIO 5: Check who has access
SHOW GRANTS ON TABLE prod.gold.dim_customer;
SHOW GRANTS ON SCHEMA prod.gold;
SHOW GRANTS TO `naveen@company.com`;
Ownership
-- Every object has an OWNER — the creator by default
-- Owners have ALL privileges on the object plus the ability to GRANT to others
-- Transfer ownership
ALTER TABLE prod.gold.dim_customer SET OWNER TO data_engineering_team;
ALTER SCHEMA prod.gold SET OWNER TO data_engineering_lead;
ALTER CATALOG prod SET OWNER TO platform_admin;
Row-Level Security
Row-level security (RLS) restricts which ROWS a user can see based on their identity. A sales manager in Toronto should only see Toronto customers — not all customers worldwide.
-- Row filters are SQL functions that return TRUE for rows the user can see
-- Step 1: Create a row filter function
CREATE FUNCTION prod.gold.region_filter(region STRING)
RETURN IF(
IS_ACCOUNT_GROUP_MEMBER('global_admins'),
true, -- Admins see everything
region = CURRENT_USER_REGION() -- Others see their region only
);
-- Simpler example: analysts only see their own department's data
CREATE FUNCTION prod.gold.dept_filter(dept STRING)
RETURN IF(
IS_ACCOUNT_GROUP_MEMBER('all_access'),
true,
dept = (SELECT department FROM prod.gold.user_dept_mapping WHERE email = CURRENT_USER())
);
-- Step 2: Apply the row filter to a table
ALTER TABLE prod.gold.dim_customer
SET ROW FILTER prod.gold.region_filter ON (region);
-- Now when a Toronto analyst queries dim_customer:
SELECT * FROM prod.gold.dim_customer;
-- They see only rows where region = 'Toronto'
-- An admin sees ALL rows
-- Remove the row filter
ALTER TABLE prod.gold.dim_customer DROP ROW FILTER;
Real-life analogy: Row-level security is like a hotel receptionist who only shows you the key to YOUR room. You walk up to the front desk (query the table), and the receptionist (row filter) checks your ID (CURRENT_USER) and only gives you the key to room 305 (your rows). The other 200 rooms exist — you just cannot see or access them.
Column Masking (Dynamic Data Masking)
Column masking hides or obfuscates sensitive column values based on the user’s identity. An analyst might see nav****@email.com while an admin sees naveen@email.com.
-- Step 1: Create a masking function
CREATE FUNCTION prod.gold.mask_email_col(email STRING)
RETURN IF(
IS_ACCOUNT_GROUP_MEMBER('pii_access'),
email, -- PII group sees full email
CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[1]) -- Others see masked
);
CREATE FUNCTION prod.gold.mask_salary_col(salary DECIMAL)
RETURN IF(
IS_ACCOUNT_GROUP_MEMBER('hr_team'),
salary, -- HR sees actual salary
NULL -- Others see NULL
);
-- Step 2: Apply column masks to the table
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email SET MASK prod.gold.mask_email_col;
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN salary SET MASK prod.gold.mask_salary_col;
-- Now queries automatically apply the mask:
-- HR team member: SELECT email, salary → naveen@email.com, 105000
-- Regular analyst: SELECT email, salary → na****@email.com, NULL
-- Remove a column mask
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email DROP MASK;
Real-life analogy: Column masking is like a redacted document. When the CIA shares a document with the public (regular analysts), names and dates are blacked out (masked). When the director (PII access group) reads the same document, nothing is redacted. Same document, different views based on who is reading it.
Data Lineage
Unity Catalog automatically captures lineage — which tables are read by which notebooks, which tables feed which downstream tables, and which queries touched which columns.
Lineage is captured automatically — no configuration needed.
Example lineage graph:
ADLS Gen2 (external)
│
▼
prod.bronze.raw_customers ◄── NB_Ingest_Customers (notebook)
│
▼
prod.silver.customers_clean ◄── NB_Clean_Customers (notebook)
│
▼
prod.gold.dim_customer ◄── NB_Build_Dimensions (notebook)
│
├──► Power BI Dashboard (downstream consumer)
└──► prod.gold.fact_orders ◄── NB_Build_Facts (notebook)
What lineage tells you:
• "Which tables does NB_Build_Dimensions read?"
→ prod.silver.customers_clean, prod.silver.orders_validated
• "What happens if I drop prod.silver.customers_clean?"
→ It will break NB_Build_Dimensions → which breaks prod.gold.dim_customer → Power BI
• "Who reads prod.gold.dim_customer?"
→ NB_Build_Facts, Power BI Dashboard, 3 ad-hoc queries last week
How to view lineage:
1. Open Databricks workspace
2. Go to Catalog Explorer
3. Navigate to any table → Click the "Lineage" tab
4. See upstream tables (where data comes FROM) and downstream tables (where data goes TO)
5. Click any node to see the notebook or query that created the dependency
Column-level lineage: Unity Catalog tracks not just table-level lineage but column-level. If dim_customer.full_name is derived from CONCAT(raw_customers.first_name, raw_customers.last_name), the lineage graph shows this column-to-column mapping.
Real-life analogy: Lineage is like the ingredient list and supply chain of a restaurant dish. You can trace the pasta dish (Gold table) back to the sauce (Silver table) back to the tomatoes (Bronze table) back to the farm (ADLS source). If the farm has a contamination issue (bad source data), you know exactly which dishes are affected.
Audit Logging
Unity Catalog logs EVERY operation — every query, every permission change, every table creation, every data access. These logs are essential for compliance (SOC 2, HIPAA, GDPR) and troubleshooting.
What is logged:
• Table/View access (SELECT, INSERT, UPDATE, DELETE)
• Permission changes (GRANT, REVOKE)
• Object creation/deletion (CREATE TABLE, DROP TABLE)
• Schema changes (ALTER TABLE ADD COLUMN)
• Authentication events (login, logout)
• Catalog/Schema operations
Where logs are stored:
• Delivered to your cloud storage via Audit Log Delivery
• In Azure: Azure Diagnostic Settings → Log Analytics or Storage Account
• Format: JSON events with timestamp, user, action, resource, IP address
Example audit log entry:
{
"timestamp": "2026-06-11T10:30:15Z",
"user": "naveen@company.com",
"action": "SELECT",
"resource": "prod.gold.dim_customer",
"workspace_id": "1234567890",
"source_ip": "10.0.0.50",
"rows_returned": 15000,
"query_text": "SELECT * FROM prod.gold.dim_customer WHERE city = 'Toronto'"
}
Real-life analogy: Audit logs are like security camera footage for your data. Every door (table) has a camera. Every time someone enters (queries), leaves (disconnects), or moves furniture (modifies data), it is recorded with a timestamp and their ID badge photo. If something goes wrong, you can rewind the footage and see exactly who did what.
Delta Sharing — Share Data Without Copying
Delta Sharing is an open protocol built into Unity Catalog that lets you share data with external organizations, other Databricks accounts, or non-Databricks tools — without copying.
Sharing Scenarios:
1. Share data with another Databricks workspace (same account) → Internal sharing via UC
2. Share data with another Databricks account (different company) → Databricks-to-Databricks sharing
3. Share data with non-Databricks tools (Pandas, Power BI, Spark) → Open Delta Sharing
How Delta Sharing works:
Provider (you) Recipient (partner company)
┌─────────────────┐ ┌─────────────────┐
│ prod.gold │ │ Pandas / Spark │
│ .dim_customer │──── Delta Sharing ────►│ Power BI │
│ .fact_orders │ (no data copied) │ Databricks │
└─────────────────┘ └─────────────────┘
Data stays in YOUR storage. The recipient gets a token to read it.
You control exactly which tables, which columns, and which rows are shared.
-- Create a share
CREATE SHARE customer_analytics_share;
-- Add tables to the share
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.dim_customer;
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.agg_monthly_revenue;
-- Share with partitions (only share specific data)
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.fact_orders
PARTITION (region = 'Canada'); -- Only share Canadian orders
-- Create a recipient
CREATE RECIPIENT partner_company
USING ID 'partner-databricks-account-id';
-- Grant the share to the recipient
GRANT SELECT ON SHARE customer_analytics_share TO RECIPIENT partner_company;
-- Recipient accesses the data in their Databricks workspace:
-- CREATE CATALOG partner_data USING SHARE provider_account.customer_analytics_share;
-- SELECT * FROM partner_data.default.dim_customer;
Real-life analogy: Delta Sharing is like sharing a Google Doc with “view only” access. The document stays in YOUR Google Drive. The recipient can read it from anywhere — even from a different app (Pandas, Power BI). You can revoke access at any time. You never sent a copy. They never downloaded a copy. The one source of truth is always yours.
Identity Federation and Account-Level Identity
Unity Catalog uses account-level users and groups, not workspace-level. This means permissions follow the USER across all workspaces.
Before UC (workspace-level identity):
Workspace A: naveen@company.com → admin
Workspace B: naveen@company.com → contributor
→ Same person, different permissions in different workspaces
→ Had to manage users PER workspace
With UC (account-level identity):
Account: naveen@company.com → data_engineers group
→ GRANT SELECT ON prod.gold TO data_engineers
→ naveen can access prod.gold from ANY workspace assigned to the same metastore
→ Manage users ONCE at the account level
Identity sources:
• Azure AD (Microsoft Entra ID) — sync users and groups via SCIM
• AWS IAM Identity Center
• Okta, OneLogin, or any SAML/SCIM provider
Azure AD Group "DataEngineers" → synced to Databricks group "data_engineers"
→ GRANT permissions to "data_engineers" in Unity Catalog
→ Add/remove users in Azure AD → automatically reflected in Databricks
Migrating from Hive Metastore to Unity Catalog
Migration approach:
Step 1: Create your UC metastore and assign workspaces
Step 2: Create catalogs and schemas matching your desired structure
CREATE CATALOG prod;
CREATE SCHEMA prod.bronze;
Step 3: Upgrade Hive tables to UC (two options):
Option A: SYNC (non-destructive, keeps Hive table too)
-- Creates a UC table pointing to the same data as the Hive table
CREATE TABLE prod.bronze.customers
USING DELTA
LOCATION 'dbfs:/user/hive/warehouse/customers';
Option B: UCX Migration Tool (recommended for large migrations)
-- Databricks provides the UCX (Unity Catalog Migration) tool
-- It scans your Hive metastore, identifies all tables, and migrates them
-- Handles clusters, notebooks, jobs, and permissions migration
Step 4: Update notebooks from 2-level to 3-level names
-- BEFORE: spark.table("customers")
-- AFTER: spark.table("prod.bronze.customers")
Step 5: Update GRANT permissions (Hive metastore permissions do NOT migrate)
Step 6: Test everything in a dev catalog before touching production
Step 7: Deprecate Hive metastore access (set workspace to UC-only mode)
Unity Catalog in Azure vs AWS vs GCP
| Feature | Azure | AWS | GCP |
|---|---|---|---|
| Managed storage | ADLS Gen2 | S3 | GCS |
| Storage Credential | Access Connector (Managed Identity) or Service Principal | IAM Role | Service Account |
| Identity Provider | Azure AD (Entra ID) via SCIM | AWS IAM Identity Center or SCIM | Google Workspace or SCIM |
| Account Console | accounts.azuredatabricks.net | accounts.cloud.databricks.com | accounts.gcp.databricks.com |
| Audit Logs | Azure Diagnostic Settings → Log Analytics | CloudTrail / S3 delivery | Cloud Logging |
| Delta Sharing | Supported | Supported | Supported |
Unity Catalog features are identical across all clouds. The only difference is the storage and identity integration — which cloud-specific service provides the storage and which identity provider syncs users.
Unity Catalog vs Microsoft Fabric Governance
| Feature | Databricks Unity Catalog | Microsoft Fabric |
|---|---|---|
| Namespace | 3-level: catalog.schema.table | 2-level: workspace/lakehouse.table |
| Metastore | Centralized, account-level | OneLake (automatic, no setup) |
| Row-Level Security | SQL row filters | RLS in semantic model (Power BI) or T-SQL in Warehouse |
| Column Masking | SQL column mask functions | Dynamic data masking in Warehouse, CLS in semantic model |
| Data Lineage | Built-in (table + column level) | Microsoft Purview |
| Audit Logging | Built-in audit log delivery | Microsoft Purview + Azure Monitor |
| Data Sharing | Delta Sharing (open protocol) | OneLake Shortcuts + Power BI sharing |
| Identity | Azure AD / IAM via SCIM | Azure AD (native) |
| Unstructured Data | Volumes | Lakehouse Files section |
Common Mistakes
-
Forgetting USE CATALOG + USE SCHEMA before GRANT SELECT — granting SELECT on a table is not enough. The user also needs USE CATALOG on the catalog and USE SCHEMA on the schema. Without all three, the query fails with a “table not found” error — not a permission error — which is confusing.
-
Using cluster-scoped credentials instead of Unity Catalog — mounting ADLS with access keys on the cluster bypasses UC entirely. No governance, no lineage, no audit. Use Storage Credentials and External Locations instead.
-
Not migrating from Hive metastore — keeping tables in hive_metastore means they have no UC governance. All new tables should be created in UC catalogs. Migrate existing tables using the UCX tool.
-
Granting ALL PRIVILEGES when SELECT is sufficient — the principle of least privilege applies. Analysts need SELECT, not MODIFY. Over-granting creates security risks.
-
Creating one giant catalog for everything — use separate catalogs for dev, staging, and prod environments. This prevents accidental writes to production from development notebooks.
-
Not setting up SCIM provisioning from Azure AD — manually managing users in Databricks means users and groups get out of sync with your identity provider. Set up SCIM so Azure AD is the single source of truth.
-
Dropping a managed table thinking only metadata is deleted — DROP TABLE on a MANAGED table deletes both metadata AND data files. For external tables, only metadata is deleted. Know the difference before running DROP.
Interview Questions
Q: What is Unity Catalog and why is it important? A: Unity Catalog is Databricks’ centralized governance layer for all data and AI assets. It provides a three-level namespace (catalog.schema.table), fine-grained access control (GRANT/REVOKE at table, row, and column level), automatic data lineage (table and column level), audit logging, Delta Sharing for external data sharing without copying, and Volumes for unstructured data governance. It replaces the workspace-scoped Hive metastore with an account-level metastore that spans all workspaces.
Q: What is the three-level namespace in Unity Catalog?
A: catalog.schema.table. The catalog is the top-level container (typically representing an environment like prod or dev). The schema is a logical grouping within a catalog (like bronze, silver, gold). The table is the actual data object. Full reference example: prod.gold.dim_customer. This replaces the two-level database.table in the Hive metastore.
Q: What is the difference between managed and external tables in Unity Catalog? A: Managed tables store their data in the catalog’s managed location — Unity Catalog controls the storage. DROP TABLE deletes both metadata and data files. External tables point to data in your own storage account — you specify the LOCATION. DROP TABLE deletes only the metadata; data files remain. Both types receive full UC governance (permissions, lineage, audit).
Q: What are Storage Credentials and External Locations? A: A Storage Credential stores the authentication method (Managed Identity, Service Principal, IAM Role) for accessing cloud storage. An External Location maps a cloud storage path to a Storage Credential, defining which storage paths Unity Catalog can access. To create an external table, there must be an External Location covering the table’s storage path, and the user must have CREATE EXTERNAL TABLE permission on that location. The user never sees the underlying credential.
Q: How does row-level security work in Unity Catalog? A: You create a SQL function that returns TRUE for rows the user should see. Then you apply that function to a table with ALTER TABLE SET ROW FILTER. When anyone queries the table, Unity Catalog automatically applies the filter based on the user’s identity (CURRENT_USER or group membership). Admins can see all rows. Regular users see only their authorized rows. The filter is transparent — users do not know it is applied.
Q: What is Delta Sharing and how does it work? A: Delta Sharing is an open protocol for sharing data without copying. The data provider creates a Share, adds tables to it, and grants access to a Recipient. The recipient accesses the data through the Delta Sharing protocol — from Databricks, Pandas, Power BI, or any compatible client. Data stays in the provider’s storage. No copies are made. The provider can restrict sharing to specific partitions, columns, or rows, and can revoke access at any time.
Q: How does Unity Catalog lineage work? A: Unity Catalog automatically captures lineage whenever a query reads from or writes to a table. The lineage graph shows upstream tables (where data comes from), downstream tables (where data goes), and the notebooks or queries that create those dependencies. It captures both table-level and column-level lineage. No configuration is needed — lineage is captured automatically for all SQL and PySpark operations on UC tables. You can view lineage in the Catalog Explorer UI.
Wrapping Up
Unity Catalog transforms Databricks from “a collection of independent workspaces” into “a governed data platform.” The three-level namespace organizes everything. GRANT/REVOKE controls who can access what. Row-level security and column masking protect sensitive data without creating separate views. Lineage shows you the full dependency graph. Audit logs give you the compliance trail. And Delta Sharing lets you share data without the copies, pipelines, and sync nightmares.
The pattern for every enterprise Databricks deployment is clear: one metastore per region, catalogs for environments (prod, dev, staging), schemas for data layers (bronze, silver, gold), and permissions assigned to Azure AD groups synced via SCIM. Build this foundation, and your lakehouse is not just fast — it is governed.
Related posts: – ADLS Gen2 Connectivity in Databricks – Managed vs External Tables – Delta Lake Deep Dive – Azure RBAC Roles – Azure Connections & Authentication – Fabric Security & Governance (comparison)
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.