Databricks Unity Catalog Deep Dive: Metastore, Three-Level Namespace, Governance, Lineage, and Securing Your Lakehouse Like a Production Engineer

Databricks Unity Catalog Deep Dive: Metastore, Three-Level Namespace, Governance, Lineage, and Securing Your Lakehouse Like a Production Engineer

You have built Delta Lake tables, run PySpark notebooks, and created workflows in Databricks. But who controls access to those tables? Where is the metadata stored? How do you prevent a junior analyst from accidentally deleting a production table? How do you track which notebook reads which table?

Before Unity Catalog, Databricks used the Hive metastore — a workspace-level catalog with no cross-workspace sharing, no fine-grained access control, and no data lineage. Every workspace had its own isolated metadata. Sharing a table between two workspaces meant copying data or mounting the same storage account in both. Security was all-or-nothing: you either had access to the entire workspace or you did not.

Unity Catalog replaces all of that with a centralized, account-level governance layer. One metastore governs ALL workspaces. One permission model controls access down to the row and column level. One lineage graph tracks every table, every notebook, every query. One sharing mechanism lets you share data across workspaces, across accounts, and even across clouds — without copying.

Think of Unity Catalog like a city government. Before Unity Catalog, each neighborhood (workspace) had its own rules, its own property registry, and its own police force. If someone in Neighborhood A wanted to use a park in Neighborhood B, they had to build a duplicate park. With Unity Catalog, the city government maintains ONE property registry for ALL neighborhoods, ONE set of rules, and ONE police force. Every property (table) is registered centrally. Access is controlled by city permits (grants). And every transaction is logged in city records (audit logs).

Table of Contents

  • The Problem Before Unity Catalog
  • What Is Unity Catalog?
  • The Three-Level Namespace: Catalog.Schema.Table
  • Unity Catalog Metastore
  • Creating and Assigning a Metastore
  • Catalogs — The Top-Level Container
  • Creating Catalogs
  • Catalog Organization Patterns
  • Schemas — The Middle Layer
  • Creating Schemas
  • Tables in Unity Catalog
  • Managed Tables vs External Tables
  • Views and Functions
  • Volumes — Unstructured Data Governance
  • Storage Credentials and External Locations
  • Storage Credentials
  • External Locations
  • How They Work Together
  • Access Control — GRANT and REVOKE
  • Securable Objects Hierarchy
  • Privileges Reference
  • Granting Access (Complete Examples)
  • Ownership
  • Row-Level Security
  • Column Masking (Dynamic Data Masking)
  • Data Lineage
  • Audit Logging
  • Delta Sharing — Share Data Without Copying
  • Identity Federation and Account-Level Identity
  • Migrating from Hive Metastore to Unity Catalog
  • Unity Catalog in Azure vs AWS vs GCP
  • Unity Catalog vs Microsoft Fabric Governance
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

The Problem Before Unity Catalog

BEFORE Unity Catalog (Hive Metastore era):

Workspace A (Data Engineering)          Workspace B (Data Science)
  ├── hive_metastore                      ├── hive_metastore
  │     ├── default.customers             │     ├── default.customers  ← COPY of same data!
  │     ├── default.orders                │     ├── default.ml_features
  │     └── default.products              │     └── default.predictions
  │                                       │
  └── Mounted storage:                    └── Mounted storage:
        /mnt/datalake/...                       /mnt/datalake/...  ← Same mount, no governance

Problems:
  ❌ Two copies of customers table (storage waste, sync nightmare)
  ❌ No way to share tables across workspaces without mounting
  ❌ Permissions are workspace-level only (all or nothing)
  ❌ No lineage — who reads what? No idea.
  ❌ No audit trail — who deleted that table last Tuesday? Unknown.
  ❌ Cluster-scoped credentials — every user on the cluster has the same access

Real-life analogy: The Hive metastore era was like a company where every department kept its own filing cabinet with its own copies of documents. HR had a copy of the employee list. Finance had a different copy. When an employee changed their address, HR updated their copy but Finance did not. Nobody knew which copy was current. And there was no central log of who accessed which documents.

What Is Unity Catalog?

Unity Catalog is Databricks’ unified governance layer for all data and AI assets. It provides:

Unity Catalog provides:
  1. CENTRALIZED METASTORE — one catalog for all workspaces in the account
  2. THREE-LEVEL NAMESPACE — catalog.schema.table for organizing everything
  3. FINE-GRAINED ACCESS CONTROL — GRANT/REVOKE at catalog, schema, table, row, column level
  4. DATA LINEAGE — automatic tracking of table → notebook → table dependencies
  5. AUDIT LOGGING — every access, every query, every permission change is logged
  6. DELTA SHARING — share data externally without copying (open protocol)
  7. VOLUMES — governance for unstructured data (files, images, models)
  8. IDENTITY FEDERATION — account-level users and groups, not per-workspace
AFTER Unity Catalog:

Databricks Account
  └── Unity Catalog Metastore (ONE for the entire account)
        ├── Catalog: prod
        │     ├── Schema: bronze
        │     │     ├── customers
        │     │     ├── orders
        │     │     └── products
        │     ├── Schema: silver
        │     │     ├── customers_clean
        │     │     └── orders_validated
        │     └── Schema: gold
        │           ├── dim_customer
        │           └── fact_orders
        │
        ├── Catalog: dev
        │     └── Schema: sandbox
        │           └── test_customers
        │
        └── Catalog: shared
              └── Schema: ml_features
                    └── customer_features

  Workspace A (Data Engineering) ──► Same metastore
  Workspace B (Data Science) ──────► Same metastore
  Workspace C (Analytics) ─────────► Same metastore

  ✅ ONE copy of customers — all workspaces access the same table
  ✅ Fine-grained permissions — analysts can read gold, not bronze
  ✅ Full lineage — see exactly which notebooks read/write which tables
  ✅ Audit logs — every query logged with user, timestamp, table
  ✅ Identity-based access — permissions follow the USER, not the cluster

The Three-Level Namespace: Catalog.Schema.Table

Unity Catalog organizes all data assets into a three-level hierarchy:

Level 1: CATALOG     — the top-level container (like a database server)
Level 2: SCHEMA      — a logical grouping within a catalog (like a database)
Level 3: TABLE/VIEW  — the actual data object

Full reference: catalog_name.schema_name.table_name

Examples:
  prod.bronze.customers          — production bronze customers table
  prod.gold.dim_customer         — production gold dimension
  dev.sandbox.test_data          — development sandbox
  shared.ml_features.customer_features — shared ML feature table
-- Always use the three-level name for clarity
SELECT * FROM prod.gold.dim_customer;

-- Or set a default catalog and schema
USE CATALOG prod;
USE SCHEMA gold;
SELECT * FROM dim_customer;  -- Resolves to prod.gold.dim_customer

-- In PySpark
df = spark.table("prod.silver.customers_clean")
df.write.format("delta").saveAsTable("prod.gold.dim_customer")

Real-life analogy: The three-level namespace is like a postal address. The catalog is the country (Canada). The schema is the city (Toronto). The table is the street address (123 Main St). Just like “123 Main St” exists in many cities, a table named “customers” can exist in multiple schemas — prod.bronze.customers and dev.sandbox.customers are completely separate tables.

Level What It Is Analogy Example
Catalog Top-level container for all data assets A building in a campus prod, dev, staging
Schema Logical grouping of tables, views, functions A floor in the building bronze, silver, gold
Table/View The actual data object A room on the floor customers, dim_product

Unity Catalog Metastore

The metastore is the top-level container for ALL Unity Catalog metadata in a Databricks account. It stores catalog definitions, schema definitions, table metadata, permissions, lineage, and audit logs.

Databricks Account
  └── Metastore (one per region)
        ├── Catalogs
        ├── Schemas
        ├── Tables / Views / Functions
        ├── Volumes
        ├── Storage Credentials
        ├── External Locations
        ├── Permissions (GRANT / REVOKE)
        ├── Lineage Graph
        └── Audit Logs

Key facts:
  • ONE metastore per cloud region (e.g., one for Canada Central, one for East US)
  • ALL workspaces in that region attach to the SAME metastore
  • The metastore stores METADATA only — actual data stays in your storage account
  • Metastore is managed by the account admin (not workspace admin)

Real-life analogy: The metastore is like a city’s land registry office. It does not own any houses (data). It maintains the RECORDS of every property — who owns it, where it is located, who has permission to enter, and a history of every transaction. The houses (data files) stay where they are. The registry (metastore) tracks everything about them.

Creating and Assigning a Metastore

Step 1: Create a metastore (Account Admin → account.cloud.databricks.com)
  • Go to Data → Create Metastore
  • Name: "main" (or your company name)
  • Region: Canada Central (must match your workspace region)
  • ADLS Gen2 container for managed storage:
      abfss://unity-catalog@yourstorageaccount.dfs.core.windows.net/

Step 2: Create an Access Connector for Azure Databricks
  • In Azure Portal → Create "Access Connector for Azure Databricks"
  • This gives the metastore a Managed Identity to access ADLS Gen2
  • Assign "Storage Blob Data Contributor" role to the Access Connector
    on your ADLS Gen2 storage account

Step 3: Assign the metastore to workspaces
  • Account Console → Workspaces → select workspace → assign metastore
  • All assigned workspaces now share this metastore

Important: A workspace can only be assigned to ONE metastore at a time. But one metastore can serve MANY workspaces. This is what enables cross-workspace data sharing.

Catalogs — The Top-Level Container

A catalog is the highest organizational unit in Unity Catalog. It contains schemas, which contain tables, views, and functions.

Creating Catalogs

-- Create a catalog
CREATE CATALOG prod;
CREATE CATALOG dev;
CREATE CATALOG staging;
CREATE CATALOG shared;

-- Create a catalog with a comment
CREATE CATALOG prod COMMENT 'Production data — Bronze, Silver, Gold layers';

-- Create a catalog with a managed location (where managed table data is stored)
CREATE CATALOG prod
MANAGED LOCATION 'abfss://unity-catalog@naveenadlsgen2de.dfs.core.windows.net/prod/';

-- List all catalogs
SHOW CATALOGS;

-- Describe a catalog
DESCRIBE CATALOG prod;

-- Drop a catalog (must be empty — no schemas)
DROP CATALOG dev;

-- Drop a catalog and ALL its contents (dangerous!)
DROP CATALOG dev CASCADE;

Catalog Organization Patterns

Pattern 1: By Environment (Most Common)
  ├── prod          — production data (restricted access)
  ├── staging       — pre-production validation
  ├── dev           — development and experimentation
  └── shared        — cross-team shared datasets

Pattern 2: By Domain / Business Unit
  ├── sales         — sales team data
  ├── marketing     — marketing data
  ├── finance       — financial data
  └── platform      — shared platform data

Pattern 3: Hybrid (Recommended for Enterprise)
  ├── prod_sales       — production sales data
  ├── prod_marketing   — production marketing data
  ├── dev              — all development work
  └── shared           — cross-domain shared assets

Pattern 4: Medallion Within Each Catalog
  └── prod
        ├── bronze      — raw ingested data (schema)
        ├── silver      — cleaned, validated data (schema)
        └── gold        — business-ready aggregates (schema)

Real-life analogy: Catalogs are like separate buildings on a university campus. The Science Building (prod) has labs (schemas) with equipment (tables). The Library Building (shared) has books everyone can access. The Student Center (dev) is for experimentation. You need different key cards (permissions) for different buildings.

Schemas — The Middle Layer

A schema (also called a database) groups related tables, views, and functions within a catalog.

Creating Schemas

-- Create schemas within a catalog
CREATE SCHEMA prod.bronze;
CREATE SCHEMA prod.silver;
CREATE SCHEMA prod.gold;

-- Create with a managed location
CREATE SCHEMA prod.bronze
MANAGED LOCATION 'abfss://unity-catalog@naveenadlsgen2de.dfs.core.windows.net/prod/bronze/'
COMMENT 'Raw ingested data — no transformations applied';

-- Create with properties
CREATE SCHEMA prod.gold
COMMENT 'Business-ready tables — star schema, aggregates'
WITH DBPROPERTIES ('team' = 'data-engineering', 'layer' = 'gold');

-- List schemas in a catalog
SHOW SCHEMAS IN prod;

-- Describe schema
DESCRIBE SCHEMA prod.gold;

-- Set default catalog and schema for the session
USE CATALOG prod;
USE SCHEMA gold;

Tables in Unity Catalog

Managed Tables vs External Tables

Feature Managed Table External Table
Data location Managed by Unity Catalog (in the catalog/schema’s managed location) Your storage account (you choose the path)
DROP TABLE behavior Deletes metadata AND data files Deletes metadata only — data files remain
Governance Full UC governance (permissions, lineage, audit) Full UC governance (permissions, lineage, audit)
Storage management UC handles file organization You manage the storage path
Best for Most tables — simpler, fully managed Tables shared with non-Databricks tools, existing data lakes
VACUUM/OPTIMIZE Works normally Works normally

-- Managed table (UC controls the data location)
CREATE TABLE prod.bronze.customers (
    customer_id INT,
    name STRING,
    email STRING,
    city STRING,
    signup_date DATE
) USING DELTA;

-- External table (you specify the data location)
CREATE TABLE prod.bronze.orders
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/orders/';

-- Check if a table is managed or external
DESCRIBE EXTENDED prod.bronze.customers;
-- Look for "Type" → MANAGED or EXTERNAL

Real-life analogy: A managed table is like renting a storage unit from a company — they handle the locks, maintenance, and security. You just put your stuff in. If you cancel the rental (DROP TABLE), your stuff is removed. An external table is like storing furniture in your own garage and giving the storage company a KEY to access it — if you stop using the company (DROP TABLE), your furniture is still in your garage.

Views and Functions

-- Views are registered in Unity Catalog with full governance
CREATE VIEW prod.gold.v_active_customers AS
SELECT customer_id, name, email, city
FROM prod.gold.dim_customer
WHERE is_active = true;

-- Functions are also governed
CREATE FUNCTION prod.gold.mask_email(email STRING)
RETURNS STRING
RETURN CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[1]);

-- Usage
SELECT name, prod.gold.mask_email(email) AS masked_email
FROM prod.gold.dim_customer;

Volumes — Unstructured Data Governance

Volumes extend Unity Catalog governance to unstructured data — files, images, CSVs, models, JARs. Before Volumes, files in ADLS/S3 had no UC governance. Volumes bring files into the three-level namespace with full permissions and audit logging.

-- Create a managed volume (UC manages the storage)
CREATE VOLUME prod.bronze.raw_files;

-- Create an external volume (your storage, UC governs access)
CREATE EXTERNAL VOLUME prod.bronze.landing_zone
LOCATION 'abfss://landing@naveenadlsgen2de.dfs.core.windows.net/incoming/';

-- Upload files to a volume
-- In notebook:
-- dbutils.fs.cp("file:/tmp/data.csv", "/Volumes/prod/bronze/raw_files/data.csv")

-- List files in a volume
LIST '/Volumes/prod/bronze/raw_files/';

-- Read a file from a volume in PySpark
df = spark.read.csv("/Volumes/prod/bronze/raw_files/customers.csv", header=True)

-- Read in SQL
SELECT * FROM csv.`/Volumes/prod/bronze/raw_files/customers.csv`;

Real-life analogy: If tables are like books in a library (structured, cataloged, searchable), Volumes are like the library’s special collections room — maps, photographs, manuscripts (unstructured data). Before Volumes, these were stored in a random closet with no catalog card. Now they are registered, governed, and tracked like any other library asset.

Storage Credentials and External Locations

These two objects control how Unity Catalog accesses your cloud storage for external tables and external volumes.

Storage Credentials

A Storage Credential tells Unity Catalog HOW to authenticate to your storage.

In Azure:
  → Azure Managed Identity (via Access Connector for Azure Databricks)
  → Service Principal (App Registration + Client Secret)

In AWS:
  → IAM Role (cross-account role for S3 access)

In GCP:
  → Service Account (GCS access)
-- Create a storage credential (account admin only)
CREATE STORAGE CREDENTIAL azure_adls_credential
WITH (AZURE_MANAGED_IDENTITY, ACCESS_CONNECTOR_ID =
  '/subscriptions/xxxx/resourceGroups/rg-databricks/providers/Microsoft.Databricks/accessConnectors/unity-catalog-connector');

-- List storage credentials
SHOW STORAGE CREDENTIALS;

External Locations

An External Location maps a CLOUD STORAGE PATH to a STORAGE CREDENTIAL.
It says: "This path in ADLS/S3 can be accessed using this credential."
-- Create an external location
CREATE EXTERNAL LOCATION adls_raw_data
URL 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL azure_adls_credential)
COMMENT 'Raw data landing zone in ADLS Gen2';

-- Now you can create external tables pointing to paths under this location
CREATE TABLE prod.bronze.orders
USING DELTA
LOCATION 'abfss://raw-data@naveenadlsgen2de.dfs.core.windows.net/orders/';
-- Unity Catalog checks: Does an external location cover this path? Does the user have permission?

-- List external locations
SHOW EXTERNAL LOCATIONS;

How They Work Together

User runs: CREATE TABLE ... LOCATION 'abfss://raw-data@storage.dfs.core.windows.net/orders/'
  │
  ├── Step 1: Unity Catalog checks — does an External Location cover this path?
  │     → YES: adls_raw_data covers abfss://raw-data@storage.dfs.core.windows.net/
  │
  ├── Step 2: Does the user have CREATE EXTERNAL TABLE permission on the external location?
  │     → YES: GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION adls_raw_data TO data_engineers
  │
  ├── Step 3: Unity Catalog uses the Storage Credential linked to the external location
  │     → azure_adls_credential (Managed Identity)
  │
  └── Step 4: Table is created. Data stays in ADLS. Metadata registered in UC.

Without an external location covering the path → CREATE TABLE fails.
Without permission on the external location → CREATE TABLE fails.
The storage credential is NEVER exposed to the user — only UC uses it internally.

Real-life analogy: The Storage Credential is the master key to a building. The External Location is a sign on the door that says “Authorized personnel may enter Rooms 100-200.” The user has a badge (GRANT) that says “Allowed to enter via this door.” The user never touches the master key — the building management system (Unity Catalog) uses it on their behalf.

Access Control — GRANT and REVOKE

Securable Objects Hierarchy

Metastore
  └── Catalog          ← GRANT at this level → applies to ALL schemas and tables below
        └── Schema     ← GRANT at this level → applies to ALL tables in this schema
              ├── Table       ← GRANT at this level → applies to this table only
              ├── View
              ├── Function
              └── Volume

Permissions INHERIT downward:
  GRANT USE CATALOG ON CATALOG prod TO analysts;
  → analysts can now see the catalog and attempt to access schemas inside it

  GRANT USE SCHEMA ON SCHEMA prod.gold TO analysts;
  → analysts can see schema and attempt to access tables inside it

  GRANT SELECT ON TABLE prod.gold.dim_customer TO analysts;
  → analysts can query this specific table

All THREE are needed to read a table:
  USE CATALOG + USE SCHEMA + SELECT = can query the table

Privileges Reference

Privilege What It Allows Applies To
USE CATALOG See and access the catalog Catalog
USE SCHEMA See and access the schema Schema
SELECT Read data from a table or view Table, View
MODIFY INSERT, UPDATE, DELETE, MERGE data Table
CREATE TABLE Create managed tables in a schema Schema
CREATE EXTERNAL TABLE Create external tables using an external location External Location
CREATE SCHEMA Create schemas in a catalog Catalog
CREATE CATALOG Create new catalogs Metastore
CREATE VOLUME Create volumes in a schema Schema
READ VOLUME Read files from a volume Volume
WRITE VOLUME Write files to a volume Volume
EXECUTE Run a function Function
ALL PRIVILEGES All available privileges on the object Any

Granting Access (Complete Examples)

-- SCENARIO 1: Give analysts read-only access to Gold tables
GRANT USE CATALOG ON CATALOG prod TO analysts;
GRANT USE SCHEMA ON SCHEMA prod.gold TO analysts;
GRANT SELECT ON SCHEMA prod.gold TO analysts;
-- analysts can now query ALL current and future tables in prod.gold

-- SCENARIO 2: Give data engineers full access to Bronze and Silver
GRANT USE CATALOG ON CATALOG prod TO data_engineers;
GRANT USE SCHEMA ON SCHEMA prod.bronze TO data_engineers;
GRANT USE SCHEMA ON SCHEMA prod.silver TO data_engineers;
GRANT ALL PRIVILEGES ON SCHEMA prod.bronze TO data_engineers;
GRANT ALL PRIVILEGES ON SCHEMA prod.silver TO data_engineers;

-- SCENARIO 3: Give a specific user access to ONE table only
GRANT USE CATALOG ON CATALOG prod TO `naveen@company.com`;
GRANT USE SCHEMA ON SCHEMA prod.gold TO `naveen@company.com`;
GRANT SELECT ON TABLE prod.gold.dim_customer TO `naveen@company.com`;

-- SCENARIO 4: Revoke access
REVOKE SELECT ON SCHEMA prod.gold FROM contractors;

-- SCENARIO 5: Check who has access
SHOW GRANTS ON TABLE prod.gold.dim_customer;
SHOW GRANTS ON SCHEMA prod.gold;
SHOW GRANTS TO `naveen@company.com`;

Ownership

-- Every object has an OWNER — the creator by default
-- Owners have ALL privileges on the object plus the ability to GRANT to others

-- Transfer ownership
ALTER TABLE prod.gold.dim_customer SET OWNER TO data_engineering_team;
ALTER SCHEMA prod.gold SET OWNER TO data_engineering_lead;
ALTER CATALOG prod SET OWNER TO platform_admin;

Row-Level Security

Row-level security (RLS) restricts which ROWS a user can see based on their identity. A sales manager in Toronto should only see Toronto customers — not all customers worldwide.

-- Row filters are SQL functions that return TRUE for rows the user can see

-- Step 1: Create a row filter function
CREATE FUNCTION prod.gold.region_filter(region STRING)
RETURN IF(
    IS_ACCOUNT_GROUP_MEMBER('global_admins'),
    true,                                        -- Admins see everything
    region = CURRENT_USER_REGION()                -- Others see their region only
);

-- Simpler example: analysts only see their own department's data
CREATE FUNCTION prod.gold.dept_filter(dept STRING)
RETURN IF(
    IS_ACCOUNT_GROUP_MEMBER('all_access'),
    true,
    dept = (SELECT department FROM prod.gold.user_dept_mapping WHERE email = CURRENT_USER())
);

-- Step 2: Apply the row filter to a table
ALTER TABLE prod.gold.dim_customer
SET ROW FILTER prod.gold.region_filter ON (region);

-- Now when a Toronto analyst queries dim_customer:
SELECT * FROM prod.gold.dim_customer;
-- They see only rows where region = 'Toronto'
-- An admin sees ALL rows

-- Remove the row filter
ALTER TABLE prod.gold.dim_customer DROP ROW FILTER;

Real-life analogy: Row-level security is like a hotel receptionist who only shows you the key to YOUR room. You walk up to the front desk (query the table), and the receptionist (row filter) checks your ID (CURRENT_USER) and only gives you the key to room 305 (your rows). The other 200 rooms exist — you just cannot see or access them.

Column Masking (Dynamic Data Masking)

Column masking hides or obfuscates sensitive column values based on the user’s identity. An analyst might see nav****@email.com while an admin sees naveen@email.com.

-- Step 1: Create a masking function
CREATE FUNCTION prod.gold.mask_email_col(email STRING)
RETURN IF(
    IS_ACCOUNT_GROUP_MEMBER('pii_access'),
    email,                                           -- PII group sees full email
    CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[1])  -- Others see masked
);

CREATE FUNCTION prod.gold.mask_salary_col(salary DECIMAL)
RETURN IF(
    IS_ACCOUNT_GROUP_MEMBER('hr_team'),
    salary,                                          -- HR sees actual salary
    NULL                                             -- Others see NULL
);

-- Step 2: Apply column masks to the table
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email SET MASK prod.gold.mask_email_col;

ALTER TABLE prod.gold.dim_customer
ALTER COLUMN salary SET MASK prod.gold.mask_salary_col;

-- Now queries automatically apply the mask:
-- HR team member:     SELECT email, salary → naveen@email.com, 105000
-- Regular analyst:    SELECT email, salary → na****@email.com, NULL

-- Remove a column mask
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email DROP MASK;

Real-life analogy: Column masking is like a redacted document. When the CIA shares a document with the public (regular analysts), names and dates are blacked out (masked). When the director (PII access group) reads the same document, nothing is redacted. Same document, different views based on who is reading it.

Data Lineage

Unity Catalog automatically captures lineage — which tables are read by which notebooks, which tables feed which downstream tables, and which queries touched which columns.

Lineage is captured automatically — no configuration needed.

Example lineage graph:

  ADLS Gen2 (external)
       │
       ▼
  prod.bronze.raw_customers  ◄── NB_Ingest_Customers (notebook)
       │
       ▼
  prod.silver.customers_clean  ◄── NB_Clean_Customers (notebook)
       │
       ▼
  prod.gold.dim_customer  ◄── NB_Build_Dimensions (notebook)
       │
       ├──► Power BI Dashboard (downstream consumer)
       └──► prod.gold.fact_orders ◄── NB_Build_Facts (notebook)

What lineage tells you:
  • "Which tables does NB_Build_Dimensions read?"
      → prod.silver.customers_clean, prod.silver.orders_validated
  • "What happens if I drop prod.silver.customers_clean?"
      → It will break NB_Build_Dimensions → which breaks prod.gold.dim_customer → Power BI
  • "Who reads prod.gold.dim_customer?"
      → NB_Build_Facts, Power BI Dashboard, 3 ad-hoc queries last week

How to view lineage:

1. Open Databricks workspace
2. Go to Catalog Explorer
3. Navigate to any table → Click the "Lineage" tab
4. See upstream tables (where data comes FROM) and downstream tables (where data goes TO)
5. Click any node to see the notebook or query that created the dependency

Column-level lineage: Unity Catalog tracks not just table-level lineage but column-level. If dim_customer.full_name is derived from CONCAT(raw_customers.first_name, raw_customers.last_name), the lineage graph shows this column-to-column mapping.

Real-life analogy: Lineage is like the ingredient list and supply chain of a restaurant dish. You can trace the pasta dish (Gold table) back to the sauce (Silver table) back to the tomatoes (Bronze table) back to the farm (ADLS source). If the farm has a contamination issue (bad source data), you know exactly which dishes are affected.

Audit Logging

Unity Catalog logs EVERY operation — every query, every permission change, every table creation, every data access. These logs are essential for compliance (SOC 2, HIPAA, GDPR) and troubleshooting.

What is logged:
  • Table/View access (SELECT, INSERT, UPDATE, DELETE)
  • Permission changes (GRANT, REVOKE)
  • Object creation/deletion (CREATE TABLE, DROP TABLE)
  • Schema changes (ALTER TABLE ADD COLUMN)
  • Authentication events (login, logout)
  • Catalog/Schema operations

Where logs are stored:
  • Delivered to your cloud storage via Audit Log Delivery
  • In Azure: Azure Diagnostic Settings → Log Analytics or Storage Account
  • Format: JSON events with timestamp, user, action, resource, IP address

Example audit log entry:
{
    "timestamp": "2026-06-11T10:30:15Z",
    "user": "naveen@company.com",
    "action": "SELECT",
    "resource": "prod.gold.dim_customer",
    "workspace_id": "1234567890",
    "source_ip": "10.0.0.50",
    "rows_returned": 15000,
    "query_text": "SELECT * FROM prod.gold.dim_customer WHERE city = 'Toronto'"
}

Real-life analogy: Audit logs are like security camera footage for your data. Every door (table) has a camera. Every time someone enters (queries), leaves (disconnects), or moves furniture (modifies data), it is recorded with a timestamp and their ID badge photo. If something goes wrong, you can rewind the footage and see exactly who did what.

Delta Sharing — Share Data Without Copying

Delta Sharing is an open protocol built into Unity Catalog that lets you share data with external organizations, other Databricks accounts, or non-Databricks tools — without copying.

Sharing Scenarios:
  1. Share data with another Databricks workspace (same account) → Internal sharing via UC
  2. Share data with another Databricks account (different company) → Databricks-to-Databricks sharing
  3. Share data with non-Databricks tools (Pandas, Power BI, Spark) → Open Delta Sharing

How Delta Sharing works:
  Provider (you)                              Recipient (partner company)
  ┌─────────────────┐                        ┌─────────────────┐
  │ prod.gold       │                        │ Pandas / Spark  │
  │  .dim_customer  │──── Delta Sharing ────►│ Power BI        │
  │  .fact_orders   │   (no data copied)     │ Databricks      │
  └─────────────────┘                        └─────────────────┘

  Data stays in YOUR storage. The recipient gets a token to read it.
  You control exactly which tables, which columns, and which rows are shared.
-- Create a share
CREATE SHARE customer_analytics_share;

-- Add tables to the share
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.dim_customer;
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.agg_monthly_revenue;

-- Share with partitions (only share specific data)
ALTER SHARE customer_analytics_share ADD TABLE prod.gold.fact_orders
PARTITION (region = 'Canada');  -- Only share Canadian orders

-- Create a recipient
CREATE RECIPIENT partner_company
USING ID 'partner-databricks-account-id';

-- Grant the share to the recipient
GRANT SELECT ON SHARE customer_analytics_share TO RECIPIENT partner_company;

-- Recipient accesses the data in their Databricks workspace:
-- CREATE CATALOG partner_data USING SHARE provider_account.customer_analytics_share;
-- SELECT * FROM partner_data.default.dim_customer;

Real-life analogy: Delta Sharing is like sharing a Google Doc with “view only” access. The document stays in YOUR Google Drive. The recipient can read it from anywhere — even from a different app (Pandas, Power BI). You can revoke access at any time. You never sent a copy. They never downloaded a copy. The one source of truth is always yours.

Identity Federation and Account-Level Identity

Unity Catalog uses account-level users and groups, not workspace-level. This means permissions follow the USER across all workspaces.

Before UC (workspace-level identity):
  Workspace A: naveen@company.com → admin
  Workspace B: naveen@company.com → contributor
  → Same person, different permissions in different workspaces
  → Had to manage users PER workspace

With UC (account-level identity):
  Account: naveen@company.com → data_engineers group
  → GRANT SELECT ON prod.gold TO data_engineers
  → naveen can access prod.gold from ANY workspace assigned to the same metastore
  → Manage users ONCE at the account level

Identity sources:
  • Azure AD (Microsoft Entra ID) — sync users and groups via SCIM
  • AWS IAM Identity Center
  • Okta, OneLogin, or any SAML/SCIM provider

  Azure AD Group "DataEngineers" → synced to Databricks group "data_engineers"
  → GRANT permissions to "data_engineers" in Unity Catalog
  → Add/remove users in Azure AD → automatically reflected in Databricks

Migrating from Hive Metastore to Unity Catalog

Migration approach:

Step 1: Create your UC metastore and assign workspaces

Step 2: Create catalogs and schemas matching your desired structure
  CREATE CATALOG prod;
  CREATE SCHEMA prod.bronze;

Step 3: Upgrade Hive tables to UC (two options):

  Option A: SYNC (non-destructive, keeps Hive table too)
  -- Creates a UC table pointing to the same data as the Hive table
  CREATE TABLE prod.bronze.customers
  USING DELTA
  LOCATION 'dbfs:/user/hive/warehouse/customers';

  Option B: UCX Migration Tool (recommended for large migrations)
  -- Databricks provides the UCX (Unity Catalog Migration) tool
  -- It scans your Hive metastore, identifies all tables, and migrates them
  -- Handles clusters, notebooks, jobs, and permissions migration

Step 4: Update notebooks from 2-level to 3-level names
  -- BEFORE: spark.table("customers")
  -- AFTER:  spark.table("prod.bronze.customers")

Step 5: Update GRANT permissions (Hive metastore permissions do NOT migrate)

Step 6: Test everything in a dev catalog before touching production

Step 7: Deprecate Hive metastore access (set workspace to UC-only mode)

Unity Catalog in Azure vs AWS vs GCP

Feature Azure AWS GCP
Managed storage ADLS Gen2 S3 GCS
Storage Credential Access Connector (Managed Identity) or Service Principal IAM Role Service Account
Identity Provider Azure AD (Entra ID) via SCIM AWS IAM Identity Center or SCIM Google Workspace or SCIM
Account Console accounts.azuredatabricks.net accounts.cloud.databricks.com accounts.gcp.databricks.com
Audit Logs Azure Diagnostic Settings → Log Analytics CloudTrail / S3 delivery Cloud Logging
Delta Sharing Supported Supported Supported

Unity Catalog features are identical across all clouds. The only difference is the storage and identity integration — which cloud-specific service provides the storage and which identity provider syncs users.

Unity Catalog vs Microsoft Fabric Governance

Feature Databricks Unity Catalog Microsoft Fabric
Namespace 3-level: catalog.schema.table 2-level: workspace/lakehouse.table
Metastore Centralized, account-level OneLake (automatic, no setup)
Row-Level Security SQL row filters RLS in semantic model (Power BI) or T-SQL in Warehouse
Column Masking SQL column mask functions Dynamic data masking in Warehouse, CLS in semantic model
Data Lineage Built-in (table + column level) Microsoft Purview
Audit Logging Built-in audit log delivery Microsoft Purview + Azure Monitor
Data Sharing Delta Sharing (open protocol) OneLake Shortcuts + Power BI sharing
Identity Azure AD / IAM via SCIM Azure AD (native)
Unstructured Data Volumes Lakehouse Files section

Common Mistakes

  1. Forgetting USE CATALOG + USE SCHEMA before GRANT SELECT — granting SELECT on a table is not enough. The user also needs USE CATALOG on the catalog and USE SCHEMA on the schema. Without all three, the query fails with a “table not found” error — not a permission error — which is confusing.

  2. Using cluster-scoped credentials instead of Unity Catalog — mounting ADLS with access keys on the cluster bypasses UC entirely. No governance, no lineage, no audit. Use Storage Credentials and External Locations instead.

  3. Not migrating from Hive metastore — keeping tables in hive_metastore means they have no UC governance. All new tables should be created in UC catalogs. Migrate existing tables using the UCX tool.

  4. Granting ALL PRIVILEGES when SELECT is sufficient — the principle of least privilege applies. Analysts need SELECT, not MODIFY. Over-granting creates security risks.

  5. Creating one giant catalog for everything — use separate catalogs for dev, staging, and prod environments. This prevents accidental writes to production from development notebooks.

  6. Not setting up SCIM provisioning from Azure AD — manually managing users in Databricks means users and groups get out of sync with your identity provider. Set up SCIM so Azure AD is the single source of truth.

  7. Dropping a managed table thinking only metadata is deleted — DROP TABLE on a MANAGED table deletes both metadata AND data files. For external tables, only metadata is deleted. Know the difference before running DROP.

Interview Questions

Q: What is Unity Catalog and why is it important? A: Unity Catalog is Databricks’ centralized governance layer for all data and AI assets. It provides a three-level namespace (catalog.schema.table), fine-grained access control (GRANT/REVOKE at table, row, and column level), automatic data lineage (table and column level), audit logging, Delta Sharing for external data sharing without copying, and Volumes for unstructured data governance. It replaces the workspace-scoped Hive metastore with an account-level metastore that spans all workspaces.

Q: What is the three-level namespace in Unity Catalog? A: catalog.schema.table. The catalog is the top-level container (typically representing an environment like prod or dev). The schema is a logical grouping within a catalog (like bronze, silver, gold). The table is the actual data object. Full reference example: prod.gold.dim_customer. This replaces the two-level database.table in the Hive metastore.

Q: What is the difference between managed and external tables in Unity Catalog? A: Managed tables store their data in the catalog’s managed location — Unity Catalog controls the storage. DROP TABLE deletes both metadata and data files. External tables point to data in your own storage account — you specify the LOCATION. DROP TABLE deletes only the metadata; data files remain. Both types receive full UC governance (permissions, lineage, audit).

Q: What are Storage Credentials and External Locations? A: A Storage Credential stores the authentication method (Managed Identity, Service Principal, IAM Role) for accessing cloud storage. An External Location maps a cloud storage path to a Storage Credential, defining which storage paths Unity Catalog can access. To create an external table, there must be an External Location covering the table’s storage path, and the user must have CREATE EXTERNAL TABLE permission on that location. The user never sees the underlying credential.

Q: How does row-level security work in Unity Catalog? A: You create a SQL function that returns TRUE for rows the user should see. Then you apply that function to a table with ALTER TABLE SET ROW FILTER. When anyone queries the table, Unity Catalog automatically applies the filter based on the user’s identity (CURRENT_USER or group membership). Admins can see all rows. Regular users see only their authorized rows. The filter is transparent — users do not know it is applied.

Q: What is Delta Sharing and how does it work? A: Delta Sharing is an open protocol for sharing data without copying. The data provider creates a Share, adds tables to it, and grants access to a Recipient. The recipient accesses the data through the Delta Sharing protocol — from Databricks, Pandas, Power BI, or any compatible client. Data stays in the provider’s storage. No copies are made. The provider can restrict sharing to specific partitions, columns, or rows, and can revoke access at any time.

Q: How does Unity Catalog lineage work? A: Unity Catalog automatically captures lineage whenever a query reads from or writes to a table. The lineage graph shows upstream tables (where data comes from), downstream tables (where data goes), and the notebooks or queries that create those dependencies. It captures both table-level and column-level lineage. No configuration is needed — lineage is captured automatically for all SQL and PySpark operations on UC tables. You can view lineage in the Catalog Explorer UI.

Wrapping Up

Unity Catalog transforms Databricks from “a collection of independent workspaces” into “a governed data platform.” The three-level namespace organizes everything. GRANT/REVOKE controls who can access what. Row-level security and column masking protect sensitive data without creating separate views. Lineage shows you the full dependency graph. Audit logs give you the compliance trail. And Delta Sharing lets you share data without the copies, pipelines, and sync nightmares.

The pattern for every enterprise Databricks deployment is clear: one metastore per region, catalogs for environments (prod, dev, staging), schemas for data layers (bronze, silver, gold), and permissions assigned to Azure AD groups synced via SCIM. Build this foundation, and your lakehouse is not just fast — it is governed.

Related posts:ADLS Gen2 Connectivity in DatabricksManaged vs External TablesDelta Lake Deep DiveAzure RBAC RolesAzure Connections & AuthenticationFabric Security & Governance (comparison)


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link