Microsoft Fabric Lakehouse: The Complete Practical Guide — Tables, Files, Notebooks, SQL Endpoint, Delta Lake, and Building Your First Data Lake

You know what a Lakehouse IS from our comparison post. Now it is time to BUILD one. This post is the hands-on guide — creating a lakehouse, uploading data, building Delta tables, writing PySpark notebooks, querying through the SQL analytics endpoint, managing schemas, and organizing a production data lake.

Think of this post as your first day in a new kitchen. You know what a kitchen IS (the comparison post). Now you need to learn where everything goes — ingredients in the pantry (Files folder), prepared dishes in the display case (Tables), the recipe book (notebooks), and the serving window where customers order (SQL analytics endpoint). By the end of this post, you will have cooked your first meal.

Creating a Lakehouse
The Two Sections: Tables and Files
Tables vs Files — What Goes Where
Uploading Data to the Lakehouse
Upload via UI (Drag and Drop)
Upload via Notebook
Upload via Pipeline (Copy Activity)
Working with the Files Section
Reading Files with Notebooks
Organizing the Files Folder
Working with the Tables Section
Creating Delta Tables from Files
Creating Tables with PySpark
Creating Tables with SparkSQL
Table Properties and Metadata
Managed vs Unmanaged Tables
Schema Management
Default Schema (dbo)
Custom Schemas (bronze, silver, gold)
Essential Notebook Operations
Reading Tables
Writing Tables
Appending Data
Overwriting Data
Upsert with Delta MERGE
Table Maintenance (OPTIMIZE, VACUUM)
Time Travel (Version History)
The SQL Analytics Endpoint in Practice
Connecting and Querying
Creating Views
What You CANNOT Do (Read-Only Reminder)
Connecting Power BI via Direct Lake
Using Shortcuts in a Lakehouse
Internal Shortcuts (Cross-Workspace)
External Shortcuts (ADLS Gen2, S3)
Building a Medallion Architecture Lakehouse
Bronze Lakehouse Setup
Silver Lakehouse Setup
End-to-End Example: CSV to Dashboard
Lakehouse Settings and Configuration
Common Mistakes
Interview Questions
Wrapping Up

Creating a Lakehouse

Step by Step

Open your Fabric workspace
Click + New item
Select Lakehouse
Name: sales_lakehouse (lowercase, underscores — no spaces)
Click Create

The lakehouse opens with two empty sections: Tables and Files.

sales_lakehouse
  ├── Tables/          ← Empty (Delta tables will go here)
  │     └── (no tables yet)
  └── Files/           ← Empty (raw files will go here)
        └── (no files yet)

The Two Sections: Tables and Files

┌─────────────────────────────────────────────────────────┐
│  LAKEHOUSE: sales_lakehouse                              │
│                                                          │
│  ┌──────────────────────┐  ┌──────────────────────────┐ │
│  │  📊 TABLES            │  │  📁 FILES                 │ │
│  │                       │  │                           │ │
│  │  Delta tables         │  │  Raw files (any format)   │ │
│  │  Queryable via SQL    │  │  NOT queryable via SQL    │ │
│  │  Structured schema    │  │  No schema enforced       │ │
│  │                       │  │                           │ │
│  │  customers            │  │  raw_csv/                 │ │
│  │  orders               │  │    sales_2026.csv         │ │
│  │  products             │  │  json_feeds/              │ │
│  │                       │  │    api_response.json      │ │
│  │  Accessed via:        │  │  images/                  │ │
│  │  • Spark notebooks    │  │    product_photos/        │ │
│  │  • SQL endpoint       │  │                           │ │
│  │  • Power BI           │  │  Accessed via:            │ │
│  │                       │  │  • Spark notebooks only   │ │
│  └──────────────────────┘  └──────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Tables vs Files — What Goes Where

Data	Goes In	Why
Delta tables (structured, queryable)	Tables/	SQL-queryable, schema enforced, versioned
Raw CSV uploads from vendors	Files/	Staging area before processing into Tables
JSON from API responses	Files/	Semi-structured, needs parsing before table
Parquet from pipeline Copy activity	Tables/ (direct) or Files/ (staging)	Depends on whether it needs cleaning first
Excel from business users	Files/	Need to parse and clean before table
Images, PDFs, logs	Files/	Unstructured — cannot be tables
Config files, lookup CSVs	Files/	Reference data for notebooks

The pattern: Raw, unprocessed data → Files. Clean, structured data → Tables. Most production data flows: Files (land) → Notebook (transform) → Tables (serve).

Uploading Data to the Lakehouse

Upload via UI (Drag and Drop)

To Files: 1. Click Files in the left panel 2. Click Upload → Upload files or Upload folder 3. Drag and drop your CSV/JSON/Excel files 4. Files appear under Files/

To Tables: 1. Click Tables in the left panel 2. Click Upload → Upload files (for Parquet/CSV/Delta) 3. Fabric auto-creates a Delta table from the file 4. Table appears under Tables/

Upload via Notebook

# Upload a CSV from Files/ and create a Delta table in Tables/

# Step 1: Read the CSV from Files
df = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load("Files/raw_csv/sales_2026.csv")

df.show(5)
print(f"Rows: {df.count()}, Columns: {len(df.columns)}")

# Step 2: Write as a Delta table in Tables
df.write.format("delta")     .mode("overwrite")     .saveAsTable("sales")

# The table now appears under Tables/sales in the Lakehouse Explorer

Upload via Pipeline (Copy Activity)

Fabric Pipeline:
  Copy Activity:
    Source: Azure SQL Database → SalesLT.Customer
    Destination: Lakehouse → sales_lakehouse → Tables → customers

    The Copy Activity creates a Delta table directly — no notebook needed.

Working with the Files Section

Reading Files with Notebooks

# Read CSV
df_csv = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load("Files/raw_csv/sales_data.csv")

# Read JSON
df_json = spark.read.format("json")     .load("Files/json_feeds/api_response.json")

# Read Parquet
df_parquet = spark.read.format("parquet")     .load("Files/parquet_exports/customers.parquet")

# Read Excel (requires library)
# pip install openpyxl
df_excel = pd.read_excel("/lakehouse/default/Files/uploads/budget.xlsx")
df_spark = spark.createDataFrame(df_excel)

# Read all CSVs in a folder (wildcard)
df_all = spark.read.format("csv")     .option("header", "true")     .load("Files/raw_csv/*.csv")

# List files in a folder
files = mssparkutils.fs.ls("Files/raw_csv/")
for f in files:
    print(f"{f.name} — {f.size} bytes")

Organizing the Files Folder

Files/
  ├── raw_csv/               ← CSV uploads from vendors/partners
  │     ├── customers_20260525.csv
  │     └── orders_20260525.csv
  ├── json_feeds/            ← API response dumps
  │     └── salesforce_contacts.json
  ├── excel_uploads/         ← Business user uploads
  │     └── q2_budget.xlsx
  ├── config/                ← Configuration files
  │     └── table_mappings.json
  └── archive/               ← Processed files moved here
        └── customers_20260524.csv (already loaded)

Working with the Tables Section

Creating Delta Tables from Files

# Method 1: Read file → Write as table
df = spark.read.csv("Files/raw_csv/customers.csv", header=True, inferSchema=True)
df.write.format("delta").mode("overwrite").saveAsTable("customers")

# Method 2: SQL (CREATE TABLE from file)
spark.sql('''
    CREATE TABLE IF NOT EXISTS customers
    USING DELTA
    AS SELECT * FROM csv.`Files/raw_csv/customers.csv`
''')

Creating Tables with PySpark

# Create from DataFrame
from pyspark.sql.types import *

schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), True),
    StructField("city", StringType(), True),
    StructField("signup_date", DateType(), True)
])

data = [
    (1, "Naveen", "naveen@email.com", "Toronto", "2024-01-15"),
    (2, "Shrey", "shrey@email.com", "Mumbai", "2024-03-20"),
]

df = spark.createDataFrame(data, schema)
df.write.format("delta").mode("overwrite").saveAsTable("customers")

Creating Tables with SparkSQL

%%sql
-- Create empty table with schema
CREATE TABLE IF NOT EXISTS orders (
    order_id INT,
    customer_id INT,
    product STRING,
    amount DECIMAL(10,2),
    order_date DATE
) USING DELTA;

-- Insert data
INSERT INTO orders VALUES
    (101, 1, 'Laptop', 999.99, '2026-01-15'),
    (102, 2, 'Mouse', 29.99, '2026-01-20');

-- Verify
SELECT * FROM orders;

Table Properties and Metadata

# List all tables
spark.sql("SHOW TABLES").show()

# Describe table schema
spark.sql("DESCRIBE TABLE customers").show()

# Describe extended (location, format, properties)
spark.sql("DESCRIBE EXTENDED customers").show(100, truncate=False)

# Table history (Delta versioning)
spark.sql("DESCRIBE HISTORY customers").show()

# Table details
spark.sql("DETAIL customers").show()

Managed vs Unmanaged Tables

# Managed table (default) — Fabric controls data AND metadata
df.write.format("delta").saveAsTable("managed_customers")
# DROP TABLE → deletes data AND metadata

# Unmanaged (external) table — you control data location
df.write.format("delta").save("Files/external/customers")
spark.sql('''
    CREATE TABLE external_customers
    USING DELTA
    LOCATION 'Files/external/customers'
''')
# DROP TABLE → deletes metadata only, data stays in Files/

Rule: Use managed tables (default) unless you need data to survive table drops.

Schema Management

Default Schema (dbo)

Every lakehouse starts with a default schema. Tables created without specifying a schema go into the default:

# These both create in the default schema
df.write.saveAsTable("customers")
spark.sql("CREATE TABLE orders (...) USING DELTA")

Custom Schemas (bronze, silver, gold)

Fabric Lakehouse supports custom schemas for organization:

%%sql
-- Create schemas
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

-- Create tables in specific schemas
CREATE TABLE bronze.raw_customers (...) USING DELTA;
CREATE TABLE silver.customers_clean (...) USING DELTA;
CREATE TABLE gold.dim_customer (...) USING DELTA;

-- Query with schema prefix
SELECT * FROM silver.customers_clean;

Tables/
  ├── bronze/
  │     ├── raw_customers
  │     ├── raw_orders
  │     └── raw_products
  ├── silver/
  │     ├── customers_clean
  │     ├── orders_validated
  │     └── products_standardized
  └── gold/
        ├── dim_customer
        ├── dim_product
        └── fact_orders

Essential Notebook Operations

Reading Tables

# Method 1: PySpark
df = spark.read.table("customers")
df = spark.table("customers")          # Shorthand

# Method 2: SparkSQL
df = spark.sql("SELECT * FROM customers WHERE city = 'Toronto'")

# Method 3: SparkSQL magic command
%%sql
SELECT * FROM customers WHERE city = 'Toronto'

# Read with schema
df = spark.table("silver.customers_clean")

Writing Tables

# Overwrite entire table
df_clean.write.format("delta").mode("overwrite").saveAsTable("silver.customers_clean")

# Append new rows
df_new.write.format("delta").mode("append").saveAsTable("silver.customers_clean")

# Write with options
df_clean.write.format("delta")     .mode("overwrite")     .option("overwriteSchema", "true")      .saveAsTable("silver.customers_clean")

Appending Data

# Append new rows to an existing table (keeps existing data)
df_new = spark.read.csv("Files/raw_csv/orders_20260608.csv", header=True, inferSchema=True)

df_new.write.format("delta") \
    .mode("append") \
    .saveAsTable("silver.orders_clean")

print(f"Total rows after append: {spark.table('silver.orders_clean').count()}")

# WARNING: Append is NOT idempotent — running twice creates duplicates!
# Use Delta MERGE for idempotent loads (insert if not exists, update if changed)

Overwriting Data

# Overwrite: replace entire table with new data
df_clean.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("silver.customers_clean")

# Overwrite with schema change (when source added new columns)
df_clean.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("silver.customers_clean")

# Overwrite IS idempotent — running twice produces the same result
# Safe for full-refresh loads (dimension tables, lookup tables)

Upsert with Delta MERGE

from delta.tables import DeltaTable

# Load target table as DeltaTable
target = DeltaTable.forName(spark, "silver.customers_clean")

# Source: new/changed records
source = spark.read.table("bronze.raw_customers")

# MERGE: update existing + insert new
target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "email": "s.email",
    "city": "s.city",
    "updated_at": "current_timestamp()"
}).whenNotMatchedInsert(values={
    "customer_id": "s.customer_id",
    "name": "s.name",
    "email": "s.email",
    "city": "s.city",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()"
}).execute()

print(f"MERGE complete. Target rows: {spark.table('silver.customers_clean').count()}")

Table Maintenance (OPTIMIZE, VACUUM)

%%sql
-- OPTIMIZE: compact small files into larger ones (faster reads)
OPTIMIZE silver.customers_clean;

-- OPTIMIZE with Z-ORDER (co-locate data by frequently filtered column)
OPTIMIZE silver.customers_clean ZORDER BY (city);

-- VACUUM: remove old file versions (save storage)
-- Default: keeps 7 days of history
VACUUM silver.customers_clean RETAIN 168 HOURS;

-- Check file count before/after
DESCRIBE DETAIL silver.customers_clean;

Time Travel (Version History)

# View version history
spark.sql("DESCRIBE HISTORY customers").show()

# Read a specific version
df_v1 = spark.read.format("delta").option("versionAsOf", 1).table("customers")

# Read as of a timestamp
df_yesterday = spark.read.format("delta")     .option("timestampAsOf", "2026-05-24")     .table("customers")

# Restore to a previous version
spark.sql("RESTORE TABLE customers TO VERSION AS OF 2")

The SQL Analytics Endpoint in Practice

Connecting and Querying

Every lakehouse automatically has a SQL analytics endpoint. To access it:

In your lakehouse, click the dropdown next to the lakehouse name (top-left)
Select SQL analytics endpoint
The view changes — you see tables with a SQL query editor

-- Query tables with T-SQL (read-only)
SELECT * FROM customers WHERE city = 'Toronto';

SELECT department, COUNT(*) AS emp_count, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

-- Join tables
SELECT o.order_id, c.name, o.product, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

Creating Views

-- Views ARE supported on the SQL analytics endpoint
CREATE VIEW vw_active_customers AS
SELECT customer_id, name, email, city
FROM customers
WHERE is_active = 1;

-- Query the view
SELECT * FROM vw_active_customers;

-- Views can join tables
CREATE VIEW vw_customer_orders AS
SELECT c.name, c.city, COUNT(o.order_id) AS total_orders, SUM(o.amount) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name, c.city;

What You CANNOT Do (Read-Only Reminder)

-- ❌ ALL of these FAIL on the SQL analytics endpoint:
INSERT INTO customers VALUES (999, 'Test', 'test@email.com');   -- ERROR
UPDATE customers SET city = 'Vancouver' WHERE customer_id = 1;  -- ERROR
DELETE FROM customers WHERE customer_id = 999;                   -- ERROR
MERGE INTO customers ...;                                        -- ERROR
CREATE TABLE new_table (...);                                    -- ERROR
-- Use Spark notebooks for ALL write operations

Connecting Power BI via Direct Lake

Open Power BI Desktop or create a new Semantic Model in the workspace
Data source: Microsoft Fabric → Lakehouse
Select your lakehouse → select tables
Connection mode: Direct Lake (automatic in Fabric)
Build your report — data is always fresh, no scheduled refresh needed

Using Shortcuts in a Lakehouse

Internal Shortcuts (Cross-Workspace)

Workspace: Analytics
  Lakehouse: analytics_lakehouse
    Tables/
      customers ← SHORTCUT to DataEng workspace / bronze_lakehouse / customers

# Analysts query 'customers' as if it were local — data stays in DataEng workspace

Right-click Tables → New shortcut
Select OneLake → choose source workspace → lakehouse → table
Name the shortcut → Create

External Shortcuts (ADLS Gen2, S3)

Lakehouse: sales_lakehouse
  Tables/
    external_customers ← SHORTCUT to ADLS Gen2 → silver/customers/

# Fabric reads Delta table from your existing ADLS — zero data copied

Right-click Tables → New shortcut
Select Azure Data Lake Storage Gen2
Enter connection details → browse to the Delta table folder
Name the shortcut → Create

Building a Medallion Architecture Lakehouse

Option A: Single Lakehouse with Schemas

sales_lakehouse
  ├── Tables/
  │     ├── bronze.raw_customers
  │     ├── bronze.raw_orders
  │     ├── silver.customers_clean
  │     ├── silver.orders_validated
  │     ├── gold.dim_customer
  │     └── gold.fact_orders
  └── Files/
        └── raw_csv/ (landing zone for uploads)

Option B: Separate Lakehouses per Layer (Recommended for Production)

Workspace: DataEng_Prod
  ├── bronze_lakehouse
  │     ├── Tables/ (raw Delta tables from pipeline Copy activities)
  │     └── Files/ (raw CSV/JSON uploads)
  │
  ├── silver_lakehouse
  │     └── Tables/ (cleaned, validated tables)
  │
  └── gold_lakehouse (or use Warehouse for Gold)
        └── Tables/ (star schema: dim + fact tables)

Bronze Lakehouse Setup

Bronze Lakehouse: bronze_lakehouse
  Purpose: Landing zone for raw data — no transformations
  
  Tables/
    raw_customers    ← Pipeline Copy Activity from Azure SQL
    raw_orders       ← Pipeline Copy Activity from Azure SQL
    raw_products     ← Pipeline Copy Activity from API response
    
  Files/
    incoming_csv/    ← Vendor CSV uploads (drag and drop)
    api_responses/   ← Raw JSON from API calls
    excel_uploads/   ← Business user Excel files
    
  Rules:
    ✅ Data arrives as-is (no cleaning, no filtering)
    ✅ Schema matches the source exactly
    ✅ Append-only (never delete raw data)
    ❌ No transformations, no business logic

Silver Lakehouse Setup

Silver Lakehouse: silver_lakehouse
  Purpose: Cleaned, validated, deduplicated data
  
  Tables/
    customers_clean   ← Notebook: trim, lowercase email, dedup, null handling
    orders_validated  ← Notebook: cast types, filter invalid, dedup by order_id
    products_std      ← Notebook: standardize category names, fill nulls
    
  Rules:
    ✅ Consistent data types (strings trimmed, dates parsed)
    ✅ Nulls handled (imputed or filtered)
    ✅ Duplicates removed (dedup by business key)
    ✅ Data quality checks passed
    ❌ No business-specific aggregations (that is Gold)

End-to-End Example: CSV to Dashboard

Step 1: Upload CSV → Files/raw_csv/customers.csv (drag and drop)

Step 2: Notebook — Read and clean:
  df = spark.read.csv("Files/raw_csv/customers.csv", header=True, inferSchema=True)
  df_clean = df.withColumn("name", initcap(trim(col("name"))))                .withColumn("email", lower(trim(col("email"))))                .filter(col("customer_id").isNotNull())                .dropDuplicates(["customer_id"])
  df_clean.write.format("delta").mode("overwrite").saveAsTable("customers")

Step 3: SQL analytics endpoint — Create view:
  CREATE VIEW vw_customer_summary AS
  SELECT city, COUNT(*) AS customer_count FROM customers GROUP BY city;

Step 4: Power BI — Connect via Direct Lake:
  Select 'customers' and 'vw_customer_summary' → build dashboard
  Dashboard auto-refreshes when notebook reruns

Step 5: Pipeline — Schedule notebook to run daily at 6 AM

Lakehouse Settings and Configuration

Access Lakehouse settings by clicking the gear icon in the Lakehouse explorer:

Setting	What It Controls	Recommendation
SQL analytics endpoint	Auto-generated read-only SQL interface	Always on (default)
Default semantic model	Auto-generated Power BI dataset	Disable on Bronze/Silver, enable on Gold only
OneLake data access roles	Table/folder-level security	Configure for multi-team access control
Sensitivity labels	Data classification	Apply based on data content
Endorsement	Promoted or Certified status	Certify Gold lakehouses, leave Bronze as none

Common Mistakes

Putting everything in Files and never creating Tables — Files are a staging area. Production data should be in Tables (Delta format) for SQL access, Power BI, and governance.
Not using schemas for organization — a lakehouse with 50 tables in the default schema is chaos. Use bronze/silver/gold schemas.
Trying to write via the SQL analytics endpoint — it is read-only. Use Spark notebooks for all INSERT, UPDATE, DELETE, and MERGE operations.
Forgetting to run OPTIMIZE — Delta tables accumulate small files over time. Run OPTIMIZE weekly or after large loads for better read performance.
Not using VACUUM — old file versions consume storage. Run VACUUM to clean up versions older than 7 days.
Creating one giant lakehouse for everything — separate by purpose: bronze_lakehouse, silver_lakehouse. Or use schemas within one lakehouse. Either way, organize.

Interview Questions

Q: What is the difference between the Tables and Files sections in a Fabric Lakehouse? A: Tables store structured Delta Lake tables that are queryable via Spark notebooks AND the SQL analytics endpoint. Files store raw, unstructured data in any format (CSV, JSON, Excel, images) accessible only via Spark notebooks. Data typically flows: Files (land raw data) → Notebook (transform) → Tables (serve clean data).

Q: How do you create a Delta table in a Fabric Lakehouse? A: Three methods: read a file and write as table (df.write.saveAsTable("name")), use SparkSQL (CREATE TABLE ... USING DELTA), or use the Copy Activity in a pipeline which creates Delta tables directly. All tables in a Lakehouse are Delta format by default.

Q: What is the SQL analytics endpoint and what are its limitations? A: An auto-generated T-SQL interface on top of Lakehouse Delta tables. Supports SELECT queries, JOINs, views, and table-valued functions. Does NOT support INSERT, UPDATE, DELETE, MERGE, or CREATE TABLE — it is read-only. Write operations must go through Spark notebooks.

Wrapping Up

The Fabric Lakehouse is your workshop for data engineering — raw files land in Files, get transformed by notebooks, and become queryable Delta tables in Tables. The SQL analytics endpoint gives analysts SQL access without touching Spark. Shortcuts connect to external data without copying. And the Medallion Architecture (bronze → silver → gold schemas) organizes everything for production.

← Previous: Lakehouse vs Warehouse Fabric (7/38) Next: Warehouse Practical Guide →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Table of Contents