Microsoft Fabric Lakehouse: The Complete Practical Guide — Tables, Files, Notebooks, SQL Endpoint, Delta Lake, and Building Your First Data Lake

Microsoft Fabric Lakehouse: The Complete Practical Guide — Tables, Files, Notebooks, SQL Endpoint, Delta Lake, and Building Your First Data Lake

You know what a Lakehouse IS from our comparison post. Now it is time to BUILD one. This post is the hands-on guide — creating a lakehouse, uploading data, building Delta tables, writing PySpark notebooks, querying through the SQL analytics endpoint, managing schemas, and organizing a production data lake.

Think of this post as your first day in a new kitchen. You know what a kitchen IS (the comparison post). Now you need to learn where everything goes — ingredients in the pantry (Files folder), prepared dishes in the display case (Tables), the recipe book (notebooks), and the serving window where customers order (SQL analytics endpoint). By the end of this post, you will have cooked your first meal.

Table of Contents

  • Creating a Lakehouse
  • The Two Sections: Tables and Files
  • Tables vs Files — What Goes Where
  • Uploading Data to the Lakehouse
  • Upload via UI (Drag and Drop)
  • Upload via Notebook
  • Upload via Pipeline (Copy Activity)
  • Working with the Files Section
  • Reading Files with Notebooks
  • Organizing the Files Folder
  • Working with the Tables Section
  • Creating Delta Tables from Files
  • Creating Tables with PySpark
  • Creating Tables with SparkSQL
  • Table Properties and Metadata
  • Managed vs Unmanaged Tables
  • Schema Management
  • Default Schema (dbo)
  • Custom Schemas (bronze, silver, gold)
  • Essential Notebook Operations
  • Reading Tables
  • Writing Tables
  • Appending Data
  • Overwriting Data
  • Upsert with Delta MERGE
  • Table Maintenance (OPTIMIZE, VACUUM)
  • Time Travel (Version History)
  • The SQL Analytics Endpoint in Practice
  • Connecting and Querying
  • Creating Views
  • What You CANNOT Do (Read-Only Reminder)
  • Connecting Power BI via Direct Lake
  • Using Shortcuts in a Lakehouse
  • Internal Shortcuts (Cross-Workspace)
  • External Shortcuts (ADLS Gen2, S3)
  • Building a Medallion Architecture Lakehouse
  • Bronze Lakehouse Setup
  • Silver Lakehouse Setup
  • End-to-End Example: CSV to Dashboard
  • Lakehouse Settings and Configuration
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

Creating a Lakehouse

Step by Step

  1. Open your Fabric workspace
  2. Click + New item
  3. Select Lakehouse
  4. Name: sales_lakehouse (lowercase, underscores — no spaces)
  5. Click Create

The lakehouse opens with two empty sections: Tables and Files.

sales_lakehouse
  ├── Tables/          ← Empty (Delta tables will go here)
  │     └── (no tables yet)
  └── Files/           ← Empty (raw files will go here)
        └── (no files yet)

The Two Sections: Tables and Files

┌─────────────────────────────────────────────────────────┐
│  LAKEHOUSE: sales_lakehouse                              │
│                                                          │
│  ┌──────────────────────┐  ┌──────────────────────────┐ │
│  │  📊 TABLES            │  │  📁 FILES                 │ │
│  │                       │  │                           │ │
│  │  Delta tables         │  │  Raw files (any format)   │ │
│  │  Queryable via SQL    │  │  NOT queryable via SQL    │ │
│  │  Structured schema    │  │  No schema enforced       │ │
│  │                       │  │                           │ │
│  │  customers            │  │  raw_csv/                 │ │
│  │  orders               │  │    sales_2026.csv         │ │
│  │  products             │  │  json_feeds/              │ │
│  │                       │  │    api_response.json      │ │
│  │  Accessed via:        │  │  images/                  │ │
│  │  • Spark notebooks    │  │    product_photos/        │ │
│  │  • SQL endpoint       │  │                           │ │
│  │  • Power BI           │  │  Accessed via:            │ │
│  │                       │  │  • Spark notebooks only   │ │
│  └──────────────────────┘  └──────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Tables vs Files — What Goes Where

Data Goes In Why
Delta tables (structured, queryable) Tables/ SQL-queryable, schema enforced, versioned
Raw CSV uploads from vendors Files/ Staging area before processing into Tables
JSON from API responses Files/ Semi-structured, needs parsing before table
Parquet from pipeline Copy activity Tables/ (direct) or Files/ (staging) Depends on whether it needs cleaning first
Excel from business users Files/ Need to parse and clean before table
Images, PDFs, logs Files/ Unstructured — cannot be tables
Config files, lookup CSVs Files/ Reference data for notebooks

The pattern: Raw, unprocessed data → Files. Clean, structured data → Tables. Most production data flows: Files (land) → Notebook (transform) → Tables (serve).

Uploading Data to the Lakehouse

Upload via UI (Drag and Drop)

To Files: 1. Click Files in the left panel 2. Click UploadUpload files or Upload folder 3. Drag and drop your CSV/JSON/Excel files 4. Files appear under Files/

To Tables: 1. Click Tables in the left panel 2. Click UploadUpload files (for Parquet/CSV/Delta) 3. Fabric auto-creates a Delta table from the file 4. Table appears under Tables/

Upload via Notebook

# Upload a CSV from Files/ and create a Delta table in Tables/

# Step 1: Read the CSV from Files
df = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load("Files/raw_csv/sales_2026.csv")

df.show(5)
print(f"Rows: {df.count()}, Columns: {len(df.columns)}")

# Step 2: Write as a Delta table in Tables
df.write.format("delta")     .mode("overwrite")     .saveAsTable("sales")

# The table now appears under Tables/sales in the Lakehouse Explorer

Upload via Pipeline (Copy Activity)

Fabric Pipeline:
  Copy Activity:
    Source: Azure SQL Database → SalesLT.Customer
    Destination: Lakehouse → sales_lakehouse → Tables → customers

    The Copy Activity creates a Delta table directly — no notebook needed.

Working with the Files Section

Reading Files with Notebooks

# Read CSV
df_csv = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load("Files/raw_csv/sales_data.csv")

# Read JSON
df_json = spark.read.format("json")     .load("Files/json_feeds/api_response.json")

# Read Parquet
df_parquet = spark.read.format("parquet")     .load("Files/parquet_exports/customers.parquet")

# Read Excel (requires library)
# pip install openpyxl
df_excel = pd.read_excel("/lakehouse/default/Files/uploads/budget.xlsx")
df_spark = spark.createDataFrame(df_excel)

# Read all CSVs in a folder (wildcard)
df_all = spark.read.format("csv")     .option("header", "true")     .load("Files/raw_csv/*.csv")

# List files in a folder
files = mssparkutils.fs.ls("Files/raw_csv/")
for f in files:
    print(f"{f.name} — {f.size} bytes")

Organizing the Files Folder

Files/
  ├── raw_csv/               ← CSV uploads from vendors/partners
  │     ├── customers_20260525.csv
  │     └── orders_20260525.csv
  ├── json_feeds/            ← API response dumps
  │     └── salesforce_contacts.json
  ├── excel_uploads/         ← Business user uploads
  │     └── q2_budget.xlsx
  ├── config/                ← Configuration files
  │     └── table_mappings.json
  └── archive/               ← Processed files moved here
        └── customers_20260524.csv (already loaded)

Working with the Tables Section

Creating Delta Tables from Files

# Method 1: Read file → Write as table
df = spark.read.csv("Files/raw_csv/customers.csv", header=True, inferSchema=True)
df.write.format("delta").mode("overwrite").saveAsTable("customers")

# Method 2: SQL (CREATE TABLE from file)
spark.sql('''
    CREATE TABLE IF NOT EXISTS customers
    USING DELTA
    AS SELECT * FROM csv.`Files/raw_csv/customers.csv`
''')

Creating Tables with PySpark

# Create from DataFrame
from pyspark.sql.types import *

schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), True),
    StructField("city", StringType(), True),
    StructField("signup_date", DateType(), True)
])

data = [
    (1, "Naveen", "naveen@email.com", "Toronto", "2024-01-15"),
    (2, "Shrey", "shrey@email.com", "Mumbai", "2024-03-20"),
]

df = spark.createDataFrame(data, schema)
df.write.format("delta").mode("overwrite").saveAsTable("customers")

Creating Tables with SparkSQL

%%sql
-- Create empty table with schema
CREATE TABLE IF NOT EXISTS orders (
    order_id INT,
    customer_id INT,
    product STRING,
    amount DECIMAL(10,2),
    order_date DATE
) USING DELTA;

-- Insert data
INSERT INTO orders VALUES
    (101, 1, 'Laptop', 999.99, '2026-01-15'),
    (102, 2, 'Mouse', 29.99, '2026-01-20');

-- Verify
SELECT * FROM orders;

Table Properties and Metadata

# List all tables
spark.sql("SHOW TABLES").show()

# Describe table schema
spark.sql("DESCRIBE TABLE customers").show()

# Describe extended (location, format, properties)
spark.sql("DESCRIBE EXTENDED customers").show(100, truncate=False)

# Table history (Delta versioning)
spark.sql("DESCRIBE HISTORY customers").show()

# Table details
spark.sql("DETAIL customers").show()

Managed vs Unmanaged Tables

# Managed table (default) — Fabric controls data AND metadata
df.write.format("delta").saveAsTable("managed_customers")
# DROP TABLE → deletes data AND metadata

# Unmanaged (external) table — you control data location
df.write.format("delta").save("Files/external/customers")
spark.sql('''
    CREATE TABLE external_customers
    USING DELTA
    LOCATION 'Files/external/customers'
''')
# DROP TABLE → deletes metadata only, data stays in Files/

Rule: Use managed tables (default) unless you need data to survive table drops.

Schema Management

Default Schema (dbo)

Every lakehouse starts with a default schema. Tables created without specifying a schema go into the default:

# These both create in the default schema
df.write.saveAsTable("customers")
spark.sql("CREATE TABLE orders (...) USING DELTA")

Custom Schemas (bronze, silver, gold)

Fabric Lakehouse supports custom schemas for organization:

%%sql
-- Create schemas
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

-- Create tables in specific schemas
CREATE TABLE bronze.raw_customers (...) USING DELTA;
CREATE TABLE silver.customers_clean (...) USING DELTA;
CREATE TABLE gold.dim_customer (...) USING DELTA;

-- Query with schema prefix
SELECT * FROM silver.customers_clean;
Tables/
  ├── bronze/
  │     ├── raw_customers
  │     ├── raw_orders
  │     └── raw_products
  ├── silver/
  │     ├── customers_clean
  │     ├── orders_validated
  │     └── products_standardized
  └── gold/
        ├── dim_customer
        ├── dim_product
        └── fact_orders

Essential Notebook Operations

Reading Tables

# Method 1: PySpark
df = spark.read.table("customers")
df = spark.table("customers")          # Shorthand

# Method 2: SparkSQL
df = spark.sql("SELECT * FROM customers WHERE city = 'Toronto'")

# Method 3: SparkSQL magic command
%%sql
SELECT * FROM customers WHERE city = 'Toronto'

# Read with schema
df = spark.table("silver.customers_clean")

Writing Tables

# Overwrite entire table
df_clean.write.format("delta").mode("overwrite").saveAsTable("silver.customers_clean")

# Append new rows
df_new.write.format("delta").mode("append").saveAsTable("silver.customers_clean")

# Write with options
df_clean.write.format("delta")     .mode("overwrite")     .option("overwriteSchema", "true")      .saveAsTable("silver.customers_clean")

Upsert with Delta MERGE

from delta.tables import DeltaTable

# Load target table as DeltaTable
target = DeltaTable.forName(spark, "silver.customers_clean")

# Source: new/changed records
source = spark.read.table("bronze.raw_customers")

# MERGE: update existing + insert new
target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "email": "s.email",
    "city": "s.city",
    "updated_at": "current_timestamp()"
}).whenNotMatchedInsert(values={
    "customer_id": "s.customer_id",
    "name": "s.name",
    "email": "s.email",
    "city": "s.city",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()"
}).execute()

print(f"MERGE complete. Target rows: {spark.table('silver.customers_clean').count()}")

Table Maintenance (OPTIMIZE, VACUUM)

%%sql
-- OPTIMIZE: compact small files into larger ones (faster reads)
OPTIMIZE silver.customers_clean;

-- OPTIMIZE with Z-ORDER (co-locate data by frequently filtered column)
OPTIMIZE silver.customers_clean ZORDER BY (city);

-- VACUUM: remove old file versions (save storage)
-- Default: keeps 7 days of history
VACUUM silver.customers_clean RETAIN 168 HOURS;

-- Check file count before/after
DESCRIBE DETAIL silver.customers_clean;

Time Travel (Version History)

# View version history
spark.sql("DESCRIBE HISTORY customers").show()

# Read a specific version
df_v1 = spark.read.format("delta").option("versionAsOf", 1).table("customers")

# Read as of a timestamp
df_yesterday = spark.read.format("delta")     .option("timestampAsOf", "2026-05-24")     .table("customers")

# Restore to a previous version
spark.sql("RESTORE TABLE customers TO VERSION AS OF 2")

The SQL Analytics Endpoint in Practice

Connecting and Querying

Every lakehouse automatically has a SQL analytics endpoint. To access it:

  1. In your lakehouse, click the dropdown next to the lakehouse name (top-left)
  2. Select SQL analytics endpoint
  3. The view changes — you see tables with a SQL query editor
-- Query tables with T-SQL (read-only)
SELECT * FROM customers WHERE city = 'Toronto';

SELECT department, COUNT(*) AS emp_count, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

-- Join tables
SELECT o.order_id, c.name, o.product, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

Creating Views

-- Views ARE supported on the SQL analytics endpoint
CREATE VIEW vw_active_customers AS
SELECT customer_id, name, email, city
FROM customers
WHERE is_active = 1;

-- Query the view
SELECT * FROM vw_active_customers;

-- Views can join tables
CREATE VIEW vw_customer_orders AS
SELECT c.name, c.city, COUNT(o.order_id) AS total_orders, SUM(o.amount) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name, c.city;

What You CANNOT Do (Read-Only Reminder)

-- ❌ ALL of these FAIL on the SQL analytics endpoint:
INSERT INTO customers VALUES (999, 'Test', 'test@email.com');   -- ERROR
UPDATE customers SET city = 'Vancouver' WHERE customer_id = 1;  -- ERROR
DELETE FROM customers WHERE customer_id = 999;                   -- ERROR
MERGE INTO customers ...;                                        -- ERROR
CREATE TABLE new_table (...);                                    -- ERROR
-- Use Spark notebooks for ALL write operations

Connecting Power BI via Direct Lake

  1. Open Power BI Desktop or create a new Semantic Model in the workspace
  2. Data source: Microsoft FabricLakehouse
  3. Select your lakehouse → select tables
  4. Connection mode: Direct Lake (automatic in Fabric)
  5. Build your report — data is always fresh, no scheduled refresh needed

Using Shortcuts in a Lakehouse

Internal Shortcuts (Cross-Workspace)

Workspace: Analytics
  Lakehouse: analytics_lakehouse
    Tables/
      customers ← SHORTCUT to DataEng workspace / bronze_lakehouse / customers

# Analysts query 'customers' as if it were local — data stays in DataEng workspace
  1. Right-click TablesNew shortcut
  2. Select OneLake → choose source workspace → lakehouse → table
  3. Name the shortcut → Create

External Shortcuts (ADLS Gen2, S3)

Lakehouse: sales_lakehouse
  Tables/
    external_customers ← SHORTCUT to ADLS Gen2 → silver/customers/

# Fabric reads Delta table from your existing ADLS — zero data copied
  1. Right-click TablesNew shortcut
  2. Select Azure Data Lake Storage Gen2
  3. Enter connection details → browse to the Delta table folder
  4. Name the shortcut → Create

Building a Medallion Architecture Lakehouse

Option A: Single Lakehouse with Schemas

sales_lakehouse
  ├── Tables/
  │     ├── bronze.raw_customers
  │     ├── bronze.raw_orders
  │     ├── silver.customers_clean
  │     ├── silver.orders_validated
  │     ├── gold.dim_customer
  │     └── gold.fact_orders
  └── Files/
        └── raw_csv/ (landing zone for uploads)
Workspace: DataEng_Prod
  ├── bronze_lakehouse
  │     ├── Tables/ (raw Delta tables from pipeline Copy activities)
  │     └── Files/ (raw CSV/JSON uploads)
  │
  ├── silver_lakehouse
  │     └── Tables/ (cleaned, validated tables)
  │
  └── gold_lakehouse (or use Warehouse for Gold)
        └── Tables/ (star schema: dim + fact tables)

End-to-End Example: CSV to Dashboard

Step 1: Upload CSV → Files/raw_csv/customers.csv (drag and drop)

Step 2: Notebook — Read and clean:
  df = spark.read.csv("Files/raw_csv/customers.csv", header=True, inferSchema=True)
  df_clean = df.withColumn("name", initcap(trim(col("name"))))                .withColumn("email", lower(trim(col("email"))))                .filter(col("customer_id").isNotNull())                .dropDuplicates(["customer_id"])
  df_clean.write.format("delta").mode("overwrite").saveAsTable("customers")

Step 3: SQL analytics endpoint — Create view:
  CREATE VIEW vw_customer_summary AS
  SELECT city, COUNT(*) AS customer_count FROM customers GROUP BY city;

Step 4: Power BI — Connect via Direct Lake:
  Select 'customers' and 'vw_customer_summary' → build dashboard
  Dashboard auto-refreshes when notebook reruns

Step 5: Pipeline — Schedule notebook to run daily at 6 AM

Common Mistakes

  1. Putting everything in Files and never creating Tables — Files are a staging area. Production data should be in Tables (Delta format) for SQL access, Power BI, and governance.

  2. Not using schemas for organization — a lakehouse with 50 tables in the default schema is chaos. Use bronze/silver/gold schemas.

  3. Trying to write via the SQL analytics endpoint — it is read-only. Use Spark notebooks for all INSERT, UPDATE, DELETE, and MERGE operations.

  4. Forgetting to run OPTIMIZE — Delta tables accumulate small files over time. Run OPTIMIZE weekly or after large loads for better read performance.

  5. Not using VACUUM — old file versions consume storage. Run VACUUM to clean up versions older than 7 days.

  6. Creating one giant lakehouse for everything — separate by purpose: bronze_lakehouse, silver_lakehouse. Or use schemas within one lakehouse. Either way, organize.

Interview Questions

Q: What is the difference between the Tables and Files sections in a Fabric Lakehouse? A: Tables store structured Delta Lake tables that are queryable via Spark notebooks AND the SQL analytics endpoint. Files store raw, unstructured data in any format (CSV, JSON, Excel, images) accessible only via Spark notebooks. Data typically flows: Files (land raw data) → Notebook (transform) → Tables (serve clean data).

Q: How do you create a Delta table in a Fabric Lakehouse? A: Three methods: read a file and write as table (df.write.saveAsTable("name")), use SparkSQL (CREATE TABLE ... USING DELTA), or use the Copy Activity in a pipeline which creates Delta tables directly. All tables in a Lakehouse are Delta format by default.

Q: What is the SQL analytics endpoint and what are its limitations? A: An auto-generated T-SQL interface on top of Lakehouse Delta tables. Supports SELECT queries, JOINs, views, and table-valued functions. Does NOT support INSERT, UPDATE, DELETE, MERGE, or CREATE TABLE — it is read-only. Write operations must go through Spark notebooks.

Wrapping Up

The Fabric Lakehouse is your workshop for data engineering — raw files land in Files, get transformed by notebooks, and become queryable Delta tables in Tables. The SQL analytics endpoint gives analysts SQL access without touching Spark. Shortcuts connect to external data without copying. And the Medallion Architecture (bronze → silver → gold schemas) organizes everything for production.

Related posts:Lakehouse vs WarehouseFabric Foundations: Capacity, Workspaces, ItemsOneLake ShortcutsDelta Lake Deep DivePySpark Transformations Cookbook


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link