Table Creation and Governance in Enterprise Data Engineering: Why Real Projects Are Nothing Like Tutorials — Dev/Test/Prod, Change Management, File Ingestion Patterns, and Production Standards

Table Creation and Governance in Enterprise Data Engineering: Why Real Projects Are Nothing Like Tutorials — Dev/Test/Prod, Change Management, File Ingestion Patterns, and Production Standards

Open any data engineering course on Udemy or YouTube. Within 10 minutes, the instructor creates a Lakehouse, opens a notebook, and runs spark.sql("CREATE TABLE customers ..."). Five minutes later, they create another. And another. Bronze table, Silver table, Gold table — created like sticky notes on a whiteboard. No approvals, no tickets, no reviews.

Now walk into an enterprise like Capgemini, TCS, Infosys, Deloitte, or any Fortune 500 company. Try creating a table in production. You will need a Jira ticket, an architect’s review, a naming convention check, a DBA sign-off, a deployment through CI/CD from Dev to Test to Prod, and probably two managers approving the pull request. Altering a single column? Change request, impact analysis, downstream dependency check, regression testing, and a deployment window.

This gap — between tutorial-world and real-world — is the number one shock for new data engineers. Nobody teaches this. No course covers it. But it is the reality of every production data platform.

This post bridges that gap. How tables are actually created in enterprise data engineering, who has permission to do what, what happens when a new file arrives from a client, and why the Medallion Architecture in production looks nothing like the version in your Udemy course.

Table of Contents

  • The Tutorial World vs Enterprise Reality
  • The Analogy: Building a House vs Renovating a Hospital
  • Environment Separation: Dev, Test, Prod
  • Who Creates Tables? Roles and Responsibilities
  • The Change Management Workflow
  • Step-by-Step: From Jira Ticket to Production Table
  • What Happens When a New File Arrives from a Client
  • Scenario 1: New CSV File, New Source System
  • Scenario 2: Existing Source Adds a New Column
  • Scenario 3: Daily Automated File Drop
  • Scenario 4: One-Time Historical Data Load
  • Do You Always Create a Table for Every File?
  • The Medallion Architecture in Production
  • Bronze Layer Governance
  • Silver Layer Governance
  • Gold Layer Governance
  • Naming Conventions and Standards
  • Table Naming Patterns
  • Column Naming Standards
  • Schema and Namespace Organization
  • Governance in Fabric
  • Workspace Separation
  • Deployment Pipelines (Dev → Test → Prod)
  • Git Integration for Table Definitions
  • Fabric Workspace Roles
  • Governance in Databricks
  • Unity Catalog and Three-Level Namespace
  • Permissions and Access Control
  • The Approval Chain: Who Approves What
  • Documentation Requirements
  • What Data Dictionaries Look Like
  • Why Tutorials Skip All This
  • Common Mistakes New Engineers Make
  • Interview Questions
  • Wrapping Up

The Tutorial World vs Enterprise Reality

Aspect Tutorial / Course Enterprise Production
Who creates tables? You — the student Defined roles (DE creates, Architect reviews, DBA approves)
Where? One workspace, one environment Dev → Test → Prod (separate workspaces)
Approvals needed? None — just click Run Jira ticket, PR review, architect sign-off, DBA approval
Naming conventions? Whatever you want — customers2, test_final_v3 Strict standards — brz_crm_customer, slv_crm_customer_clean
Altering a column? Run ALTER TABLE, done Change request, impact analysis, downstream check, deployment
Dropping a table? DROP TABLE, gone Almost never. Deprecate, archive, then remove after 90 days
Time to create a table? 30 seconds 1-5 business days (with all approvals)
Access to production? Full admin — you own everything Read-only for most. Only CI/CD pipelines write to prod.

This is not bureaucracy for the sake of bureaucracy. A single wrong ALTER TABLE in production can break 15 Power BI dashboards, 3 downstream pipelines, and 2 regulatory reports. The governance exists because the cost of a mistake in production is measured in hours of downtime, lost revenue, and compliance violations — not in “I will just recreate the table.”

The Analogy: Building a House vs Renovating a Hospital

Tutorials are like building a house on an empty plot. You are the owner, architect, builder, and inspector. Want to knock down a wall? Do it. Want to add a room? Grab a hammer. No permits, no inspections, no neighbors affected.

Enterprise data engineering is like renovating a floor in a working hospital. The hospital is running 24/7. Patients are being treated (dashboards serving business users). Medications are being dispensed (pipelines processing data). Life support systems are humming (real-time feeds to regulatory bodies).

Now you want to “add a column to the patient records table.” You cannot just walk in with a drill. You need permits (Jira ticket), architecture review (will this affect the pharmacy system?), safety checks (will the backup power handle the construction load?), a scheduled window (do it at 2 AM when fewer patients are in surgery), and a rollback plan (if the renovation fails, can we restore the original wall in 30 minutes?).

That is why real data engineering is slow, deliberate, and heavily governed. The “hospital” never stops running.

Environment Separation: Dev, Test, Prod

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   DEV Workspace   │    │  TEST Workspace   │    │  PROD Workspace   │
│                    │    │                    │    │                    │
│ Data Engineers     │    │ QA / Testers       │    │ CI/CD Pipeline     │
│ create freely      │    │ validate changes   │    │ deploys approved   │
│ experiment         │    │ run regression     │    │ code only          │
│ break things       │    │ test with real     │    │                    │
│ use sample data    │    │ data (masked)      │    │ NO human writes    │
│                    │    │                    │    │ directly            │
│ Tables: 50+        │    │ Tables: same as    │    │ Tables: same as    │
│ (many experimental)│    │ prod (mirrored)    │    │ test (promoted)    │
└────────┬───────────┘    └────────┬───────────┘    └──────────────────┘
         │                         │
         │    Git PR + Review      │    Deployment Pipeline
         └────────────────────────→└───────────────────────→
                                        (automated, no manual access)

The golden rule: Nobody — not even the lead data engineer — creates tables directly in production. Production changes ONLY go through the CI/CD deployment pipeline after passing code review and testing.

In Dev: You CAN create tables freely. This is your sandbox. Create test_customer_v3, experiment, break things. Nobody cares. But even in Dev, teams often follow naming conventions and create tables via notebooks or SQL scripts (not manually via UI clicks) so the same scripts can be promoted to Test and Prod.

In Test: The schema mirrors Prod. You deploy your Dev changes here and run automated tests. QA validates that existing dashboards still work, row counts are reasonable, and no columns broke.

In Prod: Locked down. Read-only for humans. Only the deployment pipeline (Azure DevOps, GitHub Actions, Fabric Deployment Pipelines) can create, alter, or modify tables. If you need to fix something urgently, there is a formal “hotfix” process — not “let me just run this ALTER TABLE real quick.”

Who Creates Tables? Roles and Responsibilities

Role Can Create Tables? Where? Needs Approval?
Data Engineer Yes Dev workspace only For promotion to Test/Prod — yes
Senior/Lead DE Yes Dev, sometimes Test For Prod — yes (architect review)
Data Architect Designs, does not create Reviews all table designs Approves others’ designs
DBA (Database Admin) Yes (in traditional DW) All environments Final sign-off on schema changes
Data Analyst / BI Dev No (except personal sandbox) Read-only on Prod N/A
CI/CD Pipeline Yes (automated) Test and Prod Triggered by approved PR merge
Platform/Infra Team Creates workspaces, not tables All environments Manages capacity and access

Real-life analogy: Think of a restaurant kitchen. The chef (Data Engineer) can experiment with new dishes in the test kitchen (Dev). But putting a dish on the actual menu (Prod) requires the head chef’s approval (Architect), the restaurant manager’s sign-off (Project Lead), and a trial run during a slow shift (Test environment). A line cook (Analyst) can taste and give feedback but cannot change the menu.

The Change Management Workflow

Step-by-Step: From Jira Ticket to Production Table

STEP 1: REQUIREMENT
  Business says: "We need to track customer shipping preferences."
  Product Owner creates a Jira ticket: DE-1234

STEP 2: DESIGN
  Data Engineer writes a design document:
    - Table name: slv_crm_customer_shipping_pref
    - Columns: customer_id (INT), preference (VARCHAR), updated_date (DATE)
    - Source: CRM API daily extract
    - Target: Silver Lakehouse
    - Downstream consumers: Gold fact_orders table, Shipping dashboard

  Attaches the design to Jira ticket.

STEP 3: ARCHITECTURE REVIEW
  Data Architect reviews:
    ✓ Naming convention follows standards
    ✓ Data types are appropriate
    ✓ No duplicate of existing table
    ✓ Fits within the data model
    ✗ "Add a surrogate key column" — feedback given

  DE updates design, re-submits.

STEP 4: DEVELOPMENT (Dev workspace)
  DE writes a notebook/SQL script that creates the table:
    CREATE TABLE slv_crm_customer_shipping_pref (...)

  DE writes the pipeline that populates the table.
  DE tests in Dev with sample data.
  DE commits code to Git (feature branch).

STEP 5: PULL REQUEST + CODE REVIEW
  DE opens a PR: feature/DE-1234-shipping-preferences
  Peer engineer reviews the code.
  Lead engineer approves.
  PR is merged to the develop branch.

STEP 6: DEPLOY TO TEST
  CI/CD pipeline automatically deploys to Test workspace.
  QA runs regression tests:
    ✓ Table created with correct schema
    ✓ Pipeline loads data correctly
    ✓ Downstream Gold table still works
    ✓ Shipping dashboard renders correctly

STEP 7: DEPLOY TO PROD
  After QA sign-off, code is merged to main branch.
  CI/CD pipeline deploys to Prod workspace.
  Table is created in Prod.
  Pipeline is scheduled.

STEP 8: MONITORING
  First production run is monitored.
  Audit logging confirms data loaded.
  Dashboard team confirms new column appears.

TOTAL TIME: 3-7 business days (depending on review queue)
TOTAL APPROVALS: Architect + Peer + Lead + QA = 4 people

Compare this to a tutorial: spark.sql("CREATE TABLE shipping_prefs ...") — 5 seconds, zero approvals.

What Happens When a New File Arrives from a Client

This is where real-world data engineering diverges most from tutorials. In courses, data magically appears in a folder. In reality, receiving data from clients involves contracts, format negotiations, validation, and governance decisions.

Scenario 1: New CSV File, New Source System

SITUATION:
  Client (a retail company) wants to send you daily sales data.
  They will SFTP a CSV file every night at 11 PM.

WHAT HAPPENS — NOT what tutorials show:

Week 1: Planning (no code written yet)
  - Data engineer receives sample file from client
  - Reviews: column names, data types, file format, delimiter, encoding
  - Identifies: 45 columns, pipe-delimited, ISO-8859-1 encoding, no header row
  - Creates data mapping document:
      Source Column    Target Column       Type        Transformation
      col_1            customer_id         INT         None
      col_2            customer_name       VARCHAR     Trim, Title Case
      col_3            order_date          DATE        Convert from DD/MM/YYYY
      col_15           revenue             DECIMAL     Convert from string with commas
  - Architect reviews mapping + proposes table structure
  - Decision: Create 1 Bronze table (raw landing) + 1 Silver table (cleaned)

Week 2: Development
  - DE builds in Dev workspace:
      1. Landing zone: Files/incoming/retail_client/  (raw CSV lands here)
      2. Bronze table: brz_retail_daily_sales (raw, no transformation)
      3. Silver table: slv_retail_daily_sales (cleaned, typed, deduplicated)
      4. Pipeline: PL_Ingest_Retail_Sales
         - Copy Activity: SFTP → Landing zone
         - Notebook: Landing zone → Bronze (schema validation)
         - Notebook: Bronze → Silver (cleaning + transformation)
         - Audit logging at every step

Week 3: Testing
  - Client sends 3 days of test files
  - QA validates row counts, data types, edge cases
  - DE fixes: found that revenue column has "N/A" strings → handle in Silver
  - Regression: existing pipelines unaffected

Week 4: Go Live
  - Deploy to Prod via CI/CD
  - First production file arrives
  - Monitor first 5 days closely
  - Sign-off from business stakeholders

Scenario 2: Existing Source Adds a New Column

SITUATION:
  The retail client adds a "loyalty_points" column to their daily CSV.
  They notify you (hopefully) or the pipeline breaks (realistically).

OPTION A: Client Notified in Advance (Best Case)
  1. Client sends updated spec with new column
  2. DE creates Jira ticket: DE-1301 "Add loyalty_points to retail sales"
  3. Design review: add column to Bronze and Silver tables
  4. Dev: ALTER TABLE brz_retail_daily_sales ADD loyalty_points VARCHAR(50)
         ALTER TABLE slv_retail_daily_sales ADD loyalty_points INT
  5. Update pipeline notebook to handle the new column
  6. Test → Deploy → Monitor

OPTION B: Column Appears Without Warning (Common)
  1. Pipeline breaks: "Column count mismatch" or schema drift error
  2. DE investigates: "New column 'loyalty_points' not in schema"
  3. Emergency Jira ticket created
  4. Fix depends on platform:
     - Databricks AutoLoader with schema evolution: handles automatically
     - Fabric pipeline with fixed schema: manual ALTER TABLE needed
     - If Bronze is schemaless (Parquet/Delta with mergeSchema): no break
  5. This is why Bronze layer should be PERMISSIVE (accept everything)
     and Silver should be STRICT (validated schema)

LESSON: Design your Bronze layer to absorb schema changes gracefully.
        Design your Silver layer to reject bad data explicitly.

Scenario 3: Daily Automated File Drop

SITUATION:
  An API sends a JSON file to ADLS/OneLake every day at 3 AM.
  You already have Bronze and Silver tables set up.

WHAT HAPPENS DAILY (fully automated, no human involvement):

  3:00 AM — File lands: Files/incoming/api_data/2026-06-21/orders.json
  3:05 AM — Event trigger fires (file arrival detected)
  3:06 AM — Pipeline starts:
             Step 1: Validate file (not empty, correct format)
             Step 2: Load to Bronze table (append, preserve raw data)
             Step 3: Transform to Silver (clean, deduplicate, type-cast)
             Step 4: Update Gold (aggregate into fact/dim tables)
             Step 5: Log audit record (rows loaded, duration, status)
  3:25 AM — Pipeline completes
  3:30 AM — Semantic model refresh triggers
  7:00 AM — Business users open dashboards with fresh data

DO YOU CREATE NEW TABLES? No.
  - The tables already exist (created during initial setup)
  - The daily pipeline INSERTS/APPENDS into existing tables
  - New tables are only created when a NEW source system is onboarded

The only code that runs daily is the pipeline — not CREATE TABLE statements.

Scenario 4: One-Time Historical Data Load

SITUATION:
  Client sends a 50 GB ZIP file with 3 years of historical transaction data.
  This is a one-time load, not recurring.

DECISION: Do we create permanent tables for this?

OPTION A: Create Permanent Tables (if data will be queried regularly)
  - Create brz_historical_transactions and slv_historical_transactions
  - Follow full governance: Jira, design review, naming conventions
  - Load data once, tables stay for ongoing queries

OPTION B: Load into Existing Tables (if schema matches current daily data)
  - Backfill existing brz_retail_daily_sales and slv_retail_daily_sales
  - Use tumbling window pattern (process each month as a separate window)
  - No new tables needed — data joins naturally with daily data

OPTION C: Temporary Staging (if one-time analysis only)
  - Load into a STAGING table: stg_historical_analysis
  - Analyst runs their analysis
  - Table is dropped after 30 days (documented in Jira)
  - Even staging tables follow naming conventions

NEVER: Load 50 GB into a table called "test_data_final_2" with no documentation.

Do You Always Create a Table for Every File?

No. This is a critical concept that tutorials get wrong. Not every file becomes a table.

Scenario Create New Table? What Happens Instead
New source system onboarded Yes — Bronze + Silver + possibly Gold Full governance process (Jira → Design → Review → Deploy)
Daily file from existing source No — append to existing tables Pipeline inserts into existing Bronze/Silver tables
Source adds a new column No — ALTER existing table Schema evolution or ALTER TABLE via change request
One-time historical backfill Maybe — depends on reuse Backfill existing tables or create temporary staging
Ad-hoc analysis request No — use a view or temp table Analyst queries existing tables with SQL views
New business KPI needed Maybe — new Gold table or view Often a new view on existing Gold tables suffices
Testing / experimentation Yes — in Dev only Create freely in Dev, never promote throwaway tables

Real-life analogy: A hospital does not create a new patient record system every time a patient visits. The system exists. New patients are added (INSERT). Existing patients update their address (ALTER). The system itself (table structure) changes rarely — and only through a formal process.

The Medallion Architecture in Production

In tutorials, Bronze-Silver-Gold is presented as three folders where you freely create tables. In production, each layer has different governance rules:

Bronze Layer Governance

Purpose: Raw data landing zone — preserve source data exactly as received

Who creates Bronze tables: Data Engineers (with standard naming)
Governance level: MODERATE
  - Naming: brz_{source}_{entity}  (e.g., brz_crm_customer, brz_sftp_sales)
  - Schema: PERMISSIVE — accept new columns, use mergeSchema
  - Data: APPEND-ONLY — never overwrite or delete raw data
  - Retention: Keep for 90-365 days depending on policy

Why moderate (not strict)?
  Bronze is the safety net. If Silver transformation has a bug,
  you re-process from Bronze. If the source schema changes,
  Bronze absorbed it without breaking. Bronze tables change rarely
  after initial creation.

Silver Layer Governance

Purpose: Cleaned, validated, business-ready data

Who creates Silver tables: Data Engineers (architect-reviewed design)
Governance level: STRICT
  - Naming: slv_{source}_{entity}  (e.g., slv_crm_customer_clean)
  - Schema: ENFORCED — explicit data types, NOT NULL constraints, validated
  - Data quality: NULL checks, deduplication, type casting, standardization
  - Changes: Any schema change requires Jira + impact analysis

Why strict?
  Silver tables are consumed by Gold layer AND directly by analysts.
  A wrong data type in Silver propagates to every downstream consumer.
  If Silver breaks, dashboards show wrong numbers.

Gold Layer Governance

Purpose: Business-facing aggregations, dimensions, facts

Who creates Gold tables: Senior DEs + Data Architects
Governance level: VERY STRICT
  - Naming: dim_{entity}, fact_{process}  (e.g., dim_customer, fact_sales)
  - Schema: LOCKED — changes require business stakeholder approval
  - Testing: Unit tests, data quality checks, row count validations
  - Consumers: Power BI dashboards, executive reports, regulatory filings
  - Changes: Full change management + regression testing + stakeholder sign-off

Why very strict?
  Gold tables power the CEO's dashboard. A wrong number in fact_sales
  means a wrong revenue figure in the board presentation. Gold changes
  are treated like production database changes in traditional software.

Naming Conventions and Standards

Table Naming Patterns

Layer Pattern Example
Bronze brz_{source}_{entity} brz_crm_customer, brz_sftp_daily_sales
Silver slv_{source}_{entity} slv_crm_customer_clean, slv_erp_orders
Gold dim_{entity} / fact_{process} dim_customer, fact_sales, agg_monthly_revenue
Staging stg_{purpose} stg_migration_2026, stg_reconciliation

The prefix tells everyone — instantly — which layer a table belongs to and how sensitive it is.

Column Naming Standards

Standard conventions (vary by organization):
  snake_case:          customer_id, order_date, total_amount    ✓ Most common
  camelCase:           customerId, orderDate, totalAmount       ✓ Some organizations
  PascalCase:          CustomerId, OrderDate, TotalAmount       ✓ SQL Server tradition

Common column suffixes:
  _id                  Foreign/primary keys        customer_id, order_id
  _date / _dt          Date columns                order_date, created_dt
  _ts / _tms           Timestamps                  created_tms, updated_ts
  _flag / _ind         Boolean indicators           is_active_flag, deleted_ind
  _amt                 Monetary amounts             total_amt, discount_amt
  _cnt / _count        Counts                       order_cnt, item_count
  _desc                Descriptions                 product_desc, status_desc
  _code / _cd          Codes                        country_cd, currency_code

NEVER:
  column1, column2, temp, test, data, value, field1
  (meaningless names that tell you nothing)

Schema and Namespace Organization

Fabric (Lakehouse):
  bronze_lakehouse.dbo.brz_crm_customer
  silver_lakehouse.dbo.slv_crm_customer_clean
  gold_lakehouse.dbo.dim_customer

Fabric (Warehouse):
  gold_warehouse.sales.fact_daily_sales      ← schema = business domain
  gold_warehouse.finance.fact_revenue
  gold_warehouse.hr.dim_employee

Databricks (Unity Catalog):
  catalog.bronze.crm_customer
  catalog.silver.crm_customer_clean
  catalog.gold.dim_customer

Governance in Fabric

Workspace Separation

Enterprise Fabric setup:

  WS_Dev_DataPlatform
    ├── LH_Bronze_Dev        (Data Engineers: Contributor)
    ├── LH_Silver_Dev        (Data Engineers: Contributor)
    ├── LH_Gold_Dev          (Data Engineers: Contributor)
    ├── WH_Gold_Dev          (Data Engineers: Contributor)
    └── Notebooks, Pipelines (full access to experiment)

  WS_Test_DataPlatform
    ├── LH_Bronze_Test       (QA: Viewer, CI/CD: Contributor)
    ├── LH_Silver_Test
    ├── LH_Gold_Test
    └── Deployed via Fabric Deployment Pipelines

  WS_Prod_DataPlatform
    ├── LH_Bronze_Prod       (Everyone: Viewer, CI/CD: Contributor)
    ├── LH_Silver_Prod       (Analysts: Viewer via SQL endpoint)
    ├── LH_Gold_Prod         (BI Team: Viewer, connects Power BI)
    └── NO direct human writes — only CI/CD pipeline

Deployment Pipelines (Dev → Test → Prod)

Fabric Deployment Pipelines promote items across workspaces:

Dev Workspace ──→ Test Workspace ──→ Prod Workspace
  (click Deploy)    (click Deploy)

What gets promoted:
  ✓ Pipelines          ✓ Notebooks        ✓ Semantic Models
  ✓ Dataflows Gen2     ✓ Spark Job Defs

What does NOT get promoted:
  ✗ Data (tables, files) — each environment has its own data
  ✗ Lakehouse/Warehouse items — re-created via notebooks/SQL scripts

This means: your CREATE TABLE statement lives in a notebook or SQL script
that IS promoted. When it runs in Test, it creates the table in Test.
When it runs in Prod, it creates the table in Prod. Same code, different data.

Git Integration for Table Definitions

All table creation code lives in Git:
  /notebooks/
    01_create_bronze_tables.py      ← CREATE TABLE statements
    02_create_silver_tables.py
    03_create_gold_tables.py
  /pipelines/
    PL_Daily_ETL.json               ← Pipeline definitions
  /sql/
    create_warehouse_tables.sql     ← Warehouse DDL

Pull Request flow:
  1. DE creates branch: feature/DE-1234-shipping-pref
  2. Adds CREATE TABLE to 02_create_silver_tables.py
  3. Opens PR → Peer reviews → Lead approves → Merge
  4. Deployment pipeline picks up the change
  5. Table is created in Test (automated), then Prod (after QA sign-off)

Fabric Workspace Roles

Role Can Create Tables? Can Read Data? Typical User
Admin Yes Yes Platform team, workspace owner
Member Yes Yes Data Engineers (Dev workspace)
Contributor Yes (items, not workspace settings) Yes CI/CD service principal
Viewer No Yes (read-only) Analysts, BI developers, business users

In production workspaces, Data Engineers are typically Viewers (read-only). Only the CI/CD service principal is a Contributor. This prevents accidental changes.

Governance in Databricks

Unity Catalog and Three-Level Namespace

Unity Catalog enforces governance through a three-level namespace:

  Metastore (organization-level)
    └── Catalog (environment or domain)
          └── Schema (layer or subject area)
                └── Table

  Example:
    dev_catalog.bronze.crm_customer          ← Dev, Bronze, CRM source
    prod_catalog.silver.crm_customer_clean   ← Prod, Silver, cleaned
    prod_catalog.gold.dim_customer           ← Prod, Gold, dimension

  Access control:
    GRANT SELECT ON SCHEMA prod_catalog.gold TO analysts;
    GRANT ALL PRIVILEGES ON SCHEMA dev_catalog.bronze TO data_engineers;
    -- Analysts can READ Gold but cannot CREATE or ALTER anything
    -- Data Engineers have full access in Dev but READ-ONLY in Prod

Permissions and Access Control

Typical Databricks permission setup:

  Data Engineers:
    Dev catalog: CREATE TABLE, ALTER, INSERT, SELECT, DELETE
    Prod catalog: SELECT only (read-only)

  Data Analysts:
    All catalogs: SELECT only on Silver and Gold schemas
    No access to Bronze (raw data may contain PII)

  CI/CD Service Principal:
    Prod catalog: CREATE TABLE, ALTER, INSERT (automated deployments)

  Data Architect:
    All catalogs: SELECT + ability to review and approve PRs (not direct access)

The Approval Chain: Who Approves What

Change Type Approvers Typical Time
New Bronze table (new source) Peer DE + Lead DE 1-2 days
New Silver table Peer DE + Lead DE + Architect 2-3 days
New Gold table (dim/fact) Peer DE + Lead DE + Architect + Business Stakeholder 3-5 days
ALTER column (add/rename) Peer DE + Lead DE + Impact Analysis 1-3 days
DROP column Lead DE + Architect + all downstream owners 3-7 days
DROP table Architect + Business Owner + documented deprecation 5-30 days (deprecation period)
Hotfix (emergency) Lead DE (verbal) + retroactive PR Hours (but documented after)

Documentation Requirements

What Data Dictionaries Look Like

Every production table should have an entry in the data dictionary — a central document (Confluence, SharePoint, or a metadata tool like Atlan/Alation) that describes every table and column:

TABLE: slv_crm_customer_clean
  Layer:        Silver
  Source:       CRM system (Salesforce) → Bronze: brz_crm_customer
  Owner:        Data Engineering team (Naveen Vuppula)
  Created:      2026-03-15 (Jira: DE-1102)
  Last altered: 2026-06-01 (Jira: DE-1234 — added loyalty_points column)
  Refresh:      Daily at 6:30 AM via PL_CRM_Bronze_to_Silver
  Consumers:    Gold: dim_customer, Dashboard: Customer 360

  COLUMNS:
  Column Name        Type        Nullable    Description
  ──────────         ────        ────────    ───────────
  customer_id        INT         No          Unique customer identifier from CRM
  first_name         VARCHAR(100) No         Customer first name (Title Case)
  last_name          VARCHAR(100) No         Customer last name (Title Case)
  email              VARCHAR(200) Yes        Email address (lowercase, validated)
  city               VARCHAR(100) No         Current city (standardized)
  country            VARCHAR(50)  No         Country (ISO code)
  loyalty_points     INT          Yes        Loyalty program points (added DE-1234)
  is_active          BIT          No         1 = active customer, 0 = churned
  created_date       DATE         No         Date customer was first created in CRM
  updated_tms        DATETIME     Yes        Last update timestamp from pipeline

Why this matters: When a new analyst joins the team, they do not ask “what does this column mean?” They look it up. When a DE needs to alter a column, the data dictionary shows who consumes the table (impact analysis). When auditors ask “where does this number come from,” the lineage is documented.

Why Tutorials Skip All This

  1. Governance is not exciting — “CREATE TABLE” gets clicks. “How to write a Jira ticket for ALTER TABLE” does not.
  2. Single-user environment — tutorials use one workspace, one account, full admin. No need for roles or permissions.
  3. No downstream consumers — in a tutorial, nobody’s dashboard breaks if you drop a column. In production, 50 people notice.
  4. No compliance requirements — tutorials do not worry about GDPR, SOX, HIPAA, or audit trails.
  5. Time constraints — a 2-hour course cannot spend 30 minutes on change management. They skip to the technical content.

This is not a criticism of tutorials — they serve their purpose (teaching the technology). But the gap between “I can CREATE TABLE in a notebook” and “I can deliver a production-grade table in an enterprise” is where real data engineering skill lives.

Common Mistakes New Engineers Make

  1. Creating tables directly in production — “I will just run this CREATE TABLE real quick.” Never. Use the CI/CD pipeline. Even for “small” changes.

  2. Naming tables without conventionscustomer_data_v2_final_backup_REAL tells nobody what layer, source, or purpose this table serves. Follow the standard: slv_crm_customer_clean.

  3. Not creating a Jira ticket before starting work — “I will document it later” means it never gets documented. Create the ticket FIRST, reference it in your PR, and link it to the deployment.

  4. Skipping impact analysis for ALTER TABLE — adding a column is usually safe. Renaming or dropping a column breaks everything downstream. Always check: who reads from this table?

  5. Not handling schema evolution in Bronze — if your Bronze pipeline breaks when a source adds a column, your architecture is fragile. Use mergeSchema in Delta or AutoLoader’s schema evolution to absorb changes gracefully.

  6. Creating tables via UI clicks instead of code — clicking “New Table” in the Lakehouse UI works in Dev, but that action cannot be promoted via CI/CD. Always use SQL scripts or notebooks that are version-controlled in Git.

  7. Treating Dev like Prod — “My Dev table works, so just copy it to Prod.” Dev often has sample data, different permissions, and experimental code. Promote through the proper pipeline: Dev → Test → Prod.

Interview Questions

Q: How do you manage table creation in a production environment? A: All table definitions live in version-controlled SQL scripts or notebooks in Git. A Data Engineer creates the table in the Dev workspace, opens a pull request, gets peer and architect review, and merges to the main branch. A CI/CD pipeline (Fabric Deployment Pipelines, Azure DevOps, or GitHub Actions) deploys the change to Test for validation, then to Prod after QA sign-off. Nobody creates tables directly in production.

Q: What happens when a source system adds a new column to a file? A: In a well-designed pipeline, the Bronze layer absorbs it gracefully using schema evolution (Delta mergeSchema or AutoLoader schema evolution). The new column appears in Bronze automatically. For Silver and Gold, a change request is needed: add the column to the Silver transformation, update the data dictionary, test the impact on downstream tables, and deploy through CI/CD.

Q: Why are naming conventions important for tables? A: Naming conventions like brz_crm_customer instantly tell you the layer (Bronze), source (CRM), and entity (customer). Without conventions, a team of 10 engineers creates tables like customers2, cust_data_final, test_cust_backup — nobody knows which is authoritative. Conventions enable self-documenting schemas and make impact analysis possible.

Q: How do you handle Dev vs Test vs Prod environments in Fabric? A: Create separate workspaces for each environment (WS_Dev, WS_Test, WS_Prod). Data Engineers have Contributor access in Dev and Viewer (read-only) in Prod. Only a CI/CD service principal has write access to Prod. Table creation code lives in notebooks or SQL scripts that are promoted via Fabric Deployment Pipelines from Dev to Test to Prod.

Q: Do you create a new table every time you receive a file from a client? A: No. Tables are created during initial source onboarding (one-time setup). Daily files are loaded into existing Bronze and Silver tables via pipelines. New tables are only created when a completely new source system or entity is onboarded. Schema changes to existing files are handled via ALTER TABLE through the change management process.

Q: What is the difference between Bronze, Silver, and Gold governance? A: Bronze is moderate — permissive schemas, append-only, DE-created with standard naming. Silver is strict — enforced schemas, data quality checks, architect-reviewed designs. Gold is very strict — locked schemas, business stakeholder approval, regression-tested, because Gold powers executive dashboards and regulatory reports.

Wrapping Up

The biggest skill gap in data engineering is not technical — it is understanding how real organizations operate. Anyone can write CREATE TABLE. The real skill is knowing WHEN to create it, WHO needs to approve it, HOW to deploy it safely, and WHAT documentation to leave behind.

Tutorials teach you the syntax. Enterprise projects teach you the process. This post bridges the gap so that when you walk into your first (or next) enterprise data engineering role, you are not surprised when a simple ALTER TABLE takes 3 days and 4 approvals. That process is not bureaucracy — it is what separates a working data platform from chaos.

Remember: in production, the hospital never stops running. Every change is a surgery — planned, reviewed, tested, and executed with precision.

Related posts:The Medallion ArchitectureFabric Git Integration & CI/CDFabric Security & GovernanceUnity Catalog Deep DiveHow Real Companies Receive Data


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link