Table Creation and Governance in Enterprise Data Engineering: Why Real Projects Are Nothing Like Tutorials — Dev/Test/Prod, Change Management, File Ingestion Patterns, and Production Standards
Open any data engineering course on Udemy or YouTube. Within 10 minutes, the instructor creates a Lakehouse, opens a notebook, and runs spark.sql("CREATE TABLE customers ..."). Five minutes later, they create another. And another. Bronze table, Silver table, Gold table — created like sticky notes on a whiteboard. No approvals, no tickets, no reviews.
Now walk into an enterprise like Capgemini, TCS, Infosys, Deloitte, or any Fortune 500 company. Try creating a table in production. You will need a Jira ticket, an architect’s review, a naming convention check, a DBA sign-off, a deployment through CI/CD from Dev to Test to Prod, and probably two managers approving the pull request. Altering a single column? Change request, impact analysis, downstream dependency check, regression testing, and a deployment window.
This gap — between tutorial-world and real-world — is the number one shock for new data engineers. Nobody teaches this. No course covers it. But it is the reality of every production data platform.
This post bridges that gap. How tables are actually created in enterprise data engineering, who has permission to do what, what happens when a new file arrives from a client, and why the Medallion Architecture in production looks nothing like the version in your Udemy course.
Table of Contents
- The Tutorial World vs Enterprise Reality
- The Analogy: Building a House vs Renovating a Hospital
- Environment Separation: Dev, Test, Prod
- Who Creates Tables? Roles and Responsibilities
- The Change Management Workflow
- Step-by-Step: From Jira Ticket to Production Table
- What Happens When a New File Arrives from a Client
- Scenario 1: New CSV File, New Source System
- Scenario 2: Existing Source Adds a New Column
- Scenario 3: Daily Automated File Drop
- Scenario 4: One-Time Historical Data Load
- Do You Always Create a Table for Every File?
- The Medallion Architecture in Production
- Bronze Layer Governance
- Silver Layer Governance
- Gold Layer Governance
- Naming Conventions and Standards
- Table Naming Patterns
- Column Naming Standards
- Schema and Namespace Organization
- Governance in Fabric
- Workspace Separation
- Deployment Pipelines (Dev → Test → Prod)
- Git Integration for Table Definitions
- Fabric Workspace Roles
- Governance in Databricks
- Unity Catalog and Three-Level Namespace
- Permissions and Access Control
- The Approval Chain: Who Approves What
- Documentation Requirements
- What Data Dictionaries Look Like
- Why Tutorials Skip All This
- Common Mistakes New Engineers Make
- Interview Questions
- Wrapping Up
The Tutorial World vs Enterprise Reality
| Aspect | Tutorial / Course | Enterprise Production |
|---|---|---|
| Who creates tables? | You — the student | Defined roles (DE creates, Architect reviews, DBA approves) |
| Where? | One workspace, one environment | Dev → Test → Prod (separate workspaces) |
| Approvals needed? | None — just click Run | Jira ticket, PR review, architect sign-off, DBA approval |
| Naming conventions? | Whatever you want — customers2, test_final_v3 |
Strict standards — brz_crm_customer, slv_crm_customer_clean |
| Altering a column? | Run ALTER TABLE, done | Change request, impact analysis, downstream check, deployment |
| Dropping a table? | DROP TABLE, gone | Almost never. Deprecate, archive, then remove after 90 days |
| Time to create a table? | 30 seconds | 1-5 business days (with all approvals) |
| Access to production? | Full admin — you own everything | Read-only for most. Only CI/CD pipelines write to prod. |
This is not bureaucracy for the sake of bureaucracy. A single wrong ALTER TABLE in production can break 15 Power BI dashboards, 3 downstream pipelines, and 2 regulatory reports. The governance exists because the cost of a mistake in production is measured in hours of downtime, lost revenue, and compliance violations — not in “I will just recreate the table.”
The Analogy: Building a House vs Renovating a Hospital
Tutorials are like building a house on an empty plot. You are the owner, architect, builder, and inspector. Want to knock down a wall? Do it. Want to add a room? Grab a hammer. No permits, no inspections, no neighbors affected.
Enterprise data engineering is like renovating a floor in a working hospital. The hospital is running 24/7. Patients are being treated (dashboards serving business users). Medications are being dispensed (pipelines processing data). Life support systems are humming (real-time feeds to regulatory bodies).
Now you want to “add a column to the patient records table.” You cannot just walk in with a drill. You need permits (Jira ticket), architecture review (will this affect the pharmacy system?), safety checks (will the backup power handle the construction load?), a scheduled window (do it at 2 AM when fewer patients are in surgery), and a rollback plan (if the renovation fails, can we restore the original wall in 30 minutes?).
That is why real data engineering is slow, deliberate, and heavily governed. The “hospital” never stops running.
Environment Separation: Dev, Test, Prod
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DEV Workspace │ │ TEST Workspace │ │ PROD Workspace │
│ │ │ │ │ │
│ Data Engineers │ │ QA / Testers │ │ CI/CD Pipeline │
│ create freely │ │ validate changes │ │ deploys approved │
│ experiment │ │ run regression │ │ code only │
│ break things │ │ test with real │ │ │
│ use sample data │ │ data (masked) │ │ NO human writes │
│ │ │ │ │ directly │
│ Tables: 50+ │ │ Tables: same as │ │ Tables: same as │
│ (many experimental)│ │ prod (mirrored) │ │ test (promoted) │
└────────┬───────────┘ └────────┬───────────┘ └──────────────────┘
│ │
│ Git PR + Review │ Deployment Pipeline
└────────────────────────→└───────────────────────→
(automated, no manual access)
The golden rule: Nobody — not even the lead data engineer — creates tables directly in production. Production changes ONLY go through the CI/CD deployment pipeline after passing code review and testing.
In Dev: You CAN create tables freely. This is your sandbox. Create test_customer_v3, experiment, break things. Nobody cares. But even in Dev, teams often follow naming conventions and create tables via notebooks or SQL scripts (not manually via UI clicks) so the same scripts can be promoted to Test and Prod.
In Test: The schema mirrors Prod. You deploy your Dev changes here and run automated tests. QA validates that existing dashboards still work, row counts are reasonable, and no columns broke.
In Prod: Locked down. Read-only for humans. Only the deployment pipeline (Azure DevOps, GitHub Actions, Fabric Deployment Pipelines) can create, alter, or modify tables. If you need to fix something urgently, there is a formal “hotfix” process — not “let me just run this ALTER TABLE real quick.”
Who Creates Tables? Roles and Responsibilities
| Role | Can Create Tables? | Where? | Needs Approval? |
|---|---|---|---|
| Data Engineer | Yes | Dev workspace only | For promotion to Test/Prod — yes |
| Senior/Lead DE | Yes | Dev, sometimes Test | For Prod — yes (architect review) |
| Data Architect | Designs, does not create | Reviews all table designs | Approves others’ designs |
| DBA (Database Admin) | Yes (in traditional DW) | All environments | Final sign-off on schema changes |
| Data Analyst / BI Dev | No (except personal sandbox) | Read-only on Prod | N/A |
| CI/CD Pipeline | Yes (automated) | Test and Prod | Triggered by approved PR merge |
| Platform/Infra Team | Creates workspaces, not tables | All environments | Manages capacity and access |
Real-life analogy: Think of a restaurant kitchen. The chef (Data Engineer) can experiment with new dishes in the test kitchen (Dev). But putting a dish on the actual menu (Prod) requires the head chef’s approval (Architect), the restaurant manager’s sign-off (Project Lead), and a trial run during a slow shift (Test environment). A line cook (Analyst) can taste and give feedback but cannot change the menu.
The Change Management Workflow
Step-by-Step: From Jira Ticket to Production Table
STEP 1: REQUIREMENT
Business says: "We need to track customer shipping preferences."
Product Owner creates a Jira ticket: DE-1234
STEP 2: DESIGN
Data Engineer writes a design document:
- Table name: slv_crm_customer_shipping_pref
- Columns: customer_id (INT), preference (VARCHAR), updated_date (DATE)
- Source: CRM API daily extract
- Target: Silver Lakehouse
- Downstream consumers: Gold fact_orders table, Shipping dashboard
Attaches the design to Jira ticket.
STEP 3: ARCHITECTURE REVIEW
Data Architect reviews:
✓ Naming convention follows standards
✓ Data types are appropriate
✓ No duplicate of existing table
✓ Fits within the data model
✗ "Add a surrogate key column" — feedback given
DE updates design, re-submits.
STEP 4: DEVELOPMENT (Dev workspace)
DE writes a notebook/SQL script that creates the table:
CREATE TABLE slv_crm_customer_shipping_pref (...)
DE writes the pipeline that populates the table.
DE tests in Dev with sample data.
DE commits code to Git (feature branch).
STEP 5: PULL REQUEST + CODE REVIEW
DE opens a PR: feature/DE-1234-shipping-preferences
Peer engineer reviews the code.
Lead engineer approves.
PR is merged to the develop branch.
STEP 6: DEPLOY TO TEST
CI/CD pipeline automatically deploys to Test workspace.
QA runs regression tests:
✓ Table created with correct schema
✓ Pipeline loads data correctly
✓ Downstream Gold table still works
✓ Shipping dashboard renders correctly
STEP 7: DEPLOY TO PROD
After QA sign-off, code is merged to main branch.
CI/CD pipeline deploys to Prod workspace.
Table is created in Prod.
Pipeline is scheduled.
STEP 8: MONITORING
First production run is monitored.
Audit logging confirms data loaded.
Dashboard team confirms new column appears.
TOTAL TIME: 3-7 business days (depending on review queue)
TOTAL APPROVALS: Architect + Peer + Lead + QA = 4 people
Compare this to a tutorial: spark.sql("CREATE TABLE shipping_prefs ...") — 5 seconds, zero approvals.
What Happens When a New File Arrives from a Client
This is where real-world data engineering diverges most from tutorials. In courses, data magically appears in a folder. In reality, receiving data from clients involves contracts, format negotiations, validation, and governance decisions.
Scenario 1: New CSV File, New Source System
SITUATION:
Client (a retail company) wants to send you daily sales data.
They will SFTP a CSV file every night at 11 PM.
WHAT HAPPENS — NOT what tutorials show:
Week 1: Planning (no code written yet)
- Data engineer receives sample file from client
- Reviews: column names, data types, file format, delimiter, encoding
- Identifies: 45 columns, pipe-delimited, ISO-8859-1 encoding, no header row
- Creates data mapping document:
Source Column Target Column Type Transformation
col_1 customer_id INT None
col_2 customer_name VARCHAR Trim, Title Case
col_3 order_date DATE Convert from DD/MM/YYYY
col_15 revenue DECIMAL Convert from string with commas
- Architect reviews mapping + proposes table structure
- Decision: Create 1 Bronze table (raw landing) + 1 Silver table (cleaned)
Week 2: Development
- DE builds in Dev workspace:
1. Landing zone: Files/incoming/retail_client/ (raw CSV lands here)
2. Bronze table: brz_retail_daily_sales (raw, no transformation)
3. Silver table: slv_retail_daily_sales (cleaned, typed, deduplicated)
4. Pipeline: PL_Ingest_Retail_Sales
- Copy Activity: SFTP → Landing zone
- Notebook: Landing zone → Bronze (schema validation)
- Notebook: Bronze → Silver (cleaning + transformation)
- Audit logging at every step
Week 3: Testing
- Client sends 3 days of test files
- QA validates row counts, data types, edge cases
- DE fixes: found that revenue column has "N/A" strings → handle in Silver
- Regression: existing pipelines unaffected
Week 4: Go Live
- Deploy to Prod via CI/CD
- First production file arrives
- Monitor first 5 days closely
- Sign-off from business stakeholders
Scenario 2: Existing Source Adds a New Column
SITUATION:
The retail client adds a "loyalty_points" column to their daily CSV.
They notify you (hopefully) or the pipeline breaks (realistically).
OPTION A: Client Notified in Advance (Best Case)
1. Client sends updated spec with new column
2. DE creates Jira ticket: DE-1301 "Add loyalty_points to retail sales"
3. Design review: add column to Bronze and Silver tables
4. Dev: ALTER TABLE brz_retail_daily_sales ADD loyalty_points VARCHAR(50)
ALTER TABLE slv_retail_daily_sales ADD loyalty_points INT
5. Update pipeline notebook to handle the new column
6. Test → Deploy → Monitor
OPTION B: Column Appears Without Warning (Common)
1. Pipeline breaks: "Column count mismatch" or schema drift error
2. DE investigates: "New column 'loyalty_points' not in schema"
3. Emergency Jira ticket created
4. Fix depends on platform:
- Databricks AutoLoader with schema evolution: handles automatically
- Fabric pipeline with fixed schema: manual ALTER TABLE needed
- If Bronze is schemaless (Parquet/Delta with mergeSchema): no break
5. This is why Bronze layer should be PERMISSIVE (accept everything)
and Silver should be STRICT (validated schema)
LESSON: Design your Bronze layer to absorb schema changes gracefully.
Design your Silver layer to reject bad data explicitly.
Scenario 3: Daily Automated File Drop
SITUATION:
An API sends a JSON file to ADLS/OneLake every day at 3 AM.
You already have Bronze and Silver tables set up.
WHAT HAPPENS DAILY (fully automated, no human involvement):
3:00 AM — File lands: Files/incoming/api_data/2026-06-21/orders.json
3:05 AM — Event trigger fires (file arrival detected)
3:06 AM — Pipeline starts:
Step 1: Validate file (not empty, correct format)
Step 2: Load to Bronze table (append, preserve raw data)
Step 3: Transform to Silver (clean, deduplicate, type-cast)
Step 4: Update Gold (aggregate into fact/dim tables)
Step 5: Log audit record (rows loaded, duration, status)
3:25 AM — Pipeline completes
3:30 AM — Semantic model refresh triggers
7:00 AM — Business users open dashboards with fresh data
DO YOU CREATE NEW TABLES? No.
- The tables already exist (created during initial setup)
- The daily pipeline INSERTS/APPENDS into existing tables
- New tables are only created when a NEW source system is onboarded
The only code that runs daily is the pipeline — not CREATE TABLE statements.
Scenario 4: One-Time Historical Data Load
SITUATION:
Client sends a 50 GB ZIP file with 3 years of historical transaction data.
This is a one-time load, not recurring.
DECISION: Do we create permanent tables for this?
OPTION A: Create Permanent Tables (if data will be queried regularly)
- Create brz_historical_transactions and slv_historical_transactions
- Follow full governance: Jira, design review, naming conventions
- Load data once, tables stay for ongoing queries
OPTION B: Load into Existing Tables (if schema matches current daily data)
- Backfill existing brz_retail_daily_sales and slv_retail_daily_sales
- Use tumbling window pattern (process each month as a separate window)
- No new tables needed — data joins naturally with daily data
OPTION C: Temporary Staging (if one-time analysis only)
- Load into a STAGING table: stg_historical_analysis
- Analyst runs their analysis
- Table is dropped after 30 days (documented in Jira)
- Even staging tables follow naming conventions
NEVER: Load 50 GB into a table called "test_data_final_2" with no documentation.
Do You Always Create a Table for Every File?
No. This is a critical concept that tutorials get wrong. Not every file becomes a table.
| Scenario | Create New Table? | What Happens Instead |
|---|---|---|
| New source system onboarded | Yes — Bronze + Silver + possibly Gold | Full governance process (Jira → Design → Review → Deploy) |
| Daily file from existing source | No — append to existing tables | Pipeline inserts into existing Bronze/Silver tables |
| Source adds a new column | No — ALTER existing table | Schema evolution or ALTER TABLE via change request |
| One-time historical backfill | Maybe — depends on reuse | Backfill existing tables or create temporary staging |
| Ad-hoc analysis request | No — use a view or temp table | Analyst queries existing tables with SQL views |
| New business KPI needed | Maybe — new Gold table or view | Often a new view on existing Gold tables suffices |
| Testing / experimentation | Yes — in Dev only | Create freely in Dev, never promote throwaway tables |
Real-life analogy: A hospital does not create a new patient record system every time a patient visits. The system exists. New patients are added (INSERT). Existing patients update their address (ALTER). The system itself (table structure) changes rarely — and only through a formal process.
The Medallion Architecture in Production
In tutorials, Bronze-Silver-Gold is presented as three folders where you freely create tables. In production, each layer has different governance rules:
Bronze Layer Governance
Purpose: Raw data landing zone — preserve source data exactly as received
Who creates Bronze tables: Data Engineers (with standard naming)
Governance level: MODERATE
- Naming: brz_{source}_{entity} (e.g., brz_crm_customer, brz_sftp_sales)
- Schema: PERMISSIVE — accept new columns, use mergeSchema
- Data: APPEND-ONLY — never overwrite or delete raw data
- Retention: Keep for 90-365 days depending on policy
Why moderate (not strict)?
Bronze is the safety net. If Silver transformation has a bug,
you re-process from Bronze. If the source schema changes,
Bronze absorbed it without breaking. Bronze tables change rarely
after initial creation.
Silver Layer Governance
Purpose: Cleaned, validated, business-ready data
Who creates Silver tables: Data Engineers (architect-reviewed design)
Governance level: STRICT
- Naming: slv_{source}_{entity} (e.g., slv_crm_customer_clean)
- Schema: ENFORCED — explicit data types, NOT NULL constraints, validated
- Data quality: NULL checks, deduplication, type casting, standardization
- Changes: Any schema change requires Jira + impact analysis
Why strict?
Silver tables are consumed by Gold layer AND directly by analysts.
A wrong data type in Silver propagates to every downstream consumer.
If Silver breaks, dashboards show wrong numbers.
Gold Layer Governance
Purpose: Business-facing aggregations, dimensions, facts
Who creates Gold tables: Senior DEs + Data Architects
Governance level: VERY STRICT
- Naming: dim_{entity}, fact_{process} (e.g., dim_customer, fact_sales)
- Schema: LOCKED — changes require business stakeholder approval
- Testing: Unit tests, data quality checks, row count validations
- Consumers: Power BI dashboards, executive reports, regulatory filings
- Changes: Full change management + regression testing + stakeholder sign-off
Why very strict?
Gold tables power the CEO's dashboard. A wrong number in fact_sales
means a wrong revenue figure in the board presentation. Gold changes
are treated like production database changes in traditional software.
Naming Conventions and Standards
Table Naming Patterns
| Layer | Pattern | Example |
|---|---|---|
| Bronze | brz_{source}_{entity} |
brz_crm_customer, brz_sftp_daily_sales |
| Silver | slv_{source}_{entity} |
slv_crm_customer_clean, slv_erp_orders |
| Gold | dim_{entity} / fact_{process} |
dim_customer, fact_sales, agg_monthly_revenue |
| Staging | stg_{purpose} |
stg_migration_2026, stg_reconciliation |
The prefix tells everyone — instantly — which layer a table belongs to and how sensitive it is.
Column Naming Standards
Standard conventions (vary by organization):
snake_case: customer_id, order_date, total_amount ✓ Most common
camelCase: customerId, orderDate, totalAmount ✓ Some organizations
PascalCase: CustomerId, OrderDate, TotalAmount ✓ SQL Server tradition
Common column suffixes:
_id Foreign/primary keys customer_id, order_id
_date / _dt Date columns order_date, created_dt
_ts / _tms Timestamps created_tms, updated_ts
_flag / _ind Boolean indicators is_active_flag, deleted_ind
_amt Monetary amounts total_amt, discount_amt
_cnt / _count Counts order_cnt, item_count
_desc Descriptions product_desc, status_desc
_code / _cd Codes country_cd, currency_code
NEVER:
column1, column2, temp, test, data, value, field1
(meaningless names that tell you nothing)
Schema and Namespace Organization
Fabric (Lakehouse):
bronze_lakehouse.dbo.brz_crm_customer
silver_lakehouse.dbo.slv_crm_customer_clean
gold_lakehouse.dbo.dim_customer
Fabric (Warehouse):
gold_warehouse.sales.fact_daily_sales ← schema = business domain
gold_warehouse.finance.fact_revenue
gold_warehouse.hr.dim_employee
Databricks (Unity Catalog):
catalog.bronze.crm_customer
catalog.silver.crm_customer_clean
catalog.gold.dim_customer
Governance in Fabric
Workspace Separation
Enterprise Fabric setup:
WS_Dev_DataPlatform
├── LH_Bronze_Dev (Data Engineers: Contributor)
├── LH_Silver_Dev (Data Engineers: Contributor)
├── LH_Gold_Dev (Data Engineers: Contributor)
├── WH_Gold_Dev (Data Engineers: Contributor)
└── Notebooks, Pipelines (full access to experiment)
WS_Test_DataPlatform
├── LH_Bronze_Test (QA: Viewer, CI/CD: Contributor)
├── LH_Silver_Test
├── LH_Gold_Test
└── Deployed via Fabric Deployment Pipelines
WS_Prod_DataPlatform
├── LH_Bronze_Prod (Everyone: Viewer, CI/CD: Contributor)
├── LH_Silver_Prod (Analysts: Viewer via SQL endpoint)
├── LH_Gold_Prod (BI Team: Viewer, connects Power BI)
└── NO direct human writes — only CI/CD pipeline
Deployment Pipelines (Dev → Test → Prod)
Fabric Deployment Pipelines promote items across workspaces:
Dev Workspace ──→ Test Workspace ──→ Prod Workspace
(click Deploy) (click Deploy)
What gets promoted:
✓ Pipelines ✓ Notebooks ✓ Semantic Models
✓ Dataflows Gen2 ✓ Spark Job Defs
What does NOT get promoted:
✗ Data (tables, files) — each environment has its own data
✗ Lakehouse/Warehouse items — re-created via notebooks/SQL scripts
This means: your CREATE TABLE statement lives in a notebook or SQL script
that IS promoted. When it runs in Test, it creates the table in Test.
When it runs in Prod, it creates the table in Prod. Same code, different data.
Git Integration for Table Definitions
All table creation code lives in Git:
/notebooks/
01_create_bronze_tables.py ← CREATE TABLE statements
02_create_silver_tables.py
03_create_gold_tables.py
/pipelines/
PL_Daily_ETL.json ← Pipeline definitions
/sql/
create_warehouse_tables.sql ← Warehouse DDL
Pull Request flow:
1. DE creates branch: feature/DE-1234-shipping-pref
2. Adds CREATE TABLE to 02_create_silver_tables.py
3. Opens PR → Peer reviews → Lead approves → Merge
4. Deployment pipeline picks up the change
5. Table is created in Test (automated), then Prod (after QA sign-off)
Fabric Workspace Roles
| Role | Can Create Tables? | Can Read Data? | Typical User |
|---|---|---|---|
| Admin | Yes | Yes | Platform team, workspace owner |
| Member | Yes | Yes | Data Engineers (Dev workspace) |
| Contributor | Yes (items, not workspace settings) | Yes | CI/CD service principal |
| Viewer | No | Yes (read-only) | Analysts, BI developers, business users |
In production workspaces, Data Engineers are typically Viewers (read-only). Only the CI/CD service principal is a Contributor. This prevents accidental changes.
Governance in Databricks
Unity Catalog and Three-Level Namespace
Unity Catalog enforces governance through a three-level namespace:
Metastore (organization-level)
└── Catalog (environment or domain)
└── Schema (layer or subject area)
└── Table
Example:
dev_catalog.bronze.crm_customer ← Dev, Bronze, CRM source
prod_catalog.silver.crm_customer_clean ← Prod, Silver, cleaned
prod_catalog.gold.dim_customer ← Prod, Gold, dimension
Access control:
GRANT SELECT ON SCHEMA prod_catalog.gold TO analysts;
GRANT ALL PRIVILEGES ON SCHEMA dev_catalog.bronze TO data_engineers;
-- Analysts can READ Gold but cannot CREATE or ALTER anything
-- Data Engineers have full access in Dev but READ-ONLY in Prod
Permissions and Access Control
Typical Databricks permission setup:
Data Engineers:
Dev catalog: CREATE TABLE, ALTER, INSERT, SELECT, DELETE
Prod catalog: SELECT only (read-only)
Data Analysts:
All catalogs: SELECT only on Silver and Gold schemas
No access to Bronze (raw data may contain PII)
CI/CD Service Principal:
Prod catalog: CREATE TABLE, ALTER, INSERT (automated deployments)
Data Architect:
All catalogs: SELECT + ability to review and approve PRs (not direct access)
The Approval Chain: Who Approves What
| Change Type | Approvers | Typical Time |
|---|---|---|
| New Bronze table (new source) | Peer DE + Lead DE | 1-2 days |
| New Silver table | Peer DE + Lead DE + Architect | 2-3 days |
| New Gold table (dim/fact) | Peer DE + Lead DE + Architect + Business Stakeholder | 3-5 days |
| ALTER column (add/rename) | Peer DE + Lead DE + Impact Analysis | 1-3 days |
| DROP column | Lead DE + Architect + all downstream owners | 3-7 days |
| DROP table | Architect + Business Owner + documented deprecation | 5-30 days (deprecation period) |
| Hotfix (emergency) | Lead DE (verbal) + retroactive PR | Hours (but documented after) |
Documentation Requirements
What Data Dictionaries Look Like
Every production table should have an entry in the data dictionary — a central document (Confluence, SharePoint, or a metadata tool like Atlan/Alation) that describes every table and column:
TABLE: slv_crm_customer_clean
Layer: Silver
Source: CRM system (Salesforce) → Bronze: brz_crm_customer
Owner: Data Engineering team (Naveen Vuppula)
Created: 2026-03-15 (Jira: DE-1102)
Last altered: 2026-06-01 (Jira: DE-1234 — added loyalty_points column)
Refresh: Daily at 6:30 AM via PL_CRM_Bronze_to_Silver
Consumers: Gold: dim_customer, Dashboard: Customer 360
COLUMNS:
Column Name Type Nullable Description
────────── ──── ──────── ───────────
customer_id INT No Unique customer identifier from CRM
first_name VARCHAR(100) No Customer first name (Title Case)
last_name VARCHAR(100) No Customer last name (Title Case)
email VARCHAR(200) Yes Email address (lowercase, validated)
city VARCHAR(100) No Current city (standardized)
country VARCHAR(50) No Country (ISO code)
loyalty_points INT Yes Loyalty program points (added DE-1234)
is_active BIT No 1 = active customer, 0 = churned
created_date DATE No Date customer was first created in CRM
updated_tms DATETIME Yes Last update timestamp from pipeline
Why this matters: When a new analyst joins the team, they do not ask “what does this column mean?” They look it up. When a DE needs to alter a column, the data dictionary shows who consumes the table (impact analysis). When auditors ask “where does this number come from,” the lineage is documented.
Why Tutorials Skip All This
- Governance is not exciting — “CREATE TABLE” gets clicks. “How to write a Jira ticket for ALTER TABLE” does not.
- Single-user environment — tutorials use one workspace, one account, full admin. No need for roles or permissions.
- No downstream consumers — in a tutorial, nobody’s dashboard breaks if you drop a column. In production, 50 people notice.
- No compliance requirements — tutorials do not worry about GDPR, SOX, HIPAA, or audit trails.
- Time constraints — a 2-hour course cannot spend 30 minutes on change management. They skip to the technical content.
This is not a criticism of tutorials — they serve their purpose (teaching the technology). But the gap between “I can CREATE TABLE in a notebook” and “I can deliver a production-grade table in an enterprise” is where real data engineering skill lives.
Common Mistakes New Engineers Make
-
Creating tables directly in production — “I will just run this CREATE TABLE real quick.” Never. Use the CI/CD pipeline. Even for “small” changes.
-
Naming tables without conventions —
customer_data_v2_final_backup_REALtells nobody what layer, source, or purpose this table serves. Follow the standard:slv_crm_customer_clean. -
Not creating a Jira ticket before starting work — “I will document it later” means it never gets documented. Create the ticket FIRST, reference it in your PR, and link it to the deployment.
-
Skipping impact analysis for ALTER TABLE — adding a column is usually safe. Renaming or dropping a column breaks everything downstream. Always check: who reads from this table?
-
Not handling schema evolution in Bronze — if your Bronze pipeline breaks when a source adds a column, your architecture is fragile. Use
mergeSchemain Delta or AutoLoader’s schema evolution to absorb changes gracefully. -
Creating tables via UI clicks instead of code — clicking “New Table” in the Lakehouse UI works in Dev, but that action cannot be promoted via CI/CD. Always use SQL scripts or notebooks that are version-controlled in Git.
-
Treating Dev like Prod — “My Dev table works, so just copy it to Prod.” Dev often has sample data, different permissions, and experimental code. Promote through the proper pipeline: Dev → Test → Prod.
Interview Questions
Q: How do you manage table creation in a production environment? A: All table definitions live in version-controlled SQL scripts or notebooks in Git. A Data Engineer creates the table in the Dev workspace, opens a pull request, gets peer and architect review, and merges to the main branch. A CI/CD pipeline (Fabric Deployment Pipelines, Azure DevOps, or GitHub Actions) deploys the change to Test for validation, then to Prod after QA sign-off. Nobody creates tables directly in production.
Q: What happens when a source system adds a new column to a file? A: In a well-designed pipeline, the Bronze layer absorbs it gracefully using schema evolution (Delta mergeSchema or AutoLoader schema evolution). The new column appears in Bronze automatically. For Silver and Gold, a change request is needed: add the column to the Silver transformation, update the data dictionary, test the impact on downstream tables, and deploy through CI/CD.
Q: Why are naming conventions important for tables?
A: Naming conventions like brz_crm_customer instantly tell you the layer (Bronze), source (CRM), and entity (customer). Without conventions, a team of 10 engineers creates tables like customers2, cust_data_final, test_cust_backup — nobody knows which is authoritative. Conventions enable self-documenting schemas and make impact analysis possible.
Q: How do you handle Dev vs Test vs Prod environments in Fabric? A: Create separate workspaces for each environment (WS_Dev, WS_Test, WS_Prod). Data Engineers have Contributor access in Dev and Viewer (read-only) in Prod. Only a CI/CD service principal has write access to Prod. Table creation code lives in notebooks or SQL scripts that are promoted via Fabric Deployment Pipelines from Dev to Test to Prod.
Q: Do you create a new table every time you receive a file from a client? A: No. Tables are created during initial source onboarding (one-time setup). Daily files are loaded into existing Bronze and Silver tables via pipelines. New tables are only created when a completely new source system or entity is onboarded. Schema changes to existing files are handled via ALTER TABLE through the change management process.
Q: What is the difference between Bronze, Silver, and Gold governance? A: Bronze is moderate — permissive schemas, append-only, DE-created with standard naming. Silver is strict — enforced schemas, data quality checks, architect-reviewed designs. Gold is very strict — locked schemas, business stakeholder approval, regression-tested, because Gold powers executive dashboards and regulatory reports.
Wrapping Up
The biggest skill gap in data engineering is not technical — it is understanding how real organizations operate. Anyone can write CREATE TABLE. The real skill is knowing WHEN to create it, WHO needs to approve it, HOW to deploy it safely, and WHAT documentation to leave behind.
Tutorials teach you the syntax. Enterprise projects teach you the process. This post bridges the gap so that when you walk into your first (or next) enterprise data engineering role, you are not surprised when a simple ALTER TABLE takes 3 days and 4 approvals. That process is not bureaucracy — it is what separates a working data platform from chaos.
Remember: in production, the hospital never stops running. Every change is a surgery — planned, reviewed, tested, and executed with precision.
Related posts: – The Medallion Architecture – Fabric Git Integration & CI/CD – Fabric Security & Governance – Unity Catalog Deep Dive – How Real Companies Receive Data
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.