Data Engineering - DriveDataScience

Databricks Unity Catalog Deep Dive: Metastore, Three-Level Namespace, Governance, Lineage, and Securing Your Lakehouse Like a Production Engineer

Leave a Comment / Data Engineering, Databricks

The complete Unity Catalog deep dive. Three-level namespace (catalog.schema.table), metastore setup with Access Connector, catalogs and schemas with 4 organization patterns, managed vs external tables, Volumes for unstructured data, Storage Credentials and External Locations explained, full GRANT/REVOKE with 5 complete scenarios, row-level security with SQL row filters, column masking with dynamic data masking functions, automatic data lineage (table and column level), audit logging, Delta Sharing for zero-copy data sharing, identity federation with Azure AD SCIM, Hive metastore migration with UCX tool, Azure vs AWS vs GCP comparison, and Unity Catalog vs Microsoft Fabric governance.

Databricks Unity Catalog Deep Dive: Metastore, Three-Level Namespace, Governance, Lineage, and Securing Your Lakehouse Like a Production Engineer Read More »

Feature Engineering: The Skill That Matters More Than Algorithm Choice — Encoding, Scaling, Creating Features, Handling Dates, Text Features, and Feature Selection

DP-700 Certification Study Guide: Every Exam Objective Mapped to DriveDataScience Posts, Study Plan, and Tips to Pass the Microsoft Fabric Data Engineer Associate Exam

Leave a Comment / Azure, Data Engineering

The complete DP-700 study guide with every exam objective mapped to 34 DriveDataScience Fabric posts. Three domains: Implement and Manage, Ingest and Transform, Monitor and Optimize. 8-week study plan, quick reference cards, exam-day tips, and 5 practice questions.

DP-700 Certification Study Guide: Every Exam Objective Mapped to DriveDataScience Posts, Study Plan, and Tips to Pass the Microsoft Fabric Data Engineer Associate Exam Read More »

Fabric Data Factory Expression Language: Dynamic Pipelines with @pipeline(), @activity(), @formatDateTime(), Conditional Logic, and Every Expression You Need

Leave a Comment / Azure, Data Engineering

Complete Fabric Data Factory expression language guide. Where expressions work (Copy, Notebook, If Condition, ForEach, Web activities). Pipeline functions (@pipeline, @activity, @variables, @item). String functions (concat, replace, split, trim). Date functions (utcNow, formatDateTime, addDays, startOfMonth). Logical functions (if, equals, coalesce, and, or). Eight real-world patterns: yesterday incremental load, dynamic file paths with date partitioning, conditional full vs incremental, dynamic SQL, ForEach table list, notebook parameters, error handling, dynamic email subjects. ADF vs Fabric comparison (identical syntax).

Fabric Data Factory Expression Language: Dynamic Pipelines with @pipeline(), @activity(), @formatDateTime(), Conditional Logic, and Every Expression You Need Read More »

Model Evaluation Deep Dive: Confusion Matrix, Precision, Recall, F1 Score, ROC-AUC, Cross-Validation, Bias-Variance Tradeoff, and Choosing the Right Metric

KQL (Kusto Query Language) Complete Guide: Syntax, Operators, Functions, Joins, Time Series, Anomaly Detection, and Real-World Query Patterns for Fabric Real-Time Intelligence

Leave a Comment / Azure, Data Engineering

The complete KQL reference guide. Pipe-based syntax explained with SQL comparison. Filtering with where (comparison, string operators, has vs contains performance). Selecting with project and extend. Aggregation with summarize (count, dcount, percentile, arg_max, make_list). Time-based analysis with bin and ago. All join types including lookup optimization. Complete function reference: string, date/time, numeric, dynamic/JSON. Let statements for variables and subqueries. Render for inline visualization. Advanced: make-series for time series, series_decompose_anomalies for anomaly detection, materialized views, stored functions. Eight real-world query patterns (latest per device, sessions, error rate, top N, funnel, spike detection, week-over-week, distributed tracing). Full KQL vs SQL comparison table.

KQL (Kusto Query Language) Complete Guide: Syntax, Operators, Functions, Joins, Time Series, Anomaly Detection, and Real-World Query Patterns for Fabric Real-Time Intelligence Read More »

M Language (Power Query) Complete Guide: Every Function You Need, Text, Date, Number, Table Operations, Error Handling, Custom Functions, and Real-World Patterns

Leave a Comment / Azure, Data Engineering

The complete M language reference for Dataflow Gen2. The let-in structure explained. Every essential function: text (20+ functions including split, combine, replace, extract), numbers (round, math, conversion), dates (30+ functions including arithmetic, extraction, start/end of periods, formatting), logical (if-then-else, null coalescing), lists (aggregate, filter, transform), and tables (filter rows, add columns, join with all 6 join kinds, group by, pivot, unpivot, buffer). Error handling with try-otherwise. Custom functions. Query folding explained. Five real-world patterns.

M Language (Power Query) Complete Guide: Every Function You Need, Text, Date, Number, Table Operations, Error Handling, Custom Functions, and Real-World Patterns Read More »

Fabric Connections and Gateways: Connection Types, On-Premises Data Gateway, VNet Gateway, Managing Connections, and Accessing Data Behind Firewalls

Leave a Comment / Azure, Data Engineering

Complete Fabric connections and gateways guide. Connection types for Azure, on-premises, cross-cloud (AWS S3, GCS), and SaaS sources with authentication options. Connection vs ADF Linked Service migration. On-premises data gateway installation, architecture (outbound HTTPS only, no inbound ports), and high-availability clustering. VNet data gateway for private Azure resources. Creating, sharing, and reusing connections. Security best practices (Service Principal, Key Vault). Troubleshooting connection errors. Three real-world scenarios.

Fabric Connections and Gateways: Connection Types, On-Premises Data Gateway, VNet Gateway, Managing Connections, and Accessing Data Behind Firewalls Read More »

Delta Lake Table Properties: Every TBLPROPERTIES Setting, Retention, Change Data Feed, Column Mapping, Auto-Optimize, and Managing Delta Tables Like a Production Engineer

Leave a Comment / Azure, Data Engineering

The complete Delta Lake Table Properties reference. Every TBLPROPERTIES setting explained: deletedFileRetentionDuration and logRetentionDuration with VACUUM interaction, Change Data Feed (CDF) for MLVs and streaming, column mapping for rename and drop columns, autoOptimize.optimizeWrite and autoCompact, targetFileSize, protocol versions, schema.autoMerge, and data skipping. Five real-world configurations (Gold SCD dimension, Bronze staging, Silver standard, streaming target, large fact table). Fabric vs Databricks property comparison.

Delta Lake Table Properties: Every TBLPROPERTIES Setting, Retention, Change Data Feed, Column Mapping, Auto-Optimize, and Managing Delta Tables Like a Production Engineer Read More »

Fabric Optimization Guide: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and Query Performance Tuning Across All Workloads

Leave a Comment / Azure, Data Engineering

The consolidated Fabric optimization reference. Lakehouse: OPTIMIZE, VACUUM, Z-ORDER, V-Order, file sizing, partitioning. Pipelines: parallel execution, scheduling strategy, Copy Activity tuning. Warehouse: statistics, result set caching, query patterns. Spark: shuffle partitions, AQE, broadcast joins, caching. Eventstream: retention, caching policies, materialized views. Complete optimization checklist across all workloads.

Fabric Optimization Guide: Lakehouse, Pipelines, Warehouse, Spark, Eventstream, and Query Performance Tuning Across All Workloads Read More »