Blog - DriveDataScience

File Storage in Azure Databricks: Volumes, DBFS, /tmp/, External Locations, and Where Your Files Actually Live

Leave a Comment / Azure, Data Engineering

Master every file storage option in Databricks. /tmp/ (temporary), DBFS (legacy), Unity Catalog Volumes (modern), External Locations (ADLS Gen2), and FileStore. Path prefix cheat sheet, managed vs external volumes, the append mode Illegal Seek bug with workaround, Python open() vs dbutils.fs vs spark.read comparison, and which storage for which use case.

File Storage in Azure Databricks: Volumes, DBFS, /tmp/, External Locations, and Where Your Files Actually Live Read More »

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch

Leave a Comment / Azure, Data Engineering, Python

Build a production data quality framework in PySpark. Eight validation checks (nulls, duplicates, range, regex, referential integrity, freshness, row count, type), the quarantine pattern, a reusable DQ class, DQ reports, integration with Medallion Architecture, and real-world DQ issues table.

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch Read More »

Databricks Workflows and Jobs: Scheduling, Multi-Task Pipelines, Alerts, and Production Orchestration

Leave a Comment / Azure, Data Engineering

Master Databricks Workflows for production orchestration. Job creation, Job vs All-Purpose clusters, multi-task DAG pipelines, task dependencies, parameter passing with Task Values, cron scheduling, retry and timeout config, email alerts, the complete Medallion workflow, triggering from ADF, and cost optimization.

Databricks Workflows and Jobs: Scheduling, Multi-Task Pipelines, Alerts, and Production Orchestration Read More »

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines

Leave a Comment / Azure, Data Engineering

Master the Medallion Architecture with the water purification analogy. Bronze (raw), Silver (cleaned), Gold (business-ready) layers explained with format selection, transformation patterns, ownership model, cost optimization, and exact mapping to every pipeline we built on the blog.

The Medallion Architecture in Azure: Bronze, Silver, and Gold Layers Explained with Real Pipelines Read More »

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Leave a Comment / Azure, Data Engineering, Python

Master every PySpark window function with real business scenarios. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, running totals, moving averages, percent of group. Plus 5 real-world patterns: deduplication, gap detection, sessionization, YoY comparison, and top N per group.

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns Read More »

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks

Leave a Comment / Azure, Data Engineering, Python

Implement both SCD Type 1 and Type 2 using PySpark and Delta Lake MERGE in Databricks. Type 1: one MERGE with whenMatchedUpdate + whenNotMatchedInsertAll. Type 2: MERGE to expire + APPEND for new versions. Hash-based change detection, version numbering, idempotent tests, reusable functions, and complete Synapse Data Flow to PySpark MERGE mapping.

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks Read More »

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs

Leave a Comment / Azure, Data Engineering, Python

Master lazy evaluation — the most important PySpark concept nobody explains properly. Why Spark waits, how the Catalyst Optimizer rewrites your code, transformations vs actions with complete lists, predicate pushdown, column pruning, the DAG, narrow vs wide transformations, the cache trap, proving laziness with a hands-on experiment, and how it powers your SCD Type 2 pipeline.

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs Read More »

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook

Leave a Comment / Azure, Data Engineering, Python

The complete optimization playbook for Databricks. Small file problem and OPTIMIZE compaction, Z-ORDER for file skipping, conditional OPTIMIZE, VACUUM with retention and time travel interaction, partitioning strategy, AQE, broadcast joins, join best practices, caching, coalesce vs repartition, and the production checklist that turns 45-minute pipelines into 5-minute pipelines.

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook Read More »

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies

Leave a Comment / Azure, Data Engineering, Python

Master all 8 PySpark join types using the same two DataFrames. Inner, left, right, full outer, left_semi, left_anti, cross, and self joins with output tables, real-life analogies, SQL equivalents, broadcast optimization, duplicate column handling, and a decision table for choosing the right join.

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies Read More »

Azure RBAC Roles Demystified: Every Role, Every Identity, and When to Assign What to Whom

Leave a Comment / Azure, Data Engineering

Master Azure RBAC with the hotel key card analogy. Every role organized by service: Storage (the confusing ones explained), SQL, Synapse, Databricks, Data Factory, Key Vault, Networking, and Compute. Management plane vs data plane, four identity types, five real-world scenarios, the decision framework, principle of least privilege, and common 403 error fixes.

Azure RBAC Roles Demystified: Every Role, Every Identity, and When to Assign What to Whom Read More »