Python

Python tutorials, tips, and best practices

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch

Build a production data quality framework in PySpark. Eight validation checks (nulls, duplicates, range, regex, referential integrity, freshness, row count, type), the quarantine pattern, a reusable DQ class, DQ reports, integration with Medallion Architecture, and real-world DQ issues table.

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch Read More »

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Master every PySpark window function with real business scenarios. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, running totals, moving averages, percent of group. Plus 5 real-world patterns: deduplication, gap detection, sessionization, YoY comparison, and top N per group.

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns Read More »

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks

Implement both SCD Type 1 and Type 2 using PySpark and Delta Lake MERGE in Databricks. Type 1: one MERGE with whenMatchedUpdate + whenNotMatchedInsertAll. Type 2: MERGE to expire + APPEND for new versions. Hash-based change detection, version numbering, idempotent tests, reusable functions, and complete Synapse Data Flow to PySpark MERGE mapping.

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks Read More »

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs

Master lazy evaluation — the most important PySpark concept nobody explains properly. Why Spark waits, how the Catalyst Optimizer rewrites your code, transformations vs actions with complete lists, predicate pushdown, column pruning, the DAG, narrow vs wide transformations, the cache trap, proving laziness with a hands-on experiment, and how it powers your SCD Type 2 pipeline.

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs Read More »

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook

The complete optimization playbook for Databricks. Small file problem and OPTIMIZE compaction, Z-ORDER for file skipping, conditional OPTIMIZE, VACUUM with retention and time travel interaction, partitioning strategy, AQE, broadcast joins, join best practices, caching, coalesce vs repartition, and the production checklist that turns 45-minute pipelines into 5-minute pipelines.

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook Read More »

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies

Master all 8 PySpark join types using the same two DataFrames. Inner, left, right, full outer, left_semi, left_anti, cross, and self joins with output tables, real-life analogies, SQL equivalents, broadcast optimization, duplicate column handling, and a decision table for choosing the right join.

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies Read More »

Managed vs External Tables in Azure Databricks: Unity Catalog, External Locations, Data Persistence, and Every Operation Explained

Master managed vs external tables in Databricks. Complete setup: Access Connector, Storage Credential, External Location, and external table creation. Proves data persistence after DROP TABLE with step-by-step walkthrough. Covers Delta operations on external tables, partitioning, VACUUM, granting access, and the three-layer Unity Catalog security model.

Managed vs External Tables in Azure Databricks: Unity Catalog, External Locations, Data Persistence, and Every Operation Explained Read More »

Delta Lake Deep Dive in Azure Databricks: Time Travel, Versioning, MERGE, Schema Evolution, and Every Operation Explained

Hands-on Delta Lake deep dive in Databricks. Every operation step by step: INSERT, UPDATE, DELETE, MERGE creating versions. Time travel three methods. Compare versions, track entities across history. RESTORE, VACUUM, Schema evolution, DeltaTable Python API.

Delta Lake Deep Dive in Azure Databricks: Time Travel, Versioning, MERGE, Schema Evolution, and Every Operation Explained Read More »

Connecting Azure Databricks to Azure SQL Database: JDBC Read, Write, and Production Patterns

Master Databricks to Azure SQL Database connectivity. JDBC connection setup, secure credentials with Key Vault, reading tables and custom queries, the ORDER BY subquery trap, write modes, upsert pattern, the three-notebook production architecture (Config + Functions + Operations), data quality functions, performance optimization with partitioned reads, and common JDBC errors.

Connecting Azure Databricks to Azure SQL Database: JDBC Read, Write, and Production Patterns Read More »

PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches

Master PySpark foundations that every tutorial skips. SparkSession creation and configuration, SparkSession vs SparkContext history, every import you need, builder options, spark.conf.set vs builder config, stopping sessions, running PySpark locally, spark-submit, and environment comparison (Local vs Databricks vs Synapse).

PySpark Foundations: SparkSession, Imports, Configuration, and the Basics Nobody Teaches Read More »

Scroll to Top