Python - DriveDataScience

PySpark Reading from APIs and Databases: JDBC Connections, REST API Pagination, Parallel Reads, Connection Pooling, and Production Ingestion Patterns

Leave a Comment / Data Engineering, Python

Complete guide to reading databases and APIs with PySpark. Basic JDBC read with connection properties, custom SQL query pushdown, parallel JDBC reads (partitionColumn, numPartitions, lowerBound, upperBound), choosing partition columns, REST API basic read to DataFrame, pagination handling, rate limiting with retry and exponential backoff, parallel API calls, API to Delta pipeline, Key Vault for credentials, JDBC with Managed Identity, production patterns (full load, incremental watermark, API to Bronze to Silver), 5 common mistakes, and 3 interview Q&As.

PySpark Reading from APIs and Databases: JDBC Connections, REST API Pagination, Parallel Reads, Connection Pooling, and Production Ingestion Patterns Read More »

PySpark UDFs and Higher-Order Functions: When to Use UDFs, Performance Pitfalls, Pandas UDFs, and Array/Map Functions for Complex Transformations

Leave a Comment / Data Engineering, Python

Complete guide to PySpark UDFs and higher-order functions. Built-in vs UDF comparison, creating standard Python UDFs, the serialization performance problem (10-100x slower), Pandas UDFs (vectorized 5-10x faster), scalar and grouped map Pandas UDFs, higher-order functions (transform, filter, aggregate, exists) for arrays without UDFs, working with complex types (arrays, maps, structs), when to use each approach, 5 common mistakes, and 3 interview Q&As.

PySpark UDFs and Higher-Order Functions: When to Use UDFs, Performance Pitfalls, Pandas UDFs, and Array/Map Functions for Complex Transformations Read More »

PySpark Data Cleaning and Validation: Null Handling, Deduplication, Regex, Type Casting, Data Quality Checks, and Production Cleaning Patterns

Leave a Comment / Data Engineering, Python

Complete PySpark data cleaning guide. Null handling (check, drop, fill, coalesce), deduplication (dropDuplicates vs window-based keep-latest), type casting with safe conversion, string cleaning (trim, regex replace, regex extract, format standardization), date parsing for multiple formats, column-level and row-level validation, quarantine pattern (clean vs bad rows), production cleaning pipeline function, 5 common mistakes, and 3 interview Q&As.

PySpark Data Cleaning and Validation: Null Handling, Deduplication, Regex, Type Casting, Data Quality Checks, and Production Cleaning Patterns Read More »

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch

Leave a Comment / Azure, Data Engineering, Python

Build a production data quality framework in PySpark. Eight validation checks (nulls, duplicates, range, regex, referential integrity, freshness, row count, type), the quarantine pattern, a reusable DQ class, DQ reports, integration with Medallion Architecture, and real-world DQ issues table.

Data Quality in Azure Databricks: Validation Rules, Quarantine Patterns, and Building a DQ Framework from Scratch Read More »

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns

Leave a Comment / Azure, Data Engineering, Python

Master every PySpark window function with real business scenarios. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, running totals, moving averages, percent of group. Plus 5 real-world patterns: deduplication, gap detection, sessionization, YoY comparison, and top N per group.

PySpark Window Functions Deep Dive: ROW_NUMBER, RANK, LAG, LEAD, Running Totals, and Real-World Patterns Read More »

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks

Leave a Comment / Azure, Data Engineering, Python

Implement both SCD Type 1 and Type 2 using PySpark and Delta Lake MERGE in Databricks. Type 1: one MERGE with whenMatchedUpdate + whenNotMatchedInsertAll. Type 2: MERGE to expire + APPEND for new versions. Hash-based change detection, version numbering, idempotent tests, reusable functions, and complete Synapse Data Flow to PySpark MERGE mapping.

SCD Type 1 and Type 2 Using PySpark and Delta Lake MERGE in Azure Databricks Read More »

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs

Leave a Comment / Azure, Data Engineering, Python

Master lazy evaluation — the most important PySpark concept nobody explains properly. Why Spark waits, how the Catalyst Optimizer rewrites your code, transformations vs actions with complete lists, predicate pushdown, column pruning, the DAG, narrow vs wide transformations, the cache trap, proving laziness with a hands-on experiment, and how it powers your SCD Type 2 pipeline.

Lazy Evaluation in PySpark: Why Spark Waits, How It Optimizes, and When Your Code Actually Runs Read More »

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook

Leave a Comment / Azure, Data Engineering, Python

The complete optimization playbook for Databricks. Small file problem and OPTIMIZE compaction, Z-ORDER for file skipping, conditional OPTIMIZE, VACUUM with retention and time travel interaction, partitioning strategy, AQE, broadcast joins, join best practices, caching, coalesce vs repartition, and the production checklist that turns 45-minute pipelines into 5-minute pipelines.

Delta Lake and PySpark Optimization in Azure Databricks: OPTIMIZE, Z-ORDER, VACUUM, AQE, Broadcast Joins, and the Production Playbook Read More »

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies

Leave a Comment / Azure, Data Engineering, Python

Master all 8 PySpark join types using the same two DataFrames. Inner, left, right, full outer, left_semi, left_anti, cross, and self joins with output tables, real-life analogies, SQL equivalents, broadcast optimization, duplicate column handling, and a decision table for choosing the right join.

PySpark Joins in Azure Databricks: Every Join Type Explained with Examples and Real-Life Analogies Read More »

Managed vs External Tables in Azure Databricks: Unity Catalog, External Locations, Data Persistence, and Every Operation Explained

Leave a Comment / Azure, Data Engineering, Python

Master managed vs external tables in Databricks. Complete setup: Access Connector, Storage Credential, External Location, and external table creation. Proves data persistence after DROP TABLE with step-by-step walkthrough. Covers Delta operations on external tables, partitioning, VACUUM, granting access, and the three-layer Unity Catalog security model.

Managed vs External Tables in Azure Databricks: Unity Catalog, External Locations, Data Persistence, and Every Operation Explained Read More »