Top 20 Data Engineering Interview Questions: Concepts Every Candidate Must Know

Top 20 Data Engineering Interview Questions: Concepts Every Candidate Must Know

Data engineering interviews test a unique combination of skills: SQL, Python, cloud services, pipeline design, data modeling, and system design. Unlike software engineering interviews that focus heavily on algorithms, data engineering interviews focus on practical, production-level thinking.

This post covers the 20 most common data engineering interview questions that go beyond any specific tool. These are the fundamental concepts that come up whether you are interviewing for an Azure, AWS, or GCP role.

For Azure-specific questions, see my post on Top 15 ADF Interview Questions.

Table of Contents

  • Data Pipeline Design (Questions 1-5)
  • Data Modeling and Architecture (Questions 6-10)
  • ETL, ELT, and Data Processing (Questions 11-15)
  • Data Quality, Testing, and Operations (Questions 16-20)
  • Bonus: Questions You Should Ask the Interviewer
  • Wrapping Up

Data Pipeline Design

Question 1: What is the difference between ETL and ELT?

ETL (Extract, Transform, Load): Data is extracted from the source, transformed in a staging area (cleaning, joining, aggregating), and then loaded into the target system. The transformation happens BEFORE loading.

ELT (Extract, Load, Transform): Data is extracted and loaded into the target system AS-IS (raw). Transformation happens INSIDE the target system using its compute power (e.g., SQL in Snowflake, Spark in Databricks).

When to use ETL: – Target system has limited compute (cannot handle raw data transformation) – Data must be cleaned before entering the target (compliance, PII scrubbing) – Legacy systems that expect pre-formatted data

When to use ELT: – Target is a modern cloud data warehouse or lakehouse (Snowflake, BigQuery, Databricks) – You want to preserve raw data for reprocessing – Transformations are SQL-based and the target system is optimized for it

The trend: ELT has become the dominant pattern in modern data engineering because cloud warehouses have massive compute power. Load raw data first, transform later using SQL or Spark.

Question 2: How do you handle late-arriving data in a pipeline?

Late-arriving data is data that arrives after its expected processing window. For example, an order from March 31 arrives on April 5.

Strategies:

  • Reprocessing windows: Pipeline looks back N days and reprocesses recent partitions, catching any late arrivals
  • Event-time processing: Use the event timestamp (when it happened) not the processing timestamp (when you received it) for partitioning
  • Idempotent pipelines: Design pipelines so running them twice on the same data produces the same result. Then reprocessing is safe.
  • Delta tables with MERGE: Use Delta Lake or Iceberg MERGE statements to upsert late data into existing partitions

Question 3: What is idempotency and why does it matter in data pipelines?

An idempotent pipeline produces the same result regardless of how many times you run it with the same input. Running it once or five times gives identical output.

Why it matters: Pipelines fail and get rerun. If rerunning creates duplicate rows, your data is corrupted.

How to achieve it: – Use MERGE/UPSERT instead of INSERT (avoids duplicates) – Use overwrite mode for partition writes instead of append – Include deduplication logic (delete existing partition data before writing) – Design with natural keys that prevent duplicate inserts

Example:

-- NOT idempotent (creates duplicates on rerun)
INSERT INTO target SELECT * FROM staging;

-- Idempotent (replaces existing data for the partition)
DELETE FROM target WHERE load_date = '2026-04-07';
INSERT INTO target SELECT * FROM staging WHERE load_date = '2026-04-07';

Question 4: How do you design a pipeline for exactly-once processing?

Exactly-once means every record is processed once and only once — no duplicates, no missed records.

The three guarantees:At-most-once: Records might be lost but never duplicated – At-least-once: Records are never lost but might be duplicated – Exactly-once: Records are processed exactly one time

Achieving exactly-once in batch pipelines: – Use watermark/checkpoint tables to track what has been processed – Use transactions (commit the data write and the checkpoint update together) – Use idempotent writes (MERGE/UPSERT)

Achieving exactly-once in streaming: – Use Kafka consumer offsets with transactional writes – Use Spark Structured Streaming with checkpointing – Use Delta Lake with ACID transactions

Question 5: What is backfilling and how do you handle it?

Backfilling is processing historical data that was missed or needs to be reprocessed. For example, you deploy a new pipeline on April 1 and need to process data from January 1 through March 31.

Approaches:Date-parameterized pipelines: Pass start_date and end_date as parameters. For backfill, run the pipeline for each historical date. – Incremental backfill: Process one day/week at a time to avoid overwhelming the system – Bulk backfill: Process all historical data in one large run (works if the system can handle it)

Key requirement: The pipeline must be idempotent so backfill runs do not create duplicates.

Data Modeling and Architecture

Question 6: What is the difference between a data lake and a data warehouse?

Aspect Data Lake Data Warehouse
Data format Raw, any format (CSV, JSON, Parquet) Structured, predefined schema
Schema Schema-on-read Schema-on-write
Storage Cheap object storage (S3, ADLS) Expensive structured storage
Users Data engineers, data scientists Business analysts, BI tools
Transformations Spark, Python, SQL SQL
Data quality Varies (raw data may be messy) High (enforced by schema)
Cost Low storage, variable compute Higher (provisioned resources)

Modern approach (Lakehouse): Combines both — store raw data in a lake, add a table format (Delta, Iceberg) for ACID transactions and schema enforcement, query with SQL.

Question 7: Explain the Bronze, Silver, Gold (Medallion) architecture.

Bronze (Raw): – Data ingested as-is from source systems – No transformations, no cleaning – Append-only, preserves complete history – Format: Parquet, JSON, CSV as received

Silver (Cleaned): – Deduplicated, cleaned, and standardized – Nulls handled, data types corrected – Business logic applied (joins, filters) – Ready for data science and ad-hoc analysis

Gold (Business-ready): – Aggregated, modeled for specific use cases – Star schema or dimensional model – Optimized for BI dashboards and reports – Represents the “single source of truth”

Why this matters: Each layer serves different users. Data engineers work on Bronze-to-Silver. Analytics engineers work on Silver-to-Gold. Business users consume Gold.

Question 8: What is a star schema?

A star schema organizes data into a central fact table surrounded by dimension tables:

Fact table: Contains the measurable events (sales amount, order count, click count). Has foreign keys to dimension tables. Usually very large.

Dimension tables: Contains the descriptive context (customer name, product category, date details). Smaller, slowly changing.

dim_customer
               |
dim_date --- fact_sales --- dim_product
               |
           dim_store

Example:

-- Fact table (millions of rows)
fact_sales: sale_id, date_key, customer_key, product_key, store_key, amount, quantity

-- Dimension tables (thousands of rows)
dim_customer: customer_key, name, email, city, segment
dim_product: product_key, name, category, brand, price
dim_date: date_key, full_date, year, quarter, month, day_of_week
dim_store: store_key, store_name, city, region, manager

Why star schema: Optimized for analytics queries. Simple joins. Fast aggregations. Every BI tool understands it.

Question 9: What are Slowly Changing Dimensions (SCD)?

When dimension data changes over time (customer moves to a new city, product price changes), how do you handle it?

SCD Type 1: Overwrite – Just update the row with the new value – Loses history – Simple, used when history does not matter

-- Customer moved from Toronto to Mumbai
UPDATE dim_customer SET city = 'Mumbai' WHERE customer_id = 100;
-- Old value (Toronto) is gone forever

SCD Type 2: Add New Row – Insert a new row with the new value – Keep the old row with an end date – Preserves complete history

-- Before: customer_id=100, city='Toronto', effective_from='2024-01-01', effective_to='9999-12-31', is_current=1
-- After:
UPDATE dim_customer SET effective_to='2026-04-07', is_current=0 WHERE customer_id=100 AND is_current=1;
INSERT INTO dim_customer VALUES (100, 'Mumbai', '2026-04-07', '9999-12-31', 1);

SCD Type 3: Add New Column – Add a column for the previous value – Keeps only one level of history

-- dim_customer: customer_id, city, previous_city
-- 100, Mumbai, Toronto

Most common: SCD Type 2 is the standard for enterprise data warehouses.

Question 10: What is data partitioning and why does it matter?

Partitioning divides a large table/file into smaller, manageable pieces based on a column value (usually date).

sales_data/
  year=2025/
    month=01/
      part-00000.parquet
    month=02/
      part-00000.parquet
  year=2026/
    month=01/
      part-00000.parquet

Why it matters:

  • Query performance: WHERE year=2026 AND month=04 only reads one partition instead of scanning the entire table (partition pruning)
  • Maintenance: Delete old data by dropping partitions instead of running DELETE statements
  • Pipeline efficiency: Write to a specific partition without touching others
  • Cost: Less data scanned = lower query costs (especially in Synapse Serverless, Athena, BigQuery)

Common partition keys: date, region, department, customer_segment

ETL, ELT, and Data Processing

Question 11: What is the difference between batch and stream processing?

Batch processing: – Processes data in discrete chunks (hourly, daily) – Higher latency (data is not available until the batch completes) – Simpler to build and debug – Tools: ADF, Spark batch, Airflow, Glue

Stream processing: – Processes data continuously as it arrives – Low latency (seconds to minutes) – More complex (handle out-of-order events, exactly-once, state management) – Tools: Kafka, Spark Streaming, Flink, Event Hubs

When to use batch: Daily reporting, nightly ETL, historical analysis, data warehouse loading

When to use streaming: Real-time dashboards, fraud detection, live recommendations, IoT sensor data

Question 12: What is Apache Spark and when do you use it?

Spark is a distributed processing engine that handles large-scale data transformation, analysis, and machine learning.

Use Spark when: – Data is too large for a single machine (100GB+) – You need complex transformations (joins across billion-row tables) – You need both batch and streaming in the same framework – You are working in a lakehouse (Databricks, Synapse Spark)

Do not use Spark when: – Data fits in memory on a single machine (use pandas instead) – Simple data movement (use ADF Copy activity) – SQL-only transformations (use the warehouse SQL engine)

Spark interfaces: PySpark (Python), Spark SQL, Scala, Java, R

Question 13: What is data skew and how do you handle it?

Data skew occurs when data is unevenly distributed across partitions in a distributed system. One partition has significantly more data than others, causing that partition to take much longer to process (the “straggler” problem).

Example: Partitioning by customer_id where one customer (Walmart) has 10 million orders and most others have 100. The Walmart partition takes 100x longer.

Solutions:Salting: Add a random prefix to the skewed key, distribute across multiple partitions, then aggregate – Broadcast join: If one side of the join is small, broadcast it to all nodes instead of shuffling – Repartitioning: Manually repartition with a more evenly distributed key – Adaptive Query Execution (AQE): Modern Spark (3.0+) handles some skew automatically

Question 14: What is schema evolution and how do you handle it?

Schema evolution is when the structure of your source data changes over time — new columns added, columns removed, data types changed.

Challenges: – Old Parquet files have 10 columns, new ones have 12 – Pipeline expects column “phone_number” but source renamed it to “mobile” – Column type changed from string to integer

Solutions:Schema-on-read: Store raw data as-is. Handle schema differences at query time. (This is what data lakes do.) – Delta Lake schema evolution: mergeSchema option automatically adds new columns – Schema registry: Centralized schema tracking (common with Kafka/Avro) – Data contracts: Formal agreements between producers and consumers about schema changes

Question 15: What is the difference between a view and a materialized view?

View: A saved SQL query. Does NOT store data. Every time you query the view, it re-executes the underlying SQL.

CREATE VIEW active_customers AS
SELECT * FROM customers WHERE status = 'Active';
-- Every query against this view runs the full SELECT

Materialized view: A saved SQL query that STORES the result. The data is pre-computed and cached. Faster to query but needs to be refreshed.

CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT month, SUM(amount) as total
FROM orders GROUP BY month;
-- Data is stored. Queries are fast. But needs periodic refresh.

When to use views: Simple abstraction, security (restrict column access), always need current data

When to use materialized views: Expensive aggregations, dashboards that do not need real-time data, reducing repeated computation

Data Quality, Testing, and Operations

Question 16: How do you ensure data quality in a pipeline?

Validation checks at each pipeline stage:

  • Row count validation: Source count matches destination count
  • Null checks: Critical columns have no unexpected nulls
  • Uniqueness checks: Primary keys are unique
  • Range checks: Values fall within expected bounds (age between 0 and 150)
  • Referential integrity: Foreign keys exist in the referenced table
  • Freshness checks: Data is not stale (last update within expected window)
  • Schema validation: Columns and data types match expectations

Tools: Great Expectations, dbt tests, Soda, custom SQL assertions, ADF audit logging

Question 17: What is data lineage and why does it matter?

Data lineage tracks the complete journey of data from source to destination — where it came from, what transformations were applied, and where it ended up.

Why it matters:Debugging: When a dashboard shows wrong numbers, trace back to find where the error occurred – Impact analysis: Before changing a source table, see what downstream reports will be affected – Compliance: Regulators (GDPR, CCPA) require you to show how personal data flows through your systems – Trust: Business users need confidence that the numbers they see are derived correctly

Tools: Apache Atlas, Microsoft Purview, Collibra, dbt lineage graph, OpenLineage

Question 18: How do you monitor a production data pipeline?

What to monitor:

  • Pipeline status: Did it succeed, fail, or timeout?
  • Row counts: How many rows were processed? Does it match expectations?
  • Duration: Is the pipeline taking longer than usual? (Performance regression)
  • Data freshness: When was the last successful run?
  • Error rates: Are certain tables failing more than others?
  • Resource utilization: CPU, memory, DIU usage

How to monitor:

  • Built-in monitoring: ADF Monitor tab, Airflow UI, Databricks job history
  • Custom audit tables: Log pipeline metrics to a SQL table (as we built in our audit logging post)
  • Alerting: Email/Slack/PagerDuty notifications on failures
  • Dashboards: Grafana, Power BI, or Datadog dashboards showing pipeline health

Question 19: What is orchestration and what tools do you use?

Orchestration is coordinating the execution of multiple tasks in the right order, handling dependencies, retries, and failures.

What an orchestrator does: – Runs Task B only after Task A succeeds – Retries failed tasks with exponential backoff – Sends alerts on failure – Manages scheduling (daily, hourly, event-driven) – Provides a visual history of all runs

Common orchestration tools:

Tool Type Best For
Azure Data Factory Managed, cloud-native Azure ecosystem
Apache Airflow Open-source, Python DAGs Multi-cloud, custom workflows
Databricks Workflows Managed, Spark-native Databricks-centric platforms
Prefect / Dagster Modern open-source Python-native data pipelines
AWS Step Functions Managed, serverless AWS ecosystem
dbt Cloud Managed, SQL-focused SQL transformations

Question 20: How do you handle PII (Personally Identifiable Information) in data pipelines?

PII includes names, emails, phone numbers, addresses, SSNs, and any data that can identify a person.

Strategies:

  • Encryption at rest: All storage should be encrypted (ADLS, S3 default encryption)
  • Encryption in transit: TLS/HTTPS for all data movement
  • Masking: Replace PII with fake but realistic data in non-production environments
  • Tokenization: Replace PII with a reversible token (original stored securely)
  • Hashing: One-way hash for fields you need to match but not read (email matching)
  • Access control: RBAC and ACLs to restrict who can see PII columns
  • Column-level security: Some warehouses support restricting access to specific columns
  • Data retention policies: Automatically delete PII after the retention period
  • Right to deletion: Support GDPR “right to be forgotten” by design

Pipeline design for PII: – Land raw data (with PII) in a restricted Bronze zone – Mask/tokenize PII during Silver transformation – Gold layer contains no raw PII (only aggregated or masked data) – Only authorized pipelines and users can access Bronze

Bonus: Questions You Should Ask the Interviewer

  1. “What does a typical data pipeline look like here?” — Shows you think practically
  2. “How do you handle schema changes from source systems?” — Shows you understand real challenges
  3. “What is your monitoring and alerting setup?” — Shows you think about operations
  4. “Do you follow the Medallion architecture or a different pattern?” — Shows architecture awareness
  5. “How does the team handle on-call and production incidents?” — Shows you are ready for real work
  6. “What is the team’s approach to testing data pipelines?” — Shows quality focus

Wrapping Up

Data engineering interviews test your ability to think about systems, not just write code. The questions above cover the five pillars that every interviewer evaluates:

  1. Pipeline design: ETL vs ELT, idempotency, backfill, exactly-once
  2. Data modeling: Star schema, SCD, partitioning, normalization
  3. Processing: Batch vs stream, Spark, data skew, schema evolution
  4. Data quality: Validation, lineage, monitoring, testing
  5. Operations: Orchestration, PII handling, alerting, incident response

Study these concepts, connect them to real projects you have built, and you will be well prepared for any data engineering interview.

Related posts:Top 15 ADF Interview QuestionsSQL Window FunctionsSQL Joins ExplainedSchema-on-Write vs Schema-on-ReadParquet vs CSV vs JSONAudit Logging in Data Pipelines

Good luck with your interviews. If this guide helped, share it with someone preparing for a data engineering role. Questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link