How Real Companies Receive Data: SFTP, APIs, CDC, Event Streaming, and Every Ingestion Pattern Explained

You have learned to build pipelines, transform data, and write Delta tables. But there is a question nobody answers in tutorials: “Where does the data actually COME FROM?”

Tutorials show you a CSV file already sitting in ADLS Gen2. They say “read this Parquet file” as if it magically appeared. But in a real project, somebody has to GET that data into the data lake first. And how that happens depends entirely on who the source is, what technology they use, and how often the data changes.

This post covers every real-world data ingestion pattern — not theory, but what actually happens at banks, e-commerce companies, telecoms, and hospitals. By the end, you will understand the six approaches companies use, which industries prefer which patterns, and how the data flows from source to your Bronze layer.

Think of your data lake like a warehouse. Data does not magically appear on the shelves. It arrives through different doors: some vendors send a delivery truck on a schedule (SFTP batch files). Some let you call and request specific items (API). Some stream a constant supply line (Event Streaming). Some replicate their entire inventory to your warehouse in real-time (CDC). Understanding the doors is understanding data engineering in the real world.

The Six Ingestion Patterns
Pattern 1: SFTP File Drop (Traditional Industries)
Pattern 2: REST API Ingestion (Modern SaaS)
Pattern 3: Database Replication / CDC (Transactional Systems)
Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)
Pattern 5: Direct Cloud Storage Drop (Internal Teams)
Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)
How Real Companies Combine Patterns
Banking (TD Bank, RBC Type)
E-Commerce (Amazon, Shopify Type)
Telecom (Rogers, Bell Type)
Healthcare (Hospital Networks)
Retail (Walmart, Costco Type)
Insurance
The Complete Architecture
Comparison Table
Mapping Patterns to Our Blog Pipelines
Common Mistakes in Data Ingestion
Interview Questions
Wrapping Up

The Six Ingestion Patterns

#	Pattern	How Data Arrives	Frequency	Common In
1	SFTP	Vendor drops files on SFTP, pipeline picks up	Hourly/Daily	Banks, Insurance, Telecom, Healthcare
2	API	Pipeline calls REST API, writes response	Minutes/Hourly	SaaS-heavy companies, Modern enterprises
3	CDC	Database changes streamed to data lake	Minutes/Real-time	Any company with transactional databases
4	Event Streaming	Kafka/Event Hubs continuous stream	Seconds/Real-time	Tech companies, IoT, E-commerce
5	Direct Drop	Internal teams upload to cloud storage	Ad-hoc/Daily	Internal data sharing
6	Third-Party Tools	Fivetran/Airbyte sync from 100+ sources	Automatic	Companies with many SaaS tools

Most real projects use a combination — SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and streaming for real-time events. The data lake is the landing zone for all of them.

Pattern 1: SFTP File Drop (Traditional Industries)

How It Works

The vendor does NOT drop files directly into your ADLS. They drop files on an SFTP server (Secure File Transfer Protocol), and your automated pipeline picks them up and moves them to the data lake.

Vendor System
    |
    | Uploads CSV/XML/Excel to SFTP server
    v
SFTP Server (sftp.company.com)
    |
    | ADF/Synapse polls every hour
    v
ADF Copy Activity
    |
    | Copies new files to ADLS Gen2
    v
ADLS Gen2 / bronze / vendor_name / 2026/05/20/
    |
    | Storage event trigger fires
    v
Databricks / Synapse transforms → Silver → Gold

Why SFTP and Not Direct Upload?

Security: The vendor never sees your cloud storage. They only know the SFTP address.
Control: You control when and how files are processed. The vendor just drops and leaves.
Compliance: Regulated industries (banking, healthcare) require files to land on controlled infrastructure first.
Compatibility: SFTP has existed since the 1990s. Every vendor, every system, every country supports it. No SDK, no cloud knowledge needed.

Real Examples

Banks receive daily transaction files from Visa and Mastercard via SFTP
Insurance companies receive claim files from hospitals via SFTP
Retail chains receive inventory files from warehouse management systems via SFTP
Telecom companies receive Call Detail Records (CDRs) from network equipment via SFTP
Government agencies receive tax filing data from regional offices via SFTP

The ADF Pipeline

Trigger: Schedule (every hour) or Event (new file detected)
    |
    v
Lookup: Check SFTP for new files since last poll
    |
    v
ForEach: For each new file
    |
    v
Copy Activity: SFTP source → ADLS Gen2 sink (Parquet or raw format)
    |
    v
Stored Procedure: Log file name, row count, timestamp to PIPELINE_LOG
    |
    v
Delete: Archive or remove file from SFTP after successful copy

Real-life analogy: SFTP is like a P.O. Box at the post office. The vendor mails a letter (file) to your P.O. Box (SFTP). You check the box on a schedule (pipeline polls), pick up the letter, and bring it home (ADLS). The vendor never knows your home address (cloud storage). They only know the P.O. Box number.

Pattern 2: REST API Ingestion (Modern SaaS)

How It Works

Data comes from REST APIs. A scheduled pipeline calls the API, receives a JSON response, and writes it to the data lake.

SaaS Application (Salesforce, Stripe, Shopify)
    |
    | Exposes REST API: GET /api/v2/orders?since=2026-05-19
    v
ADF Web Activity / Azure Function / Databricks Notebook
    |
    | Calls API, receives JSON response
    v
Transform JSON → write to ADLS Gen2 / bronze / salesforce / 2026/05/20/
    |
    v
Databricks transforms → Silver → Gold

Why APIs?

Real-time data: APIs can return data as recent as the last minute
Structured responses: JSON with defined schemas, not messy CSV files
Incremental: Most APIs support ?since=timestamp or ?updated_after=date for delta pulls
No file management: No SFTP servers, no file naming conventions, no missing files

Real Examples

Salesforce API → pull customer accounts and opportunity data daily
Stripe/PayPal API → pull payment transactions every hour
Shopify API → pull order and product catalog data
Google Analytics API → pull website traffic and user behavior data daily
Weather APIs → pull hourly weather data for logistics route optimization
Social media APIs → pull brand mentions for sentiment analysis
Jira/Azure DevOps API → pull project metrics for engineering dashboards

The Challenge: Pagination and Rate Limits

# Real API ingestion pattern in a Databricks notebook
import requests
import json
from datetime import datetime, timedelta

API_URL = "https://api.salesforce.com/services/data/v56.0/query"
headers = {"Authorization": f"Bearer {access_token}"}

# Incremental: only records modified since yesterday
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT%H:%M:%SZ")
query = f"SELECT Id, Name, Email, LastModifiedDate FROM Contact WHERE LastModifiedDate > {yesterday}"

all_records = []
next_url = f"{API_URL}?q={query}"

# Handle pagination (APIs return results in pages)
while next_url:
    response = requests.get(next_url, headers=headers)
    data = response.json()
    all_records.extend(data["records"])

    if data.get("nextRecordsUrl"):
        next_url = f"https://api.salesforce.com{data['nextRecordsUrl']}"
    else:
        next_url = None

# Write to Delta
df = spark.createDataFrame(all_records)
df.write.format("delta").mode("append").save("/bronze/salesforce/contacts/")
print(f"Ingested {len(all_records)} contacts")

Real-life analogy: API ingestion is like calling a restaurant for delivery. You call (API request), place your order (query parameters), and they deliver the food (JSON response). You do not go to the restaurant — they bring it to you. But the restaurant has limits: “maximum 5 deliveries per hour” (rate limits) and “maximum 100 items per delivery” (pagination).

Pattern 3: Database Replication / CDC (Transactional Systems)

How It Works

Change Data Capture (CDC) monitors a source database and captures every INSERT, UPDATE, and DELETE. Changes are streamed to the data lake as they happen — no files, no APIs, no polling.

Source Database (Oracle/SQL Server/PostgreSQL)
    |
    | CDC tool captures every INSERT/UPDATE/DELETE from the transaction log
    v
CDC Tool (Debezium / AWS DMS / Oracle GoldenGate / ADF CDC)
    |
    | Streams changes as events
    v
ADLS Gen2 / bronze / orders_cdc / → Delta format with operation column (I/U/D)
    |
    v
Databricks applies changes → Silver (current state) → Gold

Why CDC?

Near real-time: Changes arrive within seconds to minutes
Efficient: Only changed rows are transferred, not the entire table
Complete: Captures updates and deletes, not just inserts (unlike watermark-based incremental)
Low impact: Reads from the database’s transaction log, not by querying the tables directly

CDC Tools

Tool	Vendor	Best For
Debezium	Open source	PostgreSQL, MySQL, MongoDB CDC to Kafka
AWS DMS	Amazon	Any database to AWS (S3, Redshift)
Oracle GoldenGate	Oracle	Oracle database replication
ADF CDC	Microsoft	SQL Server to ADLS Gen2
Fivetran	SaaS	Managed CDC from any database
Qlik Replicate (Attunity)	Qlik	Enterprise database replication

Real Examples

E-commerce: PostgreSQL orders database replicated to Delta Lake every 5 minutes via Debezium + Kafka
Banking: Core banking Oracle database replicated nightly to the analytics platform via GoldenGate
Hospital: Patient management system replicated to the data lake for clinical dashboards
Retail: Point-of-sale SQL Server database replicated for real-time inventory tracking

What CDC Data Looks Like

| operation | order_id | customer_id | amount | timestamp           |
|-----------|----------|-------------|--------|---------------------|
| INSERT    | 5001     | 1001        | 150.00 | 2026-05-20 10:00:01 |
| INSERT    | 5002     | 1002        | 89.99  | 2026-05-20 10:00:03 |
| UPDATE    | 5001     | 1001        | 175.00 | 2026-05-20 10:05:00 |
| DELETE    | 4999     | 1003        | 25.00  | 2026-05-20 10:10:00 |

The operation column tells your pipeline what to do: INSERT new rows, UPDATE existing rows, DELETE removed rows. This maps directly to Delta Lake’s MERGE command.

Real-life analogy: CDC is like a security camera on a warehouse door. Instead of taking inventory every night (full load) or checking what is new since your last visit (watermark), the camera records every item that enters (INSERT), gets modified (UPDATE), or leaves (DELETE) in real-time. At any point, you can replay the tape to rebuild the current inventory.

Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)

How It Works

Data arrives as a continuous stream of events through a message broker. No files, no batches — events flow in real-time.

Application / IoT Device / Website
    |
    | Sends events (JSON messages) continuously
    v
Azure Event Hubs / Apache Kafka / AWS Kinesis
    |
    | Millions of events per second
    v
Spark Structured Streaming / Azure Stream Analytics
    |
    | Processes micro-batches (every few seconds)
    v
Delta Lake (append new events)
    |
    v
Real-time dashboards / ML models / Alerts

Why Streaming?

Real-time insights: React to events as they happen, not hours later
Scale: Handle millions of events per second (Event Hubs supports 1M+ events/sec)
Decoupled: Producers send events without knowing who consumes them
Event sourcing: Complete history of every event, not just current state

Real Examples

Uber: Every ride request, driver location update, price surge calculation, and trip event streams through Kafka in real-time
Netflix: Every play, pause, skip, and browse event streams for real-time recommendations
E-commerce clickstream: Every page view, add-to-cart, and checkout step streams for personalization
IoT in manufacturing: Temperature, pressure, vibration readings from factory sensors every second
Fintech: Stock market tick data streaming for real-time trading analytics
Food delivery: Order status updates (placed, preparing, picked up, delivered) stream in real-time

The Architecture

# Spark Structured Streaming in Databricks
df_stream = spark.readStream     .format("eventhubs")     .options(**event_hub_config)     .load()

# Parse JSON events
df_parsed = df_stream     .withColumn("body", col("body").cast("string"))     .withColumn("event", from_json(col("body"), event_schema))     .select("event.*", "enqueuedTime")

# Write to Delta Lake (micro-batch every 10 seconds)
df_parsed.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "/checkpoints/clickstream/")     .trigger(processingTime="10 seconds")     .start("/bronze/clickstream/")

Real-life analogy: Event streaming is like a live TV news broadcast. Events (stories) happen continuously. The broadcast (Kafka/Event Hubs) streams them immediately to millions of viewers (consumers). You can watch live (real-time dashboard) or record and watch later (write to Delta Lake). Nobody waits for the evening news (batch processing) — the news arrives as it happens.

Pattern 5: Direct Cloud Storage Drop (Internal Teams)

How It Works

Internal teams or cloud-native partners upload files directly to a cloud storage location (ADLS Gen2, S3).

Internal Team (Marketing, Finance, HR)
    |
    | Uploads CSV/Excel via Azure Portal, Storage Explorer, or automated script
    v
ADLS Gen2 / bronze / marketing / campaign_data / 2026/05/20/
    |
    | Storage event trigger detects new file
    v
ADF/Databricks pipeline runs → Silver → Gold

When This Actually Happens

This is less common than the other patterns but does happen: – Internal marketing team uploads weekly campaign performance CSVs – Finance team uploads monthly budget spreadsheets – HR team uploads employee census data for workforce analytics – Cloud-native partner with Azure Service Principal drops data directly – Azure Logic Apps or Power Automate writes Microsoft 365 data to ADLS

How to Secure It

Storage Account: datalake-prod
  Container: partner-drop
    Folder: /marketing/     ← Marketing team has write access ONLY here
    Folder: /finance/       ← Finance team has write access ONLY here
    Folder: /partner-abc/   ← Partner ABC has scoped Service Principal access

Each team/partner gets a Service Principal with Storage Blob Data Contributor
role scoped to THEIR specific folder only — they cannot see other folders.

Real-life analogy: This is like having labeled drop-off bins at a warehouse. The marketing team can only drop packages in the “Marketing” bin. The finance team can only use the “Finance” bin. Nobody can see what is in other bins. A warehouse worker (pipeline) checks the bins on a schedule and processes the contents.

Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)

How It Works

Companies use managed ingestion platforms that connect to hundreds of sources and automatically sync data to the data lake.

100+ Source Systems (Salesforce, HubSpot, Jira, Zendesk, Stripe, etc.)
    |
    | Fivetran / Airbyte / Matillion connects to all of them
    v
Managed Ingestion Platform
    |
    | Automatically syncs data on schedule
    v
ADLS Gen2 / Snowflake / Databricks Lakehouse
    |
    v
Your pipelines transform the data

Why Third-Party Tools?

Connectors: Fivetran has 300+ pre-built connectors. Building each one manually would take months.
Maintenance: When Salesforce changes their API, Fivetran updates the connector — you do nothing.
Schema handling: Automatic schema evolution when source systems add columns.
Reliability: Automatic retries, deduplication, and incremental syncing built in.

Popular Tools

Tool	Type	Price	Best For
Fivetran	Managed SaaS	$$$$	Enterprise, 300+ connectors
Airbyte	Open source	Free/$$	Startups, cost-conscious teams
Matillion	Cloud ETL	$$$	Snowflake/Databricks heavy companies
Stitch	Managed SaaS	$$	Small-medium businesses
Hevo Data	Managed SaaS	$$	Indian market, real-time pipelines

Real Examples

Company with 50 SaaS tools uses Fivetran to sync all of them to Databricks automatically
Startup uses Airbyte (open source) to pull from 30 APIs into their data lake at zero connector cost

Real-life analogy: Third-party ingestion tools are like a mail forwarding service. You give them the addresses of all your 50 magazine subscriptions (sources). They collect every magazine (data), sort them, and deliver everything to your house (data lake) on schedule. You never visit 50 different publishers — the service handles it all.

How Real Companies Combine Patterns

Banking (TD Bank, RBC Type)

Core Banking System (Oracle)  ──── CDC (GoldenGate) ────►  ADLS Bronze
Visa/Mastercard Transactions  ──── SFTP (daily files) ───►  ADLS Bronze
Regulatory Feeds (OSFI)       ──── SFTP (daily files) ───►  ADLS Bronze
Market Data (Bloomberg)       ──── API (every 15 min) ───►  ADLS Bronze
CRM (Salesforce)              ──── Fivetran (hourly) ────►  ADLS Bronze
                                                               │
                                                    Synapse/Databricks
                                                               │
                                                    Gold ──► Power BI + Risk Models

E-Commerce (Amazon, Shopify Type)

Orders DB (PostgreSQL)        ──── CDC (Debezium+Kafka) ──►  Delta Lake
Clickstream (web/mobile)      ──── Event Hubs (real-time) ─►  Delta Lake
Payments (Stripe API)         ──── API (every hour) ───────►  ADLS Bronze
Product Catalog               ──── API (daily) ───────────►  ADLS Bronze
Warehouse Inventory           ──── SFTP (hourly files) ────►  ADLS Bronze
                                                               │
                                                    Databricks
                                                               │
                                            Recommendations + Dashboards + ML

Telecom (Rogers, Bell Type)

Network Equipment CDRs        ──── SFTP (hourly files) ────►  ADLS Bronze
CRM (Salesforce)              ──── Fivetran (hourly) ──────►  ADLS Bronze
Billing System (Oracle)       ──── CDC (GoldenGate) ───────►  ADLS Bronze
Cell Tower IoT Sensors        ──── Event Hubs (real-time) ──►  Delta Lake
Customer App Events           ──── Kafka (real-time) ──────►  Delta Lake
                                                               │
                                                    Synapse/Databricks
                                                               │
                                        Network Analytics + Churn Prediction

Healthcare (Hospital Networks)

Patient Records (Epic/Cerner) ──── HL7/FHIR API (real-time) ─►  ADLS Bronze
Lab Results                   ──── SFTP (hourly files) ────────►  ADLS Bronze
Insurance Claims              ──── SFTP (daily files) ─────────►  ADLS Bronze
Medical Device IoT            ──── IoT Hub (real-time) ────────►  Delta Lake
Pharmacy Systems              ──── CDC (SQL Server) ───────────►  ADLS Bronze
                                                                    │
                                                         Databricks (with PHI masking)
                                                                    │
                                                Clinical Dashboards + Research Datasets

Retail (Walmart, Costco Type)

Point-of-Sale (POS)           ──── CDC (SQL Server) ───────►  ADLS Bronze
Supplier Inventory             ──── SFTP (daily files) ─────►  ADLS Bronze
E-commerce Orders              ──── CDC + API (real-time) ──►  ADLS Bronze
Customer Loyalty Program       ──── API (hourly) ──────────►  ADLS Bronze
Supply Chain Tracking          ──── Event Hubs (real-time) ─►  Delta Lake
Weather Data (for demand)      ──── API (hourly) ──────────►  ADLS Bronze
                                                                │
                                                     Databricks
                                                                │
                                        Demand Forecasting + Inventory Optimization

Insurance

Policy Admin System (Oracle)  ──── CDC (nightly) ──────────►  ADLS Bronze
Claims (from hospitals)       ──── SFTP (daily files) ─────►  ADLS Bronze
Reinsurance Data              ──── SFTP (monthly files) ───►  ADLS Bronze
Agent Portal                  ──── API (hourly) ──────────►  ADLS Bronze
Third-Party Risk Data         ──── Fivetran (daily) ──────►  ADLS Bronze
                                                                │
                                                     Synapse
                                                                │
                                        Claims Analytics + Fraud Detection

The Complete Architecture

Sources (left)                    Landing (center)           Processing (right)

SFTP Files      ─────┐
API Responses   ─────┤
CDC Streams     ─────┤──►  ADLS Gen2 BRONZE  ──►  Databricks/Synapse  ──►  GOLD
Event Streams   ─────┤         (Raw data)          (Silver transforms)    (Dashboards)
Direct Drops    ─────┤                                                     (ML Models)
Fivetran/Airbyte ────┘                                                     (Reports)

Every source lands in Bronze (raw, as-is). Transformations happen in Silver. Business-ready data lives in Gold. The ingestion pattern changes. The downstream architecture stays the same.

Comparison Table

Factor	SFTP	REST API	CDC	Event Streaming	Direct Drop	Third-Party Tools
Latency	Hours (batch)	Minutes to hours	Seconds to minutes	Seconds (real-time)	Ad-hoc	Minutes to hours
Data format	CSV, XML, Excel, fixed-width	JSON, XML	Row-level changes (I/U/D)	JSON events	CSV, Excel, Parquet	Structured tables
Captures deletes?	❌ (only what is in the file)	❌ (usually not)	✅ (full I/U/D)	Depends on producer	❌	✅ (some tools)
Setup complexity	Low	Medium (pagination, auth, rate limits)	High (log access, CDC tool config)	High (Kafka/Event Hubs infra)	Very low	Low (managed service)
Maintenance	Low	Medium (API changes break pipelines)	Medium (source DB upgrades can break CDC)	High (cluster management, scaling)	Very low	Very low (vendor maintains connectors)
Source impact	None (file is pre-generated)	Low-Medium (API calls use source compute)	Very low (reads transaction log, not tables)	None (producer pushes events)	None	Low
Cost	Low (SFTP server only)	Low (API calls are usually free)	Medium (CDC tool licensing)	High (Kafka/Event Hubs cluster)	Free	High (SaaS subscription per connector)
Best for	Vendor files, regulated industries	SaaS data, third-party services	Internal databases, near real-time sync	Clickstream, IoT, real-time analytics	Internal ad-hoc sharing	Companies with 20+ SaaS sources
Azure tool	ADF Copy + SFTP connector	ADF Web Activity / Azure Function	ADF CDC / Debezium / Fabric Mirroring	Event Hubs + Spark Streaming	ADLS + Event Trigger	Fivetran / Airbyte

The decision rule: If the source is a vendor with legacy systems → SFTP. If the source is a SaaS product → API or third-party tool. If the source is your own database → CDC. If the source produces continuous events → streaming. If the source is an internal team → direct drop. Most companies use 3-4 of these patterns simultaneously.

Mapping Patterns to Our Blog Pipelines

Ingestion Pattern	Our Blog Pipeline	Post
SFTP file pickup	ADF Copy from SQL to ADLS (similar pattern)	Metadata-Driven Pipeline
API incremental pull	Watermark-based incremental load	Incremental Loading
CDC apply changes	Delta MERGE for SCD	SCD with Delta MERGE
Event Streaming	Structured Streaming to Delta	(Future post)
File-triggered pipeline	Event trigger on ADLS	ADF Triggers
Multi-source config	Unified Full+Incremental	Unified Pipeline
Bronze → Silver transform	Data quality + cleaning	Data Quality Framework
Silver → Gold	SCD dimensions + aggregations	Medallion Architecture

Common Mistakes in Data Ingestion

Assuming all data comes as files — in a modern company, APIs and CDC are often more common than file drops. Design your architecture for multiple ingestion patterns.
Polling SFTP too frequently — if the vendor drops files at 6 AM daily, polling every 5 minutes wastes resources. Match polling frequency to the source’s delivery schedule.
Not handling API rate limits — calling Salesforce API 1,000 times per minute gets you blocked. Implement exponential backoff and respect rate limit headers.
Full load when CDC is available — copying the entire 10M-row table daily when only 1,000 rows changed wastes time and money. Use CDC for transactional databases.
No dead letter queue for streaming — if a streaming event fails to process, it should go to a dead letter topic, not be silently dropped.
Not validating incoming data — vendor files can have wrong formats, missing columns, or corrupt rows. Always validate at the Bronze layer before processing.

Interview Questions

Q: How do companies typically receive data from external vendors? A: Most commonly via SFTP — the vendor uploads files to an SFTP server, and an ADF pipeline polls for new files on a schedule. For modern SaaS vendors, REST APIs are increasingly common. Some vendors use CDC for database replication. The choice depends on the vendor’s capabilities and the required data freshness.

Q: What is CDC and when would you use it? A: Change Data Capture monitors a source database’s transaction log and captures every INSERT, UPDATE, and DELETE as it happens. Use CDC when you need near real-time replication of a transactional database without running expensive SELECT queries against it. Tools include Debezium, AWS DMS, Oracle GoldenGate, and ADF CDC.

Q: What is the difference between batch and streaming ingestion? A: Batch ingestion processes data in scheduled intervals (hourly, daily files via SFTP or API). Streaming ingestion processes data continuously as events arrive (via Kafka or Event Hubs). Batch is simpler and sufficient for most use cases. Streaming is necessary when you need sub-minute latency (real-time dashboards, fraud detection, live recommendations).

Q: In a real project, how many ingestion patterns would you typically use? A: Most enterprise projects use 3-4 patterns simultaneously: SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and possibly streaming for real-time events. The data lake (ADLS Gen2) serves as the unified landing zone for all patterns, and the downstream transformation layer (Databricks/Synapse) is the same regardless of how data arrives.

Wrapping Up

Data does not magically appear in your data lake. It arrives through specific doors: SFTP for traditional vendors, APIs for modern SaaS, CDC for database replication, Event Hubs for real-time streaming, direct drops for internal teams, and Fivetran for managed ingestion.

Understanding these patterns is what separates a pipeline builder from a data engineer. A pipeline builder knows how to read a CSV from ADLS. A data engineer knows how that CSV got to ADLS in the first place — and can design the architecture to handle 6 different source types landing in Bronze, flowing through Silver, and serving Gold to 100 analysts.

Every pipeline we built on this blog fits into this picture. The metadata-driven pipeline is how you process SFTP files at scale. The incremental load is how you pull API data efficiently. Delta MERGE is how you apply CDC changes. And the Medallion Architecture is how you organize it all.

← Previous: Schema-on-Write vs Read

More (5/6)

Next: Table Creation and Governance →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

How Real Companies Receive Data: SFTP, APIs, CDC, Event Streaming, and Every Ingestion Pattern Explained

Table of Contents

The Six Ingestion Patterns

Pattern 1: SFTP File Drop (Traditional Industries)

How It Works

Why SFTP and Not Direct Upload?

Real Examples

The ADF Pipeline

Pattern 2: REST API Ingestion (Modern SaaS)

How It Works

Why APIs?

Real Examples

The Challenge: Pagination and Rate Limits

Pattern 3: Database Replication / CDC (Transactional Systems)

How It Works

Why CDC?

CDC Tools

Real Examples

What CDC Data Looks Like

Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)

How It Works

Why Streaming?

Real Examples

The Architecture

Pattern 5: Direct Cloud Storage Drop (Internal Teams)

How It Works

When This Actually Happens

How to Secure It

Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)

How It Works

Why Third-Party Tools?

Popular Tools

Real Examples

How Real Companies Combine Patterns

Banking (TD Bank, RBC Type)

E-Commerce (Amazon, Shopify Type)

Telecom (Rogers, Bell Type)

Healthcare (Hospital Networks)

Retail (Walmart, Costco Type)

Insurance

The Complete Architecture

Comparison Table

Mapping Patterns to Our Blog Pipelines

Common Mistakes in Data Ingestion

Interview Questions

Wrapping Up

Related Posts

Leave a Comment Cancel Reply