How Real Companies Receive Data: SFTP, APIs, CDC, Event Streaming, and Every Ingestion Pattern Explained
You have learned to build pipelines, transform data, and write Delta tables. But there is a question nobody answers in tutorials: “Where does the data actually COME FROM?”
Tutorials show you a CSV file already sitting in ADLS Gen2. They say “read this Parquet file” as if it magically appeared. But in a real project, somebody has to GET that data into the data lake first. And how that happens depends entirely on who the source is, what technology they use, and how often the data changes.
This post covers every real-world data ingestion pattern — not theory, but what actually happens at banks, e-commerce companies, telecoms, and hospitals. By the end, you will understand the six approaches companies use, which industries prefer which patterns, and how the data flows from source to your Bronze layer.
Think of your data lake like a warehouse. Data does not magically appear on the shelves. It arrives through different doors: some vendors send a delivery truck on a schedule (SFTP batch files). Some let you call and request specific items (API). Some stream a constant supply line (Event Streaming). Some replicate their entire inventory to your warehouse in real-time (CDC). Understanding the doors is understanding data engineering in the real world.
Table of Contents
- The Six Ingestion Patterns
- Pattern 1: SFTP File Drop (Traditional Industries)
- Pattern 2: REST API Ingestion (Modern SaaS)
- Pattern 3: Database Replication / CDC (Transactional Systems)
- Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)
- Pattern 5: Direct Cloud Storage Drop (Internal Teams)
- Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)
- How Real Companies Combine Patterns
- Banking (TD Bank, RBC Type)
- E-Commerce (Amazon, Shopify Type)
- Telecom (Rogers, Bell Type)
- Healthcare (Hospital Networks)
- Retail (Walmart, Costco Type)
- Insurance
- The Complete Architecture
- Comparison Table
- Mapping Patterns to Our Blog Pipelines
- Common Mistakes in Data Ingestion
- Interview Questions
- Wrapping Up
The Six Ingestion Patterns
| # | Pattern | How Data Arrives | Frequency | Common In |
|---|---|---|---|---|
| 1 | SFTP | Vendor drops files on SFTP, pipeline picks up | Hourly/Daily | Banks, Insurance, Telecom, Healthcare |
| 2 | API | Pipeline calls REST API, writes response | Minutes/Hourly | SaaS-heavy companies, Modern enterprises |
| 3 | CDC | Database changes streamed to data lake | Minutes/Real-time | Any company with transactional databases |
| 4 | Event Streaming | Kafka/Event Hubs continuous stream | Seconds/Real-time | Tech companies, IoT, E-commerce |
| 5 | Direct Drop | Internal teams upload to cloud storage | Ad-hoc/Daily | Internal data sharing |
| 6 | Third-Party Tools | Fivetran/Airbyte sync from 100+ sources | Automatic | Companies with many SaaS tools |
Most real projects use a combination — SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and streaming for real-time events. The data lake is the landing zone for all of them.
Pattern 1: SFTP File Drop (Traditional Industries)
How It Works
The vendor does NOT drop files directly into your ADLS. They drop files on an SFTP server (Secure File Transfer Protocol), and your automated pipeline picks them up and moves them to the data lake.
Vendor System
|
| Uploads CSV/XML/Excel to SFTP server
v
SFTP Server (sftp.company.com)
|
| ADF/Synapse polls every hour
v
ADF Copy Activity
|
| Copies new files to ADLS Gen2
v
ADLS Gen2 / bronze / vendor_name / 2026/05/20/
|
| Storage event trigger fires
v
Databricks / Synapse transforms → Silver → Gold
Why SFTP and Not Direct Upload?
- Security: The vendor never sees your cloud storage. They only know the SFTP address.
- Control: You control when and how files are processed. The vendor just drops and leaves.
- Compliance: Regulated industries (banking, healthcare) require files to land on controlled infrastructure first.
- Compatibility: SFTP has existed since the 1990s. Every vendor, every system, every country supports it. No SDK, no cloud knowledge needed.
Real Examples
- Banks receive daily transaction files from Visa and Mastercard via SFTP
- Insurance companies receive claim files from hospitals via SFTP
- Retail chains receive inventory files from warehouse management systems via SFTP
- Telecom companies receive Call Detail Records (CDRs) from network equipment via SFTP
- Government agencies receive tax filing data from regional offices via SFTP
The ADF Pipeline
Trigger: Schedule (every hour) or Event (new file detected)
|
v
Lookup: Check SFTP for new files since last poll
|
v
ForEach: For each new file
|
v
Copy Activity: SFTP source → ADLS Gen2 sink (Parquet or raw format)
|
v
Stored Procedure: Log file name, row count, timestamp to PIPELINE_LOG
|
v
Delete: Archive or remove file from SFTP after successful copy
Real-life analogy: SFTP is like a P.O. Box at the post office. The vendor mails a letter (file) to your P.O. Box (SFTP). You check the box on a schedule (pipeline polls), pick up the letter, and bring it home (ADLS). The vendor never knows your home address (cloud storage). They only know the P.O. Box number.
Pattern 2: REST API Ingestion (Modern SaaS)
How It Works
Data comes from REST APIs. A scheduled pipeline calls the API, receives a JSON response, and writes it to the data lake.
SaaS Application (Salesforce, Stripe, Shopify)
|
| Exposes REST API: GET /api/v2/orders?since=2026-05-19
v
ADF Web Activity / Azure Function / Databricks Notebook
|
| Calls API, receives JSON response
v
Transform JSON → write to ADLS Gen2 / bronze / salesforce / 2026/05/20/
|
v
Databricks transforms → Silver → Gold
Why APIs?
- Real-time data: APIs can return data as recent as the last minute
- Structured responses: JSON with defined schemas, not messy CSV files
- Incremental: Most APIs support
?since=timestampor?updated_after=datefor delta pulls - No file management: No SFTP servers, no file naming conventions, no missing files
Real Examples
- Salesforce API → pull customer accounts and opportunity data daily
- Stripe/PayPal API → pull payment transactions every hour
- Shopify API → pull order and product catalog data
- Google Analytics API → pull website traffic and user behavior data daily
- Weather APIs → pull hourly weather data for logistics route optimization
- Social media APIs → pull brand mentions for sentiment analysis
- Jira/Azure DevOps API → pull project metrics for engineering dashboards
The Challenge: Pagination and Rate Limits
# Real API ingestion pattern in a Databricks notebook
import requests
import json
from datetime import datetime, timedelta
API_URL = "https://api.salesforce.com/services/data/v56.0/query"
headers = {"Authorization": f"Bearer {access_token}"}
# Incremental: only records modified since yesterday
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT%H:%M:%SZ")
query = f"SELECT Id, Name, Email, LastModifiedDate FROM Contact WHERE LastModifiedDate > {yesterday}"
all_records = []
next_url = f"{API_URL}?q={query}"
# Handle pagination (APIs return results in pages)
while next_url:
response = requests.get(next_url, headers=headers)
data = response.json()
all_records.extend(data["records"])
if data.get("nextRecordsUrl"):
next_url = f"https://api.salesforce.com{data['nextRecordsUrl']}"
else:
next_url = None
# Write to Delta
df = spark.createDataFrame(all_records)
df.write.format("delta").mode("append").save("/bronze/salesforce/contacts/")
print(f"Ingested {len(all_records)} contacts")
Real-life analogy: API ingestion is like calling a restaurant for delivery. You call (API request), place your order (query parameters), and they deliver the food (JSON response). You do not go to the restaurant — they bring it to you. But the restaurant has limits: “maximum 5 deliveries per hour” (rate limits) and “maximum 100 items per delivery” (pagination).
Pattern 3: Database Replication / CDC (Transactional Systems)
How It Works
Change Data Capture (CDC) monitors a source database and captures every INSERT, UPDATE, and DELETE. Changes are streamed to the data lake as they happen — no files, no APIs, no polling.
Source Database (Oracle/SQL Server/PostgreSQL)
|
| CDC tool captures every INSERT/UPDATE/DELETE from the transaction log
v
CDC Tool (Debezium / AWS DMS / Oracle GoldenGate / ADF CDC)
|
| Streams changes as events
v
ADLS Gen2 / bronze / orders_cdc / → Delta format with operation column (I/U/D)
|
v
Databricks applies changes → Silver (current state) → Gold
Why CDC?
- Near real-time: Changes arrive within seconds to minutes
- Efficient: Only changed rows are transferred, not the entire table
- Complete: Captures updates and deletes, not just inserts (unlike watermark-based incremental)
- Low impact: Reads from the database’s transaction log, not by querying the tables directly
CDC Tools
| Tool | Vendor | Best For |
|---|---|---|
| Debezium | Open source | PostgreSQL, MySQL, MongoDB CDC to Kafka |
| AWS DMS | Amazon | Any database to AWS (S3, Redshift) |
| Oracle GoldenGate | Oracle | Oracle database replication |
| ADF CDC | Microsoft | SQL Server to ADLS Gen2 |
| Fivetran | SaaS | Managed CDC from any database |
| Qlik Replicate (Attunity) | Qlik | Enterprise database replication |
Real Examples
- E-commerce: PostgreSQL orders database replicated to Delta Lake every 5 minutes via Debezium + Kafka
- Banking: Core banking Oracle database replicated nightly to the analytics platform via GoldenGate
- Hospital: Patient management system replicated to the data lake for clinical dashboards
- Retail: Point-of-sale SQL Server database replicated for real-time inventory tracking
What CDC Data Looks Like
| operation | order_id | customer_id | amount | timestamp |
|-----------|----------|-------------|--------|---------------------|
| INSERT | 5001 | 1001 | 150.00 | 2026-05-20 10:00:01 |
| INSERT | 5002 | 1002 | 89.99 | 2026-05-20 10:00:03 |
| UPDATE | 5001 | 1001 | 175.00 | 2026-05-20 10:05:00 |
| DELETE | 4999 | 1003 | 25.00 | 2026-05-20 10:10:00 |
The operation column tells your pipeline what to do: INSERT new rows, UPDATE existing rows, DELETE removed rows. This maps directly to Delta Lake’s MERGE command.
Real-life analogy: CDC is like a security camera on a warehouse door. Instead of taking inventory every night (full load) or checking what is new since your last visit (watermark), the camera records every item that enters (INSERT), gets modified (UPDATE), or leaves (DELETE) in real-time. At any point, you can replay the tape to rebuild the current inventory.
Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)
How It Works
Data arrives as a continuous stream of events through a message broker. No files, no batches — events flow in real-time.
Application / IoT Device / Website
|
| Sends events (JSON messages) continuously
v
Azure Event Hubs / Apache Kafka / AWS Kinesis
|
| Millions of events per second
v
Spark Structured Streaming / Azure Stream Analytics
|
| Processes micro-batches (every few seconds)
v
Delta Lake (append new events)
|
v
Real-time dashboards / ML models / Alerts
Why Streaming?
- Real-time insights: React to events as they happen, not hours later
- Scale: Handle millions of events per second (Event Hubs supports 1M+ events/sec)
- Decoupled: Producers send events without knowing who consumes them
- Event sourcing: Complete history of every event, not just current state
Real Examples
- Uber: Every ride request, driver location update, price surge calculation, and trip event streams through Kafka in real-time
- Netflix: Every play, pause, skip, and browse event streams for real-time recommendations
- E-commerce clickstream: Every page view, add-to-cart, and checkout step streams for personalization
- IoT in manufacturing: Temperature, pressure, vibration readings from factory sensors every second
- Fintech: Stock market tick data streaming for real-time trading analytics
- Food delivery: Order status updates (placed, preparing, picked up, delivered) stream in real-time
The Architecture
# Spark Structured Streaming in Databricks
df_stream = spark.readStream .format("eventhubs") .options(**event_hub_config) .load()
# Parse JSON events
df_parsed = df_stream .withColumn("body", col("body").cast("string")) .withColumn("event", from_json(col("body"), event_schema)) .select("event.*", "enqueuedTime")
# Write to Delta Lake (micro-batch every 10 seconds)
df_parsed.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/checkpoints/clickstream/") .trigger(processingTime="10 seconds") .start("/bronze/clickstream/")
Real-life analogy: Event streaming is like a live TV news broadcast. Events (stories) happen continuously. The broadcast (Kafka/Event Hubs) streams them immediately to millions of viewers (consumers). You can watch live (real-time dashboard) or record and watch later (write to Delta Lake). Nobody waits for the evening news (batch processing) — the news arrives as it happens.
Pattern 5: Direct Cloud Storage Drop (Internal Teams)
How It Works
Internal teams or cloud-native partners upload files directly to a cloud storage location (ADLS Gen2, S3).
Internal Team (Marketing, Finance, HR)
|
| Uploads CSV/Excel via Azure Portal, Storage Explorer, or automated script
v
ADLS Gen2 / bronze / marketing / campaign_data / 2026/05/20/
|
| Storage event trigger detects new file
v
ADF/Databricks pipeline runs → Silver → Gold
When This Actually Happens
This is less common than the other patterns but does happen: – Internal marketing team uploads weekly campaign performance CSVs – Finance team uploads monthly budget spreadsheets – HR team uploads employee census data for workforce analytics – Cloud-native partner with Azure Service Principal drops data directly – Azure Logic Apps or Power Automate writes Microsoft 365 data to ADLS
How to Secure It
Storage Account: datalake-prod
Container: partner-drop
Folder: /marketing/ ← Marketing team has write access ONLY here
Folder: /finance/ ← Finance team has write access ONLY here
Folder: /partner-abc/ ← Partner ABC has scoped Service Principal access
Each team/partner gets a Service Principal with Storage Blob Data Contributor
role scoped to THEIR specific folder only — they cannot see other folders.
Real-life analogy: This is like having labeled drop-off bins at a warehouse. The marketing team can only drop packages in the “Marketing” bin. The finance team can only use the “Finance” bin. Nobody can see what is in other bins. A warehouse worker (pipeline) checks the bins on a schedule and processes the contents.
Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)
How It Works
Companies use managed ingestion platforms that connect to hundreds of sources and automatically sync data to the data lake.
100+ Source Systems (Salesforce, HubSpot, Jira, Zendesk, Stripe, etc.)
|
| Fivetran / Airbyte / Matillion connects to all of them
v
Managed Ingestion Platform
|
| Automatically syncs data on schedule
v
ADLS Gen2 / Snowflake / Databricks Lakehouse
|
v
Your pipelines transform the data
Why Third-Party Tools?
- Connectors: Fivetran has 300+ pre-built connectors. Building each one manually would take months.
- Maintenance: When Salesforce changes their API, Fivetran updates the connector — you do nothing.
- Schema handling: Automatic schema evolution when source systems add columns.
- Reliability: Automatic retries, deduplication, and incremental syncing built in.
Popular Tools
| Tool | Type | Price | Best For |
|---|---|---|---|
| Fivetran | Managed SaaS | $$$$ | Enterprise, 300+ connectors |
| Airbyte | Open source | Free/$$ | Startups, cost-conscious teams |
| Matillion | Cloud ETL | $$$ | Snowflake/Databricks heavy companies |
| Stitch | Managed SaaS | $$ | Small-medium businesses |
| Hevo Data | Managed SaaS | $$ | Indian market, real-time pipelines |
Real Examples
- Company with 50 SaaS tools uses Fivetran to sync all of them to Databricks automatically
- Startup uses Airbyte (open source) to pull from 30 APIs into their data lake at zero connector cost
Real-life analogy: Third-party ingestion tools are like a mail forwarding service. You give them the addresses of all your 50 magazine subscriptions (sources). They collect every magazine (data), sort them, and deliver everything to your house (data lake) on schedule. You never visit 50 different publishers — the service handles it all.
How Real Companies Combine Patterns
Banking (TD Bank, RBC Type)
Core Banking System (Oracle) ──── CDC (GoldenGate) ────► ADLS Bronze
Visa/Mastercard Transactions ──── SFTP (daily files) ───► ADLS Bronze
Regulatory Feeds (OSFI) ──── SFTP (daily files) ───► ADLS Bronze
Market Data (Bloomberg) ──── API (every 15 min) ───► ADLS Bronze
CRM (Salesforce) ──── Fivetran (hourly) ────► ADLS Bronze
│
Synapse/Databricks
│
Gold ──► Power BI + Risk Models
E-Commerce (Amazon, Shopify Type)
Orders DB (PostgreSQL) ──── CDC (Debezium+Kafka) ──► Delta Lake
Clickstream (web/mobile) ──── Event Hubs (real-time) ─► Delta Lake
Payments (Stripe API) ──── API (every hour) ───────► ADLS Bronze
Product Catalog ──── API (daily) ───────────► ADLS Bronze
Warehouse Inventory ──── SFTP (hourly files) ────► ADLS Bronze
│
Databricks
│
Recommendations + Dashboards + ML
Telecom (Rogers, Bell Type)
Network Equipment CDRs ──── SFTP (hourly files) ────► ADLS Bronze
CRM (Salesforce) ──── Fivetran (hourly) ──────► ADLS Bronze
Billing System (Oracle) ──── CDC (GoldenGate) ───────► ADLS Bronze
Cell Tower IoT Sensors ──── Event Hubs (real-time) ──► Delta Lake
Customer App Events ──── Kafka (real-time) ──────► Delta Lake
│
Synapse/Databricks
│
Network Analytics + Churn Prediction
Healthcare (Hospital Networks)
Patient Records (Epic/Cerner) ──── HL7/FHIR API (real-time) ─► ADLS Bronze
Lab Results ──── SFTP (hourly files) ────────► ADLS Bronze
Insurance Claims ──── SFTP (daily files) ─────────► ADLS Bronze
Medical Device IoT ──── IoT Hub (real-time) ────────► Delta Lake
Pharmacy Systems ──── CDC (SQL Server) ───────────► ADLS Bronze
│
Databricks (with PHI masking)
│
Clinical Dashboards + Research Datasets
Retail (Walmart, Costco Type)
Point-of-Sale (POS) ──── CDC (SQL Server) ───────► ADLS Bronze
Supplier Inventory ──── SFTP (daily files) ─────► ADLS Bronze
E-commerce Orders ──── CDC + API (real-time) ──► ADLS Bronze
Customer Loyalty Program ──── API (hourly) ──────────► ADLS Bronze
Supply Chain Tracking ──── Event Hubs (real-time) ─► Delta Lake
Weather Data (for demand) ──── API (hourly) ──────────► ADLS Bronze
│
Databricks
│
Demand Forecasting + Inventory Optimization
Insurance
Policy Admin System (Oracle) ──── CDC (nightly) ──────────► ADLS Bronze
Claims (from hospitals) ──── SFTP (daily files) ─────► ADLS Bronze
Reinsurance Data ──── SFTP (monthly files) ───► ADLS Bronze
Agent Portal ──── API (hourly) ──────────► ADLS Bronze
Third-Party Risk Data ──── Fivetran (daily) ──────► ADLS Bronze
│
Synapse
│
Claims Analytics + Fraud Detection
The Complete Architecture
Sources (left) Landing (center) Processing (right)
SFTP Files ─────┐
API Responses ─────┤
CDC Streams ─────┤──► ADLS Gen2 BRONZE ──► Databricks/Synapse ──► GOLD
Event Streams ─────┤ (Raw data) (Silver transforms) (Dashboards)
Direct Drops ─────┤ (ML Models)
Fivetran/Airbyte ────┘ (Reports)
Every source lands in Bronze (raw, as-is). Transformations happen in Silver. Business-ready data lives in Gold. The ingestion pattern changes. The downstream architecture stays the same.
Mapping Patterns to Our Blog Pipelines
| Ingestion Pattern | Our Blog Pipeline | Post |
|---|---|---|
| SFTP file pickup | ADF Copy from SQL to ADLS (similar pattern) | Metadata-Driven Pipeline |
| API incremental pull | Watermark-based incremental load | Incremental Loading |
| CDC apply changes | Delta MERGE for SCD | SCD with Delta MERGE |
| Event Streaming | Structured Streaming to Delta | (Future post) |
| File-triggered pipeline | Event trigger on ADLS | ADF Triggers |
| Multi-source config | Unified Full+Incremental | Unified Pipeline |
| Bronze → Silver transform | Data quality + cleaning | Data Quality Framework |
| Silver → Gold | SCD dimensions + aggregations | Medallion Architecture |
Common Mistakes in Data Ingestion
-
Assuming all data comes as files — in a modern company, APIs and CDC are often more common than file drops. Design your architecture for multiple ingestion patterns.
-
Polling SFTP too frequently — if the vendor drops files at 6 AM daily, polling every 5 minutes wastes resources. Match polling frequency to the source’s delivery schedule.
-
Not handling API rate limits — calling Salesforce API 1,000 times per minute gets you blocked. Implement exponential backoff and respect rate limit headers.
-
Full load when CDC is available — copying the entire 10M-row table daily when only 1,000 rows changed wastes time and money. Use CDC for transactional databases.
-
No dead letter queue for streaming — if a streaming event fails to process, it should go to a dead letter topic, not be silently dropped.
-
Not validating incoming data — vendor files can have wrong formats, missing columns, or corrupt rows. Always validate at the Bronze layer before processing.
Interview Questions
Q: How do companies typically receive data from external vendors? A: Most commonly via SFTP — the vendor uploads files to an SFTP server, and an ADF pipeline polls for new files on a schedule. For modern SaaS vendors, REST APIs are increasingly common. Some vendors use CDC for database replication. The choice depends on the vendor’s capabilities and the required data freshness.
Q: What is CDC and when would you use it? A: Change Data Capture monitors a source database’s transaction log and captures every INSERT, UPDATE, and DELETE as it happens. Use CDC when you need near real-time replication of a transactional database without running expensive SELECT queries against it. Tools include Debezium, AWS DMS, Oracle GoldenGate, and ADF CDC.
Q: What is the difference between batch and streaming ingestion? A: Batch ingestion processes data in scheduled intervals (hourly, daily files via SFTP or API). Streaming ingestion processes data continuously as events arrive (via Kafka or Event Hubs). Batch is simpler and sufficient for most use cases. Streaming is necessary when you need sub-minute latency (real-time dashboards, fraud detection, live recommendations).
Q: In a real project, how many ingestion patterns would you typically use? A: Most enterprise projects use 3-4 patterns simultaneously: SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and possibly streaming for real-time events. The data lake (ADLS Gen2) serves as the unified landing zone for all patterns, and the downstream transformation layer (Databricks/Synapse) is the same regardless of how data arrives.
Wrapping Up
Data does not magically appear in your data lake. It arrives through specific doors: SFTP for traditional vendors, APIs for modern SaaS, CDC for database replication, Event Hubs for real-time streaming, direct drops for internal teams, and Fivetran for managed ingestion.
Understanding these patterns is what separates a pipeline builder from a data engineer. A pipeline builder knows how to read a CSV from ADLS. A data engineer knows how that CSV got to ADLS in the first place — and can design the architecture to handle 6 different source types landing in Bronze, flowing through Silver, and serving Gold to 100 analysts.
Every pipeline we built on this blog fits into this picture. The metadata-driven pipeline is how you process SFTP files at scale. The incremental load is how you pull API data efficiently. Delta MERGE is how you apply CDC changes. And the Medallion Architecture is how you organize it all.
Related posts: – Metadata-Driven Pipeline – Incremental Loading with Watermark – Unified Full + Incremental Pipeline – ADF Triggers – Medallion Architecture – SCD Type 1 and 2 with Delta MERGE
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.