How Real Companies Receive Data: SFTP, APIs, CDC, Event Streaming, and Every Ingestion Pattern Explained

How Real Companies Receive Data: SFTP, APIs, CDC, Event Streaming, and Every Ingestion Pattern Explained

You have learned to build pipelines, transform data, and write Delta tables. But there is a question nobody answers in tutorials: “Where does the data actually COME FROM?”

Tutorials show you a CSV file already sitting in ADLS Gen2. They say “read this Parquet file” as if it magically appeared. But in a real project, somebody has to GET that data into the data lake first. And how that happens depends entirely on who the source is, what technology they use, and how often the data changes.

This post covers every real-world data ingestion pattern — not theory, but what actually happens at banks, e-commerce companies, telecoms, and hospitals. By the end, you will understand the six approaches companies use, which industries prefer which patterns, and how the data flows from source to your Bronze layer.

Think of your data lake like a warehouse. Data does not magically appear on the shelves. It arrives through different doors: some vendors send a delivery truck on a schedule (SFTP batch files). Some let you call and request specific items (API). Some stream a constant supply line (Event Streaming). Some replicate their entire inventory to your warehouse in real-time (CDC). Understanding the doors is understanding data engineering in the real world.

Table of Contents

  • The Six Ingestion Patterns
  • Pattern 1: SFTP File Drop (Traditional Industries)
  • Pattern 2: REST API Ingestion (Modern SaaS)
  • Pattern 3: Database Replication / CDC (Transactional Systems)
  • Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)
  • Pattern 5: Direct Cloud Storage Drop (Internal Teams)
  • Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)
  • How Real Companies Combine Patterns
  • Banking (TD Bank, RBC Type)
  • E-Commerce (Amazon, Shopify Type)
  • Telecom (Rogers, Bell Type)
  • Healthcare (Hospital Networks)
  • Retail (Walmart, Costco Type)
  • Insurance
  • The Complete Architecture
  • Comparison Table
  • Mapping Patterns to Our Blog Pipelines
  • Common Mistakes in Data Ingestion
  • Interview Questions
  • Wrapping Up

The Six Ingestion Patterns

# Pattern How Data Arrives Frequency Common In
1 SFTP Vendor drops files on SFTP, pipeline picks up Hourly/Daily Banks, Insurance, Telecom, Healthcare
2 API Pipeline calls REST API, writes response Minutes/Hourly SaaS-heavy companies, Modern enterprises
3 CDC Database changes streamed to data lake Minutes/Real-time Any company with transactional databases
4 Event Streaming Kafka/Event Hubs continuous stream Seconds/Real-time Tech companies, IoT, E-commerce
5 Direct Drop Internal teams upload to cloud storage Ad-hoc/Daily Internal data sharing
6 Third-Party Tools Fivetran/Airbyte sync from 100+ sources Automatic Companies with many SaaS tools

Most real projects use a combination — SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and streaming for real-time events. The data lake is the landing zone for all of them.

Pattern 1: SFTP File Drop (Traditional Industries)

How It Works

The vendor does NOT drop files directly into your ADLS. They drop files on an SFTP server (Secure File Transfer Protocol), and your automated pipeline picks them up and moves them to the data lake.

Vendor System
    |
    | Uploads CSV/XML/Excel to SFTP server
    v
SFTP Server (sftp.company.com)
    |
    | ADF/Synapse polls every hour
    v
ADF Copy Activity
    |
    | Copies new files to ADLS Gen2
    v
ADLS Gen2 / bronze / vendor_name / 2026/05/20/
    |
    | Storage event trigger fires
    v
Databricks / Synapse transforms → Silver → Gold

Why SFTP and Not Direct Upload?

  • Security: The vendor never sees your cloud storage. They only know the SFTP address.
  • Control: You control when and how files are processed. The vendor just drops and leaves.
  • Compliance: Regulated industries (banking, healthcare) require files to land on controlled infrastructure first.
  • Compatibility: SFTP has existed since the 1990s. Every vendor, every system, every country supports it. No SDK, no cloud knowledge needed.

Real Examples

  • Banks receive daily transaction files from Visa and Mastercard via SFTP
  • Insurance companies receive claim files from hospitals via SFTP
  • Retail chains receive inventory files from warehouse management systems via SFTP
  • Telecom companies receive Call Detail Records (CDRs) from network equipment via SFTP
  • Government agencies receive tax filing data from regional offices via SFTP

The ADF Pipeline

Trigger: Schedule (every hour) or Event (new file detected)
    |
    v
Lookup: Check SFTP for new files since last poll
    |
    v
ForEach: For each new file
    |
    v
Copy Activity: SFTP source → ADLS Gen2 sink (Parquet or raw format)
    |
    v
Stored Procedure: Log file name, row count, timestamp to PIPELINE_LOG
    |
    v
Delete: Archive or remove file from SFTP after successful copy

Real-life analogy: SFTP is like a P.O. Box at the post office. The vendor mails a letter (file) to your P.O. Box (SFTP). You check the box on a schedule (pipeline polls), pick up the letter, and bring it home (ADLS). The vendor never knows your home address (cloud storage). They only know the P.O. Box number.

Pattern 2: REST API Ingestion (Modern SaaS)

How It Works

Data comes from REST APIs. A scheduled pipeline calls the API, receives a JSON response, and writes it to the data lake.

SaaS Application (Salesforce, Stripe, Shopify)
    |
    | Exposes REST API: GET /api/v2/orders?since=2026-05-19
    v
ADF Web Activity / Azure Function / Databricks Notebook
    |
    | Calls API, receives JSON response
    v
Transform JSON → write to ADLS Gen2 / bronze / salesforce / 2026/05/20/
    |
    v
Databricks transforms → Silver → Gold

Why APIs?

  • Real-time data: APIs can return data as recent as the last minute
  • Structured responses: JSON with defined schemas, not messy CSV files
  • Incremental: Most APIs support ?since=timestamp or ?updated_after=date for delta pulls
  • No file management: No SFTP servers, no file naming conventions, no missing files

Real Examples

  • Salesforce API → pull customer accounts and opportunity data daily
  • Stripe/PayPal API → pull payment transactions every hour
  • Shopify API → pull order and product catalog data
  • Google Analytics API → pull website traffic and user behavior data daily
  • Weather APIs → pull hourly weather data for logistics route optimization
  • Social media APIs → pull brand mentions for sentiment analysis
  • Jira/Azure DevOps API → pull project metrics for engineering dashboards

The Challenge: Pagination and Rate Limits

# Real API ingestion pattern in a Databricks notebook
import requests
import json
from datetime import datetime, timedelta

API_URL = "https://api.salesforce.com/services/data/v56.0/query"
headers = {"Authorization": f"Bearer {access_token}"}

# Incremental: only records modified since yesterday
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT%H:%M:%SZ")
query = f"SELECT Id, Name, Email, LastModifiedDate FROM Contact WHERE LastModifiedDate > {yesterday}"

all_records = []
next_url = f"{API_URL}?q={query}"

# Handle pagination (APIs return results in pages)
while next_url:
    response = requests.get(next_url, headers=headers)
    data = response.json()
    all_records.extend(data["records"])

    if data.get("nextRecordsUrl"):
        next_url = f"https://api.salesforce.com{data['nextRecordsUrl']}"
    else:
        next_url = None

# Write to Delta
df = spark.createDataFrame(all_records)
df.write.format("delta").mode("append").save("/bronze/salesforce/contacts/")
print(f"Ingested {len(all_records)} contacts")

Real-life analogy: API ingestion is like calling a restaurant for delivery. You call (API request), place your order (query parameters), and they deliver the food (JSON response). You do not go to the restaurant — they bring it to you. But the restaurant has limits: “maximum 5 deliveries per hour” (rate limits) and “maximum 100 items per delivery” (pagination).

Pattern 3: Database Replication / CDC (Transactional Systems)

How It Works

Change Data Capture (CDC) monitors a source database and captures every INSERT, UPDATE, and DELETE. Changes are streamed to the data lake as they happen — no files, no APIs, no polling.

Source Database (Oracle/SQL Server/PostgreSQL)
    |
    | CDC tool captures every INSERT/UPDATE/DELETE from the transaction log
    v
CDC Tool (Debezium / AWS DMS / Oracle GoldenGate / ADF CDC)
    |
    | Streams changes as events
    v
ADLS Gen2 / bronze / orders_cdc / → Delta format with operation column (I/U/D)
    |
    v
Databricks applies changes → Silver (current state) → Gold

Why CDC?

  • Near real-time: Changes arrive within seconds to minutes
  • Efficient: Only changed rows are transferred, not the entire table
  • Complete: Captures updates and deletes, not just inserts (unlike watermark-based incremental)
  • Low impact: Reads from the database’s transaction log, not by querying the tables directly

CDC Tools

Tool Vendor Best For
Debezium Open source PostgreSQL, MySQL, MongoDB CDC to Kafka
AWS DMS Amazon Any database to AWS (S3, Redshift)
Oracle GoldenGate Oracle Oracle database replication
ADF CDC Microsoft SQL Server to ADLS Gen2
Fivetran SaaS Managed CDC from any database
Qlik Replicate (Attunity) Qlik Enterprise database replication

Real Examples

  • E-commerce: PostgreSQL orders database replicated to Delta Lake every 5 minutes via Debezium + Kafka
  • Banking: Core banking Oracle database replicated nightly to the analytics platform via GoldenGate
  • Hospital: Patient management system replicated to the data lake for clinical dashboards
  • Retail: Point-of-sale SQL Server database replicated for real-time inventory tracking

What CDC Data Looks Like

| operation | order_id | customer_id | amount | timestamp           |
|-----------|----------|-------------|--------|---------------------|
| INSERT    | 5001     | 1001        | 150.00 | 2026-05-20 10:00:01 |
| INSERT    | 5002     | 1002        | 89.99  | 2026-05-20 10:00:03 |
| UPDATE    | 5001     | 1001        | 175.00 | 2026-05-20 10:05:00 |
| DELETE    | 4999     | 1003        | 25.00  | 2026-05-20 10:10:00 |

The operation column tells your pipeline what to do: INSERT new rows, UPDATE existing rows, DELETE removed rows. This maps directly to Delta Lake’s MERGE command.

Real-life analogy: CDC is like a security camera on a warehouse door. Instead of taking inventory every night (full load) or checking what is new since your last visit (watermark), the camera records every item that enters (INSERT), gets modified (UPDATE), or leaves (DELETE) in real-time. At any point, you can replay the tape to rebuild the current inventory.

Pattern 4: Event Streaming / Real-Time (Data-Intensive Companies)

How It Works

Data arrives as a continuous stream of events through a message broker. No files, no batches — events flow in real-time.

Application / IoT Device / Website
    |
    | Sends events (JSON messages) continuously
    v
Azure Event Hubs / Apache Kafka / AWS Kinesis
    |
    | Millions of events per second
    v
Spark Structured Streaming / Azure Stream Analytics
    |
    | Processes micro-batches (every few seconds)
    v
Delta Lake (append new events)
    |
    v
Real-time dashboards / ML models / Alerts

Why Streaming?

  • Real-time insights: React to events as they happen, not hours later
  • Scale: Handle millions of events per second (Event Hubs supports 1M+ events/sec)
  • Decoupled: Producers send events without knowing who consumes them
  • Event sourcing: Complete history of every event, not just current state

Real Examples

  • Uber: Every ride request, driver location update, price surge calculation, and trip event streams through Kafka in real-time
  • Netflix: Every play, pause, skip, and browse event streams for real-time recommendations
  • E-commerce clickstream: Every page view, add-to-cart, and checkout step streams for personalization
  • IoT in manufacturing: Temperature, pressure, vibration readings from factory sensors every second
  • Fintech: Stock market tick data streaming for real-time trading analytics
  • Food delivery: Order status updates (placed, preparing, picked up, delivered) stream in real-time

The Architecture

# Spark Structured Streaming in Databricks
df_stream = spark.readStream     .format("eventhubs")     .options(**event_hub_config)     .load()

# Parse JSON events
df_parsed = df_stream     .withColumn("body", col("body").cast("string"))     .withColumn("event", from_json(col("body"), event_schema))     .select("event.*", "enqueuedTime")

# Write to Delta Lake (micro-batch every 10 seconds)
df_parsed.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "/checkpoints/clickstream/")     .trigger(processingTime="10 seconds")     .start("/bronze/clickstream/")

Real-life analogy: Event streaming is like a live TV news broadcast. Events (stories) happen continuously. The broadcast (Kafka/Event Hubs) streams them immediately to millions of viewers (consumers). You can watch live (real-time dashboard) or record and watch later (write to Delta Lake). Nobody waits for the evening news (batch processing) — the news arrives as it happens.

Pattern 5: Direct Cloud Storage Drop (Internal Teams)

How It Works

Internal teams or cloud-native partners upload files directly to a cloud storage location (ADLS Gen2, S3).

Internal Team (Marketing, Finance, HR)
    |
    | Uploads CSV/Excel via Azure Portal, Storage Explorer, or automated script
    v
ADLS Gen2 / bronze / marketing / campaign_data / 2026/05/20/
    |
    | Storage event trigger detects new file
    v
ADF/Databricks pipeline runs → Silver → Gold

When This Actually Happens

This is less common than the other patterns but does happen: – Internal marketing team uploads weekly campaign performance CSVs – Finance team uploads monthly budget spreadsheets – HR team uploads employee census data for workforce analytics – Cloud-native partner with Azure Service Principal drops data directly – Azure Logic Apps or Power Automate writes Microsoft 365 data to ADLS

How to Secure It

Storage Account: datalake-prod
  Container: partner-drop
    Folder: /marketing/     ← Marketing team has write access ONLY here
    Folder: /finance/       ← Finance team has write access ONLY here
    Folder: /partner-abc/   ← Partner ABC has scoped Service Principal access

Each team/partner gets a Service Principal with Storage Blob Data Contributor
role scoped to THEIR specific folder only — they cannot see other folders.

Real-life analogy: This is like having labeled drop-off bins at a warehouse. The marketing team can only drop packages in the “Marketing” bin. The finance team can only use the “Finance” bin. Nobody can see what is in other bins. A warehouse worker (pipeline) checks the bins on a schedule and processes the contents.

Pattern 6: Third-Party Ingestion Tools (SaaS-Heavy Companies)

How It Works

Companies use managed ingestion platforms that connect to hundreds of sources and automatically sync data to the data lake.

100+ Source Systems (Salesforce, HubSpot, Jira, Zendesk, Stripe, etc.)
    |
    | Fivetran / Airbyte / Matillion connects to all of them
    v
Managed Ingestion Platform
    |
    | Automatically syncs data on schedule
    v
ADLS Gen2 / Snowflake / Databricks Lakehouse
    |
    v
Your pipelines transform the data

Why Third-Party Tools?

  • Connectors: Fivetran has 300+ pre-built connectors. Building each one manually would take months.
  • Maintenance: When Salesforce changes their API, Fivetran updates the connector — you do nothing.
  • Schema handling: Automatic schema evolution when source systems add columns.
  • Reliability: Automatic retries, deduplication, and incremental syncing built in.
Tool Type Price Best For
Fivetran Managed SaaS $$$$ Enterprise, 300+ connectors
Airbyte Open source Free/$$ Startups, cost-conscious teams
Matillion Cloud ETL $$$ Snowflake/Databricks heavy companies
Stitch Managed SaaS $$ Small-medium businesses
Hevo Data Managed SaaS $$ Indian market, real-time pipelines

Real Examples

  • Company with 50 SaaS tools uses Fivetran to sync all of them to Databricks automatically
  • Startup uses Airbyte (open source) to pull from 30 APIs into their data lake at zero connector cost

Real-life analogy: Third-party ingestion tools are like a mail forwarding service. You give them the addresses of all your 50 magazine subscriptions (sources). They collect every magazine (data), sort them, and deliver everything to your house (data lake) on schedule. You never visit 50 different publishers — the service handles it all.

How Real Companies Combine Patterns

Banking (TD Bank, RBC Type)

Core Banking System (Oracle)  ──── CDC (GoldenGate) ────►  ADLS Bronze
Visa/Mastercard Transactions  ──── SFTP (daily files) ───►  ADLS Bronze
Regulatory Feeds (OSFI)       ──── SFTP (daily files) ───►  ADLS Bronze
Market Data (Bloomberg)       ──── API (every 15 min) ───►  ADLS Bronze
CRM (Salesforce)              ──── Fivetran (hourly) ────►  ADLS Bronze
                                                               │
                                                    Synapse/Databricks
                                                               │
                                                    Gold ──► Power BI + Risk Models

E-Commerce (Amazon, Shopify Type)

Orders DB (PostgreSQL)        ──── CDC (Debezium+Kafka) ──►  Delta Lake
Clickstream (web/mobile)      ──── Event Hubs (real-time) ─►  Delta Lake
Payments (Stripe API)         ──── API (every hour) ───────►  ADLS Bronze
Product Catalog               ──── API (daily) ───────────►  ADLS Bronze
Warehouse Inventory           ──── SFTP (hourly files) ────►  ADLS Bronze
                                                               │
                                                    Databricks
                                                               │
                                            Recommendations + Dashboards + ML

Telecom (Rogers, Bell Type)

Network Equipment CDRs        ──── SFTP (hourly files) ────►  ADLS Bronze
CRM (Salesforce)              ──── Fivetran (hourly) ──────►  ADLS Bronze
Billing System (Oracle)       ──── CDC (GoldenGate) ───────►  ADLS Bronze
Cell Tower IoT Sensors        ──── Event Hubs (real-time) ──►  Delta Lake
Customer App Events           ──── Kafka (real-time) ──────►  Delta Lake
                                                               │
                                                    Synapse/Databricks
                                                               │
                                        Network Analytics + Churn Prediction

Healthcare (Hospital Networks)

Patient Records (Epic/Cerner) ──── HL7/FHIR API (real-time) ─►  ADLS Bronze
Lab Results                   ──── SFTP (hourly files) ────────►  ADLS Bronze
Insurance Claims              ──── SFTP (daily files) ─────────►  ADLS Bronze
Medical Device IoT            ──── IoT Hub (real-time) ────────►  Delta Lake
Pharmacy Systems              ──── CDC (SQL Server) ───────────►  ADLS Bronze
                                                                    │
                                                         Databricks (with PHI masking)
                                                                    │
                                                Clinical Dashboards + Research Datasets

Retail (Walmart, Costco Type)

Point-of-Sale (POS)           ──── CDC (SQL Server) ───────►  ADLS Bronze
Supplier Inventory             ──── SFTP (daily files) ─────►  ADLS Bronze
E-commerce Orders              ──── CDC + API (real-time) ──►  ADLS Bronze
Customer Loyalty Program       ──── API (hourly) ──────────►  ADLS Bronze
Supply Chain Tracking          ──── Event Hubs (real-time) ─►  Delta Lake
Weather Data (for demand)      ──── API (hourly) ──────────►  ADLS Bronze
                                                                │
                                                     Databricks
                                                                │
                                        Demand Forecasting + Inventory Optimization

Insurance

Policy Admin System (Oracle)  ──── CDC (nightly) ──────────►  ADLS Bronze
Claims (from hospitals)       ──── SFTP (daily files) ─────►  ADLS Bronze
Reinsurance Data              ──── SFTP (monthly files) ───►  ADLS Bronze
Agent Portal                  ──── API (hourly) ──────────►  ADLS Bronze
Third-Party Risk Data         ──── Fivetran (daily) ──────►  ADLS Bronze
                                                                │
                                                     Synapse
                                                                │
                                        Claims Analytics + Fraud Detection

The Complete Architecture

Sources (left)                    Landing (center)           Processing (right)

SFTP Files      ─────┐
API Responses   ─────┤
CDC Streams     ─────┤──►  ADLS Gen2 BRONZE  ──►  Databricks/Synapse  ──►  GOLD
Event Streams   ─────┤         (Raw data)          (Silver transforms)    (Dashboards)
Direct Drops    ─────┤                                                     (ML Models)
Fivetran/Airbyte ────┘                                                     (Reports)

Every source lands in Bronze (raw, as-is). Transformations happen in Silver. Business-ready data lives in Gold. The ingestion pattern changes. The downstream architecture stays the same.

Mapping Patterns to Our Blog Pipelines

Ingestion Pattern Our Blog Pipeline Post
SFTP file pickup ADF Copy from SQL to ADLS (similar pattern) Metadata-Driven Pipeline
API incremental pull Watermark-based incremental load Incremental Loading
CDC apply changes Delta MERGE for SCD SCD with Delta MERGE
Event Streaming Structured Streaming to Delta (Future post)
File-triggered pipeline Event trigger on ADLS ADF Triggers
Multi-source config Unified Full+Incremental Unified Pipeline
Bronze → Silver transform Data quality + cleaning Data Quality Framework
Silver → Gold SCD dimensions + aggregations Medallion Architecture

Common Mistakes in Data Ingestion

  1. Assuming all data comes as files — in a modern company, APIs and CDC are often more common than file drops. Design your architecture for multiple ingestion patterns.

  2. Polling SFTP too frequently — if the vendor drops files at 6 AM daily, polling every 5 minutes wastes resources. Match polling frequency to the source’s delivery schedule.

  3. Not handling API rate limits — calling Salesforce API 1,000 times per minute gets you blocked. Implement exponential backoff and respect rate limit headers.

  4. Full load when CDC is available — copying the entire 10M-row table daily when only 1,000 rows changed wastes time and money. Use CDC for transactional databases.

  5. No dead letter queue for streaming — if a streaming event fails to process, it should go to a dead letter topic, not be silently dropped.

  6. Not validating incoming data — vendor files can have wrong formats, missing columns, or corrupt rows. Always validate at the Bronze layer before processing.

Interview Questions

Q: How do companies typically receive data from external vendors? A: Most commonly via SFTP — the vendor uploads files to an SFTP server, and an ADF pipeline polls for new files on a schedule. For modern SaaS vendors, REST APIs are increasingly common. Some vendors use CDC for database replication. The choice depends on the vendor’s capabilities and the required data freshness.

Q: What is CDC and when would you use it? A: Change Data Capture monitors a source database’s transaction log and captures every INSERT, UPDATE, and DELETE as it happens. Use CDC when you need near real-time replication of a transactional database without running expensive SELECT queries against it. Tools include Debezium, AWS DMS, Oracle GoldenGate, and ADF CDC.

Q: What is the difference between batch and streaming ingestion? A: Batch ingestion processes data in scheduled intervals (hourly, daily files via SFTP or API). Streaming ingestion processes data continuously as events arrive (via Kafka or Event Hubs). Batch is simpler and sufficient for most use cases. Streaming is necessary when you need sub-minute latency (real-time dashboards, fraud detection, live recommendations).

Q: In a real project, how many ingestion patterns would you typically use? A: Most enterprise projects use 3-4 patterns simultaneously: SFTP for vendor files, CDC for internal databases, APIs for SaaS tools, and possibly streaming for real-time events. The data lake (ADLS Gen2) serves as the unified landing zone for all patterns, and the downstream transformation layer (Databricks/Synapse) is the same regardless of how data arrives.

Wrapping Up

Data does not magically appear in your data lake. It arrives through specific doors: SFTP for traditional vendors, APIs for modern SaaS, CDC for database replication, Event Hubs for real-time streaming, direct drops for internal teams, and Fivetran for managed ingestion.

Understanding these patterns is what separates a pipeline builder from a data engineer. A pipeline builder knows how to read a CSV from ADLS. A data engineer knows how that CSV got to ADLS in the first place — and can design the architecture to handle 6 different source types landing in Bronze, flowing through Silver, and serving Gold to 100 analysts.

Every pipeline we built on this blog fits into this picture. The metadata-driven pipeline is how you process SFTP files at scale. The incremental load is how you pull API data efficiently. Delta MERGE is how you apply CDC changes. And the Medallion Architecture is how you organize it all.

Related posts:Metadata-Driven PipelineIncremental Loading with WatermarkUnified Full + Incremental PipelineADF TriggersMedallion ArchitectureSCD Type 1 and 2 with Delta MERGE


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link