ETL & Data Pipelines
Master the art of building robust, scalable data pipelines. From extraction patterns to orchestration, learn how data moves from source to insight.
Begin Learning ↓Master the art of building robust, scalable data pipelines. From extraction patterns to orchestration, learn how data moves from source to insight.
Begin Learning ↓Understanding the automated workflows that move and transform data between systems.
A data pipeline is an automated workflow that moves data from one or more sources to a destination, typically transforming the data along the way. Think of it as the plumbing of the data world — invisible when it works well, but catastrophic when it breaks.
Every organization that uses data for decision-making relies on data pipelines. Whether it is a simple nightly batch job that loads sales data into a reporting database, or a complex real-time streaming system processing millions of events per second, the fundamental concept is the same: extract data, process it, and deliver it where it needs to go.
Data pipelines sit at the heart of the modern data stack. Without them, data scientists would have no clean datasets to model, analysts would have no dashboards to monitor, and machine learning systems would have no features to train on. The quality of your data pipeline directly determines the quality of every downstream decision.
Data pipelines come in several flavors, each suited to different use cases and latency requirements. Understanding these types helps you choose the right architecture for your needs.
Batch pipelines process data in scheduled chunks — hourly, daily, or weekly. They are the workhorses of most analytics teams, handling large volumes of data efficiently. Most traditional ETL falls into this category.
Streaming pipelines process data continuously in real-time or near-real-time. They are essential for use cases like fraud detection, live dashboards, and recommendation engines where latency matters.
Micro-batch pipelines sit in between, processing small batches every few seconds or minutes. Apache Spark Structured Streaming is a popular example of this approach, offering a balance between throughput and latency.
Hybrid pipelines combine batch and streaming approaches — often called the Lambda or Kappa architecture. A streaming layer handles real-time needs while a batch layer ensures accuracy through periodic reprocessing.
A data pipeline is like an assembly line for data. Raw materials (data) go in one end, get processed at various stations, and finished products (insights) come out the other end. Just like a factory, the pipeline must be reliable, efficient, and able to handle varying throughput without breaking down.
Two paradigms for moving and transforming data — and why the industry is shifting.
The distinction between ETL and ELT may seem subtle — it is just rearranging three letters — but the architectural implications are profound. The order in which you transform data fundamentally shapes your pipeline design, tooling choices, and scalability.
Data is extracted from sources, transformed in a separate staging area or ETL server, and only then loaded into the target data warehouse in its final form.
Data is shaped and cleaned before it reaches the warehouse. The target system receives only clean, structured data ready for querying.
Ideal when the destination system is expensive or has limited processing power. The heavy lifting happens elsewhere.
On-premise data warehouse loads where an ETL tool like Informatica or Talend transforms data on a dedicated server before loading into Oracle or Teradata.
Data is extracted from sources, loaded raw into the target system first, and then transformed inside the target using its native compute power.
Raw data lands in the warehouse as-is. Transformations happen using SQL or tools like dbt, leveraging the warehouse's massive compute capabilities.
Perfect for cloud warehouses like Snowflake, BigQuery, and Databricks where compute scales elastically and storage is cheap.
Load raw data into Snowflake via Fivetran or Airbyte, then transform with dbt models. The warehouse handles all the heavy computation.
In the traditional ETL approach, data passes through a transformation engine before reaching the warehouse:
In the modern ELT approach, raw data is loaded first and transformed inside the warehouse:
The industry is rapidly shifting toward ELT because cloud warehouses like Snowflake, BigQuery, and Databricks have massive, elastic compute power. It is far more efficient to transform data where it already lives rather than moving it to a separate transformation server. Tools like dbt have made SQL-based transformation elegant, testable, and version-controlled, accelerating ELT adoption across organizations of all sizes.
Getting data out of source systems reliably and efficiently.
Data extraction is the first step in any pipeline, and it is often the most challenging. Source systems were not designed to have data pulled out of them — they were designed to serve applications. Your extraction strategy must respect source system limitations while ensuring complete, timely data delivery.
The extraction method you choose depends on the source system type, data volume, freshness requirements, and how much control you have over the source. Here are the most common extraction patterns used in production pipelines:
REST/GraphQL APIs with paginated fetches and rate limiting. The most common method for pulling data from SaaS platforms like Salesforce, Stripe, and HubSpot.
CDC (Change Data Capture) and log-based replication capture every insert, update, and delete. Tools: Debezium, Airbyte, Fivetran.
CSV, JSON, and Parquet files from SFTP, S3, or shared drives. Batch-oriented and common for partner data exchanges and legacy systems.
Real-time events from Kafka, Kinesis, or Pub/Sub. Continuous flow for event-driven architectures with sub-second latency requirements.
Extracting data from websites using Beautiful Soup, Scrapy, or Playwright. Use responsibly and always respect robots.txt and rate limits.
Time-series data from devices via MQTT or HTTP endpoints. High volume, high frequency. Requires buffering and aggregation strategies.
Extraction is where many pipeline problems originate. Following these best practices will save you countless hours of debugging and ensure your pipelines are reliable at scale.
Only pull new or changed data since the last successful run. Use timestamps (updated_at), auto-incrementing IDs, or Change Data Capture (CDC) to identify changes. Incremental extraction is orders of magnitude faster than full loads and puts far less strain on source systems. Always track your high-water mark — the last successfully extracted point — in a persistent state store.
Running the same extraction twice should produce the same result without duplicating data. Use deduplication keys (natural or composite keys) and implement "extract-or-skip" logic. If a run fails halfway through, you should be able to re-run the entire extraction safely. This is the single most important principle in data engineering.
Auto-detect source schema changes and alert on breaking changes immediately. Sources evolve — columns get added, renamed, or removed. Your extraction layer should detect these changes, log them, and either adapt automatically or fail loudly rather than silently ingesting corrupt data. Tools like Airbyte and Fivetran handle schema evolution automatically.
Turning raw data into clean, trustworthy, analysis-ready datasets.
Transformation is where raw data becomes valuable. It is the process of cleaning, restructuring, enriching, and validating data so that downstream consumers — analysts, data scientists, ML models, and dashboards — can trust and use it effectively. A well-designed transformation layer is the difference between a data platform people trust and one they avoid.
Handle nulls, fix data types, remove corrupted or malformed records. This is quality gate number one. Common tasks include casting strings to proper types, trimming whitespace, standardizing null representations (empty strings, "N/A", "null" text), and filtering out test or dummy records. Clean data is the foundation everything else builds on.
Identify and remove duplicate records that inevitably creep into your data. Use business keys combined with timestamps to determine which record to keep. Common strategies include keeping the latest record (last-write-wins), keeping the first occurrence, or merging fields from multiple duplicates. Window functions like ROW_NUMBER() are your best friend here.
Join with reference data to add context and computed fields. This includes looking up customer names from IDs, geocoding addresses into latitude/longitude, adding currency conversion rates, or computing derived metrics like customer lifetime value. Enrichment turns isolated facts into connected, meaningful information.
Summarize detailed data into actionable metrics: daily revenue totals, running averages, rolling window computations, and period-over-period comparisons. SQL window functions (SUM() OVER, AVG() OVER, LAG()) are essential tools for creating these aggregations efficiently without losing access to the underlying detail.
Standardize formats across all sources so downstream consumers do not need to handle variations. This means consistent date formats (ISO 8601), unified currency codes, standardized country names, normalized phone number formats, and consistent units of measurement. When data comes from ten different sources, normalization is what makes it feel like one.
Check business rules to ensure data integrity: value ranges, referential integrity between tables, completeness checks, and cross-field consistency. For example, an order amount should never be negative, every order should reference a valid customer, and a shipment date should never precede the order date. Failed validations should quarantine records, not crash the pipeline.
Here is a practical example of a silver-layer transformation that cleans, enriches, and categorizes order data. This is the kind of transformation you would write in dbt or as a warehouse view:
| order_id | customer | amount | currency | status |
|---|---|---|---|---|
| 1001 | Ali Ahmed | 2500 | PKR | completed |
| 1002 | NULL | 75 | USD | pending |
| 1003 | sara khan | 1800 | PKR | COMPLETED |
| 1004 | Omar Farooq | NULL | EUR | completed |
| order_id | customer | amount_pkr | status | order_tier |
|---|
Always test transformations with edge cases: nulls, empty strings, extreme values, timezone boundaries, and Unicode characters. A transformation that works for 99% of data but fails on edge cases will silently corrupt your data. Write unit tests for your SQL models using tools like dbt's built-in testing framework. Test with production-like data volumes, not just five sample rows.
Delivering processed data to its final destination reliably.
Modern data platforms organize data into layers, often called the medallion architecture (or bronze/silver/gold). Each layer represents a different level of data quality and processing. Understanding these layers is key to designing your loading strategy.
Dimensional models, KPI tables, and pre-computed metrics. Optimized for BI tools and dashboards. Full refreshes or carefully managed incrementals.
Deduplicated, type-casted, and enriched data. Conforms to business rules. Loaded via incremental upserts (MERGE operations).
Exact copy of source data, untouched. Append-only with metadata (load timestamp, source system). Your safety net for reprocessing.
How you load data into each layer depends on the data characteristics, target system capabilities, and freshness requirements. Here are the three primary loading strategies:
Replace the entire target table each run. Simple and guarantees consistency, but slow for large tables. Best for small reference/lookup tables under a few million rows.
Add only new records to the target. Fast and efficient, but does not handle updates to existing records. Ideal for immutable event/log data and clickstream data.
Insert new records and update existing ones in a single atomic operation. Handles both cases gracefully. The standard approach for dimension tables and any data that changes over time.
Use MERGE statements (or Delta Lake's MERGE INTO) for reliable upserts. They handle insert-or-update logic atomically, preventing race conditions and partial writes. Most modern warehouses — Snowflake, BigQuery, Databricks — support MERGE natively. Combined with proper partitioning and clustering, MERGE operations can handle billions of rows efficiently.
Scheduling, dependency management, and keeping everything running smoothly.
A single pipeline task is easy to manage. But in the real world, you have dozens or hundreds of interconnected tasks with complex dependencies: table A must finish loading before transformation B can run, and transformation B must complete before dashboard C refreshes. Orchestration is the discipline of managing this complexity.
An orchestrator handles scheduling (when tasks run), dependency management (what order they run in), retries (what happens when they fail), alerting (who gets notified), and observability (what happened and why). Without orchestration, you end up with a fragile web of cron jobs and prayer.
The industry standard. Python-based DAGs with a massive community, extensive operator library, and battle-tested at scale.
Modern Airflow alternative. Pythonic API, superior error handling, and built-in observability. Great developer experience.
Software-defined assets approach. Type-safe, testable pipelines with first-class support for data quality and lineage.
SQL transformation orchestrator. Version controlled, documented, and tested. The standard for ELT transformation layers.
Visual pipeline builder with notebook-style interface. Great for quick prototyping and teams new to data engineering.
Unix scheduler. Simple time-based triggers but no dependency management, retries, or monitoring. Fine for one-off scripts only.
At the core of every orchestrator is the concept of a DAG — Directed Acyclic Graph. A DAG defines the order of tasks and their dependencies. "Directed" means each edge has a direction (task A must run before task B). "Acyclic" means there are no circular dependencies — you cannot have A depend on B which depends on A.
DAGs are powerful because they allow the orchestrator to determine which tasks can run in parallel and which must wait. This maximizes throughput while respecting data dependencies. Here is a typical DAG for a pipeline that combines data from two sources:
Pull customer data from the application database. This task has no upstream dependencies and starts immediately when the DAG is triggered.
Pull transaction data from the payment API. This runs concurrently with step 1 since they are independent — neither depends on the other's output.
Clean and deduplicate customer records. This task waits for the customer extraction to complete before starting.
Validate and enrich transaction records. This task waits for the transaction extraction to complete. It can run in parallel with step 3.
Combine customer and transaction data into a unified fact table. Both transformations must complete before this step can begin.
MERGE the joined dataset into the production warehouse table. Handles both new inserts and updates to existing records.
Validate the final output: check row counts, null rates, value distributions, and business rule compliance. Alert on anomalies.
Building resilient pipelines that fail gracefully and recover automatically.
Pipelines will fail. Networks go down, APIs return errors, schemas change unexpectedly, and data arrives in formats you never anticipated. The question is not whether your pipeline will fail, but how gracefully it handles failure. These four patterns form the foundation of resilient pipeline design:
Transient failures — network timeouts, API rate limits, temporary database locks — should be retried automatically with exponential backoff. Start with a short delay (1 second), then double it on each retry (2s, 4s, 8s, 16s) up to a maximum. Add jitter (random variation) to prevent thundering herd problems when many tasks retry simultaneously. Most orchestrators have built-in retry configuration.
Records that cannot be processed — malformed JSON, invalid data types, business rule violations — should be routed to a separate table or queue (the "dead letter queue") for manual review. This prevents a few bad records from blocking the entire pipeline. Log enough context with each dead letter (error message, original payload, timestamp) to make debugging straightforward.
If the error rate exceeds a threshold (for example, more than 10% of records failing), stop the pipeline and alert immediately. This prevents cascading failures — if a source system is returning garbage data, you do not want to load that garbage into your warehouse and corrupt downstream tables. The circuit breaker pattern is borrowed from electrical engineering and is essential for production systems.
Design every step so it can be safely re-run without side effects. Use transaction boundaries to ensure atomicity — either all changes commit or none do. Store processing state externally (not in memory) so recovery is possible after crashes. If step 3 of 5 fails, you should be able to restart from step 3 without re-running steps 1 and 2 or duplicating their output.
Monitoring is not optional — it is a core feature of any production pipeline. Without monitoring, you are flying blind. These four dimensions of monitoring give you complete visibility into your pipeline health:
Track success/failure rates, run durations, and SLA compliance. Build dashboards showing trends over time. Alert when runs take significantly longer than usual or fail repeatedly.
How old is the data in your warehouse? Track the maximum updated_at timestamp in each table. Alert when data is staler than your SLA allows. Freshness is the metric stakeholders care about most.
Monitor row counts and byte sizes for each pipeline run. Detect unexpected drops (source system issue) or spikes (duplicate data, schema change). Volume anomalies are early warning signs of problems.
Measure null rates, schema drift, uniqueness violations, and business rule compliance. Tools like Great Expectations and dbt tests automate quality checks as part of your pipeline.
The worst data pipeline failures are silent ones. Your pipeline runs successfully — green checkmarks everywhere — but the data is wrong. A schema change caused a column to be misaligned, a join condition matched incorrectly, or a filter silently dropped 80% of records. Always implement data quality checks that validate the OUTPUT, not just the process. Check row counts, distributions, and business invariants after every load.
A step-by-step walkthrough of building a real-world data pipeline from scratch.
Let us walk through building a complete data pipeline that extracts order data from a PostgreSQL application database, transforms it by joining with customer and product information, loads the results into a warehouse, and validates the output. This example demonstrates every stage of the ETL process in a realistic scenario.
Each step below represents a discrete, testable unit of work that would be a separate task in your orchestration DAG. Breaking pipelines into small, focused steps makes them easier to debug, test, and maintain.
Start by defining your source connection and specifying which tables to extract. Configuration should be externalized, never hardcoded. Here we define a YAML config that our extraction framework will read:
Pull only new or updated records since the last successful run. The last_run_timestamp variable is managed by our orchestrator and stored in a state table. This incremental approach keeps extraction fast and minimizes load on the source system:
Clean and enrich the extracted data by joining orders with customer and product information. We filter out cancelled orders, compute the total amount, and extract the order date. This transformation runs in the staging schema after all three extractions complete:
Merge the transformed data into the production warehouse table. The MERGE statement handles both new orders (INSERT) and updated orders (UPDATE) in a single atomic operation, ensuring data consistency:
Run quality checks on the final output to ensure data integrity. These checks should run after every load and alert on failures. A validation that returns any rows means something is wrong:
Hard-won lessons from production pipelines — what to do and what to avoid.
Building a pipeline that works once is easy. Building one that works reliably for months and years, across schema changes, data volume growth, and team turnover, requires discipline. These best practices are distilled from real-world experience across hundreds of production pipelines.
Every step in your pipeline should be safely re-runnable. Use MERGE instead of INSERT, track processing state externally, and use transaction boundaries. When something fails at 3 AM, you want to re-run the pipeline without worrying about duplicates.
Avoid full loads whenever possible. Incremental extraction and loading are faster, cheaper, and put less strain on source systems. Design your tables with timestamps and use high-water marks to track progress.
Treat pipeline code like application code. Use Git, write pull requests, do code reviews. Every SQL model, every DAG definition, every configuration file should be in version control. You need to know what changed and when.
A pipeline can succeed (green checkmark) while producing wrong data. Implement automated data quality checks that validate row counts, null rates, value distributions, and business rules after every run.
Track where data comes from, how it is transformed, and where it goes. Data lineage helps with debugging, impact analysis, and compliance. Tools like dbt auto-generate lineage from your SQL models.
Break pipelines into small, testable stages. A single monolithic script that does extraction, transformation, and loading is impossible to debug and maintain. Each task should do one thing well.
Source schemas will change — columns get added, renamed, or removed. Plan for it. Implement schema detection, use flexible data types in your bronze layer, and alert immediately on breaking changes.
Test with edge cases and production-like data volumes. Unit test your transformations, integration test your end-to-end flow, and load test with realistic data sizes. The bugs that matter only show up at scale.
Use secret managers like AWS Secrets Manager, HashiCorp Vault, or your orchestrator's built-in secrets. Credentials in code end up in Git history, in logs, and eventually compromised. No exceptions.
Design pipelines so they can reprocess historical data. You will need to backfill when you fix bugs, add new columns, or change transformation logic. Parameterize your date ranges and make backfilling a first-class operation.
The best data pipeline is one that is boring. It runs reliably, handles errors gracefully, and you rarely need to think about it. Invest time upfront in idempotency, monitoring, and testing to achieve this. A pipeline that requires constant babysitting is a pipeline that needs to be redesigned. Your goal is not to build something clever — it is to build something that just works, day after day, without surprises.