PySpark
Complete Guide
From transformations and actions to Spark architecture. Master PySpark with interactive before-and-after demos, interview prep, and real ETL pipeline walkthroughs.
Begin Learning ↓From transformations and actions to Spark architecture. Master PySpark with interactive before-and-after demos, interview prep, and real ETL pipeline walkthroughs.
Begin Learning ↓Transformations are lazy operations that define a computation but don't execute until an action is called. Click any transformation below to see before & after data.
In PySpark, transformations create a new DataFrame from an existing one without modifying the original. They are lazy — Spark builds a logical plan (DAG) but delays execution until an action like show() or collect() triggers it. This allows Spark's Catalyst optimizer to rearrange and optimize the entire chain before running a single line.
No computation happens when you call df.filter() or df.select(). Spark only records what you want to do. The actual processing happens when you call an action like .show(), .count(), or .write.
Click a transformation to see the PySpark code and the before/after data side by side.
Additional powerful transformations every PySpark developer should know.
Actions trigger execution of the transformation chain and return results to the driver or write data to storage.
While transformations are lazy and build a DAG, actions are what trigger actual computation. When you call an action, Spark's DAG scheduler converts the logical plan into physical stages and tasks, distributes them across executors, and returns the result.
Click an action to see the code and the output it produces. Notice how each action triggers the full Spark execution.
Click any card to reveal the answer. These are the most frequently asked Spark & PySpark questions in data engineering interviews.
broadcast(df_small) hint. Ideal for dimension table lookups. Avoids data movement across the cluster.cache() stores the DataFrame in memory only (MEMORY_AND_DISK by default in newer versions). persist() lets you choose the storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc. Use caching when a DataFrame is reused multiple times in the same job. Always unpersist() when done to free resources.Walk through a complete 12-step production ETL pipeline built with PySpark. Click Next to advance through each stage and see the code + data at every step.
This pipeline reads raw CSV data, cleans it, transforms it, and writes the final output as partitioned Parquet — exactly as you would in a real data engineering job.
Here is the complete pipeline in one view. Each step builds on the previous, following the Extract → Clean → Transform → Load pattern.
Understanding how Spark distributes and executes your PySpark code across a cluster. Click any component in the diagram to learn more.
Spark follows a master-worker architecture. The Driver program orchestrates work, and Executors run tasks in parallel across the cluster. Click each component below to understand its role.
Select any box in the architecture diagram to learn about its role in the Spark cluster.
When you write PySpark code, here is exactly what happens under the hood:
Always chain transformations and let Catalyst optimize them. Only trigger actions when you need results. This is the single most important concept in PySpark.
Wide transformations (groupBy, join) cause data shuffling across the network. Use broadcast joins for small tables, partition your data wisely, and avoid unnecessary orderBy operations.
Always use the DataFrame API. It benefits from Catalyst optimization and Tungsten execution, making it 10x+ faster than raw RDDs. RDDs bypass the optimizer entirely.
Essential advanced topics every PySpark developer must master — from UDFs and partitioning strategies to Spark SQL and broadcast variables.
Both control the number of partitions, but they work very differently under the hood.
UDFs let you extend PySpark with custom Python logic when built-in functions aren't enough. But use them wisely — they bypass Catalyst optimization.
F.upper(), F.when(), etc.@udf decoratorYou can write SQL queries directly on DataFrames by registering them as temporary views. This is powerful for analysts who prefer SQL syntax.
Both produce the same execution plan — Catalyst optimizes them identically. Use whichever your team prefers. Many production pipelines mix both: SQL for complex aggregations, DataFrame API for programmatic transformations.
Two special shared variables for distributed computing:
Breaks the DAG lineage by materializing data to disk. Prevents stack overflow on deep lineage chains. Use df.checkpoint() after 10+ chained transformations or in iterative algorithms.
Spark 3.x feature that re-optimizes queries at runtime based on actual data statistics. Handles data skew, coalesces small partitions, and switches join strategies dynamically. Enable with spark.sql.adaptive.enabled=true.
Open-source storage layer that adds ACID transactions, schema enforcement, and time travel to Parquet files. The foundation of modern Lakehouse architecture. Write with df.write.format("delta").
Default is 200 partitions after any shuffle (groupBy, join). Too many = overhead from small tasks. Too few = memory pressure. Tune with spark.sql.shuffle.partitions based on data size (~128MB per partition).
When one partition has 100x more data than others, it bottlenecks the entire job. Solutions: salt keys, use broadcast joins, enable AQE skew join optimization, or repartition on a different column.
Process real-time data streams with the same DataFrame API using Structured Streaming. Supports Kafka, files, sockets as sources. Triggers: micro-batch or continuous processing mode.