Spark

Pandas DataFrames mutate inside functions

See why pandas DataFrames are mutable, how in-place ops leak changes across function boundaries, and how to make intent explicit. Includes a runnable repro, expected output, and safer patterns.

Build a Spark streaming Data Source

Implement a minimal Data Source API reader with real offsets, a clear schema, and a usable format. You will compare the naive batch approach vs real streaming and run it end-to-end.

Fix skewed Spark joins

Detect skewed joins in Spark and apply salting to spread hot keys. You will compare before/after stage and shuffle times, with a synthetic repro and a real dataset plus downloads at the end.

PySpark basics for everyday work

Practical guide with clear examples and expected outputs to master core DataFrame transformations. Includes readable chaining patterns and quick validations.

Query past versions in Delta

Learn versionAsOf and timestampAsOf, validate changes, and understand when time travel is best for auditing, recovery, and regression analysis in Delta Lake.

Read Kafka with Spark Streaming

Connect local Kafka to Spark Structured Streaming, define a schema, and run a continuous read. Includes simple metrics and validations to confirm the stream is working.

Spark local, first run

Hands‑on guide to bring up the local stack, check UI/health, and run a first job. Includes minimal checks to confirm Master/Workers are healthy and ready for the rest of the series.

Spark partitions without the pain

Introduce spark.sql.shuffle.partitions, repartition, and coalesce with a reproducible example to see impact on stages, time, and shuffle size.

Delta storage layout: what's really on disk

What Delta stores on disk

Explore the on‑disk layout, commits, and checkpoints, and see why it matters for performance, maintenance, and troubleshooting in production.

Your first Delta table, step by step

End‑to‑end walkthrough: create a Delta table, insert data, read, filter, and validate results with expected outputs. The minimal base before any optimization work.