Pandas DataFrames mutate inside functions

See why pandas DataFrames are mutable, how in-place ops leak changes across function boundaries, and how to make intent explicit. Includes a runnable repro, expected output, and safer patterns.

February 4, 2026 · 2 min · 301 words · pw

Build a Spark streaming Data Source

Implement a minimal Data Source API reader with real offsets, a clear schema, and a usable format. You will compare the naive batch approach vs real streaming and run it end-to-end.

February 1, 2026 · 3 min · 441 words · pw

Fix skewed Spark joins

Detect skewed joins in Spark and apply salting to spread hot keys. You will compare before/after stage and shuffle times, with a synthetic repro and a real dataset plus downloads at the end.

February 1, 2026 · 4 min · 725 words · pw

Spark partitions without the pain

Introduce spark.sql.shuffle.partitions, repartition, and coalesce with a reproducible example to see impact on stages, time, and shuffle size.

February 1, 2026 · 2 min · 252 words · pw