Pandas DataFrames mutate inside functions

See why pandas DataFrames are mutable, how in-place ops leak changes across function boundaries, and how to make intent explicit. Includes a runnable repro, expected output, and safer patterns.

February 4, 2026 · 2 min · 301 words · pw

Build a Spark streaming Data Source

Implement a minimal Data Source API reader with real offsets, a clear schema, and a usable format. You will compare the naive batch approach vs real streaming and run it end-to-end.

February 1, 2026 · 3 min · 441 words · pw

Fix skewed Spark joins

Detect skewed joins in Spark and apply salting to spread hot keys. You will compare before/after stage and shuffle times, with a synthetic repro and a real dataset plus downloads at the end.

February 1, 2026 · 4 min · 725 words · pw

PySpark basics for everyday work

Practical guide with clear examples and expected outputs to master core DataFrame transformations. Includes readable chaining patterns and quick validations.

February 1, 2026 · 2 min · 343 words · pw

Query past versions in Delta

Learn versionAsOf and timestampAsOf, validate changes, and understand when time travel is best for auditing, recovery, and regression analysis in Delta Lake.

February 1, 2026 · 2 min · 308 words · pw

Read Kafka with Spark Streaming

Connect local Kafka to Spark Structured Streaming, define a schema, and run a continuous read. Includes simple metrics and validations to confirm the stream is working.

February 1, 2026 · 1 min · 210 words · pw

Spark local, first run

Hands‑on guide to bring up the local stack, check UI/health, and run a first job. Includes minimal checks to confirm Master/Workers are healthy and ready for the rest of the series.

February 1, 2026 · 1 min · 211 words · pw

Spark partitions without the pain

Introduce spark.sql.shuffle.partitions, repartition, and coalesce with a reproducible example to see impact on stages, time, and shuffle size.

February 1, 2026 · 2 min · 252 words · pw
Delta storage layout: what's really on disk

What Delta stores on disk

Explore the on‑disk layout, commits, and checkpoints, and see why it matters for performance, maintenance, and troubleshooting in production.

February 1, 2026 · 2 min · 295 words · pw

Your first Delta table, step by step

End‑to‑end walkthrough: create a Delta table, insert data, read, filter, and validate results with expected outputs. The minimal base before any optimization work.

February 1, 2026 · 2 min · 321 words · pw