Build a Spark Data Source API (real streaming)

Implement `SimpleDataSourceStreamReader`, define schema and offsets, and expose a custom format to read streaming events with control and observability, without external connectors.

February 1, 2026 · 3 min · 452 words · pw
Delta storage layout: what's really on disk

Delta storage layout: what's really on disk

Explore the on‑disk layout, commits, and checkpoints, and see why it matters for performance, maintenance, and troubleshooting in production.

February 1, 2026 · 2 min · 302 words · pw

Delta Table 101: your first table end‑to‑end

End‑to‑end walkthrough: create a Delta table, insert data, read, filter, and validate results with expected outputs. The minimal base before any optimization work.

February 1, 2026 · 2 min · 328 words · pw

Delta Time Travel: query the past with confidence

Learn `versionAsOf` and `timestampAsOf`, validate changes, and understand when time travel is best for auditing, recovery, and regression analysis in Delta Lake.

February 1, 2026 · 2 min · 315 words · pw

Fix skewed joins in Spark with salting

Detect skewed joins in Spark and apply salting to spread hot keys. You will compare before/after stage and shuffle times, with a synthetic repro and a real dataset plus downloads at the end.

February 1, 2026 · 4 min · 725 words · pw

Kafka + Spark: your first streaming read

Connect local Kafka to Spark Structured Streaming, define a schema, and run a continuous read. Includes simple metrics and validations to confirm the stream is working.

February 1, 2026 · 2 min · 214 words · pw

Kafka 101: your first local topic

Kafka CLI first steps: create topics, produce events, and consume them from console in a reproducible local environment. Perfect for practice without cloud dependencies.

February 1, 2026 · 1 min · 211 words · pw

Kafka consumer groups: how work gets split

Explains offsets, partitions, and rebalances with a runnable example that shows how consumption is split across consumers and what happens when scaling or failures occur.

February 1, 2026 · 1 min · 198 words · pw

PySpark DataFrames: the three daily moves

Practical guide with clear examples and expected outputs to master core DataFrame transformations. Includes readable chaining patterns and quick validations.

February 1, 2026 · 2 min · 350 words · pw

Spark local: first run and verification

Hands‑on guide to bring up the local stack, check UI/health, and run a first job. Includes minimal checks to confirm Master/Workers are healthy and ready for the rest of the series.

February 1, 2026 · 1 min · 211 words · pw