SeriesSpark & Delta 101
1/5. Your first Delta table, step by step2/5. PySpark basics for everyday work3/5. Query past versions in Delta4/5. What Delta stores on disk5/5. Spark partitions without the pain
If you are new to Spark, start with these three operations: select, filter, and write. This is a short, practical tour with a small dataset you can run anywhere. Reference: select, filter.
Downloads at the end: go to Downloads.
Quick takeaways
- DataFrames are the core API you will use every day.
- The basics (select, filter, groupBy) cover most daily tasks.
- Writing data is part of the workflow, not an afterthought.
Run it yourself
- Local Spark (Docker): main path for this blog.
- Databricks Free Edition: quick alternative if you do not want Docker.
| |
Links:
Create a tiny dataset
We build a small DataFrame we can explore.
| |
Select and filter
Pick columns and filter to keep relevant rows.
| |
Expected output (example):
+---+-------+------+
| id|country|amount|
+---+-------+------+
| 1| PE| 78.21|
...
Group and aggregate
Group by country to get a quick summary.
| |
Expected output (example):
+-------+-----+
|country|count|
+-------+-----+
| PE|16667|
| MX|16666|
| CO|16667|
Write the result
Persist the output to see the on‑disk layout.
| |
Expected output: The output folder is created with Parquet files.
What to verify
filtered.count()is less than the original count.- The output folder exists and contains Parquet files.
- The group counts make sense for your distribution.
Notes from practice
- Always start by inspecting a small sample with
show(). - Keep paths simple when teaching new users.
- Save outputs to build intuition about file layouts.
Downloads
If you want to run this without copying code, download the notebook or the .py export.