PySpark basics for everyday work

SeriesSpark & Delta 101

1/5. Your first Delta table, step by step2/5. PySpark basics for everyday work3/5. Query past versions in Delta 4/5. What Delta stores on disk 5/5. Spark partitions without the pain

If you are new to Spark, start with these three operations: select, filter, and write. This is a short, practical tour with a small dataset you can run anywhere. Reference: select, filter.

Downloads at the end: go to Downloads.

Quick takeaways

DataFrames are the core API you will use every day.
The basics (select, filter, groupBy) cover most daily tasks.
Writing data is part of the workflow, not an afterthought.

Run it yourself

Local Spark (Docker): main path for this blog.
Databricks Free Edition: quick alternative if you do not want Docker.

1
docker compose up

Links:

Create a tiny dataset

We build a small DataFrame we can explore.

1
2
3
4
5
6
7
from pyspark.sql import functions as F

df = (
    spark.range(0, 100_000)
         .withColumn("country", F.when(F.col("id") % 3 == 0, "MX").when(F.col("id") % 3 == 1, "PE").otherwise("CO"))
         .withColumn("amount", (F.rand() * 100).cast("double"))
)

Select and filter

Pick columns and filter to keep relevant rows.

1
2
filtered = df.select("id", "country", "amount").filter("amount > 50")
filtered.show(5)

Expected output (example):

+---+-------+------+
| id|country|amount|
+---+-------+------+
|  1|     PE| 78.21|
...

Group and aggregate

Group by country to get a quick summary.

1
2
summary = filtered.groupBy("country").count()
summary.show()

Expected output (example):

+-------+-----+
|country|count|
+-------+-----+
|     PE|16667|
|     MX|16666|
|     CO|16667|

Write the result

Persist the output to see the on‑disk layout.

1
2
out_path = "/tmp/pyspark/basics"
summary.write.mode("overwrite").parquet(out_path)

Expected output: The output folder is created with Parquet files.

What to verify

filtered.count() is less than the original count.
The output folder exists and contains Parquet files.
The group counts make sense for your distribution.

Notes from practice

Always start by inspecting a small sample with show().
Keep paths simple when teaching new users.
Save outputs to build intuition about file layouts.

Downloads

If you want to run this without copying code, download the notebook or the .py export.

Download .ipynb Download .py

Quick takeaways#

Run it yourself#

Create a tiny dataset#

Select and filter#

Group and aggregate#

Write the result#

What to verify#

Notes from practice#

Downloads#