If you are new to Spark, start with these three operations: select, filter, and write. This is a short, practical tour with a small dataset you can run anywhere. Reference: select, filter.

Downloads at the end: go to Downloads.

Quick takeaways

  • DataFrames are the core API you will use every day.
  • The basics (select, filter, groupBy) cover most daily tasks.
  • Writing data is part of the workflow, not an afterthought.

Run it yourself

  • Local Spark (Docker): main path for this blog.
  • Databricks Free Edition: quick alternative if you do not want Docker.
1
docker compose up

Links:


Create a tiny dataset

We build a small DataFrame we can explore.

1
2
3
4
5
6
7
from pyspark.sql import functions as F

df = (
    spark.range(0, 100_000)
         .withColumn("country", F.when(F.col("id") % 3 == 0, "MX").when(F.col("id") % 3 == 1, "PE").otherwise("CO"))
         .withColumn("amount", (F.rand() * 100).cast("double"))
)

Select and filter

Pick columns and filter to keep relevant rows.

1
2
filtered = df.select("id", "country", "amount").filter("amount > 50")
filtered.show(5)

Expected output (example):

+---+-------+------+
| id|country|amount|
+---+-------+------+
|  1|     PE| 78.21|
...

Group and aggregate

Group by country to get a quick summary.

1
2
summary = filtered.groupBy("country").count()
summary.show()

Expected output (example):

+-------+-----+
|country|count|
+-------+-----+
|     PE|16667|
|     MX|16666|
|     CO|16667|

Write the result

Persist the output to see the on‑disk layout.

1
2
out_path = "/tmp/pyspark/basics"
summary.write.mode("overwrite").parquet(out_path)

Expected output: The output folder is created with Parquet files.


What to verify

  • filtered.count() is less than the original count.
  • The output folder exists and contains Parquet files.
  • The group counts make sense for your distribution.

Notes from practice

  • Always start by inspecting a small sample with show().
  • Keep paths simple when teaching new users.
  • Save outputs to build intuition about file layouts.

Downloads

If you want to run this without copying code, download the notebook or the .py export.