Spark partitions without the pain

SeriesSpark & Delta 101

1/5. Your first Delta table, step by step 2/5. PySpark basics for everyday work 3/5. Query past versions in Delta 4/5. What Delta stores on disk5/5. Spark partitions without the pain

Partitions are the unit of parallelism in Spark. This post shows how partition count changes task distribution and performance. Ref: repartition, coalesce.

Downloads at the end: go to Downloads.

Quick takeaways

Too few partitions underutilize the cluster.
Too many partitions add overhead.
You can inspect partitions and adjust them safely.

Run it yourself

Local Spark (Docker): main path for this blog.
Databricks Free Edition: quick alternative if you do not want Docker.

1
docker compose up

Links:

Create a dataset

Use a large range to see partition behavior.

1
df = spark.range(0, 5_000_000)

Check current partitions

Inspect how many partitions the DataFrame has.

1
df.rdd.getNumPartitions()

Expected output (example):

Repartition vs coalesce

Compare both to understand their impact.

1
2
df_repart = df.repartition(64)
df_coal = df.coalesce(8)

Expected output: df_repart has 64 partitions; df_coal has 8 or fewer.

What to verify

The number of partitions changes as expected.
More partitions increase tasks; fewer partitions reduce them.
Task duration becomes more balanced with a reasonable count.

Notes from practice

Start with defaults; adjust only when the evidence is clear.
Repartition triggers a full shuffle; coalesce avoids one.
Use Spark UI to see how partitions map to tasks.

Downloads

If you want to run this without copying code, download the notebook or the .py export.

Download .ipynb Download .py

Quick takeaways#

Run it yourself#

Create a dataset#

Check current partitions#

Repartition vs coalesce#

What to verify#

Notes from practice#

Downloads#