Partitions are the unit of parallelism in Spark. This post shows how partition count changes task distribution and performance. Ref: repartition, coalesce.

Downloads at the end: go to Downloads.

Quick takeaways

  • Too few partitions underutilize the cluster.
  • Too many partitions add overhead.
  • You can inspect partitions and adjust them safely.

Run it yourself

  • Local Spark (Docker): main path for this blog.
  • Databricks Free Edition: quick alternative if you do not want Docker.
1
docker compose up

Links:


Create a dataset

Use a large range to see partition behavior.

1
df = spark.range(0, 5_000_000)

Check current partitions

Inspect how many partitions the DataFrame has.

1
df.rdd.getNumPartitions()

Expected output (example):

8

Repartition vs coalesce

Compare both to understand their impact.

1
2
df_repart = df.repartition(64)
df_coal = df.coalesce(8)

Expected output: df_repart has 64 partitions; df_coal has 8 or fewer.


What to verify

  • The number of partitions changes as expected.
  • More partitions increase tasks; fewer partitions reduce them.
  • Task duration becomes more balanced with a reasonable count.

Notes from practice

  • Start with defaults; adjust only when the evidence is clear.
  • Repartition triggers a full shuffle; coalesce avoids one.
  • Use Spark UI to see how partitions map to tasks.

Downloads

If you want to run this without copying code, download the notebook or the .py export.