SeriesSpark & Delta 101
1/5. Your first Delta table, step by step2/5. PySpark basics for everyday work3/5. Query past versions in Delta4/5. What Delta stores on disk5/5. Spark partitions without the pain
Partitions are the unit of parallelism in Spark. This post shows how partition count changes task distribution and performance. Ref: repartition, coalesce.
Downloads at the end: go to Downloads.
Quick takeaways
- Too few partitions underutilize the cluster.
- Too many partitions add overhead.
- You can inspect partitions and adjust them safely.
Run it yourself
- Local Spark (Docker): main path for this blog.
- Databricks Free Edition: quick alternative if you do not want Docker.
| |
Links:
Create a dataset
Use a large range to see partition behavior.
| |
Check current partitions
Inspect how many partitions the DataFrame has.
| |
Expected output (example):
8
Repartition vs coalesce
Compare both to understand their impact.
| |
Expected output:
df_repart has 64 partitions; df_coal has 8 or fewer.
What to verify
- The number of partitions changes as expected.
- More partitions increase tasks; fewer partitions reduce them.
- Task duration becomes more balanced with a reasonable count.
Notes from practice
- Start with defaults; adjust only when the evidence is clear.
- Repartition triggers a full shuffle; coalesce avoids one.
- Use Spark UI to see how partitions map to tasks.
Downloads
If you want to run this without copying code, download the notebook or the .py export.