SeriesSpark & Delta 101
1/5. Your first Delta table, step by step2/5. PySpark basics for everyday work3/5. Query past versions in Delta4/5. What Delta stores on disk5/5. Spark partitions without the pain
If you are new to Delta Lake, this is the first post to run. It focuses on the minimal actions you do in real work: create a Delta table, read it back, and overwrite it safely. Reference: Delta Lake.
Downloads at the end: go to Downloads.
Quick takeaways
- Delta tables are regular files + transaction log.
- You can read/write Delta like a normal table, but with reliability.
- This post gives you a minimal, reproducible flow to start.
Run it yourself
- Local Spark (Docker): main path for this blog.
- Databricks Free Edition: quick alternative if you do not want Docker.
| |
Links:
Minimal setup
We generate a small dataset, write it as Delta, then read it back. Ref: Spark range.
| |
Create the Delta table
Persist the DataFrame as Delta in a local path. Ref: DataFrameWriter.
| |
Read it back
Read the same path to validate it. Ref: DataFrameReader.
| |
Expected output (example):
+-----+-----+
|group|count|
+-----+-----+
| 0|10000|
| 1|10000|
...
Overwrite safely (same schema)
| |
Expected output: No direct output. The count should drop after reading again.
What to verify
- The table reads without errors.
- Counts change after overwrite.
- The folder contains a
_delta_logdirectory.
Notes from practice
- Always use
format("delta")explicitly to avoid ambiguity. - Start with a local path so you can inspect files on disk.
- Keep paths simple for beginners.
Downloads
If you want to run this without copying code, download the notebook or the .py export.