If you are new to Delta Lake, this is the first post to run. It focuses on the minimal actions you do in real work: create a Delta table, read it back, and overwrite it safely. Reference: Delta Lake.

Downloads at the end: go to Downloads.

Quick takeaways

  • Delta tables are regular files + transaction log.
  • You can read/write Delta like a normal table, but with reliability.
  • This post gives you a minimal, reproducible flow to start.

Run it yourself

  • Local Spark (Docker): main path for this blog.
  • Databricks Free Edition: quick alternative if you do not want Docker.
1
docker compose up

Links:


Minimal setup

We generate a small dataset, write it as Delta, then read it back. Ref: Spark range.

1
2
3
4
5
6
from pyspark.sql import functions as F

df = (
    spark.range(0, 100_000)
         .withColumn("group", (F.col("id") % 10).cast("int"))
)

Create the Delta table

Persist the DataFrame as Delta in a local path. Ref: DataFrameWriter.

1
2
3
delta_path = "/tmp/delta/table_101"

df.write.format("delta").mode("overwrite").save(delta_path)

Read it back

Read the same path to validate it. Ref: DataFrameReader.

1
2
read_back = spark.read.format("delta").load(delta_path)
read_back.groupBy("group").count().show()

Expected output (example):

+-----+-----+
|group|count|
+-----+-----+
|    0|10000|
|    1|10000|
...

Overwrite safely (same schema)

1
2
df_filtered = df.filter("group < 5")
df_filtered.write.format("delta").mode("overwrite").save(delta_path)

Expected output: No direct output. The count should drop after reading again.


What to verify

  • The table reads without errors.
  • Counts change after overwrite.
  • The folder contains a _delta_log directory.

Notes from practice

  • Always use format("delta") explicitly to avoid ambiguity.
  • Start with a local path so you can inspect files on disk.
  • Keep paths simple for beginners.

Downloads

If you want to run this without copying code, download the notebook or the .py export.