This post connects Spark Structured Streaming to a local Kafka topic and reads messages in real time. Ref: Structured Streaming + Kafka.

Downloads at the end: go to Downloads.

Quick takeaways

  • Spark can read Kafka topics directly using the Kafka connector.
  • You can validate end-to-end streaming locally.
  • This is the bridge between ingestion and processing.

Run it yourself

  • Local Docker: default path for this blog.
1
docker compose up

Links:


Produce messages

1
kafka-console-producer.sh --topic demo-events --bootstrap-server localhost:9092

Read with Spark Structured Streaming

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
df = (
    spark.readStream.format("kafka")
         .option("kafka.bootstrap.servers", "localhost:9092")
         .option("subscribe", "demo-events")
         .load()
)

out = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

q = (
    out.writeStream.format("console")
       .outputMode("append")
       .option("truncate", False)
       .start()
)

Expected output: You should see new rows in the console when you produce messages.


What to verify

  • Messages appear in the Spark console sink.
  • The streaming query stays active while you produce data.
  • Stopping the producer does not crash the query.

Downloads

If you want to run this without copying code, download the notebook or the .py export.