Read Kafka with Spark Streaming

1/3. Local Kafka with CLI, your first run 2/3. Kafka consumer groups, explained3/3. Read Kafka with Spark Streaming

This post connects Spark Structured Streaming to a local Kafka topic and reads messages in real time. Ref: Structured Streaming + Kafka.

Downloads at the end: go to Downloads.

Quick takeaways

Spark can read Kafka topics directly using the Kafka connector.
You can validate end-to-end streaming locally.
This is the bridge between ingestion and processing.

Run it yourself

Local Docker: default path for this blog.

1
docker compose up

Links:

Produce messages

1
kafka-console-producer.sh --topic demo-events --bootstrap-server localhost:9092

Read with Spark Structured Streaming

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
df = (
    spark.readStream.format("kafka")
         .option("kafka.bootstrap.servers", "localhost:9092")
         .option("subscribe", "demo-events")
         .load()
)

out = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

q = (
    out.writeStream.format("console")
       .outputMode("append")
       .option("truncate", False)
       .start()
)

Expected output: You should see new rows in the console when you produce messages.

What to verify

Messages appear in the Spark console sink.
The streaming query stays active while you produce data.
Stopping the producer does not crash the query.

Downloads

If you want to run this without copying code, download the notebook or the .py export.

Download .ipynb Download .py

Quick takeaways#

Run it yourself#

Produce messages#

Read with Spark Structured Streaming#

What to verify#

Downloads#