This post is your first step before running any notebook. We verify Spark starts, the UI responds, and you can write/read Parquet.

Downloads at the end: go to Downloads.

At a glance

  • Confirm Spark starts without errors.
  • Verify Spark UI and version.
  • Write/read Parquet on the local volume.

Run it yourself

Use the Spark Docker stack from this blog.

Links:


1) Start Spark and check version

This confirms Spark is alive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("pw0 - env check")
    .config("spark.ui.port", "4040")
    .getOrCreate()
)

spark.version

Expected output (example):

'3.5.1'

Open the UI at http://localhost:4040 and confirm the app name.


2) Simple count

A basic count validates jobs execute correctly.

1
2
df = spark.range(0, 1_000_000)
df.count()

Expected output:

1000000

3) Write and read Parquet

This validates that local volumes are mounted correctly.

1
2
3
4
5
out_path = "/home/jovyan/work/data/env_check_parquet"
df.write.mode("overwrite").parquet(out_path)

df2 = spark.read.parquet(out_path)
df2.count()

Expected output:

1000000

Notes from practice

  • If UI does not load, check the port in Docker.
  • If the path fails, review volume mounts.
  • This post is the base before Delta Table 101.

Downloads

If you do not want to copy code, download the notebook or the .py.