Spark local, first run

This post is your first step before running any notebook. We verify Spark starts, the UI responds, and you can write/read Parquet.

Downloads at the end: go to Downloads.

At a glance

Confirm Spark starts without errors.
Verify Spark UI and version.
Write/read Parquet on the local volume.

Run it yourself

Use the Spark Docker stack from this blog.

Links:

Apache Spark tool

1) Start Spark and check version

This confirms Spark is alive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("pw0 - env check")
    .config("spark.ui.port", "4040")
    .getOrCreate()
)

spark.version

Expected output (example):

'3.5.1'

Open the UI at http://localhost:4040 and confirm the app name.

2) Simple count

A basic count validates jobs execute correctly.

1
2
df = spark.range(0, 1_000_000)
df.count()

Expected output:

3) Write and read Parquet

This validates that local volumes are mounted correctly.

1
2
3
4
5
out_path = "/home/jovyan/work/data/env_check_parquet"
df.write.mode("overwrite").parquet(out_path)

df2 = spark.read.parquet(out_path)
df2.count()

Expected output:

Notes from practice

If UI does not load, check the port in Docker.
If the path fails, review volume mounts.
This post is the base before Delta Table 101.

Downloads

If you do not want to copy code, download the notebook or the .py.

Download .ipynb Download .py

At a glance#

Run it yourself#

1) Start Spark and check version#

2) Simple count#

3) Write and read Parquet#

Notes from practice#

Downloads#