Quick steps

  • Start a local Spark session
  • Print the Spark version
  • Run a simple row count
  • Write & read a small Parquet dataset under ./data

Tip: Keep this notebook as your first-run check for any lab session

1) Paths and data folder

We resolve path for data/ folder, all files written here persist on your host

from pathlib import Path

base_dir = Path.cwd().parent
data_dir = base_dir / "data" / "00_env_check"

print("Project base folder:", base_dir)
print("Project data folder:", data_dir)
Project base folder: /home/jovyan/work
Project data folder: /home/jovyan/work/data/00_env_check

2) Spark session and version

Creates a local Spark session, the Spark UI should use port 4040

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("pw0 - env check")
    .config("spark.ui.port", "4040")
    .getOrCreate()
)

spark

SparkSession - in-memory

SparkContext

Spark UI

Version
v4.0.1
Master
local[*]
AppName
pw0 - env check

Open http://localhost:4040 to confirm the Spark UI loads (header and app name)

3) Simple count

Perform a basic count with the timing to perform the operation

import time

start = time.perf_counter()
f1 = spark.range(0, 1_000_000)
total = df1.count()
elapsed = time.perf_counter() - start

print("Rows counted:", total)
print(f"Time elapsed: {elapsed:.3f} s")
Rows counted: 1000000
Time elapsed: 0.098 s

4) Write & read parquet

Writes a small dataset to data/ and reads it back, this verifies your folder mapping and permissions

import os, shutil

df1.write.mode("overwrite").parquet(str(data_dir))
df2 = spark.read.parquet(str(parquet_path))

print("Rows read from parquet:", df2.count())
Rows read from parquet: 1000000
%ls ../data/00_env_check
_SUCCESS
part-00000-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00001-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00002-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00003-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00004-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00005-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00006-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00007-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00008-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00009-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00010-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00011-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet

6) Shutdown

Close the session to release resources when you’re done

spark.stop()
print("Spark session stopped")
Spark session stopped

Lab 0 - Basic environment check (Jupyter + Spark, local)

Quick steps

  • Start a local Spark session
  • Print the Spark version
  • Run a simple row count
  • Write & read a small Parquet dataset under ./data

Tip: Keep this notebook as your first-run check for any lab session

1) Paths and data folder

We resolve path for data/ folder, all files written here persist on your host

from pathlib import Path

base_dir = Path.cwd().parent
data_dir = base_dir / "data" / "00_env_check"

print("Project base folder:", base_dir)
print("Project data folder:", data_dir)
Project base folder: /home/jovyan/work
Project data folder: /home/jovyan/work/data/00_env_check

2) Spark session and version

Creates a local Spark session, the Spark UI should use port 4040

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("pw0 - env check")
    .config("spark.ui.port", "4040")
    .getOrCreate()
)

spark
<div>
    <p><b>SparkSession - in-memory</b></p>

SparkContext

<p><a href="http://5bebdeb0c884:4041">Spark UI</a></p>

<dl>
  <dt>Version</dt>
    <dd><code>v4.0.1</code></dd>
  <dt>Master</dt>
    <dd><code>local[*]</code></dd>
  <dt>AppName</dt>
    <dd><code>pw0 - env check</code></dd>
</dl>
</div>

Open http://localhost:4040 to confirm the Spark UI loads (header and app name)

3) Simple count

Perform a basic count with the timing to perform the operation

import time

start = time.perf_counter()
df1 = spark.range(0, 1_000_000)
total = df1.count()
elapsed = time.perf_counter() - start

print("Rows counted:", total)
print(f"Time elapsed: {elapsed:.3f} s")
Rows counted: 1000000
Time elapsed: 0.098 s

4) Write & read parquet

Writes a small dataset to data/ and reads it back, this verifies your folder mapping and permissions

import os, shutil

df1.write.mode("overwrite").parquet(str(data_dir))
df2 = spark.read.parquet(str(parquet_path))

print("Rows read from parquet:", df2.count())
Rows read from parquet: 1000000
%ls ../data/00_env_check
_SUCCESS
part-00000-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00001-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00002-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00003-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00004-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00005-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00006-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00007-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00008-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00009-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00010-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet
part-00011-b68ec6ae-95c2-4359-bbe6-0245b81925da-c000.snappy.parquet

6) Shutdown

Close the session to release resources when you’re done

spark.stop()
print("Spark session stopped")
Spark session stopped