📡 Streaming 101 with Spark: file/Auto Loader → console (no installs)

1‑line value: Spin up Structured Streaming without external services: read files (Auto Loader or classic file source), do a tiny transform, and print to console. Executive summary Use file-based streaming instead of rate: either Auto Loader (cloudFiles) on Databricks or the built‑in file source on vanilla Spark. Works with existing public/sample data—no Kafka, no sockets, no netcat. Add a tiny transform (filter + derived column) and stream to console for instant feedback. Tune throughput/latency with trigger(availableNow=True) (one‑shot catch‑up) or processingTime (micro‑batches). Include copy‑ready snippets, plus a minimalist checklist to move toward production. 1) Problem & context I want a minimal streaming skeleton that anyone can run today—locally or on Databricks—without provisioning brokers or external services. The goal: read → transform → print to validate the pipeline shape and metrics. ...

October 13, 2025
Miniatura de mi post

🛠️ Basic Environment Check (Jupyter + Spark Local)

Quick steps Start a local Spark session Print the Spark version Run a simple row count Write & read a small Parquet dataset under ./data Tip: Keep this notebook as your first-run check for any lab session 1) Paths and data folder We resolve path for data/ folder, all files written here persist on your host from pathlib import Path base_dir = Path.cwd().parent data_dir = base_dir / "data" / "00_env_check" print("Project base folder:", base_dir) print("Project data folder:", data_dir) Project base folder: /home/jovyan/work Project data folder: /home/jovyan/work/data/00_env_check 2) Spark session and version Creates a local Spark session, the Spark UI should use port 4040 ...

October 6, 2025