Observability & Profiling¶
Updated: 2026-04-16 | Environment: OSC Pitzer, V100 (16 GB), CUDA 12.6, PyTorch 2.8, PyG 2.7
Architecture¶
Two stores: MLflow for run-level metadata + scalar metrics timeseries + device telemetry, OTel for spans + structured-log events. They share a Resource populated by SlurmResourceDetector and an identity-derived run_name that links rows across both.
Phase A (process startup — graphids/_otel.py:init_providers, called from the Typer @app.callback() in graphids/cli/app.py):
- TracerProvider + optional Wandb Weave OTLP exporter (gated on WANDB_API_KEY)
- LoggerProvider -> ConsoleLogRecordExporter(out=stderr) + LoggingHandler bridges stdlib logging -> OTel
- SlurmResourceDetector merges SLURM env vars into the shared Resource
Phase B (after run_dir is known — graphids/_otel.py:wire_file_exporters, called from cli/training.py::_prepare):
- BatchSpanProcessor -> ConsoleSpanExporter(out=run_dir/traces.jsonl) — the training.fit span + structured-log events
Phase C (at fit-start — graphids/_mlflow.py::start_training_run, called from orchestrate/stage.py::train):
- Opens MLflow run (SQLite backend at {lake_root}/mlflow.db), logs params/tags/cache_digest
- Enables MLflow system-metrics sampler (background thread, 5s interval) — GPU util, VRAM, CPU, memory, disk, network
- MLflowTrainingCallback appends per-epoch train_loss/val_loss/lr/early_stop.wait at step=epoch
- At fit-end: peak VRAM + epochs_run + ckpt SHA256 tag, run closes with status FINISHED (or FAILED on exception)
Wired tooling¶
| Layer | Tool | Where |
|---|---|---|
| Run metadata + scalar metrics | MLflow SQLite backend | graphids/_mlflow.py |
| Per-epoch metrics | MLflowTrainingCallback |
configs/_lib/defaults.libsonnet callbacks.mlflow |
| Device telemetry (GPU/CPU/mem) | MLflow system-metrics sampler (psutil + nvidia-ml-py) | _mlflow.start_training_run |
| Structured logging | _StructuredAdapter -> LoggingHandler |
graphids/_otel.py |
| Traces + log events (per-run) | traces.jsonl via ConsoleSpanExporter |
{run_dir}/traces.jsonl |
| Wandb Weave (optional) | OTLP HTTP exporter to trace.wandb.ai |
graphids/_otel.py, gated on WANDB_API_KEY |
| Op-level profiling | PyTorchProfiler (chrome traces) | scripts/run --mode gpu --length short --command "python -m graphids profile" |
| SLURM job accounting | sacct summary + log rotation | _epilog.sh |
| CUDA alloc config | expandable_segments:True,garbage_collection_threshold:0.8 |
_preamble.sh |
| Mixed precision | precision: 16-mixed |
configs/_lib/defaults.libsonnet |
| Gradient checkpointing | use_reentrant=False |
_conv.py |
MLflowTrainingCallback (graphids/core/mlflow_callback.py)¶
Installed via defaults.libsonnet callbacks.mlflow. Run lifecycle is owned by _mlflow.start_training_run (called from stage.train before trainer.fit); this callback only writes into the active run.
on_train_epoch_end:mlflow.log_metrics({train_loss, val_loss, lr, early_stop.wait, early_stop.best_score}, step=current_epoch)on_fit_end:log_final_fit(peak_vram_mb, epochs_run, best_ckpt_path, run_dir)+end_training_run("FINISHED")on_exception:end_training_run("FAILED")
Device telemetry is captured by MLflow's background system-metrics thread while the run is active — no per-batch NVML hooks needed. Span lifecycle for training.fit is a single span created implicitly via trainer.fit wrapping; cross-stage KD lineage (VGAE→GAT→fusion) is recoverable via graphids.ckpt_sha256 tags + upstream ckpt paths stored in downstream resolved.json.
Storage layers¶
- MLflow run store (
{lake_root}/mlflow.db+mlartifacts/) — authoritative for run metadata, params, per-epoch scalar metrics, and device telemetry. One fit-phase row (opened at fit-start, closed at fit-end) + one test-phase row (post-hoc sink instage.evaluate), both sharingrun_name = {group}_{variant}_{dataset}_seed{N}[_{cluster}]and distinguished by thegraphids.phasetag. Query viamlflow.search_runsorclient.get_metric_history(run_id, key). - Per-run traces (
{run_dir}/traces.jsonl) — OTel spans + structured-log events (budget_probed,vram_drift_detected,early_stopping, etc.). Parsed bygraphids/core/run_io.py::load_traces(polars NDJSON). Useful for debugging single runs; not a query surface for cross-run analysis.
GPU profiling tools¶
| Tool | Best for | On OSC? |
|---|---|---|
torch.cuda.max_memory_allocated() |
Peak memory -> batch sizing | Yes |
torch.cuda.memory._record_memory_history() |
Memory leak debugging (pickle -> pytorch.org/memory_viz) | Yes |
torch.profiler.profile |
Per-operator cost, CPU<->GPU gaps (JSON/CSV) | Yes |
nsys (module load nvhpc/25.1) |
System-wide CPU<->GPU bottlenecks | Yes |
ncu (module load nvhpc/25.1) |
Per-kernel roofline (10-100x slower) | Yes |
nsys invocation (OSC)¶
module load nvhpc/25.1 # nsys 2024.7.1.84
nsys profile --pytorch=autograd-shapes-nvtx -t cuda,nvtx,osrt,cudnn,cublas \
-o /fs/scratch/PAS1266/profiles/my_run \
python -m graphids fit --config configs/stages/autoencoder.jsonnet
nsys stats my_run.nsys-rep # summary
nsys stats --report cuda_gpu_kern_sum --format csv my_run.nsys-rep # kernel CSV
nsys export --type=sqlite my_run.nsys-rep # for queries
Focused profiling: torch.cuda.cudart().cudaProfilerStart() / Stop() in code, then nsys profile --capture-range=cudaProfilerApi ...
ncu invocation (use after nsys finds slow kernels)¶
ncu --kernel-name "scatter_mean" --launch-count 5 \
-o /fs/scratch/PAS1266/profiles/scatter_report \
python -m graphids fit ...
WARNING: ncu replays each kernel 10-100x. GraphIDS bottleneck is CPU-side (data loading), so ncu priority is LOW.
PyG-specific notes¶
- Variable-size graph batches cause kernel dimension variance per step. Use
--pytorch=autograd-shapes-nvtxto see batch size effects. max_memory_allocated(tensors used) is the correct metric for batch sizing, notmax_memory_reserved(allocator blocks).
Tool decisions (don't re-investigate)¶
Adopt: OpenTelemetry (traces + metrics + logs), MLflow run store (SQLite on GPFS, file artifacts), PyTorchProfiler, nsys (one-off), torch.cuda memory APIs, sacct profiler
Skip (with reasons):
- nvprof: deprecated. ncu: 10-100x slower, only after nsys finds bad kernel. DCGM: needs admin (error -37 conflicts with SLURM GPU accounting).
- cuGraph/cugraph-pyg: graph classification, not sampling. kvikIO/GDS: no OSC infra.
- cudnn.benchmark: CNN-only. channels_last: image tensors. TF32: Ampere+ only. CUDA Graphs: variable-size batches.
- Aim: RocksDB NFS issues. Neptune: dead. DVC: duplicates staging. pytorch_memlab: abandoned. MLflow file-store backend: deprecated Feb 2026; we use the SQLite backend.
- torch.compile reduce-overhead: increases memory. Use default mode only.
- wandb (direct dep): removed — OTel + optional Weave OTLP. Wandb Weave receives traces when WANDB_API_KEY is set.
V100 deprecation warning¶
cuDNN 9.11+ drops V100 (Volta, compute 7.0). PyTorch 2.8 ships cuDNN 9.10.2 (last Volta version). Pin torch<2.9 when it ships. Sources: PyTorch #162574, cuDNN 9.11.0 notes