GraphIDS Write Path Inventory¶
Audited 2026-04-09.
The Rule¶
- Code lives in
~/graphids/(NFS). Read-only at runtime. - ALL runtime writes go to
/fs/ess/PAS1266/graphids/(ESS), routed viaGRAPHIDS_LAKE_ROOT. - Scratch (
/fs/scratch/PAS1266/) is for transient data: staged data copies. - Nothing should write to the repo directory. Ever.
Single Source of Truth¶
graphids/config/constants.py declares write path constants. graphids/config/settings.py owns all GRAPHIDS_* env vars.
Constants: CKPT_SUBPATH, LAST_CKPT_SUBPATH, PHASE_MARKERS (all in graphids/config/constants.py). MLflow backend path (mlflow.db) and artifact subpath (mlartifacts) live in graphids/_mlflow.py.
Filesystem Layout¶
/fs/ess/PAS1266/graphids/ <-- $GRAPHIDS_LAKE_ROOT (persistent, shared)
+-- dev/{user}/{dataset}/
| +-- ablations/{group}/{variant}/
| +-- seed_{N}/ <-- trainer.default_root_dir
| +-- checkpoints/
| | +-- best_model.ckpt <-- ModelCheckpoint (dirpath pinned by instantiate.py)
| | +-- last.ckpt <-- ModelCheckpoint (save_last: true)
| +-- traces.jsonl <-- OTel spans (wire_file_exporters, SimpleSpanProcessor)
| +-- artifacts/ <-- analysis outputs (written by `graphids analyze`, not the pipeline driver)
| +-- .train_complete <-- phase marker (fit done; diagnostic only)
| +-- .test_complete <-- phase marker (test done; diagnostic only)
| +-- resolved.json <-- Pydantic-validated rendered config (cli/training._prepare)
| +-- overrides.json <-- TLA dict + --set payload (cli/training._prepare)
+-- mlflow.db <-- MLflow SQLite backend (runs + params + metrics + tags)
+-- mlartifacts/{exp_id}/{run_id}/ <-- MLflow artifact store (per-experiment per-run)
+-- raw/{dataset}/ <-- source CSV data
+-- cache/v{ver}/{dataset}/ <-- preprocessed graph .pt files
+-- slurm/ <-- SLURM stdout/stderr (default)
/fs/scratch/PAS1266/ <-- transient (90-day purge)
+-- graphids-data/ <-- staged data (scratch -> TMPDIR)
$TMPDIR/graphids-data/ <-- per-job local SSD (ephemeral)
Checkpoint dirpath is owned by core.callbacks.ModelCheckpoint._resolve_dirpath: {default_root_dir}/checkpoints unless an explicit dirpath is configured. The /checkpoints subdir convention lives on the callback, not the instantiator.
Write Path Detail¶
1. Trainer (custom, post-Lightning)¶
All training writes land under trainer.default_root_dir from the rendered jsonnet config.
| What | Relative path | Who writes |
|---|---|---|
| Best checkpoint | checkpoints/best_model.ckpt |
core.callbacks.ModelCheckpoint (self-describing — class_path + state_dict + hyper_parameters) |
| Resume checkpoint | checkpoints/last.ckpt |
ModelCheckpoint (save_last: true) |
| Train/val predictions | predictions/{train,val}.pt |
orchestrate.stage.train after trainer.fit |
| Per-test-set predictions | predictions/test/{set_name}.pt |
orchestrate.stage.evaluate |
| OTel spans + log events | traces.jsonl |
wire_file_exporters -> BatchSpanProcessor -> ConsoleSpanExporter |
2. Training-time tracking¶
graphids/_mlflow.py::start_training_run opens the MLflow run in stage.train before trainer.fit, logs params + tags + cache digest, and enables the MLflow system-metrics sampler (psutil + nvidia-ml-py, 5s interval).
graphids/core/mlflow_callback.py::MLflowTrainingCallback appends train_loss/val_loss/lr/early_stop.wait at step=epoch via on_train_epoch_end, stamps peak_vram_mb + epochs_run + ckpt SHA256 tag at on_fit_end, and closes the run (FINISHED / FAILED via on_exception).
graphids/_otel.py::wire_file_exporters wires the traces.jsonl span exporter (Phase B). Structured-log events emitted via log.info("event_name", ...) land here alongside the single training.fit span. Wandb Weave OTLP export is optional when WANDB_API_KEY is set.
3. Phase Markers¶
Diagnostic only — resume is authoritative on checkpoints/best_model.ckpt
existence, not these markers.
| What | Path | Who writes | Code |
|---|---|---|---|
| Train phase marker | {run_dir}/.train_complete |
stage.py::train after trainer.fit |
PHASE_MARKERS["train"] |
| Test phase marker | {run_dir}/.test_complete |
stage.py::evaluate after trainer.test |
PHASE_MARKERS["test"] |
| Analysis artifacts | {run_dir}/artifacts/ |
core/analysis/analyzer.py via python -m graphids analyze (not pipeline) |
analyze.py |
4. SLURM¶
| What | Path | Who writes |
|---|---|---|
| Job stdout/stderr | {slurm_log_dir}/{name}_%j.{out,err} |
sbatch/OS |
slurm_log_dir defaults to {lake_root}/slurm (settings.py:46); override via GRAPHIDS_SLURM_LOG_DIR.
5. Data / Preprocessing¶
| What | Path | Who writes |
|---|---|---|
| Graph cache .pt files | {lake_root}/cache/v{ver}/{dataset}/processed/data_*.pt |
core/data/io.py::atomic_save |
| Cache metadata (v2) | {lake_root}/cache/v{ver}/{dataset}/cache_metadata.json |
core/data/metadata.py::merge_split_into_metadata |
| NFS advisory lock | {cache_dir}/processed/.lock |
core/data/io.py::nfs_lock |
| Metadata merge lock | {cache_dir}/.metadata_lock |
core/data/metadata.py (fcntl.flock) |
6. MLflow Store¶
{lake_root}/mlflow.db — MLflow SQLite backend (runs, params, metrics, tags). Written by graphids/_mlflow.py::log_run at the end of orchestrate/stage.py::evaluate. Each run is an MLflow row keyed by run_name = {group}_{variant}_{dataset}_seed{N}[_{cluster}]. Artifacts (if any) land under {lake_root}/mlartifacts/{exp_id}/{run_id}/. Browse via the OSC OnDemand MLflow app pointed at the SQLite URI.
Execution Order¶
SLURM JOB (compute node)
------------------------
_preamble.sh (env, venv)
python -m graphids fit --config <preset.jsonnet> --tla dataset=... --tla seed=...
+- ensure_spawn()
+- render(config, tla)
+- apply_overrides(rendered, --set ...)
+- ResolvedConfig.from_rendered(rendered)
+- build(resolved)
| +- gc + torch.cuda reset
| +- build_run(rendered) -> trainer/model/datamodule
+- wire_file_exporters(run_dir)
+- train(artifacts, resolved, resume_from=...)
| +- trainer.fit() -> touch .train_complete
# (separate invocation for eval)
python -m graphids test --config <preset.jsonnet> --tla dataset=... --tla seed=...
+- ... evaluate(...) -> trainer.test() -> touch .test_complete
# Analysis is decoupled:
python -m graphids analyze --ckpt-path <run_dir>/checkpoints/best_model.ckpt \
--dataset <dataset>
Env Var -> Path Mapping¶
| Env var | Default | Set in | Controls |
|---|---|---|---|
GRAPHIDS_LAKE_ROOT |
"experimentruns" (relative) |
.env -> /fs/ess/PAS1266/graphids |
All experiment IO |
GRAPHIDS_SLURM_LOG_DIR |
{lake_root}/slurm (derived) |
.env |
SLURM stdout/stderr |
GRAPHIDS_LAKE_WRITE |
false |
.env (set to 1 in SLURM jobs) |
Guards lake writes |
MLFLOW_TRACKING_URI |
sqlite:///{lake_root}/mlflow.db |
derived by _mlflow.ensure_tracking_uri |
MLflow backend |
WANDB_API_KEY |
(none) | .env |
Enables Wandb Weave OTLP export |
TMPDIR |
(SLURM sets) | OS | Per-job local SSD |