GraphIDS Write Path Inventory¶

Status: current

Rule¶

Source code lives in the repo and should be read-only at runtime.
Shared persistent data lives under $GRAPHIDS_LAKE_ROOT.
Per-run manifests, events, checkpoints, and artifacts live under the resolved run directory.
SLURM text logs live under the configured SLURM log directory.

Roots¶

Root	Default/current behavior	Owner
Repo	`/users/PAS2022/rf15/graphids`	source code
Lake	`$GRAPHIDS_LAKE_ROOT`, typically `/fs/ess/PAS1266/graphids`	raw data, caches, MLflow DB, MLflow artifacts
Runs	`Path.home() / "ray_results"` outside Ray	run journals, checkpoints, non-MLflow artifacts
SLURM logs	`GRAPHIDS_SLURM_LOG_DIR`, `.env`, or `{lake_root}/slurm`; current jobs use `/fs/ess/PAS1266/graphids/slurm_logs`	sbatch scripts, stdout, stderr

Relevant code:

graphids/paths.py: lake, cache, run, and catalog helpers.
graphids/exp/config.py: OutputConfig and run-directory resolution.
graphids/exp/slurm.py: SLURM script/log path resolution.
graphids/_mlflow.py: MLflow tracking URI.

Filesystem Layout¶

$GRAPHIDS_LAKE_ROOT/
  raw/{dataset}/
  cache/v{PREPROCESSING_VERSION}/{dataset}/
  mlflow.db
  mlartifacts/{dataset}/{stage}/
  slurm_logs/
    scripts/{experiment_name}.sbatch
    {experiment_name}_{job_id}.out
    {experiment_name}_{job_id}.err

~/ray_results/{dataset}/{experiment_name}/{run_name}/
  .graphids/
    manifest.json
    events.jsonl
  checkpoints/
    best_model.ckpt
    best_model.ckpt.sha256
    last.ckpt
    last.ckpt.sha256
  artifacts/

Checkpoint files exist only when the experiment config enables checkpointing and includes a checkpoint callback.

Cache Paths¶

Graph caches are versioned by graphids.paths.PREPROCESSING_VERSION.

Snapshot and snapshot-sequence graph caches currently live under:

{lake_root}/cache/v{PREPROCESSING_VERSION}/{dataset}/{representation_kind}_{representation_digest}_voc_{scope}/
  processed/
    data_train.pt
    data_test_<split>.pt
    .complete
  cache_metadata.json

The representation digest is part of the cache path and cache key, so changing representation settings creates a distinct cache.

Run Journals¶

graphids.exp.runtime.launch_run writes:

File	Purpose
`.graphids/manifest.json`	resolved run identity, config, outputs, status, failure
`.graphids/events.jsonl`	launch, finish, and failure events

gx exp status <run_dir> reads these files.

MLflow¶

graphids._mlflow.configure_tracking_uri() defaults MLflow to:

sqlite:///{lake_root}/mlflow.db

graphids.exp.runtime.launch_run creates an MLflow logger, logs run hyperparameters/tags, and marks the run FINISHED or FAILED. For Lightning fit and test, final callback metrics are explicitly logged after the trainer returns.

MLflow artifact roots are explicit lake paths:

{lake_root}/mlartifacts/{dataset}/{stage}/

MLflow system metrics are sampled by MLflowSystemMetricsCallback for Lightning-created runs.

SLURM¶

gx exp submit <yaml> writes one script per experiment name:

{slurm_log_dir}/scripts/{experiment_name}.sbatch

The script runs:

cd /users/PAS2022/rf15/graphids
source scripts/slurm/_preamble.sh
python -m graphids exp launch /abs/path/to/config.yml
source scripts/slurm/_epilog.sh

Stdout/stderr go to:

{slurm_log_dir}/{experiment_name}_%j.out
{slurm_log_dir}/{experiment_name}_%j.err

Execution Order¶

gx exp submit <yaml>
  -> ExperimentConfig.from_yaml
  -> build RunConfig for validation
  -> write sbatch script
  -> sbatch

compute node:
  -> scripts/slurm/_preamble.sh
  -> python -m graphids exp launch <yaml>
  -> ExperimentConfig.from_yaml
  -> launch_run
  -> run_stage
  -> manifest/events + MLflow
  -> scripts/slurm/_epilog.sh