Skip to content

Config Architecture

Jsonnet composition -> Pydantic validation -> direct instantiation. For file layout, stage conventions, and running examples, see .claude/rules/config-system.md.


1. CLI Routes

One training route + operational commands:

Route A: Train a preset

python -m graphids fit \
    --tla 'dataset="hcrl_ch"' \
    --tla 'scale="small"' \
    --config configs/ablations/unsupervised/vgae.jsonnet \
    --set model.init_args.lr=0.01
  -> __main__.py
  -> cli.training (Typer @app.command)
  -> render(jsonnet_path, tla)
  -> apply_overrides(rendered, --set ...)
  -> ResolvedConfig.from_rendered(rendered)    # validates + pulls run_dir
  -> build(resolved)  ->  train(artifacts, resolved, resume_from=--ckpt-path)

Every ablation preset under configs/ablations/*.jsonnet computes its own run_dir from (lake_root, dataset, seed) via _paths.libsonnet. The SLURM wrapper (scripts/run) just forwards TLAs.

Multi-stage chains (e.g. autoencoder → supervised → fusion) are a Python DAG driver (graphids.slurm.dag, CLI python -m graphids launch-ablation). Topology is a declarative OFAT_DAG tuple of FitNode/ExtractStatesNode; the executor walks it in topological order, submitting each node via scripts/run with SBATCH_DEP=afterok derived from the in-memory jid map. There is no in-process pipeline driver.

Route B: Operational commands (no training)

python -m graphids {analyze|rebuild-caches|extract-fusion-states|compare}
  -> __main__.py imports cli submodules
  -> Typer @app.command() dispatch per submodule

2. Pydantic Validation Layer

graphids/config/schemas.py::validate_config(rendered) -> ValidatedConfig runs immediately after render on every path. Torch-free, deterministic.

Schema tree

ValidatedConfig (extra="forbid")
+-- seed_everything: int
+-- trainer: TrainerSection    (extra="allow" -- TrainerConfig dataclass kwargs flow through)
+-- data: ClassPathBlock       (extra="forbid"; class_path required)
+-- model: ClassPathBlock      (extra="forbid"; class_path required)
+-- checkpoint: CheckpointSection  (mode: Literal["min","max"])
+-- early_stopping: EarlyStoppingSection  (mode: Literal["min","max"])
+-- ckpt_path: str | None      (auto-resume passthrough)

Model validators

Validator Rule Why it exists
_no_null_list_fields model.init_args.{pool_aggrs, hidden_dims, auxiliaries} must not be null Instantiation rejects null lists with a cryptic error
_monitor_pair_consistent checkpoint.monitor/mode == early_stopping.monitor/mode Divergent monitors = typo in the stage libsonnet
_lr_monitor_requires_logger LearningRateMonitor callback needs trainer.logger != False LR monitor is silently disabled without a logger
_class_paths_namespaced data.class_path and model.class_path must start with graphids. Catches relative imports and stray modules

3. Forced Callbacks + Direct Instantiation

Critical callbacks are constructed explicitly by instantiate._build_callbacks(). Any stage-level trainer.callbacks appends user callbacks; it cannot drop the forced set.

Forced callbacks (from defaults.libsonnet): ModelCheckpoint, EarlyStopping, MLflowTrainingCallback, CurriculumEpochCallback. No trainer logger (MLflow callback handles metrics).

build_run() responsibilities

graphids.orchestrate.instantiate.build_run(rendered, validated=None):

Step How
Class-path import importlib.import_module + getattr
Signature-filtered kwargs filter_kwargs(klass, init_args)
Callbacks / logger build_callbacks(rendered) / build_loggers(rendered) — explicit construction
KD loss injection inject_loss_fn pops distillation_config, builds loss via build_loss()
seed_everything explicit seed_everything(rendered["seed_everything"])

4. Key Files

File Role Torch?
cli/training.py fit / test — renders preset, builds + runs Lazy
instantiate.py build_run(rendered) -> InstantiatedRun — importlib, filter_kwargs, callback wiring Yes
__main__.py Imports cli/ submodules to register Typer commands Lazy
config/jsonnet.py render(path, tla) via _jsonnet C bindings No
config/schemas.py ValidatedConfig, validate_config, ConfigValidationError No
config/topology.py Stage-file existence check, dataset catalog, path helpers No
orchestrate/config.py ResolvedConfig, InstantiatedRun No
orchestrate/stage.py build, train, evaluate primitives Yes
core/analysis/runner.py run_single_analysis — invoked by graphids analyze CLI Yes
core/monitoring.py SlurmResourceDetector (OTel resource attrs) No
core/mlflow_callback.py MLflowTrainingCallback (per-epoch metrics + finalize) Yes
_mlflow.py start_training_run, log_epoch_metrics, log_test_run, lifecycle Lazy
_otel.py init_providers, wire_file_exporters No