Architecture Decision Log¶

Consolidated 2026-04-09. Full rationale for each decision is in git history (search by ADR number). Decisions are permanent — if reversed, add a note here.

Decisions¶

0001 — Reject Hydra/OmegaConf. Jsonnet replaced YAML chains for config composition; Hydra's defaults lists offer no advantage over jsonnet's native imports, and OmegaConf's DictConfig impedance creates a dual-merge anti-pattern.

0002 — Forced callbacks via explicit construction. Stage configs that set trainer.callbacks silently dropped ModelCheckpoint/EarlyStopping (jsonnet list replacement). Critical callbacks now live in top-level namespaces (checkpoint.*, early_stopping.*) and are constructed by instantiate._build_callbacks(), immune to stage overrides.

0003 — Consolidate train/test/analyze into one SLURM job. (superseded 2026-04-15: pipeline deleted entirely — each ablation preset trains+evals in one job; analysis is a separate graphids analyze invocation.) Original context: separate SLURM jobs per phase caused analysis to run on CPU dagster workers and introduced process-boundary failures.

0004 — Keep custom VRAM probe, reject Lightning profilers. The VRAM probe must run before DataLoader construction to size NodeBudgetBatchSampler. All Lightning profilers/callbacks run after the DataLoader is built — lifecycle mismatch makes them unusable for batch sizing.

0005 — Wandb removed as direct dependency; OTel replaces it. WandbLogger/CSVLogger replaced by OpenTelemetry (OTelTrainingCallback + OTelTrainingLogger). Wandb Weave receives traces optionally via OTLP when WANDB_API_KEY is set. Model Registry and Data Artifacts were rejected (quota limits, no file:// support).

0006 — Dagster removed; pipeline driver deleted. (updated 2026-04-15: the Python pipeline driver was itself removed — multi-stage chains are a bash loop over scripts/run <preset.jsonnet> with SBATCH_DEP=afterok:<jid>.) Dagster's multi-job model caused queue-wait overhead between stages; the Python in-process driver that replaced it duplicated declarations the jsonnet presets already made (run_dir, identity hash, stage DAG, upstream family mapping). Dropping the driver collapsed the two routes into one.

0007 — Config system: independent axes + typed contract. (simplified 2026-04-15: TrainingRunConfig / StageConfig / ResolvedConfig.resolve deleted with the pipeline. Each ablation preset is now a self-contained jsonnet function; validation is Pydantic ValidatedConfig on the rendered dict only.) Original context: config combinatorial explosion (scale x model in one file) and parallel topology declarations caused silent drift. Fix: independent config axes in jsonnet, Pydantic extra="forbid" validator on the rendered dict. Don't adopt Hydra, don't mirror every __init__ signature.

0008 — No custom collation; prebatched path supersedes both. Custom _FastCollate was 1.6x slower than warm Batch.from_data_list() over full training (warm cache via persistent_workers=True). Both paths are now moot — prebatching collates all batches once at setup with num_workers=0.

0009 — Collapse override handoff chain. (superseded 2026-04-15: the remaining two-step handoff (ResolvedConfig.resolve → instantiate) collapsed to one: ResolvedConfig.from_rendered → build_run.) Original context: a 9-step handoff stringified override dicts across process boundaries, with validation only inside the SLURM job. Collapsed iteratively; the final path is render → apply_overrides → ResolvedConfig.from_rendered → build_run.

0010 — Use go-jsonnet binary, not the jsonnet PyPI package. go-jsonnet is 10-100x faster than libjsonnet, requires no C++ compile step on OSC, and installs as a single static binary to ~/.local/bin/jsonnet. Python access via subprocess.run in graphids/config/jsonnet.py (~5ms per render, not a hot path).

Library Evaluations (don't re-investigate)¶

tach (module boundary enforcement) — Strong fit for enforcing config/ never imports torch and orchestrate never imports core at definition time. Not yet adopted; revisit when adding CI. Full report in git history.

icontract (Design by Contract) — Marginal benefit. Overlap with Pydantic for config validation is near-total. Only use case: SLOW-gated tensor shape/NaN contracts during development. Not adopted.

PySlurm (Cython SLURM bindings) — Technically viable on OSC (libslurmfull.so at /usr/lib64/slurm/, PySlurm 25.5.0 matches SLURM 25.05). Not adopted: tight version coupling (every SLURM upgrade requires rebuild), GPL-2.0 license, replaces only ~330 lines of sacct subprocess calls.

simple_slurm (subprocess sbatch wrapper) — Installs trivially, no version coupling, clean script-generation API. But solves a problem we don't have (sbatch generation — submit.sh exists), lacks sacct parsing, AGPL-3.0 license, squeue broken on array jobs (issue #44), hijacks root logger at import time (issue #42). Net code savings: ~0. Not adopted.

pyslurmutils (SLURM REST executor) — Blocked on OSC: requires slurmrestd daemon which is not available. Not adopted.

slurm-pipeline (shell-script pipeline DAGs) — Pipeline model doesn't fit (expects TASK: stdout protocol). Heavy deps (pandas+plotly). Stale (no updates since Oct 2024). Not adopted.

Parsl (parallel workflow library) — Strong SLURM support, auto-scaling pilot jobs, active maintenance (UChicago/Argonne, weekly releases). Not needed for current fixed 3-stage pipeline (the in-process loop is sufficient). Worth revisiting if large hyperparameter sweeps outgrow the explicit configs/ablations/ tree.

Globus Compute / funcX (federated FaaS) — Cloud-routed task dispatch. Adds complexity without benefit for single-cluster use. Not adopted.

Garden AI, Cascade, ProxyStore, Colmena, GlassBox — Domain-specific or wrong phase. Not relevant.

python-fire (CLI generator) — Zero-boilerplate CLI from introspection. No benefit over Typer: loses type validation, repeatable --tla flags, and structured help. Not adopted.

pydantic-settings — Adopted (session 39). GraphIDSSettings(BaseSettings) with env_prefix="GRAPHIDS_" replaced 19 scattered os.environ.get() calls.

OpenTelemetry — Adopted (session 39). Replaced wandb + RunRecordCallback + ResourceProfileCallback + CSVLogger + custom logging with unified OTel stack. See docs/reference/observability.md.