Core: Trainer¶
Pure-PyTorch training loop — Lightning was removed. Single-GPU only
(project targets 1× V100), handles AMP via GradScaler, gradient
clipping, AMP-safe scheduler skipping on inf/nan scale-warmup batches,
and callback lifecycle using the same hook names as Lightning so the
OTel + curriculum callbacks ported over without change.
graphids.core.trainer¶
trainer ¶
Pure-PyTorch training loop for GraphIDS.
Single-GPU only (project uses 1x V100). Handles:
- AMP via torch.amp.autocast + GradScaler(enabled=...) (no-op when disabled)
- Gradient clipping via clip_grad_norm_
- automatic_optimization=False for RL fusion models
- Metric accumulation and logger dispatch
- Callback lifecycle (same hook names as Lightning)
- Checkpoint resume
MetricAccumulator ¶
Dynamic-keyed batch-weighted mean.
Plain dict[str, (sum, count)] — NOT an nn.Module. These are
transient per-phase accumulators; storing them in a ModuleDict
both pollutes the parent's state_dict and rejects keys with
"." (add_module's attribute-name check), breaking metric names
like "test/precision@0.95recall".
NaN detection hard-fails the run — under precision: 16-mixed a
silent NaN in callback_metrics fools EarlyStopping
(NaN < inf is False) and wastes the full patience window.
Source code in graphids/core/trainer.py
Trainer ¶
Single-GPU training loop with AMP, gradient clipping, and callbacks.
Source code in graphids/core/trainer.py
fit ¶
Fit the model.
Wires datamodule → device → model.setup → device.to(), then runs
the train/val loop up to max_epochs or until a callback flips
trainer.should_stop. ckpt_path resumes weights +
optimizer + scheduler + AMP scaler state; on_exception fires
on any raise so callbacks can close MLflow runs cleanly before
re-raising.
Source code in graphids/core/trainer.py
predict ¶
Run predict_step over every test loader and return the
concatenated list. Setups with "predict" so datamodules can
swap in a predict-specific loader.
Source code in graphids/core/trainer.py
predict_on ¶
Run predict_step over a single loader. Assumes model/dm set up.
Source code in graphids/core/trainer.py
test ¶
Evaluate on all test dataloaders, return aggregated metrics.
Multiple test loaders (e.g. one per attack subdir) are dispatched
with a dataloader_idx so test_step can name metrics per
subdir.
Source code in graphids/core/trainer.py
validate ¶
Run one validation pass, return aggregated metrics.
Setups with "fit" (not "validate") because the val
loader is built there; the train loader is allocated but never
iterated.
Source code in graphids/core/trainer.py
TrainerConfig
dataclass
¶
TrainerConfig(max_epochs: int = 300, precision: str = '16-mixed', gradient_clip_val: float = 1.0, log_every_n_steps: int = 50, accelerator: str = 'auto', devices: str | int = 'auto', default_root_dir: str = '')
Flat config matching the jsonnet trainer section keys.
seed_everything ¶
Seed Python, NumPy, and PyTorch RNGs. torch.manual_seed covers CPU + CUDA.