Core: Callbacks¶
Pure-Python callback protocol (CallbackBase) with concrete no-op
defaults so subclasses override only what they need. The three
first-party callbacks:
ModelCheckpoint— atomic best + last ckpt persistence, owns thecheckpoints/subdir conventionEarlyStopping— flipstrainer.should_stopat the epoch boundaryVRAMDriftCallback— one-shot warn when free VRAM drifts past threshold (co-resident process / activation-leak detector)
DGI's OCGIN centroid is fit fresh at Trainer.test start (not
persisted in state_dict), so it needs no callback.
graphids.core.callbacks¶
callbacks ¶
Training callback and logger protocols — pure-Python replacements for Lightning.
Callback lifecycle mirrors Lightning's hook names so existing OTel and curriculum callbacks need minimal changes.
CallbackBase ¶
Concrete base with no-op defaults so subclasses only override what they need.
EarlyStopping
dataclass
¶
Bases: CallbackBase
Stop training when monitored metric stops improving.
Flips trainer.should_stop at the epoch boundary — doesn't raise.
The fit loop observes the flag after the scheduler step so the
current epoch's metrics are logged before exit.
ModelCheckpoint
dataclass
¶
ModelCheckpoint(monitor: str = 'val_loss', mode: str = 'min', save_top_k: int = 1, save_last: bool = True, filename: str = 'best_model', dirpath: str = '', best_model_path: str = '')
Bases: CallbackBase
Save best + last checkpoints based on a monitored metric.
Writes to {trainer.default_root_dir}/checkpoints/ unless an
explicit dirpath is set. The /checkpoints subdir convention
is owned here so neither jsonnet nor the instantiator has to wire
it from the trainer's run_dir.
VRAMDriftCallback
dataclass
¶
Bases: CallbackBase
Warn when free VRAM shrinks past threshold between epochs.
The node-budget probe captures free once at build time. Over a
long run the actual free pool drifts — co-resident CUDA processes on
shared nodes, activation checkpoint leaks in PyG, growing OTel
exporter caches. We capture a baseline at on_fit_start and
compare at each epoch boundary. Epoch boundaries deliberately avoid
transient allocations (teacher params are moved on/off GPU per-step
in KD; checking between epochs catches persistent leaks only).
Log-and-warn: re-probing mid-run would race optimizer state, so the
researcher decides whether to abort.