Skip to content

Core: Callbacks

Pure-Python callback protocol (CallbackBase) with concrete no-op defaults so subclasses override only what they need. The three first-party callbacks:

  • ModelCheckpoint — atomic best + last ckpt persistence, owns the checkpoints/ subdir convention
  • EarlyStopping — flips trainer.should_stop at the epoch boundary
  • VRAMDriftCallback — one-shot warn when free VRAM drifts past threshold (co-resident process / activation-leak detector)

DGI's OCGIN centroid is fit fresh at Trainer.test start (not persisted in state_dict), so it needs no callback.

graphids.core.callbacks

callbacks

Training callback and logger protocols — pure-Python replacements for Lightning.

Callback lifecycle mirrors Lightning's hook names so existing OTel and curriculum callbacks need minimal changes.

CallbackBase

Concrete base with no-op defaults so subclasses only override what they need.

EarlyStopping dataclass

EarlyStopping(monitor: str = 'val_loss', mode: str = 'min', patience: int = 100)

Bases: CallbackBase

Stop training when monitored metric stops improving.

Flips trainer.should_stop at the epoch boundary — doesn't raise. The fit loop observes the flag after the scheduler step so the current epoch's metrics are logged before exit.

ModelCheckpoint dataclass

ModelCheckpoint(monitor: str = 'val_loss', mode: str = 'min', save_top_k: int = 1, save_last: bool = True, filename: str = 'best_model', dirpath: str = '', best_model_path: str = '')

Bases: CallbackBase

Save best + last checkpoints based on a monitored metric.

Writes to {trainer.default_root_dir}/checkpoints/ unless an explicit dirpath is set. The /checkpoints subdir convention is owned here so neither jsonnet nor the instantiator has to wire it from the trainer's run_dir.

VRAMDriftCallback dataclass

VRAMDriftCallback(threshold: float = 0.2)

Bases: CallbackBase

Warn when free VRAM shrinks past threshold between epochs.

The node-budget probe captures free once at build time. Over a long run the actual free pool drifts — co-resident CUDA processes on shared nodes, activation checkpoint leaks in PyG, growing OTel exporter caches. We capture a baseline at on_fit_start and compare at each epoch boundary. Epoch boundaries deliberately avoid transient allocations (teacher params are moved on/off GPU per-step in KD; checking between epochs catches persistent leaks only). Log-and-warn: re-probing mid-run would race optimizer state, so the researcher decides whether to abort.