Data: Budget¶
VRAM budget probe — sizes max_nodes and max_edges for
NodeBudgetBatchSampler before DataLoader
construction. Two-point linear fit of peak VRAM vs. batch size
isolates the scaling slope (bpn_node) from fixed overhead (cuDNN
workspaces, optimizer state, KD teacher). The slope-only estimate is
safer at high VRAM than the single-point probe it replaced, which
charged small batches with fixed cost and capped packs at ~20% of
actual VRAM.
GPS models use a quadratic probe (peak = α·V² + β·V + γ) to
capture attention's O(V²) blowup.
See .claude/rules/critical-constraints.md for the two-point-probe
invariant and the GRAPHIDS_BUDGET_SAFETY_MARGIN=0.95 rationale.
graphids.core.data.budget¶
budget ¶
VRAM budget → batch size → worker count.
Measures forward/backward memory via the CUDA allocator high-water mark
(torch.cuda.max_memory_allocated) and timing via wall-clock around
torch.cuda.synchronize. The probe is two-point: runs forward +
backward on a small batch and a larger one, then takes the slope
(peak_big - peak_small) / (nodes_big - nodes_small) as bpn_node.
The y-intercept of that line absorbs every fixed cost (model params,
optimizer state, cuDNN workspaces, allocator baseline, KD teacher) —
no resident-subtract heuristic. Single-point estimates systematically
over-charge small batches with the fixed costs, inflating bpn_node by
~3-4× and capping packed batches well below the real hardware limit.
GPS uses a three-point quadratic probe instead (probe_quadratic):
global attention memory scales as O(V²·heads), so a linear fit is wrong.
The quadratic coefficients (α, β, γ) are solved for via least-squares,
and the node budget is the positive root of α·V² + β·V + γ = free·safety.
Edge budget is derived from the empirical edges-per-node ratio on the
probe batches themselves — not a hardcoded multiplier.
Workers sized by ceil((t_io + t_collation) / t_gpu).
BudgetResult
dataclass
¶
BudgetResult(budget: int, edge_budget: int | None = None, binding: BudgetBinding = 'measured', backward_multiplier: float | None = None, t_fwd: float = 0.0, target_bytes: int = 0)
Output of node_budget — the sampler's sizing contract.
autosize_workers ¶
autosize_workers(model, dataset, result: BudgetResult, *, default_prefetch: int = 2) -> tuple[int, int]
ceil((t_io + t_collation) / t_gpu) → (num_workers, prefetch_factor).
Worker time has two components: dataset __getitem__ (real I/O) plus
Batch.from_data_list (collation, CPU-bound).
Source code in graphids/core/data/budget.py
collect_batch ¶
Collect graphs until reaching target_nodes total. No DataLoader overhead.
Source code in graphids/core/data/budget.py
node_budget ¶
node_budget(dataset: str, *, conv_type: str = 'gatv2', heads: int = 4, model=None, train_dataset=None) -> BudgetResult
Pack budget: free × safety / bpn per dimension.
free from mem_get_info already excludes resident allocation, and
bpn from the probe is purely batch-scaling — so one multiply gives
the max batch that fits without double-counting.
Source code in graphids/core/data/budget.py
probe ¶
Two-point linear fit of VRAM vs. batch size.
Runs a warmup on batch_small (trigger lazy CUDA init, cuDNN
autotuning, kernel JIT, allocator baseline), then full fwd+bwd passes
on both batches. bpn_node is the slope of bwd_peak vs. nodes
across the two points; the intercept (fixed overhead) drops out.
Same for bpn_edge.
Returns (bpn_node, bpn_edge, bwd_mult, t_fwd_seconds). t_fwd
is from the larger batch, which is more representative of real
training steps than the warmup-sized one.
Caller owns batch lifecycles.
Source code in graphids/core/data/budget.py
probe_quadratic ¶
Three-point quadratic fit of peak_bwd = α·V² + β·V + γ for GPS.
Runs warmup on the smallest batch (cuDNN autotuning, allocator baseline), then full fwd+bwd on every batch and collects (V, peak) pairs. Fits the quadratic via least-squares in float64 — the intercept γ absorbs every fixed cost, α isolates the O(V²) attention term, β picks up O(V) linear contributions (MPNN branch + feature projections).
Returns (alpha, beta, gamma, t_fwd_last). Caller solves the quadratic
against the free-VRAM target and handles degenerate fits.
Caller owns batch lifecycles; batches should be ordered ascending by
num_nodes so the smallest is used for warmup.