Data: Sampler¶
Dual-budget bin-packing sampler — closes a batch when adding a graph
would exceed either the node budget or the edge budget. Single-axis
node-only budgets allowed edge-heavy batches to OOM; see
.claude/rules/critical-constraints.md.
Two paths:
NodeBudgetBatchSampler— live sampler, bucket-shuffled, fresh each epoch. Used whenshuffle=True.pack_offline— first-fit-decreasing packing used by the prebatch path at setup. ~10-20% tighter than sequential; no epoch-to-epoch randomness to preserve.
graphids.core.data.sampler¶
sampler ¶
Node-budget batch sampler for variable-size graphs.
Bin-packing sampler that yields index batches honoring a node budget, and optionally an edge budget as a dual constraint. The dual constraint matters when per-batch memory is dominated by message-passing activations (∝ edges) rather than node features (∝ nodes) — the edge budget prevents rare dense-edge graphs from OOMing even when the node budget would admit them.
NodeBudgetBatchSampler ¶
NodeBudgetBatchSampler(sizes: Tensor, max_num: int, *, edge_sizes: Tensor | None = None, max_edges: int | None = None, shuffle: bool = True, num_buckets: int = 20, indices: Tensor | list[int] | None = None)
Bases: Sampler[list[int]]
Bin-packing sampler with optional dual node/edge budget.
sizes/max_num: per-graph node counts, max nodes per batch.edge_sizes/max_edges(optional): per-graph edge counts, max edges per batch. A batch closes when adding a graph would exceed EITHER budget. A graph exceeding either budget on its own is skipped (with a one-line per-epoch summary warning).
Bucket-shuffle keeps batch-to-batch size variance low. indices
maps local positions to dataset-global indices (for curriculum subsets).
Source code in graphids/core/data/sampler.py
pack_offline ¶
pack_offline(sizes: Tensor, max_num: int, *, edge_sizes: Tensor | None = None, max_edges: int | None = None) -> list[list[int]]
First-fit-decreasing packing for the prebatch path.
The sampler's live packing walks indices sequentially (or bucket-shuffled) and closes a batch greedily — ~11/9 × OPT at best, and significantly worse when dataset order isn't size-sorted. FFD sorts graphs by size descending, then places each into the first batch it fits. For variable- size graphs this gives ~10-20% better node-budget utilization than sequential packing with no epoch-to-epoch randomness to preserve.
Returns a list of batch index lists (dataset-global indices; no
shuffle). Used by GraphDataModule._prebatch — the class sampler
is still used for live training where shuffle=True re-buckets
per epoch.