Pre-batch Timing Rationale — hcrl_sa (probe 2026-04-07)¶
Justifies the choice of prebatched training + workers=0 over per-step collation
with workers. Numbers are from a single probe run; see data-flow.md for current
architecture.
Dataset and hardware¶
- hcrl_sa: 19,085 graphs, mean 38.3 nodes, mean 98 edges, ~10.8 KB/graph
- V100 16GB, PCIe 3.0 x16 (~12 GB/s), pin_memory ~20 GB/s
Probe numbers (optimizer+compile warmup included)¶
| Model | Budget (nodes) | Graphs/batch | Batch MB | T_collation (old) | T_gpu | H2D (est) | Pin (est) |
|---|---|---|---|---|---|---|---|
| VGAE small | 404,718 | 10,557 | 113.9 | 380.7 ms | 154.9 ms | 9.5 ms | 5.7 ms |
| VGAE large | 343,522 | 8,959 | 96.7 | 327.1 ms | 286.8 ms | 8.1 ms | 4.8 ms |
| GAT small | 233,339 | 6,086 | 65.7 | 218.5 ms | 261.5 ms | 5.5 ms | 3.3 ms |
| GAT large | 62,439 | 1,629 | 17.6 | 58.9 ms | 121.6 ms | 1.5 ms | 0.9 ms |
T_collation = Batch.from_data_list() cost on the old per-step path only.
H2D = batch_MB / 12 GB/s. Pin = batch_MB / 20 GB/s.
Why prebatch wins¶
Old path (per-step collation, 3 workers): T_step = max(T_collation/workers, T_gpu) + T_H2D.
For VGAE small: max(381/3, 155) + 10 = max(127, 155) + 10 = 165 ms/step, GPU util ~60-83%.
Workers needed because T_collation (381 ms) >> T_gpu (155 ms).
New path (prebatch at setup, workers=0): _prebatched_train_dataloader calls
NodeBudgetBatchSampler once at setup to plan batches, then Batch.from_data_list()
once per batch (not per step, not per epoch). The training loop iterates over a
list[Batch] with num_workers=0 wrapped by PrefetchLoader for async H2D.
Per-step CPU work: pin(5.7 ms) + H2D(9.5 ms) = 15.2 ms. GPU step = 154.6 ms. CPU finishes preparing next batch in 15.2 ms, waits 139 ms. GPU never idles.
Worst case across all model/scale combos: VGAE small at 15.2/154.6 = 9.8% overhead.
| OLD (3 workers) | NEW (prebatched) | |
|---|---|---|
| T_step | ~165 ms | ~155 ms |
| GPU util | ~60-83% | ~100% |
| CPUs | 5 | 1-2 |
Workers add IPC serialization overhead (~2-5 ms) for zero benefit when each
__getitem__ is O(1) list lookup. See graphids/core/data/datamodule/graph.py:328.
Generalization and limits — the probe is dataset- and scale-conditional¶
The "~100% GPU util" line above is true for the regime it was measured in (hcrl_sa, ~400k-node budgets, V100). It does not generalize to all combinations. A counter-measurement from set_01 makes the pattern explicit.
set_01, VGAE small, V100 (job 47045030, 2026-04-22)¶
| Metric | Value | Source |
|---|---|---|
| GPU 0 utilization — mean | 27.4% | MLflow system/gpu_0_utilization_percentage, 5s sampling, 643 points |
| GPU 0 utilization — median | 16.0% | same |
| Fraction of samples < 80% | 86.0% | same |
| GPU VRAM usage — median | 93.3% | system/gpu_0_memory_usage_percentage |
| Wall-clock, 197 epochs | 54:31 | sacct |
VRAM is packed correctly — the two-point probe is sized right. But GPU compute idles ~86% of the time.
Why the probe doesn't predict this¶
Per-step util obeys GPU_util ≈ T_gpu / (T_gpu + T_cpu_step). Both
terms scale with the workload:
| Term | hcrl_sa VGAE-small (Apr 7) | set_01 VGAE-small (Apr 22) | Scaling |
|---|---|---|---|
| T_gpu | ~155 ms | ~5 ms (est.) | ∝ batch_nodes × model_FLOPs |
| T_cpu_step | ~15 ms (pin+H2D) | ~20-30 ms (clone+dispatch) | relatively flat |
Under the April probe, T_gpu ≫ T_cpu — prebatch fully hides the CPU step. Under a smaller model or tighter node budget, T_gpu shrinks faster than T_cpu does, and util drops mechanically even with the same pipeline. This is a workload characteristic, not a regression.
What's load-bearing in the pipeline (don't "fix" these)¶
pin_memory=device is Noneat_spawn_loader(graph.py:56) and missing from_prebatched_loader(graph.py:71-81) is deliberate. PyG'sPrefetchLoader(called from_prefetch, graph.py:23-27) owns pinning + async H2D on its own CUDA stream. Enablingpin_memory=Trueat the DataLoader level would double-pin._clone_collateat graph.py:38-40 runs every step because PyGData.to(device)is in-place (seecritical-constraints.md) and the pre-built batches are shared state. This is the last main-process CPU cost in the prebatch path.
Configuration guidance¶
- Smoke runs (
--smoke→ gpudebug 1h wall). Expect low GPU util on small models / small datasets. Measure correctness (no NaN, no IndexError, loss curve shape), not throughput. - Production fits. Use
--length long(gpu partition, 4h+) andscale="medium"or"large"so T_gpu dominates. For set_01, move to--cluster cardinal(H100) — the compute headroom lets you push node budgets high enough to saturate. The April hcrl_sa probe numbers are representative of this regime. - Don't chase GPU util on smoke runs. A compute-tiny model on a V100 will show 20-40% util no matter how well the data pipeline is tuned — that's hardware + model, not pipeline.
Diagnostic hierarchy¶
MLflow system/gpu_*metrics (5s sampler, already on) — aggregate util, memory, power. Queryable across runs.- OTel
ml.batch.duration_sintraces.jsonl— per-step wall. - Explicit
nvidia-smi dmon -s uduring a live run — real-time nvml counters.
Never reason about throughput from wall-clock alone — matches the
feedback_device_metrics_first lesson.