Data: Vocab¶
Shared-vocabulary construction for datasets with categorical node IDs
(CAN arbitration IDs, sensor names). Every split — train, val, every
test subdir — uses the same arb_id → index map so an embedding
table sized for train doesn't over-flow when a test subdir contains
attack-injected IDs. Index 0 is reserved for UNK; a SHA256 digest over
the (id, index) pairs is the cache invariant.
Stage 1 of the OOV handling plan (~/plans/oov-embedding-handling.md).
graphids.core.data.vocab¶
vocab ¶
Shared-vocabulary construction for datasets with categorical node IDs.
Every GraphIDS dataset that uses nn.Embedding(num_ids, ...) over a
per-node categorical identity (CAN arbitration IDs, sensor names, etc.)
MUST build its vocab once across all splits (train + val + every test
subdir) and pass the result to every split at construction time.
Rationale: a per-split vocab drifts the index → physical-id mapping across splits and leaves the model's embedding table under-sized at test time, because test subdirs can contain attack-injected IDs absent from train. Index 0 is reserved for UNK (out-of-vocabulary); real IDs start at 1.
Research basis: ~/plans/oov-embedding-handling.md (Stage 1).
load_vocab ¶
Read a persisted vocab; return (entries, digest).
Keys are stringified at persist time (JSON constraint), so reloaded
entries is always str → int even if the original in-memory
vocab was int → int. Callers that pipe the result into polars
replace_strict against a numeric column must cast keys first,
otherwise the match silently fails and every row routes to UNK.
Source code in graphids/core/data/vocab.py
persist_vocab ¶
Atomically write vocab as JSON under path; return its digest.
Source code in graphids/core/data/vocab.py
scan_arb_ids ¶
Return sorted unique arb_id values across every CSV under every source_dir.
Tolerates both the HCRL arbitration_id and the in-schema
arb_id column names. Only the id column is materialized.
Source code in graphids/core/data/vocab.py
vocab_digest ¶
Stable SHA256 digest over the vocab's (id, index) pairs.
Used as a cache invariant — any vocab change forces rebuild. Sorted by index so the digest is insensitive to dict iteration order but sensitive to any (id, index) change.