Data Architecture¶
This is the current GraphIDS data layout after the preprocessing refactor.
1. Raw storage¶
Source of truth: immutable CAN/CPS rows.
Typical fields:
vehicle_idtimestamparb_idpayloadattackattack_type- provenance fields from the source
Code surface:
graphids/core/data/datasets/can_bus.pygraphids/core/data/datasets/_base.py
2. Representations¶
The primary public representation kinds are:
snapshotsnapshot_sequencemulti_scaletemporalentity
Representation configs live in:
graphids/core/data/preprocessing/representations.py
They bridge to:
- view configs
- segment configs
- temporal stream specs
3. Materialized views¶
Training-facing materializations are derived from raw storage through the selected representation.
Examples:
- snapshot graphs
- snapshot sequences
- multi-scale views
- temporal streams
- entity-centric views
Code surface:
graphids/core/data/preprocessing/views.pygraphids/core/data/preprocessing/segments.pygraphids/core/data/preprocessing/materialization.pygraphids/core/data/preprocessing/pyg.pygraphids/core/data/preprocessing/temporal.py
4. Discovery and hypotheses¶
This layer stores signal profiles and provisional canonical mappings. It is where hidden-DBC cross-vehicle alignment lives.
Typical records:
- raw signal profile tables
- canonical hypotheses
- confidence
- evidence
- provenance
Code surface:
graphids/core/data/discovery/hypotheses.pygraphids/core/data/discovery/canonical.pygraphids/core/data/discovery/layout.py
5. Selection rule¶
The primary user-facing control surface is now:
representation_cfg
Window sizes and strides are resolved from the representation config at the pipeline boundary, which derives an explicit segment config before materialization.
6. Training flow¶
Recommended read order:
- raw storage
- representation selection
- materialized views
- hypothesis annotations
The training path should consume the materialized view that matches the representation, while the discovery path writes the signal profile and hypothesis tables alongside the cache.