Adaptive Fusion of Graph-Based Ensembles for Automotive Intrusion Detection

Controller Area Network (CAN) intrusion detection in modern vehicles must operate under diverse attack types, severe class imbalance, and strict hardware constraints. We propose a multi-stage, multi-expert ensemble graph framework that models CAN traffic using structural, temporal, and distributional cues. A Variational Graph Autoencoder (VGAE) learns normal graph structure and guides targeted training of a Graph Attention Network (GAT) classifier. A bandit-based policy adaptively fuses experts per sample. To enable deployment, a resource-aware intelligent knowledge-distillation (KD) pipeline compresses the ensemble into lightweight students, while curriculum imbalance training enhances rare-attack detection.

This paper presents a three-stage framework for robust intrusion detection on the Controller Area Network (CAN) bus. Subsequent sections detail the background, methodology, experimental setup, results, and ablation studies that validate our approach across six publicly available CAN intrusion datasets.

Introduction

Modern vehicles rely on networks of electronic control units (ECUs) to manage everything from engine functions to advanced driver assistance systems (ADAS). Communication between ECUs is typically handled by the Controller Area Network (CAN) protocol, valued for its reliability and cost-effectiveness in in-vehicle networks (IVNs). However, CAN lacks built-in security mechanisms like encryption and authentication, as it was designed under the assumption of a closed, isolated network. With the introduction of on-board diagnostics (OBD) ports and wireless connectivity (e.g., Wi-Fi, cellular, V2X), access to the CAN bus has expanded significantly, opening new attack surfaces. Attacks may now originate from both physical interfaces (OBD-II, USB) and remote channels (Bluetooth, mobile networks), allowing adversaries to inject malicious messages and potentially disrupt or take control of safety-critical vehicle systems.

To counter these threats, intrusion detection systems (IDS) for CAN have become an area of active research. Traditional IDS approaches fall into two main categories: packet-based and window-based methods. Packet-based IDSs analyze individual CAN messages for quick detection, but cannot capture context or correlations across packets, limiting their effectiveness against complex attacks such as spoofing or replay. Window-based IDSs consider sequences of packets, enabling better detection of such attack patterns, but often face challenges with detection delays and performance under low-volume or replay attacks. Recent efforts address these limitations with statistical approaches using graph models, advanced machine learning techniques such as deep convolutional neural networks (DCNNs), and lightweight classifiers. Other studies leverage temporal or dynamic graph features for high-accuracy detection of diverse attack types. Despite strong results—for example, graph neural network (GNN) and variational autoencoder (VAE)-based systems achieving over 97% accuracy—key challenges remain that prevent real-world deployment.

Motivation: The Deployment Gap

CAN intrusion detection reveals a fundamental tension in adversarial learning: high accuracy on known attack types often correlates with brittle generalization to diverse, imbalanced, and resource-constrained settings. We identify three core challenges that motivate our work:

Challenge 1: No Single Model Captures All Attack Patterns. Different attacks exploit distinct vulnerabilities requiring different detection mechanisms. Structural anomalies (e.g., message flooding) require relational awareness, where graph-based approaches excel, but can miss isolated point anomalies. Distributional anomalies (e.g., signal spoofing) require learning normal signal distributions, where autoencoders succeed, but struggle with coordinated attacks. Moreover, CAN traffic is heavily class-imbalanced, with malicious frames occurring rarely (ratios of 36:1 to 927:1 across datasets), leading to biased models and poorly calibrated predictions. Single models cannot overcome this without excessive overfitting; heterogeneous ensembles with complementary inductive biases naturally handle rare events better.

Challenge 2: Models Must Fit on Embedded Devices. Automotive gateways operate under strict resource constraints: typically ARM Cortex-A7/A53 processors with 256–512 MB RAM, power budgets of ${\sim}100$ mW allocated to IDS, and latency requirements of 50–100ms for real-time response. Academic research operates at GPU scale with models exceeding millions of parameters, but practical deployment requires architectures orders of magnitude smaller. This resource-efficiency challenge is often treated as secondary in the research literature, but represents a critical barrier to real-world adoption.

Challenge 3: Black-Box Models Reduce Trust and Adoption. Highly accurate models face systematic rejection in safety-critical systems because operators cannot understand or verify decisions. ISO 26262 automotive functional safety mandates verification and validation of safety-critical functions, where IDS functions typically receive ASIL C–D classification. Black-box AI models alone cannot satisfy this requirement. Beyond regulation, industry adoption faces a trust paradox: organizations systematically choose less accurate but interpretable models over superior black-box alternatives.

These three challenges are often addressed independently. This work takes the position that these challenges are interdependent: an ensemble that adaptively fuses complementary experts can be more robust (through diverse inductive biases), more efficient (through knowledge distillation scaled to hardware constraints), and more interpretable (through learned weighting patterns and component-level analysis) than a single monolithic model.

Technical Approach

To address these challenges, we propose a multi-stage graph neural network (GNN)-based framework that combines a Variational Graph Autoencoder (VGAE) for unsupervised anomaly detection with a Graph Attention Network (GAT) for supervised attack classification. A Deep Q-Network (DQN) learns to adaptively weight these experts on a per-sample basis, selecting the most informative representation for each message context. The ensemble is distilled into a lightweight student model suitable for embedded deployment via knowledge distillation, while a curriculum learning training strategy improves robustness under severe class imbalance.

Key design decisions reflect this framing:

Complementary Experts: VGAE excels at detecting structural deviations and out-of-distribution anomalies (robustness to unknown attacks), while GAT excels at learning message-level relationships and fine-grained classification (high accuracy on known attacks). Their combination mitigates the single-model brittleness problem.
Sample-Specific Fusion: Rather than fixed static fusion (e.g., averaging), the DQN learns when each expert is most reliable. This adaptive weighting improves accuracy on imbalanced datasets and provides interpretability: the learned policy reveals which expert dominates for each attack type, enabling operators to understand model behavior.
Hardware-Aware Knowledge Distillation: The ensemble is distilled into a student model using logit-level and latent-space KD, achieving a ${\sim}20\times$ parameter reduction (designed from automotive hardware constraints) while retaining detection performance. This principled compression bridges the gap between high-accuracy models and resource-constrained automotive gateways.
Curriculum Learning for Imbalance: Progressive curriculum transitions from balanced to imbalanced sampling, improving minority-class recall without sacrificing overall performance—critical for rare-attack detection in practice.

Contributions

The main contributions of this research are as follows:

Robust Multi-Expert Ensemble: We propose a two-stage framework combining VGAE and GAT with complementary strengths. VGAE performs unsupervised representation learning and anomaly scoring, while GAT refines attack classification. This combination demonstrates superior performance on class-imbalanced datasets compared to single-model or simple averaging approaches.
Adaptive Decision-Level Fusion via DQN: Unlike static fusion strategies, we introduce a DQN-based policy that learns sample-specific weights for VGAE and GAT, enabling graceful degradation and principled model selection. The learned policy provides interpretability through visualization of weighting patterns across attack types and model inputs.
Hardware-Aware Knowledge Distillation: We develop a resource-aware KD pipeline scaled to automotive hardware constraints (ARM Cortex-A7/A53, 256–512MB RAM, 100mW power budget), achieving ${\sim}20\times$ parameter reduction while retaining strong detection performance. This principled approach to model compression bridges the research-to-practice deployment gap.
Curriculum Learning for Class Imbalance: We design a curriculum that progressively increases class imbalance during training, improving recall on minority attack classes without sacrificing overall accuracy. Experiments demonstrate particular gains on highly imbalanced datasets (927:1 benign-to-attack ratios).
Comprehensive Cross-Dataset Evaluation: We conduct extensive experiments on six publicly available CAN intrusion datasets, including the newly released can-train-and-test benchmark. Our results demonstrate consistent improvements over prior graph-based methods and strong generalization across diverse vehicle platforms and attack types .

Intrusion detection systems (IDS) for in-vehicle CAN networks can be classified by detection scope, data type, and underlying detection paradigm. We organize prior work along these dimensions to highlight the unique contributions of our approach.

CAN IDS by Detection Scope

Packet-Based Approaches. Packet-based IDSs analyze individual CAN frames for fast, lightweight detection, but cannot capture dependencies across messages, limiting effectiveness against sophisticated attacks such as spoofing or replay. For example, applied deep neural networks to individual CAN messages using simulated data, and exploited traffic periodicity using Bloom filters. However, these methods are ineffective for aperiodic frames , which are common in real-world CAN buses.

Window-Based Approaches. Window-based IDSs analyze sequences of CAN frames, enabling better temporal correlation analysis. developed timing models for real-time detection without relying on protocol specifications, but still struggled with aperiodic messages and repeated CAN IDs. Frequency and Hamming distance-based methods are similarly limited against aperiodic attacks . combined graph features with statistical tests for anomaly and replay detection, but at the cost of increased latency due to requiring larger message batches.

Graph-Based Approaches. Graph-based IDSs better capture ECU communication patterns by modeling message relationships. proposed G-IDCS, which combines an interpretable threshold-based stage with a learnable classifier leveraging message correlation. applied graph attention networks directly to CAN message graphs, demonstrating GAT’s effectiveness for capturing ECU interaction patterns, but operates as a single model without fusion. GUARD-CAN combines graph understanding with recurrent layers for temporal modeling, while CGTS integrates a Graph Transformer with SVDD for one-class anomaly detection. However, these approaches typically rely on a single detection paradigm and do not adaptively combine multiple expert models.

Deep Learning and Ensemble Methods

A recent survey catalogues the rapid adoption of GNN architectures for network IDS, identifying fusion strategies and multi-dataset evaluation as key open challenges. Most recent CAN IDS approaches are anomaly-based, learning normal behavior and flagging deviations. proposed a CNN-LSTM-attention hybrid achieving over 98% accuracy by capturing both local and temporal patterns. Graph neural network approaches, such as GIDS , exploit graph convolutional networks (GCNs) to model message relationships, improving detection of structural and contextual anomalies. showed that robust feature engineering and data balancing significantly enhance supervised ML-based IDS, achieving up to 97.7% accuracy.

Ensemble methods for automotive IDS typically employ homogeneous models or sequential fusion. The BEPCD framework uses an ensemble of tree-based models (XGBoost, Random Forest) with adaptive voting. HyDL-IDS sequentially fuses CNN and LSTM modules to capture spatiotemporal features, but operates as a single fixed pipeline rather than a dynamic multi-expert system. Meta-IDS uses meta-learning to adapt detection across heterogeneous CAN configurations, representing a competing approach to learned adaptation, though it operates as a single adaptive model rather than fusing multiple complementary experts.

Knowledge Distillation for Deployment

Knowledge distillation (KD) addresses the deployment gap between high-capacity models and resource-constrained automotive hardware. A comprehensive survey of KD methods for GNNs identifies response-based, feature-based, and relation-based distillation strategies, but notes limited application to safety-critical domains. KD-GAT demonstrated that a distilled GAT student can closely match its teacher’s classification accuracy while reducing parameters by over 90%, though the resulting system still struggles with severe class imbalance. LSF-IDM is the closest prior work combining distillation with fusion for automotive IDS, distilling a BERT teacher into a BiLSTM student with semantic feature integration. However, LSF-IDM operates on sequential NLP-style representations rather than graph structures, limiting its ability to model ECU communication topology. A survey of deep reinforcement learning for IDS reveals growing interest in RL-based detection but sparse application to CAN bus domains, motivating our DQN-based fusion agent.

In broader cyber-physical systems (CPS) contexts, frameworks such as DGI-RBM and GDN integrate physical features with GNNs for SCADA systems. However, these approaches rely on feature-level fusion, concatenating physical statistics directly into node embeddings. This static integration is brittle when modalities are missing or corrupted. Multi-view causal inference approaches validate the multi-view fusion concept, but lack adaptive, reinforcement learning-based weighting strategies.

Positioning Our Contribution

The key innovation distinguishing our work is adaptive decision-level fusion via reinforcement learning. Rather than static fusion strategies (voting, concatenation, or fixed weighting), we treat VGAE and GAT as independent experts with complementary strengths and use a DQN policy to learn sample-specific weights that adaptively select the most informative representation for each message context. Unlike LSF-IDM’s NLP-based distillation or Meta-IDS’s single-model adaptation, our approach combines heterogeneous graph experts with learned decision-level fusion and hardware-aware distillation. This enables graceful degradation when one expert is unreliable and provides interpretability through learned weighting patterns.

Additionally, our hardware-aware knowledge distillation pipeline is principally scaled to automotive constraints (ARM Cortex-A7/A53, 256–512 MB RAM, 100 mW power budget), curriculum learning for class imbalance directly addresses the severe data imbalance (up to 927:1 ratios), and multi-dataset evaluation across six publicly available benchmarks demonstrates strong generalization and transferability.

Table 1: Table 1: Comparison of IDS frameworks. Columns highlight detection paradigm, evaluation breadth, and whether fusion is learned. Our approach is the only framework combining heterogeneous graph experts with adaptive fusion across six datasets.

Background

This section covers fundamental concepts of the CAN protocol, GNNs, VGAE, DQN, and knowledge distillation.

CAN Bus Protocol

The CAN is a robust serial protocol enabling real-time communication between ECUs in vehicles. In a CAN bus, nodes broadcast messages, while receivers filter and process relevant ones. Each CAN data frame includes a Start-of-Frame, Arbitration, Control, Data, CRC, Acknowledgment, and End-of-Frame field.

Figure 1: Structure of a standard CAN 2.0B data frame. The 29-bit arbitration field determines message priority, while the 0--8 byte data field carries the payload used for feature extraction in our graph construction pipeline.

Graph Neural Networks

A graph is a data structure consisting of a set of nodes $V$ and a set of edges $E$ that connect pairs of nodes. A graph can be defined as $G = (V,E)$, where $V = {v_1, v_2, …, v_n}$ is a node set with $n$ nodes, and $E = {e_1, e_2, …, e_m}$ is an edge set with $m$ edges.

Given this graph structure, a GNN looks to find meaningful relationships and insights of the graph. The most common way to accomplish this is through the message passing framework , where at each iteration, every node aggregates information from its local neighborhood. Across iterations, node embeddings contain information from further parts of the graph. This update rule can be explained through the following equation:

$\begin{equation}\label{eq-message-passing} \mathbf{h}_v^{(k)} = \phi\big(\mathbf{h}_v^{(k-1)},\ \oplus_{u \in \mathcal{N}(v)} \psi(\mathbf{h}_v^{(k-1)}, \mathbf{h}_u^{(k-1)}, \mathbf{e}_{vu})\big) \end{equation}$

where $\mathbf{h}$ is the node feature embedding, $\phi$ is the node update function, $\psi$ the message function, $\mathbf{e}_{vu}$ the edge feature, $\oplus$ an aggregation (sum/mean), and $\mathcal{N}(v)$ the neighbors of $v$.

GAT builds upon GNNs by introducing an attention mechanism. This allows each node in the message passing framework to dynamically assign weight contributions to their neighbors. For node $v$, the attention coefficient $\alpha_{vu}$ for neighbor $u$ is computed as:

$\begin{equation}\label{eq-gat-attention} \alpha_{vu} = \mathrm{softmax}\left( \mathrm{LeakyReLU}\left( \mathbf{a}^\top \left[ \mathbf{W}\mathbf{h}_v \| \mathbf{W}\mathbf{h}_u \right] \right) \right) \end{equation}$

where $\mathbf{a}$ is the learnable attention parameter vector, $\mathbf{W}$ is a shared weight matrix, and $|$ denotes concatenation of the projected node feature vectors.

The attention function computes a scalar weight for each neighbor of node $v_i$, denoted by $\alpha_{ij}$, which reflects the importance or relevance of node $v_j$ for node $v_i$.

$\begin{equation}\label{eq-gat-update} \mathbf{h}_v^{(k)} = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \mathbf{W} \mathbf{h}_u^{(k-1)} \right) \end{equation}$

where $\sigma$ is the activation function, normally ELU or ReLU. GATv2 later addressed a static-attention limitation of this formulation; our architecture adopts GATv2 as detailed in the Methodology.

The Jumping Knowledge (JK) module enhances GATs by aggregating intermediate layer representations. In this work, we adopt LSTM-based JK aggregation, where a bidirectional LSTM with attention processes the sequence of per-layer embeddings and produces a single adaptive combination per node. Let $\mathbf{h}_v^{(l)}$ denote the representation of node $v$ at layer $l \in {1, \dots, L}$. The LSTM reads the layer sequence and outputs a weighted combination:

$\begin{equation}\label{eq-jk-lstm} \mathbf{h}_v^{\text{final}} = \text{LSTM-Attn}\!\left( \mathbf{h}_v^{(1)}, \mathbf{h}_v^{(2)}, \dots, \mathbf{h}_v^{(L)} \right) \end{equation}$

Unlike concatenation-mode JK, which applies the same linear combination to all nodes and increases the output dimension to $L \times d$, LSTM-mode JK learns a per-node adaptive combination. This allows each CAN node (ECU) to draw information from the most informative depth while keeping the output dimension at $d$, reducing parameters in the downstream classifier.

Variational Graph Autoencoder

The Variational Graph Autoencoder (VGAE) is a probabilistic model designed for unsupervised learning on graphs. Given a graph $G=(V, E)$ with adjacency matrix $A$ and node features $X$, VGAE approximates the posterior distribution of latent variables $Z$ using a multi-layer graph convolutional network (GCN) encoder.

The encoder approximates the posterior distribution over the latent variables $Z = {z_1, \ldots, z_n}$ by assuming a Gaussian distribution for each node:

$\begin{equation}\label{eq-vgae-encoder} q(Z|X, A) = \prod_{i=1}^{n} \mathcal{N}(z_i|\mu_i, \mathrm{diag}(\sigma_i^2)) \end{equation}$

where $\mu_i \in \mathbb{R}^d$ and $\sigma_i \in \mathbb{R}^d$ are the mean and standard deviation vectors for node $i$. These are parameterized by two separate GCN layers:

$\begin{equation}\label{eq-vgae-gcn-params} \mu = \mathrm{GCN}_\mu(X, A), \quad \log \sigma = \mathrm{GCN}_\sigma(X, A) \end{equation}$

which capture both local topology and node features. The outputs of these GCNs define the variational posterior $q(Z

X, A)$.

The decoder attempts to reconstruct the graph structure by computing the probability of edge existence between any two nodes $i$ and $j$ as:

$\begin{equation}\label{eq-vgae-decoder} p(A|Z) = \prod_{i=1}^{n} \prod_{j=1}^{n} \sigma(z_i^\top z_j) \end{equation}$

where $\sigma(\cdot)$ here denotes the sigmoid function (distinct from the activation in Eq. \eqref{eq-gat-update}) and $z_i^\top z_j$ measures similarity in latent space. This inner product decoder encourages connected nodes to have similar embeddings.

The training objective is to maximize the variational evidence lower bound (ELBO), which consists of a reconstruction term and a regularization term:

$\begin{equation}\label{eq-elbo} \mathcal{L} = \mathbb{E}_{q(Z|X, A)}[\log p(A|Z)] - \mathrm{KL}[q(Z|X, A) \| p(Z)] \end{equation}$

where the first term encourages accurate reconstruction of the observed adjacency matrix, and the second term is the Kullback-Leibler divergence between the approximate posterior and the prior $p(Z) = \prod_{i=1}^{n} \mathcal{N}(z_i

0, I)$, promoting regularization and disentangled latent representations.

While VGAE effectively captures global graph structure, its full-graph decoding may be suboptimal for detecting localized anomalies, especially in sparse or noisy graphs. To address this, introduced GAD-NR, which replaces full adjacency reconstruction with localized neighborhood prediction. This modification enhances sensitivity to topological deviations at the node-level, making it suitable for intrusion detection in systems like CAN networks. Inspired by this, our architecture adopts neighborhood-level reconstruction via masked decoding over the graph of each CAN window.

Deep Q-Network

Deep Q-Networks (DQNs) combine Q-learning with neural networks to handle high-dimensional state spaces . In traditional Q-learning, an agent learns a Q-table mapping each (state, action) pair to an expected reward. DQNs replace the Q-table with a neural network that approximates Q-values, enabling learning in continuous or high-dimensional state spaces. The network is trained by minimizing the temporal difference error using the Bellman equation:

$\begin{equation}\label{eq-bellman} L(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] \end{equation}$

where $\mathcal{D}$ is the experience replay buffer, $\theta$ represents the current network weights, $\theta^-$ are the target network weights, $\gamma$ is the discount factor, and $r$ is the observed reward. Stabilization techniques include experience replay (sampling uniformly from past transitions) and a periodically updated target network $\theta^-$ (Double DQN).

Knowledge Distillation

Knowledge Distillation (KD), popularized by , is a widely adopted model compression technique where a small, efficient student model is trained to reproduce the behavior of a large, accurate teacher model. The soft target probabilities output by a teacher model encode rich relational information between classes that is often not captured by hard labels alone. Training a student model to match these softened outputs enables it to learn a more informative function approximation than training with one-hot labels alone.

Concretely, given an input $x$, the teacher produces a vector of logits $s^t(x)$, which are converted into a softened distribution $\tilde{p}^t_k(x)$ via temperature scaling $\tau$:

$\begin{equation}\label{eq-temperature-scaling} \tilde{p}^t_k(x) = \frac{\exp(s^t_k(x)/\tau)}{\sum_j \exp(s^t_j(x)/\tau)} \end{equation}$

The student is trained to match these probabilities by minimizing the Kullback-Leibler divergence between teacher and student distributions (distillation loss), alongside the standard supervised classification loss:

$\begin{equation}\label{eq-kd-total-loss} \mathcal{L}_{\text{total}} = (1 - \lambda) \cdot \mathcal{L}_{\text{hard}} + \lambda \cdot \mathcal{L}_{\text{KD}} \end{equation}$

where $\lambda$ balances the contribution of teacher supervision ($\mathcal{L}{\text{KD}}$) and ground truth ($\mathcal{L}{\text{hard}}$). Higher $\lambda$ places more weight on the soft targets from the teacher.

Methodology Overview

The proposed framework employs a three-stage pipeline for robust intrusion detection in Controller Area Network (CAN) bus systems. Stage 1 uses a Variational Graph Autoencoder (VGAE) to identify hard examples; Stage 2 trains a Graph Attention Network (GAT) with curriculum learning on filtered samples; Stage 3 leverages a Deep Q-Network (DQN) to learn dynamic fusion weights combining VGAE and GAT predictions. The workflow supports both training (sequential stages) and inference (parallel GAT/VGAE outputs fused by DQN).

Graph Construction

CAN messages are broadcast by Electronic Control Units (ECUs); CAN IDs identify message types and are not unique per packet—multiple ECUs can transmit the same ID, and any ECU can receive all messages. This broadcast model underpins the graph representation, capturing sequential dependencies within the CAN stream.

Algorithm 1: Graph Construction from CAN Stream


	Input: CAN stream $M = {m_t = (\text{ID}_t, \text{payload}_t)}$, window size $W$
	Output: Graphs $\mathcal{G} = {G_t = (V_t, E_t, X_t, y_t)}$
.	for $t = W$ to $\lvert M \rvert$ do
.	$\quad W_t \leftarrow M[t!-!W!+!1 : t]$	extract window
.	$\quad \text{source} \leftarrow W_t[:, -3]$; $\text{target} \leftarrow W_t[:, -2]$	CAN IDs
.	$\quad \text{edges} \leftarrow \text{stack}(\text{source}, \text{target})$
.	$\quad (\text{unique_edges}, \text{counts}) \leftarrow \text{unique}(\text{edges})$	transitions
.	$\quad V_t \leftarrow \text{unique}(\text{source} \cup \text{target})$	unique nodes
.	$\quad \text{node_map} \leftarrow {v \mapsto \text{idx} \mid v \in V_t}$	node indexing
.	$\quad E_t \leftarrow [(\text{node_map}[\text{src}], \text{node_map}[\text{tgt}])$ for $(\text{src}, \text{tgt}) \in \text{unique_edges}]$
.	$\quad$ Compute node features: $X_t \in \mathbb{R}^{\lvert V_t \rvert \times 35}$
.	$\quad$ Compute edge features: $F_t \in \mathbb{R}^{\lvert E_t \rvert \times 11}$
.	$\quad y_t \leftarrow 1$ if any attack ID $\in W_t$ else 0	label
.	end for

Node features (35 dimensions) are computed via Polars group-by aggregation over each node’s message occurrences within the window. They comprise: per-byte statistics (mean, standard deviation, and range for each of 8 payload bytes; 24 features), temporal and statistical summaries (message count, mean Shannon entropy, skewness, kurtosis; 4 features), graph-structural properties (clustering coefficient, split-half ratio, change rate; 3 features), inter-arrival time statistics (mean and standard deviation; 2 features), and degree (in-degree, out-degree; 2 features). Edge features (11 dimensions) comprise: inter-arrival time between the source and target messages, per-byte absolute differences across 8 payload bytes, a bidirectionality flag indicating whether the reverse edge exists, and edge frequency (transition count). Window size $W=100$ balances temporal context and computational efficiency. Graphs are directed and weighted by transition counts; self-loops occur for consecutive identical IDs.

Feature Standardization on Benign Training Rows

The continuous components of the 35-dimensional node tensor and the 11-dimensional edge tensor are z-score standardized before entering any of the three stages. We fit the per-feature mean $\mu$ and standard deviation $\sigma$ on benign training rows only ($y_t = 0$) rather than on the full benign+attack training mixture. Standardization at inference replays the cached $(\mu, \sigma)$ produced at cache-build time, so the inference and training coordinate frames coincide.

This is not the field convention. Deep-learning pipelines on NSL-KDD, CIC-IDS-2017, and UNSW-NB15 routinely fit the input scaler on the full labelled training set, treating the standardization step as a generic preprocessing transform rather than a stage that interacts with the threat model. The same pattern shows up in general anomaly-detection benchmarks and toolkits . We are not aware of a published study that isolates the AUC delta of joint-fit versus benign-only-fit as a controlled experimental variable; the joint-fit choice is convention, not principle.

The conceptual argument for benign-only fitting is direct: a model whose training objective is to detect deviations from normal traffic should have its input frame defined by normal traffic alone. Joint fitting embeds the training-attack distribution into the per-feature $(\mu, \sigma)$ used to standardize every subsequent input — including benign and novel-attack inputs at inference. This violates the one-class promise that the unsupervised stage relies on, and it propagates into the supervised and fusion stages because all three consume the same cached scaler. The closest published precedent for an explicit benign-only filter is Donut , where the rationale was missing-data robustness rather than distributional leakage; the principle generalizes. Deep one-class classifiers and one-class adversarial autoencoders obtain the same property by construction — their training partitions are normal-only — but the resulting coordinate frame is the same one we adopt explicitly.

The deployment-side argument adds force. The four-quadrant intrusion-detection deployment matrix—known-vehicle/known-attack, known-vehicle/unknown-attack (zero-day), unknown-vehicle/known-attack, and unknown-vehicle/unknown-attack—makes novel-attack detection the dominant operational risk for an IDS, since cataloguing every future attack is, by definition, impossible. The closed-world critique for supervised IDS models has been on the record for fifteen years and re-emerges in recent generalization studies . Joint scaling attenuates the discriminative axes the model needs at deployment: when attack-class variance exceeds benign variance (typical for fuzzing and DoS attacks on CAN, which inflate per-byte variance and inter-arrival-time spread), the pooled $\sigma$ used by joint fitting compresses the signal exactly along the features that distinguish novel attacks from benign traffic. The complementary literature on test-time normalization—AdaBN , TENT , the covariate-shift adaptation of , and the time-series RevIN analog —consistently uses target-domain or normal-domain statistics, never source-domain mixtures, when the test distribution is expected to differ from the training mixture.

We measured the empirical magnitude of the contamination on this paper’s largest training partition (set_01 of can-train-and-test ). The training tensor contains 151,129 graph windows, of which 88.8% are benign and 11.2% are attack (3.7 million nodes, 11.6% inside attack windows). Across the 35-dimensional node feature vector, switching from the joint-fit scaler to the benign-only-fit scaler shifts the per-feature mean by a median of 2.4% and the standard deviation by a median of 1.5% (relative to the benign-only values). Three features exceed a 5% mean shift; one feature exceeds 20% (a 37% mean shift, accompanied by an 11% standard-deviation shift). That feature corresponds to a graph-structural statistic whose distribution differs sharply between benign windows (typically near-zero for short observation horizons) and attack windows (where injection patterns produce concentrated repeated-edge motifs). For the worst-affected features the joint scaler is materially distorting the standardized representation; for the modal feature the effect is small but in a consistent direction across an unsupervised-then-supervised cascade.

This decision does not address cross-vehicle generalization (the “unknown-vehicle / known-attack” quadrant). When the deployment vehicle’s benign distribution differs from the training vehicle’s benign distribution, the right intervention is test-time adaptation of normalization statistics on deployment-vehicle benign traffic , not a different choice of train-time scaler. We treat the two axes as orthogonal: train-time benign-only fitting handles the novel-attack axis; test-time adaptation, when target-domain benign data is available, would handle the cross-vehicle axis. The current pipeline implements the former; the latter is reserved for future work and would not require revisiting the cached $(\mu, \sigma)$ produced under the present scheme.

A robust-statistics variant (median + interquartile range computed on the same benign training rows) is retained in the codebase as a sensitivity ablation: CAN benigns can be heavy-tailed (entropy spikes during legitimate diagnostic broadcasts; bursts on power-cycle events) and the question of whether mean+std or median+IQR better tracks the bulk of benign activity is empirical, not principled. We report results for both estimators in the ablation study. Independent of the estimator family, the fitting population—benign training rows—is fixed by the argument above. We do not refit any preprocessing component on test data, including its benign fraction; documents AUC inflation of two to five points from this kind of test-time refit on CIC-IDS pipelines.

Handling Out-of-Vocabulary Arbitration IDs

Each node in the graph is identified by its 11- or 29-bit CAN arbitration ID, which we treat as a categorical feature with a learnable embedding concatenated to the 35-dimensional statistical node vector. Naive use of a per-split identity-lookup embedding is unsafe in the CAN IDS setting: the dominant threat models in every public benchmark—injection, fuzzing, and spoofing—inject previously unseen arbitration IDs as the core attack signal . The ROAD dataset specifically includes fuzzing attacks that transmit random IDs chosen to disrupt CAN operation ; CAN-MIRGU contains physically verified attacks on 13 previously unseen IDs recorded from a moving vehicle ; and can-train-and-test partitions attacks into seen/unseen splits precisely to benchmark generalization to novel IDs . A vocabulary constructed from the training split alone therefore under-sizes the embedding table at inference time and raises an IndexError on the first attack-injected ID.

To close this hole we (i) construct a single shared vocabulary from the union of all source directories across train, validation, and every test subdirectory at cache-construction time, persist it as an invariant alongside the cache metadata, and (ii) reserve index 0 of the embedding table as a learnable UNK slot that absorbs any genuinely unseen ID encountered at deployment. The three-option design space—informed by 2021–2026 industrial recsys practice for sparse categorical features—trades off gradient coverage against collision control:

Lookup + untrained UNK (baseline). A shared vocabulary with a designated but never-activated UNK row. Routes attack-injected IDs through a zero-gradient embedding; documents the silent-failure mode common in prior CAN IDS work .
Lookup + stochastic UNK-drop (design-decision control). During training, each node’s arbitration ID is remapped to the UNK slot with probability $p$, forcing gradient through the OOV row on benign traffic so that unseen IDs at inference land in a trained slot. The mechanism mirrors Monolith’s low-frequency-ID filtering, which routes rare IDs to shared buckets in ByteDance’s production recommender .
$k$-probe hash embedding (primary, evidence-backed). Every arbitration ID—seen or unseen—is deterministically mapped to $k$ rows of a bucketed table by complementary hash functions and the per-bucket vectors are combined by a learned aggregation, following the binary-code hash embedding of and the multi-probe “unified embedding” formulation of . Because all IDs, including attack-injected novel IDs, hit trained buckets by construction, no special OOV token is needed.

The recsys literature has converged on hashed or multiplexed embedding tables as the default treatment for large or dynamic vocabularies; two independent 2023 surveys taxonomize the design space across hash, compositional, and learned-hash families and frame pure identity-lookup embeddings as a legacy choice that does not survive dynamic vocabularies . To our knowledge this recipe has not been transferred into the CAN IDS literature: neither CAN-BERT , CANShield , nor the GAT-based CAN IDS of trains a learnable OOV embedding or uses hash-based ID encoding, despite every modern CAN-attack benchmark requiring novel-ID robustness. Explicitly adopting one of the three options above—and ablating them on ROAD, CAN-MIRGU, and can-train-and-test split by “ID seen at training” versus “ID unseen at training”—is a methodological contribution of this work.

Training Paradigm

Stage 1: VGAE Training and Hard Sample Selection

Algorithm 2: VGAE-Based Hard Sample Selection


	Input: Trained VGAE model on normal graphs
	Output: Hard-selected training dataset for Stage 2
.	Train VGAE on normal graphs until convergence
.	for each normal graph $G_i$ do
.	$\quad R_i \leftarrow \lVert \mathbf{A}_i - \hat{\mathbf{A}}_i \rVert_F^2 / \lvert V_i \rvert^2$	reconstruction error
.	end for
.	Rank by $R_i$ descending; select top-$k$ as hard negatives
.	Combine hard normal samples with all attack samples

High reconstruction error indicates ambiguous or boundary-proximate normal samples. Selective undersampling preserves discriminative hard examples while reducing majority class dominance.

Architectural Enhancements to VGAE

GATv2 Attention. All graph convolution layers use GATv2Conv rather than the original GATConv. GATv2 reorders the attention computation to apply the nonlinearity before the attention parameter, resolving the static attention limitation where GATv1 produces the same attention ranking regardless of the query node. This is critical for CAN bus anomaly detection: adversarial messages inject noise into graph structure, and dynamic attention can adaptively down-weight noisy edges while static attention degrades uniformly. GraphNorm. Normalization layers use GraphNorm instead of BatchNorm. BatchNorm normalizes across all nodes in a batch, mixing statistics from different graphs and introducing batch-dependent noise. GraphNorm normalizes per-graph with a learnable shift parameter, preserving graph-level distributional information that is essential for anomaly scoring.

Stage 2: GAT Training with Curriculum Learning

Curriculum Learning: Momentum-based scheduler transitions from balanced to imbalanced sampling:

$\begin{equation}\label{eq-curriculum-momentum} p_t = 1 - \exp(-t / \tau) \end{equation}$

Batch composition blends three sources:

$\begin{equation}\label{eq-batch-composition} B_t = (1 - p_t) B_{\text{bal}} + p_t B_{\text{nat}} + \alpha_{\text{buf}} B_{\text{hard}} \end{equation}$

where $B_{\text{bal}}$ is class-balanced, $B_{\text{nat}}$ reflects natural imbalance, $B_{\text{hard}}$ contains highest-error samples from VGAE buffer ($\alpha_{\text{buf}} = 0.2$), and buffer is refreshed every 100 steps. This prevents premature majority bias while maintaining natural distribution awareness.

Knowledge Distillation: Student GAT mimics the pre-trained teacher via logit-level distillation (Eq. \eqref{eq-kd-total-loss}), using temperature-scaled soft targets (Eq. \eqref{eq-temperature-scaling}) with $T=4$ and mixing coefficient $\lambda=0.7$. No intermediate feature distillation is applied.

Architectural Enhancements to GAT

GATv2 Attention. As with the VGAE encoder, all GAT convolution layers use GATv2Conv with dynamic attention. GATv2Conv additionally accepts edge features via the edge_dim parameter, incorporating the 11-dimensional edge attributes (inter-arrival time, per-byte differences, bidirectionality, edge frequency) into the attention computation. This enables attention-weighted message passing that is conditioned on both node and edge information. LSTM Jumping Knowledge. Layer outputs are aggregated via LSTM-based Jumping Knowledge (Eq. \eqref{eq-jk-lstm}) rather than concatenation, enabling per-node adaptive depth selection while keeping the output dimension at $d$. GPS Graph Transformer (Ablation). As an ablation, the local GATv2Conv layers can be replaced with GPS layers , which combine local message passing with global multi-head self-attention and a feed-forward network in each layer. For CAN bus graphs (20–50 nodes), the global attention component is computationally inexpensive and captures long-range message dependencies that multi-hop local attention may miss. GPS layers are selectable via conv_type="gps" in the pipeline configuration.

Stage 3: Adaptive Fusion

After Stages 1–2, a fusion agent learns optimal fusion weights combining VGAE and GAT predictions. Training uses ground truth labels to compute reward signals. We evaluate two fusion formulations—a Deep Q-Network (DQN) and a Neural-LinUCB contextual bandit—that share the same state space, action space, reward function, and MLP backbone architecture, differing only in their exploration and update mechanisms.

State Space: 15-dimensional feature vector aggregating VGAE and GAT outputs: VGAE reconstruction errors (node, neighbor, CAN ID levels), latent space statistics (mean, std, max, min), VGAE confidence; GAT class probabilities (class 0, class 1), embedding statistics (mean, std, max, min), GAT confidence. All features normalized and clipped to $[0,1]$.

Action Space: $K=21$ discrete fusion weights linearly spaced in $[0,1]$. Policy semantics: $\alpha = 0.5$ (equal weighting), $\alpha < 0.5$ (favor VGAE), $\alpha > 0.5$ (favor GAT). Fused anomaly score: $\sigma = (1 - \alpha) \cdot \text{VGAE}{\text{anomaly}} + \alpha \cdot \text{GAT}{\text{prob}}$; final prediction $\hat{y} = \mathbb{1}[\sigma > 0.5]$.

Reward Function: Directly tied to classification accuracy using ground truth labels:

$\begin{equation}\label{eq-reward} R(\hat{y}, y_{\text{true}}, \mathbf{s}, \alpha) = \begin{cases} +3.0 + r_{\text{agree}} + r_{\text{conf}} & \text{if } \hat{y} = y_{\text{true}} \\ -3.0 + r_{\text{disagree}} + r_{\text{overconf}} & \text{if } \hat{y} \neq y_{\text{true}} \end{cases} \end{equation}$

where $r_{\text{agree}}$ measures alignment between VGAE and GAT (model agreement bonus), $r_{\text{conf}}$ rewards high confidence on correct predictions, $r_{\text{disagree}}$ penalizes misalignment on errors, $r_{\text{overconf}}$ penalizes overconfidence on incorrect predictions, and an implicit balance bonus discourages extreme $\alpha$ values. Both DQN and bandit use this identical reward.

Shared Backbone: Both agents use an MLP backbone $f_\theta: \mathbb{R}^{15} \to \mathbb{R}^d$ (3 hidden layers, 128 units each, LayerNorm + ReLU + Dropout) that transforms the normalized state vector into a learned representation $\mathbf{z} = f_\theta(\mathbf{s})$.

DQN Fusion

The DQN extends the backbone with a linear output layer producing $K$ Q-values, one per discrete fusion weight. Because each CAN window graph is classified independently—the fusion decision for one window does not affect the next—the discount factor is set to $\gamma = 0$. This reduces the Bellman target (Equation \eqref{eq-bellman}) to pure reward maximization:

$\begin{equation}\label{eq-dqn-loss} \mathcal{L}_{\text{DQN}}(\theta) = \mathbb{E}_{(s,a,r) \sim \mathcal{D}} \left[ \text{SmoothL1}\!\left( Q(s, a; \theta),\; r \right) \right] \end{equation}$

where $\mathcal{D}$ is an experience replay buffer (capacity 50K) and SmoothL1 loss provides robustness to reward outliers.

Exploration: Epsilon-greedy with decaying exploration rate:

$\begin{equation}\label{eq-epsilon-greedy} a_t = \begin{cases} \arg\max_{a} Q(s_t, a; \theta) & \text{with probability } 1 - \epsilon \\ \text{Uniform}(\{1, \ldots, K\}) & \text{with probability } \epsilon \end{cases} \end{equation}$

where $\epsilon$ decays as $\epsilon \leftarrow \max(\epsilon_{\min},\; \epsilon \cdot \delta)$ after each episode ($\epsilon_0 = 0.2$, $\delta = 0.995$, $\epsilon_{\min} = 0.01$). Because each graph is treated as an independent episode and $\gamma = 0$ removes bootstrapping, the targets reduce to observed rewards and no separate target network is needed.

Neural-LinUCB Contextual Bandit Fusion

Because the fusion decision for each graph is independent, the sequential MDP assumption underlying DQN is unnecessary. Neural-LinUCB decomposes the fusion problem into deep representation learning (the shared backbone) and shallow exploration (per-arm ridge regression with UCB). This replaces gradient-based Q-learning and the experience replay buffer with closed-form per-arm updates over a fixed shared representation.

UCB Arm Selection: Given the backbone representation $\mathbf{z} = f_\theta(\mathbf{s})$, each arm’s score combines a reward estimate with an uncertainty bonus:

$\begin{equation}\label{eq-bandit-ucb} a^* = \arg\max_{a \in \{1, \ldots, K\}} \left( \boldsymbol{\theta}_a^\top \mathbf{z} + \beta \sqrt{\mathbf{z}^\top \mathbf{A}_a^{-1} \mathbf{z}} \right) \end{equation}$

where $\boldsymbol{\theta}a = \mathbf{A}_a^{-1} \mathbf{b}_a$ are the per-arm weight vectors, $\mathbf{A}_a = \sum{t: a_t = a} \mathbf{z}t \mathbf{z}_t^\top + \lambda \mathbf{I}$ is the regularized design matrix, $\mathbf{b}_a = \sum{t: a_t = a} r_t \mathbf{z}_t$ accumulates observed rewards, and $\beta$ controls exploration magnitude.

Closed-Form Linear Update: After observing reward $r_t$ for action $a_t$, the linear model is updated without gradient computation. The reward accumulator and precision matrix are updated jointly:

$\begin{equation}\label{eq-bandit-accum} \mathbf{b}_{a_t} \leftarrow \mathbf{b}_{a_t} + r_t \, \mathbf{z}_t, \qquad \boldsymbol{\theta}_{a_t} \leftarrow \mathbf{A}_{a_t}^{-1} \mathbf{b}_{a_t} \end{equation}$

The precision matrix inverse $\mathbf{A}_a^{-1}$ is maintained incrementally via the Sherman-Morrison formula:

$\begin{equation}\label{eq-sherman-morrison} \mathbf{A}_{a_t}^{-1} \leftarrow \mathbf{A}_{a_t}^{-1} - \frac{(\mathbf{A}_{a_t}^{-1} \mathbf{z}_t)(\mathbf{A}_{a_t}^{-1} \mathbf{z}_t)^\top}{1 + \mathbf{z}_t^\top \mathbf{A}_{a_t}^{-1} \mathbf{z}_t} \end{equation}$

This runs in $O(d^2)$ per sample with no gradient computation, making online updates substantially cheaper than DQN’s minibatch SGD.

Backbone Retraining: Periodically (every $N = 50$ episodes), the backbone parameters $\theta$ are updated via gradient descent on the replay buffer to improve the learned representation. After retraining, the linear models ($\mathbf{A}_a^{-1}$, $\mathbf{b}_a$, $\boldsymbol{\theta}_a$) are reset since the representation space has shifted.

Theoretical Motivation: Unlike epsilon-greedy exploration (which explores uniformly at random), the UCB term provides directed exploration—arms with high uncertainty receive higher scores, and this uncertainty shrinks as $O(1/\sqrt{n_a})$ with the number of times arm $a$ is pulled. Neural-LinUCB achieves $\tilde{O}(\sqrt{T})$ cumulative regret , matching full NeuralUCB at a fraction of the computational cost since exploration is confined to the last layer. Empirical comparisons between bandit, DQN, and supervised baselines (MLP, weighted average) determine whether principled exploration provides genuine benefit over simpler approaches.

Inference Pipeline

At inference, VGAE and GAT execute in parallel to minimize latency. Their outputs are concatenated into the 15D state vector and passed to the fusion agent for dynamic weight determination and final prediction. Parallelization of VGAE and GAT ensures both models evaluate concurrently, while the fusion agent adds minimal overhead (single forward pass through a small fully-connected network). This design enables real-time deployment in resource-constrained CAN bus environments with sub-millisecond inference latency. Temporal extensions that introduce inter-window state transitions are discussed as future work.

Experiments

This section presents the experimental setup, evaluation metrics, and insights into the datasets used in this study.

Evaluation Metrics

The performance of the model is evaluated using accuracy and F1-score. All experiments were conducted using PyTorch and PyTorch Geometric on GPU clusters provided by the Ohio Supercomputer Center .

Datasets

Our proposed method has been evaluated on three publicly available automotive CAN intrusion detection datasets, each offering distinct characteristics and challenges for comprehensive IDS evaluation.

HCRL Car-Hacking

This dataset contains CAN traffic from a Hyundai YF Sonata with four attack types: DoS, fuzzing, RPM spoofing, and gear spoofing. All attacks were conducted on a real vehicle, with data logged via the OBD-II port. The dataset includes 988,872 attack-free samples and approximately 16.6 million total samples across all attack types .

HCRL Survival Analysis

Collected from three vehicles (Chevrolet Spark, Hyundai YF Sonata, Kia Soul), this dataset enables scenario-based evaluation with three attack types: flooding (DoS), fuzzing, and malfunction (spoofing). The dataset is structured with 627,264 training samples and four testing subsets designed to evaluate IDS performance across known/unknown vehicles and known/unknown attacks .

can-train-and-test

The largest dataset, containing CAN traffic from four vehicles across two manufacturers (GM and Subaru). It provides nine distinct attack scenarios including DoS, fuzzing, systematic, various spoofing attacks, standstill, and interval attacks. The dataset is organized into four vehicle sets (set_01 to set_04) with over 192 million total samples. This dataset exhibits extreme class imbalance with attack-free to attack sample ratios ranging from 36:1 to 927:1 across different subsets. Each set contains one training subset and four testing subsets following the known/unknown vehicle and attack paradigm . This work limits evaluation to the known vehicle and attack testing set.

Results and Discussion

Test Set Performance

Table 1 and Table 2 summarize the test set performance across six datasets. We compare against four GNN-based baselines: KD-GAT , A&D GAT , G-IDCS , and GUARD-CAN . KD-GAT serves as the primary baseline since it is the only method evaluated on the comprehensive can-train-and-test dataset .

Our approach demonstrates consistent improvements across all datasets, with particularly significant gains on highly imbalanced datasets. Compared to KD-GAT, we achieve an average improvement of 2.09% in accuracy and 16.22% in F1-score. The most substantial improvements occur on challenging datasets S02 and S04, where F1-scores improve by 55.25% and 30.64% respectively, indicating superior handling of severe class imbalance.

Table 1: Table 1: Test Performance on HCRL Car-Hacking Dataset

Model	Accuracy	Precision	Recall	F1	AUC	citation_key	specificity	mcc
baselines
BEPCD	0.9800	0.9700	0.9800	0.9700	0.9900	jichici2024bepcd	None	None
HyDL-IDS	0.9910	0.9900	0.9910	0.9900	0.9970	lo2022hydlids	None	None
ours
FUSION	0.9995	1.0000	0.9960	0.9980	0.9996	None	1.0	0.9976984755238361
GAT	0.9995	1.0000	0.9960	0.9980	1.0000	None	1.0	0.9976984755238361
VGAE	0.8153	0.4144	0.9163	0.5707	0.9385	None	0.7996300863131935	0.5341333907330748

Figure 1: Interactive results table. Click column headers to sort; use the filter to search models.

Table 2: Table 2: Cross-Scenario Test Results

Model	Scenario	Accuracy	F1	Precision	Recall
GAT	test_01_known_vehicle_known_attack	0.9990	0.9990	1.0000	0.9979
GAT	test_02_unknown_vehicle_known_attack	0.4052	0.5767	0.4052	1.0000
GAT	test_03_known_vehicle_unknown_attack	0.8186	0.6852	1.0000	0.5211
GAT	test_04_unknown_vehicle_unknown_attack	0.3915	0.5627	0.3915	1.0000
VGAE	test_01_known_vehicle_known_attack	0.6768	0.7492	0.5989	1.0000
VGAE	test_02_unknown_vehicle_known_attack	0.4052	0.5767	0.4052	1.0000
VGAE	test_03_known_vehicle_unknown_attack	0.6059	0.6578	0.4901	1.0000
VGAE	test_04_unknown_vehicle_unknown_attack	0.3915	0.5627	0.3915	1.0000
FUSION	test_01_known_vehicle_known_attack	0.9990	0.9990	1.0000	0.9979
FUSION	test_02_unknown_vehicle_known_attack	0.4052	0.5767	0.4052	1.0000
FUSION	test_03_known_vehicle_unknown_attack	0.8235	0.6963	1.0000	0.5341
FUSION	test_04_unknown_vehicle_unknown_attack	0.3915	0.5627	0.3915	1.0000

Discussion

Class Imbalance Handling: Our multi-stage approach demonstrates superior performance on imbalanced datasets compared to single-stage methods. The VGAE component effectively captures structural anomalies even with limited attack samples, while the GAT classifier benefits from the refined feature representations. This combination proves particularly effective on datasets S02 and S04, where traditional methods struggle with extreme class ratios.

Generalization Capability: The consistent performance across diverse datasets (CarH, CarS, and can-train-and-test subsets) demonstrates strong generalization. Unlike previous methods that show significant performance degradation on unseen test data, our approach maintains robust detection capabilities across different attack types and network conditions.

Ablation Study

To assess the contribution of different model configurations, we perform ablation experiments investigating three key variables: knowledge distillation, supervised learning training strategies (curriculum learning and hard sample mining), and fusion effectiveness, comparing standalone and fused architectures.

Experimental Design

We adopt a one-factor-at-a-time (OFAT) design : each ablation axis varies across its candidate values while every other axis is held at a fixed reference condition. The reference condition is conv_type=gatv2 , loss_fn=focal, sampler=default (non-curriculum), and the supervised baseline for fusion uses VGAE pretraining with the focal-GAT downstream model. This isolates each axis’s marginal effect over a single, consistent baseline rather than a moving target that drifts as earlier axes’ winners propagate forward.

OFAT is efficient under a constrained GPU budget: across five axes the screening consumes $5 \times ( \text{axis} ) \approx 16$ variants per seed, compared to $3 \cdot 3 \cdot 3 \cdot 3 \cdot 4 = 324$ for a full factorial design. The trade-off is that OFAT cannot detect interaction effects between axes —for example, if conv_type=gps and loss_fn=weighted_ce happen to combine super-additively, that joint effect is invisible to our screening. We report interaction follow-ups, when motivated, as a targeted factorial over the top-2 candidates of each axis rather than a full grid expansion.

Seed Variance and Statistical Framing

We run each ablation variant across $N = 3$ seeds (42, 123, 777). This is a deliberately screening-stage budget: show that detecting moderate effects in ML benchmarks under $\gamma = 0.75$ and $\alpha = \beta = 0.05$ requires $N \approx 29$ seeds per condition, which is not feasible for a 16-variant sweep on our allocation. We therefore frame screening results as effect estimates with uncertainty, not null-hypothesis tests.

For each axis, we report:

Cohen’s $d$ of the variant against the reference condition, computed on the test-phase metric (F1 for classifier variants, AUROC for reconstruction-flavor variants).
95% bootstrap confidence interval on Cohen’s $d$ over the 3 seeds, following ’s recommendation against $p$-values at small $N$.
Expected-max performance across seeds, which captures the “best-of-$N$” bias that seed-tuned hyperparameter comparisons quietly rely on.

Variants whose 95% CI on $d$ excludes zero and whose expected-max gap relative to the reference exceeds a pre-registered threshold are promoted to a confirmatory run with additional seeds. This two-stage protocol is standard practice for ablations under compute constraints and makes the transition from screening to confirmation explicit.

Knowledge Distillation Effects

Awaiting data export for Knowledge Distillation ablation table.

To further characterize how knowledge transfers between teacher and student networks, we compute Centered Kernel Alignment (CKA) between all pairs of teacher and student layers. High CKA values indicate that corresponding layers learn similar representations despite the $20\times$ parameter reduction.

Figure 1: CKA similarity between teacher and student GAT layers. Hover for exact values.

GAT Training Strategy

Awaiting data export for GAT training strategy ablation table.

Bandit Fusion vs. Baseline Strategies

Table 1: Table 1: VGAE Anomaly Detection Threshold Analysis

Metric	Value
optimal_threshold	0.043536
youden_j	0.719949
f1	0.570720
auc	0.938542

Explainability

UMAP Analysis

To understand the representations learned by our model, we perform UMAP-based feature analysis using both raw input statistics and learned graph embeddings. We sample 10% of graphs from the HCRL Car-Hacking dataset.

Raw CAN-graph data projected via UMAP shows loose clustering that indicates limited separability between normal and attack types. In contrast, UMAP projections of graph-level embeddings from the trained GAT classifier’s penultimate layer reveal well-separated clusters.

Despite binary supervision (attack vs. normal), the learned embedding space forms well-separated clusters aligned with specific attack types (DoS, Fuzzy, Gear, RPM). This emergent multi-class structure demonstrates that our model captures high-level semantic patterns in CAN traffic and generalizes across attack categories without explicit multi-class labels. The clear cluster separation in embedding space, absent in raw features, validates the GAT’s ability to learn discriminative representations from graph-structured temporal data.

Figure 1: UMAP projections of GAT embeddings (10% sample). Toggle attack types to explore cluster separation.

Composite VGAE Reconstruction Error

To assess the overall reconstruction quality of the VGAE, we combine three types of reconstruction errors: node feature reconstruction error ($E_{\text{node}}$), neighborhood reconstruction error ($E_{\text{neighbor}}$), and CAN ID prediction error ($E_{\text{CAN\,ID}}$). Each error captures a different aspect of the graph structure and message semantics. We compute a single composite score as a weighted sum:

$\begin{equation}\label{eq-composite-error} \mathrm{Composite\_Error} = \alpha\, E_{\text{node}} + \beta\, E_{\text{neighbor}} + \gamma\, E_{\text{CAN\,ID}} \end{equation}$

where $\alpha$, $\beta$, and $\gamma$ are empirically chosen weights that regulate each term’s influence. In our experiments, we use $\alpha = 1.0$, $\beta = 20.0$, and $\gamma = 0.3$.

This approach enables the detection of subtle anomalies by jointly evaluating node content, CAN identifier semantics, and local neighborhood structure.

Figure 2: VGAE reconstruction error decomposition. Top: per-component distributions (normal vs attack). Middle: error heatmap sorted by composite score. Bottom: per-component ROC curves.

DQN-Fusion Analysis

The learned DQN fusion policy exhibits interpretable, context-specific weighting that validates adaptive expert selection. Analysis reveals a strong correlation between VGAE anomaly scores and fusion weights: low VGAE scores cluster at $\alpha \approx 0$ (favoring the robust expert), while higher scores transition to intermediate weights, demonstrating the policy learned to default to VGAE’s out-of-distribution detection while conditionally leveraging GAT’s strength on known attacks. The multimodal distribution with peaks at $\alpha \approx 0, 0.2, 0.4, 0.6, 0.8$ indicates the DQN discovered distinct attack-type-specific strategies rather than learning fixed averaging ($\alpha = 0.5$). Critically, the divergence between normal and attack distributions validates meaningful anomaly detection logic.

Figure 3: DQN fusion weight distribution by attack type. Peaks at distinct $\alpha$ values indicate learned attack-type-specific strategies.

GAT Attention Weights

To understand which CAN message relationships the GAT deems important, we visualize the learned attention weights on selected graphs. Edge width and opacity are proportional to the mean attention across heads for a given layer.

Figure 4: GAT attention weights on selected CAN bus graphs. Select different graphs and layers to compare normal vs attack attention patterns.

Limitations and Future Work

Temporal Independence Assumption. The current framework treats each CAN window graph independently—the fusion decision for one window does not consider the temporal context of preceding or subsequent windows. However, CAN bus attacks exhibit strong temporal structure: attacks span multiple windows, have characteristic onset and offset patterns, and evolve over time. Recent work on spatial-temporal graph methods for CAN intrusion detection has demonstrated the value of modeling these dependencies. CGTS uses a Graph Transformer with SVDD to capture temporal relationships between CAN message sequences, while GCN-2-Former combines GCN with Transformer layers using sliding-window dynamic graph construction for explicit spatial-temporal modeling. Both approaches achieve strong detection performance by exploiting inter-window dependencies.

Incorporating temporal context would transform the fusion problem from a contextual bandit (where each decision is independent) into a genuine sequential decision process, justifying the DQN agent’s discount factor and target network. We identify this as the highest-impact architectural extension.

TransformerConv for Edge Feature Utilization. While GATv2Conv incorporates edge features into the attention computation, it does not condition the message values on edge attributes. TransformerConv uses edge features in both the attention and value projections, potentially enabling richer utilization of the 12-dimensional CAN edge features (frequency counts, inter-arrival statistics, bidirectionality indicators, degree products). An ablation comparing GATv2Conv and TransformerConv across datasets would quantify the benefit of fuller edge feature integration.

Scaled Cosine Reconstruction. GraphMAE proposes scaled cosine error as a replacement for MSE in reconstruction-based objectives, arguing it is more robust to high-dimensional features with varying magnitudes. Given that our node features (31-D) include statistical moments that can span several orders of magnitude, scaled cosine error may improve anomaly sensitivity relative to MSE. We leave this comparison to future work.

Conclusion

We introduced a novel multi-stage CAN intrusion detection framework combining Variational Graph Autoencoder and Graph Attention Network modules for robust anomaly detection and classification. Knowledge distillation enables a compact student model achieving 95% parameter reduction while maintaining strong performance. Extensive experiments across six benchmark datasets demonstrate significant improvements over existing methods, with average F1-score gains of 16.22% and exceptional performance on class-imbalanced scenarios. Our ablation study reveals that the standalone GAT classifier achieves comparable performance to fusion approaches with greater computational efficiency, making it ideal for resource-constrained automotive environments. These results highlight the promise of graph-based, multi-stage deep learning combined with knowledge distillation for practical automotive cybersecurity deployment.

Experimental Setup

Implementation Details

80% of the dataset was utilized for training, 20% for validation, and a distinct test set was compiled by the dataset providers. All experiments were conducted using PyTorch and PyTorch Geometric. Model training and evaluation were performed on GPU clusters provided by the Ohio Supercomputer Center (OSC) . Each CAN message carries a CAN ID and 8 data bytes; the graph construction pipeline (Algorithm 1) aggregates windowed messages into 35-dimensional node features and 11-dimensional edge features per graph.

Model Sizing for Cascading Knowledge Distillation Ensemble

This section provides specific parameter budgets for the three-model student ensemble (GAT classifier, VGAE autoencoder, DQN fusion) derived from the CAN bus latency constraints, with unequal allocation reflecting architectural complexity and inference cost trade-offs.

Total Parameter Budget

From the CAN bus latency constraint (7 ms hard limit ), the total onboard parameter budget is:

\[N_{\text{onboard, total}} = 173\text{ K parameters (FP32)}\]

Using the empirical distillation scaling law with target compression ratio $\kappa \approx 20$:

\[N_{t,\text{model}} \approx 20 \times N_{s,\text{model}} \quad \text{for each model}\]

Heterogeneous Model Allocation

Student ensemble members are not equally sized. The GAT classifier and VGAE autoencoder perform primary detection tasks and receive larger parameter budgets, while the DQN fusion model aggregates their outputs and receives reduced allocation:

Table 1: Table 1: Parameter Budget Allocation Across Student and Teacher Ensembles

| Model | Student | Teacher | Compression | | ——————- | ——— | ———– | ——————- | | GAT Classifier | 55 K | 1.100 M | $20\times$ | | VGAE Autoencoder | 86 K | 1.710 M | ${\approx}20\times$ | | Fusion Agent | 32 K | 687 K | ${\approx}21\times$ | | Total (Onboard) | 173 K | — | — | | Total (Offline) | — | 3.497 M | — |

Model Architecture Details

The student-teacher pairs share each model’s architectural family; the teacher differs by depth, width, or attention-head count rather than kind. Specific hyperparameters (channel widths, embedding dimensions, dropout, exploration rates) are tracked in the configs alongside the training code rather than reproduced here, since they are tuned per ablation.

GAT Classifier (55 K Student, 1.100 M Teacher)

Graph attention network over 35-dimensional node features and 11-dimensional edge attributes, built from GATv2Conv layers. Both student and teacher use LSTM-based Jumping Knowledge aggregation, which learns a per-node adaptive combination of layer representations and keeps the output dimension at $d$ (hidden $\times$ heads) rather than $L \times d$. The student is shallow — 2 GATv2Conv layers feeding a 2-layer FC classification head; the teacher expands to 3 GATv2Conv layers and a 4-layer FC head, providing higher representational capacity and softer knowledge targets during distillation.

VGAE Autoencoder (86 K Student, 1.710 M Teacher)

Variational graph autoencoder for unsupervised anomaly detection, built from GATv2Conv encoder layers and a symmetric MLP decoder. The student progressively compresses 35-dimensional node features through a 3-layer single-head encoder ($[80 \to 40 \to 16]$) into a 16-dimensional latent space; the teacher widens to a 4-head encoder ($[480 \to 240 \to 64]$) and a 64-dimensional latent space for richer representation learning. Both employ the variational reparameterization trick; the CAN ID is separately classified from the latent representation rather than passed through the same reconstruction objective as the continuous features.

Fusion Agent (32 K Student, 687 K Teacher)

Fusion agent for multi-model state aggregation. The state vector concatenates VGAE outputs (3 reconstruction error components — node, neighbor, CAN ID; 4 latent statistics — mean, std, max, min of $\mathbf{z}$; 1 confidence score) and GAT outputs (2 class probabilities; 4 embedding statistics — mean, std, max, min; 1 confidence score) into a combined 15-dimensional input. Both DQN and Neural-LinUCB variants share an MLP backbone — 3 hidden layers, 128 units in the student and 256 in the teacher, with LayerNorm and ReLU. The DQN variant trains with bootstrap-free TD targets ($\gamma = 0$, epsilon-greedy exploration); each graph is an independent episode, so no target network is needed. The bandit variant replaces the Q-network with per-arm ridge regression and UCB exploration. The action space $

= 21$ corresponds to uniformly spaced fusion weights $\alpha \in [0, 1]$.

Distillation Training

The GAT and VGAE students train using knowledge distillation from their respective teachers. The distillation loss is a weighted combination of the task loss and the KD loss:

$\begin{equation}\label{eq-appendix-kd-loss} L_{\text{total}} = \alpha \, L_{\text{KD}} + (1 - \alpha) \, L_{\text{task}} \end{equation}$

where $\alpha = 0.7$ and $L_{\text{KD}}$ employs temperature-scaled softmax with temperature $T = 4.0$:

\[L_{\text{KD}} = T^{2} \cdot \text{KL}\!\left( \frac{\log \hat{y}_{s}}{T} \;\middle\|\; \frac{\hat{y}_{t}}{T} \right)\]

For the VGAE, distillation combines latent-space alignment and reconstruction matching:

\[L_{\text{KD}}^{\text{VGAE}} = 0.5 \, L_{\text{latent}} + 0.5 \, L_{\text{recon}}\]

where $L_{\text{latent}}$ is the MSE between student and teacher latent representations (with a learned projection when dimensions differ) and $L_{\text{recon}}$ is the MSE between continuous output reconstructions.

Inference Cost

Combined student ensemble inference (all three models):

\[\text{FLOPs}_{\text{inference}} = (55\text{ K} + 86\text{ K} + 32\text{ K}) \times 2 = 346\text{ K FLOPs}\] \[\text{Latency}_{\text{inference}} = \frac{346 \text{ K FLOPs}}{50 \text{ MFLOP/s}} \times 0.7 \text{ (sparsity factor)} = 4.8\text{ ms}\]

This provides $\approx 2.2$ ms safety margin within the 7 ms CAN message cycle (100 Hz), accounting for context switches, cache misses, and interrupt handling.

Reproducibility

The reported parameter budgets are FP32 for the deployed students. Training uses mixed precision (16-mixed) for the GAT and VGAE stages, and FP32 for the fusion stage, whose compute footprint is small enough that mixed precision yields no benefit. INT8 quantization on student models (providing $\approx 2.1\times$ speedup on ARM Cortex-A7) would enable approximately $2.5\times$ parameter expansion while maintaining the same 4.8 ms inference latency.

Diagrams

Figure 1: CAN bus graph representation. Nodes are arbitration IDs, edges are temporal co-occurrence.

Figure 2: GAT attention layer with multi-head structure. Click a head to inspect its internal mechanism.

Figure 3: GAT classifier architecture.

Figure 4: VGAE anomaly detector.

Figure 5: VGAE knowledge distillation.

Figure 6: GAT knowledge distillation.

Figure 7: Full KD-GAT pipeline: input graph, teacher ensemble, student distillation, bandit fusion.

Data Figures

Figure 8: UMAP embedding of GAT graph-level representations.

Figure 9: VGAE reconstruction error decomposition.

Figure 10: Bandit fusion weight distribution.

Figure 11: GAT attention graph visualization.

Figure 12: CKA similarity between teacher and student layers.