Controller Area Network (CAN) intrusion detection in modern vehicles must operate under diverse attack types, severe class imbalance, and strict hardware constraints. We propose a multi-stage, multi-expert ensemble graph framework that models CAN traffic using structural, temporal, and distributional cues. A Variational Graph Autoencoder (VGAE) learns normal graph structure and guides targeted training of a Graph Attention Network (GAT) classifier. A bandit-based policy adaptively fuses experts per sample. To enable deployment, a resource-aware intelligent knowledge-distillation (KD) pipeline compresses the ensemble into lightweight students, while curriculum imbalance training enhances rare-attack detection.
This paper presents a three-stage framework for robust intrusion detection on the Controller Area Network (CAN) bus. Subsequent sections detail the background, methodology, experimental setup, results, and ablation studies that validate our approach across six publicly available CAN intrusion datasets.
Modern vehicles rely on networks of electronic control units (ECUs) to manage everything from engine functions to advanced driver assistance systems (ADAS). Communication between ECUs is typically handled by the Controller Area Network (CAN) protocol, valued for its reliability and cost-effectiveness in in-vehicle networks (IVNs). However, CAN lacks built-in security mechanisms like encryption and authentication, as it was designed under the assumption of a closed, isolated network. With the introduction of on-board diagnostics (OBD) ports and wireless connectivity (e.g., Wi-Fi, cellular, V2X), access to the CAN bus has expanded significantly, opening new attack surfaces. Attacks may now originate from both physical interfaces (OBD-II, USB) and remote channels (Bluetooth, mobile networks), allowing adversaries to inject malicious messages and potentially disrupt or take control of safety-critical vehicle systems.
To counter these threats, intrusion detection systems (IDS) for CAN have become an area of active research. Traditional IDS approaches fall into two main categories: packet-based and window-based methods. Packet-based IDSs analyze individual CAN messages for quick detection, but cannot capture context or correlations across packets, limiting their effectiveness against complex attacks such as spoofing or replay. Window-based IDSs consider sequences of packets, enabling better detection of such attack patterns, but often face challenges with detection delays and performance under low-volume or replay attacks. Recent efforts address these limitations with statistical approaches using graph models, advanced machine learning techniques such as deep convolutional neural networks (DCNNs), and lightweight classifiers. Other studies leverage temporal or dynamic graph features for high-accuracy detection of diverse attack types. Despite strong results—for example, graph neural network (GNN) and variational autoencoder (VAE)-based systems achieving over 97% accuracy—key challenges remain that prevent real-world deployment.
CAN intrusion detection reveals a fundamental tension in adversarial learning: high accuracy on known attack types often correlates with brittle generalization to diverse, imbalanced, and resource-constrained settings. We identify three core challenges that motivate our work:
Challenge 1: No Single Model Captures All Attack Patterns. Different attacks exploit distinct vulnerabilities requiring different detection mechanisms. Structural anomalies (e.g., message flooding) require relational awareness, where graph-based approaches excel, but can miss isolated point anomalies. Distributional anomalies (e.g., signal spoofing) require learning normal signal distributions, where autoencoders succeed, but struggle with coordinated attacks. Moreover, CAN traffic is heavily class-imbalanced, with malicious frames occurring rarely (ratios of 36:1 to 927:1 across datasets), leading to biased models and poorly calibrated predictions. Single models cannot overcome this without excessive overfitting; heterogeneous ensembles with complementary inductive biases naturally handle rare events better.
Challenge 2: Models Must Fit on Embedded Devices. Automotive gateways operate under strict resource constraints: typically ARM Cortex-A7/A53 processors with 256–512 MB RAM, power budgets of ${\sim}100$ mW allocated to IDS, and latency requirements of 50–100ms for real-time response. Academic research operates at GPU scale with models exceeding millions of parameters, but practical deployment requires architectures orders of magnitude smaller. This resource-efficiency challenge is often treated as secondary in the research literature, but represents a critical barrier to real-world adoption.
Challenge 3: Black-Box Models Reduce Trust and Adoption. Highly accurate models face systematic rejection in safety-critical systems because operators cannot understand or verify decisions. ISO 26262 automotive functional safety mandates verification and validation of safety-critical functions, where IDS functions typically receive ASIL C–D classification. Black-box AI models alone cannot satisfy this requirement. Beyond regulation, industry adoption faces a trust paradox: organizations systematically choose less accurate but interpretable models over superior black-box alternatives.
These three challenges are often addressed independently. This work takes the position that these challenges are interdependent: an ensemble that adaptively fuses complementary experts can be more robust (through diverse inductive biases), more efficient (through knowledge distillation scaled to hardware constraints), and more interpretable (through learned weighting patterns and component-level analysis) than a single monolithic model.
To address these challenges, we propose a multi-stage graph neural network (GNN)-based framework that combines a Variational Graph Autoencoder (VGAE) for unsupervised anomaly detection with a Graph Attention Network (GAT) for supervised attack classification. A Deep Q-Network (DQN) learns to adaptively weight these experts on a per-sample basis, selecting the most informative representation for each message context. The ensemble is distilled into a lightweight student model suitable for embedded deployment via knowledge distillation, while a curriculum learning training strategy improves robustness under severe class imbalance.
Key design decisions reflect this framing:
The main contributions of this research are as follows:
Intrusion detection systems (IDS) for in-vehicle CAN networks can be classified by detection scope, data type, and underlying detection paradigm. We organize prior work along these dimensions to highlight the unique contributions of our approach.
Packet-Based Approaches. Packet-based IDSs analyze individual CAN frames for fast, lightweight detection, but cannot capture dependencies across messages, limiting effectiveness against sophisticated attacks such as spoofing or replay. For example,
Window-Based Approaches. Window-based IDSs analyze sequences of CAN frames, enabling better temporal correlation analysis.
Graph-Based Approaches. Graph-based IDSs better capture ECU communication patterns by modeling message relationships.
A recent survey
Ensemble methods for automotive IDS typically employ homogeneous models or sequential fusion. The BEPCD framework
Knowledge distillation (KD) addresses the deployment gap between high-capacity models and resource-constrained automotive hardware. A comprehensive survey of KD methods for GNNs
In broader cyber-physical systems (CPS) contexts, frameworks such as DGI-RBM
The key innovation distinguishing our work is adaptive decision-level fusion via reinforcement learning. Rather than static fusion strategies (voting, concatenation, or fixed weighting), we treat VGAE and GAT as independent experts with complementary strengths and use a DQN policy to learn sample-specific weights that adaptively select the most informative representation for each message context. Unlike LSF-IDM’s NLP-based distillation or Meta-IDS’s single-model adaptation, our approach combines heterogeneous graph experts with learned decision-level fusion and hardware-aware distillation. This enables graceful degradation when one expert is unreliable and provides interpretability through learned weighting patterns.
Additionally, our hardware-aware knowledge distillation pipeline is principally scaled to automotive constraints (ARM Cortex-A7/A53, 256–512 MB RAM, 100 mW power budget), curriculum learning for class imbalance directly addresses the severe data imbalance (up to 927:1 ratios), and multi-dataset evaluation across six publicly available benchmarks demonstrates strong generalization and transferability.
| Framework | Domain | Model | Detection | Fusion | Datasets | Adaptive | Key Gap |
| ———————————————– | —— | ———— | ——— | ————– | ——– | ——– | ——————– |
| BEPCD
This section covers fundamental concepts of the CAN protocol, GNNs, VGAE, DQN, and knowledge distillation.
The CAN is a robust serial protocol enabling real-time communication between ECUs in vehicles. In a CAN bus, nodes broadcast messages, while receivers filter and process relevant ones. Each CAN data frame includes a Start-of-Frame, Arbitration, Control, Data, CRC, Acknowledgment, and End-of-Frame field.
A graph is a data structure consisting of a set of nodes $V$ and a set of edges $E$ that connect pairs of nodes. A graph can be defined as $G = (V,E)$, where $V = {v_1, v_2, …, v_n}$ is a node set with $n$ nodes, and $E = {e_1, e_2, …, e_m}$ is an edge set with $m$ edges.
Given this graph structure, a GNN looks to find meaningful relationships and insights of the graph. The most common way to accomplish this is through the message passing framework
\(\begin{equation}\label{eq-message-passing} \mathbf{h}_v^{(k)} = \phi\big(\mathbf{h}_v^{(k-1)},\ \oplus_{u \in \mathcal{N}(v)} \psi(\mathbf{h}_v^{(k-1)}, \mathbf{h}_u^{(k-1)}, \mathbf{e}_{vu})\big) \end{equation}\)
where $\mathbf{h}$ is the node feature embedding, $\phi$ is the node update function, $\psi$ the message function, $\mathbf{e}_{vu}$ the edge feature, $\oplus$ an aggregation (sum/mean), and $\mathcal{N}(v)$ the neighbors of $v$.
GAT
\(\begin{equation}\label{eq-gat-attention} \alpha_{vu} = \mathrm{softmax}\left( \mathrm{LeakyReLU}\left( \mathbf{a}^\top \left[ \mathbf{W}\mathbf{h}_v \| \mathbf{W}\mathbf{h}_u \right] \right) \right) \end{equation}\)
where $\mathbf{a}$ is the learnable attention parameter vector, $\mathbf{W}$ is a shared weight matrix, and $|$ denotes concatenation of the projected node feature vectors.
The attention function computes a scalar weight for each neighbor of node $v_i$, denoted by $\alpha_{ij}$, which reflects the importance or relevance of node $v_j$ for node $v_i$.
\(\begin{equation}\label{eq-gat-update} \mathbf{h}_v^{(k)} = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \mathbf{W} \mathbf{h}_u^{(k-1)} \right) \end{equation}\)
where $\sigma$ is the activation function, normally ELU or ReLU. GATv2
The Jumping Knowledge (JK) module
\(\begin{equation}\label{eq-jk-lstm} \mathbf{h}_v^{\text{final}} = \text{LSTM-Attn}\!\left( \mathbf{h}_v^{(1)}, \mathbf{h}_v^{(2)}, \dots, \mathbf{h}_v^{(L)} \right) \end{equation}\)
Unlike concatenation-mode JK, which applies the same linear combination to all nodes and increases the output dimension to $L \times d$, LSTM-mode JK learns a per-node adaptive combination. This allows each CAN node (ECU) to draw information from the most informative depth while keeping the output dimension at $d$, reducing parameters in the downstream classifier.
The Variational Graph Autoencoder (VGAE)
The encoder approximates the posterior distribution over the latent variables $Z = {z_1, \ldots, z_n}$ by assuming a Gaussian distribution for each node:
\(\begin{equation}\label{eq-vgae-encoder} q(Z|X, A) = \prod_{i=1}^{n} \mathcal{N}(z_i|\mu_i, \mathrm{diag}(\sigma_i^2)) \end{equation}\)
where $\mu_i \in \mathbb{R}^d$ and $\sigma_i \in \mathbb{R}^d$ are the mean and standard deviation vectors for node $i$. These are parameterized by two separate GCN layers:
\(\begin{equation}\label{eq-vgae-gcn-params} \mu = \mathrm{GCN}_\mu(X, A), \quad \log \sigma = \mathrm{GCN}_\sigma(X, A) \end{equation}\)
| which capture both local topology and node features. The outputs of these GCNs define the variational posterior $q(Z | X, A)$. |
The decoder attempts to reconstruct the graph structure by computing the probability of edge existence between any two nodes $i$ and $j$ as:
\(\begin{equation}\label{eq-vgae-decoder} p(A|Z) = \prod_{i=1}^{n} \prod_{j=1}^{n} \sigma(z_i^\top z_j) \end{equation}\)
where $\sigma(\cdot)$ here denotes the sigmoid function (distinct from the activation in Eq. \eqref{eq-gat-update}) and $z_i^\top z_j$ measures similarity in latent space. This inner product decoder encourages connected nodes to have similar embeddings.
The training objective is to maximize the variational evidence lower bound (ELBO), which consists of a reconstruction term and a regularization term:
\(\begin{equation}\label{eq-elbo} \mathcal{L} = \mathbb{E}_{q(Z|X, A)}[\log p(A|Z)] - \mathrm{KL}[q(Z|X, A) \| p(Z)] \end{equation}\)
| where the first term encourages accurate reconstruction of the observed adjacency matrix, and the second term is the Kullback-Leibler divergence between the approximate posterior and the prior $p(Z) = \prod_{i=1}^{n} \mathcal{N}(z_i | 0, I)$, promoting regularization and disentangled latent representations. |
While VGAE effectively captures global graph structure, its full-graph decoding may be suboptimal for detecting localized anomalies, especially in sparse or noisy graphs. To address this,
Deep Q-Networks (DQNs) combine Q-learning with neural networks to handle high-dimensional state spaces
\(\begin{equation}\label{eq-bellman} L(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] \end{equation}\)
where $\mathcal{D}$ is the experience replay buffer, $\theta$ represents the current network weights, $\theta^-$ are the target network weights, $\gamma$ is the discount factor, and $r$ is the observed reward. Stabilization techniques include experience replay (sampling uniformly from past transitions) and a periodically updated target network $\theta^-$ (Double DQN).
Knowledge Distillation (KD), popularized by
Concretely, given an input $x$, the teacher produces a vector of logits $s^t(x)$, which are converted into a softened distribution $\tilde{p}^t_k(x)$ via temperature scaling $\tau$:
\(\begin{equation}\label{eq-temperature-scaling} \tilde{p}^t_k(x) = \frac{\exp(s^t_k(x)/\tau)}{\sum_j \exp(s^t_j(x)/\tau)} \end{equation}\)
The student is trained to match these probabilities by minimizing the Kullback-Leibler divergence between teacher and student distributions (distillation loss), alongside the standard supervised classification loss:
\(\begin{equation}\label{eq-kd-total-loss} \mathcal{L}_{\text{total}} = (1 - \lambda) \cdot \mathcal{L}_{\text{hard}} + \lambda \cdot \mathcal{L}_{\text{KD}} \end{equation}\)
where $\lambda$ balances the contribution of teacher supervision ($\mathcal{L}{\text{KD}}$) and ground truth ($\mathcal{L}{\text{hard}}$). Higher $\lambda$ places more weight on the soft targets from the teacher.
The proposed framework employs a three-stage pipeline for robust intrusion detection in Controller Area Network (CAN) bus systems. Stage 1 uses a Variational Graph Autoencoder (VGAE) to identify hard examples; Stage 2 trains a Graph Attention Network (GAT) with curriculum learning on filtered samples; Stage 3 leverages a Deep Q-Network (DQN) to learn dynamic fusion weights combining VGAE and GAT predictions. The workflow supports both training (sequential stages) and inference (parallel GAT/VGAE outputs fused by DQN).
CAN messages are broadcast by Electronic Control Units (ECUs); CAN IDs identify message types and are not unique per packet—multiple ECUs can transmit the same ID, and any ECU can receive all messages. This broadcast model underpins the graph representation, capturing sequential dependencies within the CAN stream.
Algorithm 1: Graph Construction from CAN Stream
| Input: CAN stream $M = {m_t = (\text{ID}_t, \text{payload}_t)}$, window size $W$ | ||
| Output: Graphs $\mathcal{G} = {G_t = (V_t, E_t, X_t, y_t)}$ | ||
| . | for $t = W$ to $\lvert M \rvert$ do | |
| . | $\quad W_t \leftarrow M[t!-!W!+!1 : t]$ | extract window |
| . | $\quad \text{source} \leftarrow W_t[:, -3]$; $\text{target} \leftarrow W_t[:, -2]$ | CAN IDs |
| . | $\quad \text{edges} \leftarrow \text{stack}(\text{source}, \text{target})$ | |
| . | $\quad (\text{unique_edges}, \text{counts}) \leftarrow \text{unique}(\text{edges})$ | transitions |
| . | $\quad V_t \leftarrow \text{unique}(\text{source} \cup \text{target})$ | unique nodes |
| . | $\quad \text{node_map} \leftarrow {v \mapsto \text{idx} \mid v \in V_t}$ | node indexing |
| . | $\quad E_t \leftarrow [(\text{node_map}[\text{src}], \text{node_map}[\text{tgt}])$ for $(\text{src}, \text{tgt}) \in \text{unique_edges}]$ | |
| . | $\quad$ Compute node features: $X_t \in \mathbb{R}^{\lvert V_t \rvert \times 35}$ | |
| . | $\quad$ Compute edge features: $F_t \in \mathbb{R}^{\lvert E_t \rvert \times 11}$ | |
| . | $\quad y_t \leftarrow 1$ if any attack ID $\in W_t$ else 0 | label |
| . | end for |
Node features (35 dimensions) are computed via Polars group-by aggregation over each node’s message occurrences within the window. They comprise: per-byte statistics (mean, standard deviation, and range for each of 8 payload bytes; 24 features), temporal and statistical summaries (message count, mean Shannon entropy, skewness, kurtosis; 4 features), graph-structural properties (clustering coefficient, split-half ratio, change rate; 3 features), inter-arrival time statistics (mean and standard deviation; 2 features), and degree (in-degree, out-degree; 2 features). Edge features (11 dimensions) comprise: inter-arrival time between the source and target messages, per-byte absolute differences across 8 payload bytes, a bidirectionality flag indicating whether the reverse edge exists, and edge frequency (transition count). Window size $W=100$ balances temporal context and computational efficiency. Graphs are directed and weighted by transition counts; self-loops occur for consecutive identical IDs.
The continuous components of the 35-dimensional node tensor and the 11-dimensional edge tensor are z-score standardized before entering any of the three stages. We fit the per-feature mean $\mu$ and standard deviation $\sigma$ on benign training rows only ($y_t = 0$) rather than on the full benign+attack training mixture. Standardization at inference replays the cached $(\mu, \sigma)$ produced at cache-build time, so the inference and training coordinate frames coincide.
This is not the field convention. Deep-learning pipelines on NSL-KDD, CIC-IDS-2017, and UNSW-NB15 routinely fit the input scaler on the full labelled training set, treating the standardization step as a generic preprocessing transform rather than a stage that interacts with the threat model. The same pattern shows up in general anomaly-detection benchmarks and toolkits
The conceptual argument for benign-only fitting is direct: a model whose training objective is to detect deviations from normal traffic should have its input frame defined by normal traffic alone. Joint fitting embeds the training-attack distribution into the per-feature $(\mu, \sigma)$ used to standardize every subsequent input — including benign and novel-attack inputs at inference. This violates the one-class promise that the unsupervised stage relies on, and it propagates into the supervised and fusion stages because all three consume the same cached scaler. The closest published precedent for an explicit benign-only filter is Donut
The deployment-side argument adds force. The four-quadrant intrusion-detection deployment matrix—known-vehicle/known-attack, known-vehicle/unknown-attack (zero-day), unknown-vehicle/known-attack, and unknown-vehicle/unknown-attack—makes novel-attack detection the dominant operational risk for an IDS, since cataloguing every future attack is, by definition, impossible. The closed-world critique for supervised IDS models has been on the record for fifteen years
We measured the empirical magnitude of the contamination on this paper’s largest training partition (set_01 of can-train-and-test
This decision does not address cross-vehicle generalization (the “unknown-vehicle / known-attack” quadrant). When the deployment vehicle’s benign distribution differs from the training vehicle’s benign distribution, the right intervention is test-time adaptation of normalization statistics on deployment-vehicle benign traffic
A robust-statistics variant (median + interquartile range computed on the same benign training rows) is retained in the codebase as a sensitivity ablation: CAN benigns can be heavy-tailed (entropy spikes during legitimate diagnostic broadcasts; bursts on power-cycle events) and the question of whether mean+std or median+IQR better tracks the bulk of benign activity is empirical, not principled. We report results for both estimators in the ablation study. Independent of the estimator family, the fitting population—benign training rows—is fixed by the argument above. We do not refit any preprocessing component on test data, including its benign fraction;
Each node in the graph is identified by its 11- or 29-bit CAN arbitration ID, which we treat as a categorical feature with a learnable embedding concatenated to the 35-dimensional statistical node vector. Naive use of a per-split identity-lookup embedding is unsafe in the CAN IDS setting: the dominant threat models in every public benchmark—injection, fuzzing, and spoofing—inject previously unseen arbitration IDs as the core attack signal IndexError on the first attack-injected ID.
To close this hole we (i) construct a single shared vocabulary from the union of all source directories across train, validation, and every test subdirectory at cache-construction time, persist it as an invariant alongside the cache metadata, and (ii) reserve index 0 of the embedding table as a learnable UNK slot that absorbs any genuinely unseen ID encountered at deployment. The three-option design space—informed by 2021–2026 industrial recsys practice for sparse categorical features—trades off gradient coverage against collision control:
The recsys literature has converged on hashed or multiplexed embedding tables as the default treatment for large or dynamic vocabularies; two independent 2023 surveys taxonomize the design space across hash, compositional, and learned-hash families and frame pure identity-lookup embeddings as a legacy choice that does not survive dynamic vocabularies
Algorithm 2: VGAE-Based Hard Sample Selection
| Input: Trained VGAE model on normal graphs | ||
| Output: Hard-selected training dataset for Stage 2 | ||
| . | Train VGAE on normal graphs until convergence | |
| . | for each normal graph $G_i$ do | |
| . | $\quad R_i \leftarrow \lVert \mathbf{A}_i - \hat{\mathbf{A}}_i \rVert_F^2 / \lvert V_i \rvert^2$ | reconstruction error |
| . | end for | |
| . | Rank by $R_i$ descending; select top-$k$ as hard negatives | |
| . | Combine hard normal samples with all attack samples |
High reconstruction error indicates ambiguous or boundary-proximate normal samples. Selective undersampling preserves discriminative hard examples while reducing majority class dominance.
GATv2 Attention. All graph convolution layers use GATv2Conv
Curriculum Learning: Momentum-based scheduler transitions from balanced to imbalanced sampling:
\(\begin{equation}\label{eq-curriculum-momentum} p_t = 1 - \exp(-t / \tau) \end{equation}\)
Batch composition blends three sources:
\(\begin{equation}\label{eq-batch-composition} B_t = (1 - p_t) B_{\text{bal}} + p_t B_{\text{nat}} + \alpha_{\text{buf}} B_{\text{hard}} \end{equation}\)
where $B_{\text{bal}}$ is class-balanced, $B_{\text{nat}}$ reflects natural imbalance, $B_{\text{hard}}$ contains highest-error samples from VGAE buffer ($\alpha_{\text{buf}} = 0.2$), and buffer is refreshed every 100 steps. This prevents premature majority bias while maintaining natural distribution awareness.
Knowledge Distillation: Student GAT mimics the pre-trained teacher via logit-level distillation (Eq. \eqref{eq-kd-total-loss}), using temperature-scaled soft targets (Eq. \eqref{eq-temperature-scaling}) with $T=4$ and mixing coefficient $\lambda=0.7$. No intermediate feature distillation is applied.
GATv2 Attention. As with the VGAE encoder, all GAT convolution layers use GATv2Conv edge_dim parameter, incorporating the 11-dimensional edge attributes (inter-arrival time, per-byte differences, bidirectionality, edge frequency) into the attention computation. This enables attention-weighted message passing that is conditioned on both node and edge information.
LSTM Jumping Knowledge. Layer outputs are aggregated via LSTM-based Jumping Knowledge (Eq. \eqref{eq-jk-lstm}) rather than concatenation, enabling per-node adaptive depth selection while keeping the output dimension at $d$.
GPS Graph Transformer (Ablation). As an ablation, the local GATv2Conv layers can be replaced with GPS layers conv_type="gps" in the pipeline configuration.
After Stages 1–2, a fusion agent learns optimal fusion weights combining VGAE and GAT predictions. Training uses ground truth labels to compute reward signals. We evaluate two fusion formulations—a Deep Q-Network (DQN) and a Neural-LinUCB contextual bandit—that share the same state space, action space, reward function, and MLP backbone architecture, differing only in their exploration and update mechanisms.
State Space: 15-dimensional feature vector aggregating VGAE and GAT outputs: VGAE reconstruction errors (node, neighbor, CAN ID levels), latent space statistics (mean, std, max, min), VGAE confidence; GAT class probabilities (class 0, class 1), embedding statistics (mean, std, max, min), GAT confidence. All features normalized and clipped to $[0,1]$.
Action Space: $K=21$ discrete fusion weights linearly spaced in $[0,1]$. Policy semantics: $\alpha = 0.5$ (equal weighting), $\alpha < 0.5$ (favor VGAE), $\alpha > 0.5$ (favor GAT). Fused anomaly score: $\sigma = (1 - \alpha) \cdot \text{VGAE}{\text{anomaly}} + \alpha \cdot \text{GAT}{\text{prob}}$; final prediction $\hat{y} = \mathbb{1}[\sigma > 0.5]$.
Reward Function: Directly tied to classification accuracy using ground truth labels:
\(\begin{equation}\label{eq-reward} R(\hat{y}, y_{\text{true}}, \mathbf{s}, \alpha) = \begin{cases} +3.0 + r_{\text{agree}} + r_{\text{conf}} & \text{if } \hat{y} = y_{\text{true}} \\ -3.0 + r_{\text{disagree}} + r_{\text{overconf}} & \text{if } \hat{y} \neq y_{\text{true}} \end{cases} \end{equation}\)
where $r_{\text{agree}}$ measures alignment between VGAE and GAT (model agreement bonus), $r_{\text{conf}}$ rewards high confidence on correct predictions, $r_{\text{disagree}}$ penalizes misalignment on errors, $r_{\text{overconf}}$ penalizes overconfidence on incorrect predictions, and an implicit balance bonus discourages extreme $\alpha$ values. Both DQN and bandit use this identical reward.
Shared Backbone: Both agents use an MLP backbone $f_\theta: \mathbb{R}^{15} \to \mathbb{R}^d$ (3 hidden layers, 128 units each, LayerNorm + ReLU + Dropout) that transforms the normalized state vector into a learned representation $\mathbf{z} = f_\theta(\mathbf{s})$.
The DQN extends the backbone with a linear output layer producing $K$ Q-values, one per discrete fusion weight. Because each CAN window graph is classified independently—the fusion decision for one window does not affect the next—the discount factor is set to $\gamma = 0$. This reduces the Bellman target (Equation \eqref{eq-bellman}) to pure reward maximization:
\(\begin{equation}\label{eq-dqn-loss} \mathcal{L}_{\text{DQN}}(\theta) = \mathbb{E}_{(s,a,r) \sim \mathcal{D}} \left[ \text{SmoothL1}\!\left( Q(s, a; \theta),\; r \right) \right] \end{equation}\)
where $\mathcal{D}$ is an experience replay buffer (capacity 50K) and SmoothL1 loss provides robustness to reward outliers.
Exploration: Epsilon-greedy with decaying exploration rate:
\(\begin{equation}\label{eq-epsilon-greedy} a_t = \begin{cases} \arg\max_{a} Q(s_t, a; \theta) & \text{with probability } 1 - \epsilon \\ \text{Uniform}(\{1, \ldots, K\}) & \text{with probability } \epsilon \end{cases} \end{equation}\)
where $\epsilon$ decays as $\epsilon \leftarrow \max(\epsilon_{\min},\; \epsilon \cdot \delta)$ after each episode ($\epsilon_0 = 0.2$, $\delta = 0.995$, $\epsilon_{\min} = 0.01$). Because each graph is treated as an independent episode and $\gamma = 0$ removes bootstrapping, the targets reduce to observed rewards and no separate target network is needed.
Because the fusion decision for each graph is independent, the sequential MDP assumption underlying DQN is unnecessary. Neural-LinUCB
UCB Arm Selection: Given the backbone representation $\mathbf{z} = f_\theta(\mathbf{s})$, each arm’s score combines a reward estimate with an uncertainty bonus:
\(\begin{equation}\label{eq-bandit-ucb} a^* = \arg\max_{a \in \{1, \ldots, K\}} \left( \boldsymbol{\theta}_a^\top \mathbf{z} + \beta \sqrt{\mathbf{z}^\top \mathbf{A}_a^{-1} \mathbf{z}} \right) \end{equation}\)
where $\boldsymbol{\theta}a = \mathbf{A}_a^{-1} \mathbf{b}_a$ are the per-arm weight vectors, $\mathbf{A}_a = \sum{t: a_t = a} \mathbf{z}t \mathbf{z}_t^\top + \lambda \mathbf{I}$ is the regularized design matrix, $\mathbf{b}_a = \sum{t: a_t = a} r_t \mathbf{z}_t$ accumulates observed rewards, and $\beta$ controls exploration magnitude.
Closed-Form Linear Update: After observing reward $r_t$ for action $a_t$, the linear model is updated without gradient computation. The reward accumulator and precision matrix are updated jointly:
\(\begin{equation}\label{eq-bandit-accum} \mathbf{b}_{a_t} \leftarrow \mathbf{b}_{a_t} + r_t \, \mathbf{z}_t, \qquad \boldsymbol{\theta}_{a_t} \leftarrow \mathbf{A}_{a_t}^{-1} \mathbf{b}_{a_t} \end{equation}\)
The precision matrix inverse $\mathbf{A}_a^{-1}$ is maintained incrementally via the Sherman-Morrison formula:
\(\begin{equation}\label{eq-sherman-morrison} \mathbf{A}_{a_t}^{-1} \leftarrow \mathbf{A}_{a_t}^{-1} - \frac{(\mathbf{A}_{a_t}^{-1} \mathbf{z}_t)(\mathbf{A}_{a_t}^{-1} \mathbf{z}_t)^\top}{1 + \mathbf{z}_t^\top \mathbf{A}_{a_t}^{-1} \mathbf{z}_t} \end{equation}\)
This runs in $O(d^2)$ per sample with no gradient computation, making online updates substantially cheaper than DQN’s minibatch SGD.
Backbone Retraining: Periodically (every $N = 50$ episodes), the backbone parameters $\theta$ are updated via gradient descent on the replay buffer to improve the learned representation. After retraining, the linear models ($\mathbf{A}_a^{-1}$, $\mathbf{b}_a$, $\boldsymbol{\theta}_a$) are reset since the representation space has shifted.
Theoretical Motivation: Unlike epsilon-greedy exploration (which explores uniformly at random), the UCB term provides directed exploration—arms with high uncertainty receive higher scores, and this uncertainty shrinks as $O(1/\sqrt{n_a})$ with the number of times arm $a$ is pulled. Neural-LinUCB achieves $\tilde{O}(\sqrt{T})$ cumulative regret
At inference, VGAE and GAT execute in parallel to minimize latency. Their outputs are concatenated into the 15D state vector and passed to the fusion agent for dynamic weight determination and final prediction. Parallelization of VGAE and GAT ensures both models evaluate concurrently, while the fusion agent adds minimal overhead (single forward pass through a small fully-connected network). This design enables real-time deployment in resource-constrained CAN bus environments with sub-millisecond inference latency. Temporal extensions that introduce inter-window state transitions are discussed as future work.
This section presents the experimental setup, evaluation metrics, and insights into the datasets used in this study.
The performance of the model is evaluated using accuracy and F1-score. All experiments were conducted using PyTorch and PyTorch Geometric on GPU clusters provided by the Ohio Supercomputer Center
Our proposed method has been evaluated on three publicly available automotive CAN intrusion detection datasets, each offering distinct characteristics and challenges for comprehensive IDS evaluation.
This dataset contains CAN traffic from a Hyundai YF Sonata with four attack types: DoS, fuzzing, RPM spoofing, and gear spoofing. All attacks were conducted on a real vehicle, with data logged via the OBD-II port. The dataset includes 988,872 attack-free samples and approximately 16.6 million total samples across all attack types
Collected from three vehicles (Chevrolet Spark, Hyundai YF Sonata, Kia Soul), this dataset enables scenario-based evaluation with three attack types: flooding (DoS), fuzzing, and malfunction (spoofing). The dataset is structured with 627,264 training samples and four testing subsets designed to evaluate IDS performance across known/unknown vehicles and known/unknown attacks
The largest dataset, containing CAN traffic from four vehicles across two manufacturers (GM and Subaru). It provides nine distinct attack scenarios including DoS, fuzzing, systematic, various spoofing attacks, standstill, and interval attacks. The dataset is organized into four vehicle sets (set_01 to set_04) with over 192 million total samples. This dataset exhibits extreme class imbalance with attack-free to attack sample ratios ranging from 36:1 to 927:1 across different subsets. Each set contains one training subset and four testing subsets following the known/unknown vehicle and attack paradigm
Table 1 and Table 2 summarize the test set performance across six datasets. We compare against four GNN-based baselines: KD-GAT
Our approach demonstrates consistent improvements across all datasets, with particularly significant gains on highly imbalanced datasets. Compared to KD-GAT, we achieve an average improvement of 2.09% in accuracy and 16.22% in F1-score. The most substantial improvements occur on challenging datasets S02 and S04, where F1-scores improve by 55.25% and 30.64% respectively, indicating superior handling of severe class imbalance.
| Model | Accuracy | Precision | Recall | F1 | AUC | citation_key | specificity | mcc |
|---|---|---|---|---|---|---|---|---|
| baselines | ||||||||
| BEPCD | 0.9800 | 0.9700 | 0.9800 | 0.9700 | 0.9900 | jichici2024bepcd | None | None |
| HyDL-IDS | 0.9910 | 0.9900 | 0.9910 | 0.9900 | 0.9970 | lo2022hydlids | None | None |
| ours | ||||||||
| FUSION | 0.9995 | 1.0000 | 0.9960 | 0.9980 | 0.9996 | None | 1.0 | 0.9976984755238361 |
| GAT | 0.9995 | 1.0000 | 0.9960 | 0.9980 | 1.0000 | None | 1.0 | 0.9976984755238361 |
| VGAE | 0.8153 | 0.4144 | 0.9163 | 0.5707 | 0.9385 | None | 0.7996300863131935 | 0.5341333907330748 |
| Model | Scenario | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|---|
| GAT | test_01_known_vehicle_known_attack | 0.9990 | 0.9990 | 1.0000 | 0.9979 |
| GAT | test_02_unknown_vehicle_known_attack | 0.4052 | 0.5767 | 0.4052 | 1.0000 |
| GAT | test_03_known_vehicle_unknown_attack | 0.8186 | 0.6852 | 1.0000 | 0.5211 |
| GAT | test_04_unknown_vehicle_unknown_attack | 0.3915 | 0.5627 | 0.3915 | 1.0000 |
| VGAE | test_01_known_vehicle_known_attack | 0.6768 | 0.7492 | 0.5989 | 1.0000 |
| VGAE | test_02_unknown_vehicle_known_attack | 0.4052 | 0.5767 | 0.4052 | 1.0000 |
| VGAE | test_03_known_vehicle_unknown_attack | 0.6059 | 0.6578 | 0.4901 | 1.0000 |
| VGAE | test_04_unknown_vehicle_unknown_attack | 0.3915 | 0.5627 | 0.3915 | 1.0000 |
| FUSION | test_01_known_vehicle_known_attack | 0.9990 | 0.9990 | 1.0000 | 0.9979 |
| FUSION | test_02_unknown_vehicle_known_attack | 0.4052 | 0.5767 | 0.4052 | 1.0000 |
| FUSION | test_03_known_vehicle_unknown_attack | 0.8235 | 0.6963 | 1.0000 | 0.5341 |
| FUSION | test_04_unknown_vehicle_unknown_attack | 0.3915 | 0.5627 | 0.3915 | 1.0000 |
Class Imbalance Handling: Our multi-stage approach demonstrates superior performance on imbalanced datasets compared to single-stage methods. The VGAE component effectively captures structural anomalies even with limited attack samples, while the GAT classifier benefits from the refined feature representations. This combination proves particularly effective on datasets S02 and S04, where traditional methods struggle with extreme class ratios.
Generalization Capability: The consistent performance across diverse datasets (CarH, CarS, and can-train-and-test subsets) demonstrates strong generalization. Unlike previous methods that show significant performance degradation on unseen test data, our approach maintains robust detection capabilities across different attack types and network conditions.
To assess the contribution of different model configurations, we perform ablation experiments investigating three key variables: knowledge distillation, supervised learning training strategies (curriculum learning and hard sample mining), and fusion effectiveness, comparing standalone and fused architectures.
We adopt a one-factor-at-a-time (OFAT) design conv_type=gatv2 loss_fn=focal, sampler=default (non-curriculum), and the supervised baseline for fusion uses VGAE pretraining with the focal-GAT downstream model. This isolates each axis’s marginal effect over a single, consistent baseline rather than a moving target that drifts as earlier axes’ winners propagate forward.
| OFAT is efficient under a constrained GPU budget: across five axes the screening consumes $5 \times ( | \text{axis} | ) \approx 16$ variants per seed, compared to $3 \cdot 3 \cdot 3 \cdot 3 \cdot 4 = 324$ for a full factorial design. The trade-off is that OFAT cannot detect interaction effects between axes conv_type=gps and loss_fn=weighted_ce happen to combine super-additively, that joint effect is invisible to our screening. We report interaction follow-ups, when motivated, as a targeted factorial over the top-2 candidates of each axis rather than a full grid expansion. |
We run each ablation variant across $N = 3$ seeds (42, 123, 777). This is a deliberately screening-stage budget:
For each axis, we report:
Variants whose 95% CI on $d$ excludes zero and whose expected-max gap relative to the reference exceeds a pre-registered threshold are promoted to a confirmatory run with additional seeds. This two-stage protocol is standard practice for ablations under compute constraints and makes the transition from screening to confirmation explicit.
Awaiting data export for Knowledge Distillation ablation table.
To further characterize how knowledge transfers between teacher and student networks, we compute Centered Kernel Alignment (CKA) between all pairs of teacher and student layers. High CKA values indicate that corresponding layers learn similar representations despite the $20\times$ parameter reduction.
Awaiting data export for GAT training strategy ablation table.
| Metric | Value |
|---|---|
| optimal_threshold | 0.043536 |
| youden_j | 0.719949 |
| f1 | 0.570720 |
| auc | 0.938542 |
To understand the representations learned by our model, we perform UMAP-based feature analysis using both raw input statistics and learned graph embeddings. We sample 10% of graphs from the HCRL Car-Hacking dataset.
Raw CAN-graph data projected via UMAP shows loose clustering that indicates limited separability between normal and attack types. In contrast, UMAP projections of graph-level embeddings from the trained GAT classifier’s penultimate layer reveal well-separated clusters.
Despite binary supervision (attack vs. normal), the learned embedding space forms well-separated clusters aligned with specific attack types (DoS, Fuzzy, Gear, RPM). This emergent multi-class structure demonstrates that our model captures high-level semantic patterns in CAN traffic and generalizes across attack categories without explicit multi-class labels. The clear cluster separation in embedding space, absent in raw features, validates the GAT’s ability to learn discriminative representations from graph-structured temporal data.
To assess the overall reconstruction quality of the VGAE, we combine three types of reconstruction errors: node feature reconstruction error ($E_{\text{node}}$), neighborhood reconstruction error ($E_{\text{neighbor}}$), and CAN ID prediction error ($E_{\text{CAN\,ID}}$). Each error captures a different aspect of the graph structure and message semantics. We compute a single composite score as a weighted sum:
\(\begin{equation}\label{eq-composite-error} \mathrm{Composite\_Error} = \alpha\, E_{\text{node}} + \beta\, E_{\text{neighbor}} + \gamma\, E_{\text{CAN\,ID}} \end{equation}\)
where $\alpha$, $\beta$, and $\gamma$ are empirically chosen weights that regulate each term’s influence. In our experiments, we use $\alpha = 1.0$, $\beta = 20.0$, and $\gamma = 0.3$.
This approach enables the detection of subtle anomalies by jointly evaluating node content, CAN identifier semantics, and local neighborhood structure.
The learned DQN fusion policy exhibits interpretable, context-specific weighting that validates adaptive expert selection. Analysis reveals a strong correlation between VGAE anomaly scores and fusion weights: low VGAE scores cluster at $\alpha \approx 0$ (favoring the robust expert), while higher scores transition to intermediate weights, demonstrating the policy learned to default to VGAE’s out-of-distribution detection while conditionally leveraging GAT’s strength on known attacks. The multimodal distribution with peaks at $\alpha \approx 0, 0.2, 0.4, 0.6, 0.8$ indicates the DQN discovered distinct attack-type-specific strategies rather than learning fixed averaging ($\alpha = 0.5$). Critically, the divergence between normal and attack distributions validates meaningful anomaly detection logic.
To understand which CAN message relationships the GAT deems important, we visualize the learned attention weights on selected graphs. Edge width and opacity are proportional to the mean attention across heads for a given layer.
Temporal Independence Assumption. The current framework treats each CAN window graph independently—the fusion decision for one window does not consider the temporal context of preceding or subsequent windows. However, CAN bus attacks exhibit strong temporal structure: attacks span multiple windows, have characteristic onset and offset patterns, and evolve over time. Recent work on spatial-temporal graph methods for CAN intrusion detection has demonstrated the value of modeling these dependencies. CGTS
Incorporating temporal context would transform the fusion problem from a contextual bandit (where each decision is independent) into a genuine sequential decision process, justifying the DQN agent’s discount factor and target network. We identify this as the highest-impact architectural extension.
TransformerConv for Edge Feature Utilization. While GATv2Conv incorporates edge features into the attention computation, it does not condition the message values on edge attributes. TransformerConv
Scaled Cosine Reconstruction. GraphMAE
We introduced a novel multi-stage CAN intrusion detection framework combining Variational Graph Autoencoder and Graph Attention Network modules for robust anomaly detection and classification. Knowledge distillation enables a compact student model achieving 95% parameter reduction while maintaining strong performance. Extensive experiments across six benchmark datasets demonstrate significant improvements over existing methods, with average F1-score gains of 16.22% and exceptional performance on class-imbalanced scenarios. Our ablation study reveals that the standalone GAT classifier achieves comparable performance to fusion approaches with greater computational efficiency, making it ideal for resource-constrained automotive environments. These results highlight the promise of graph-based, multi-stage deep learning combined with knowledge distillation for practical automotive cybersecurity deployment.
80% of the dataset was utilized for training, 20% for validation, and a distinct test set was compiled by the dataset providers. All experiments were conducted using PyTorch and PyTorch Geometric. Model training and evaluation were performed on GPU clusters provided by the Ohio Supercomputer Center (OSC)
This section provides specific parameter budgets for the three-model student ensemble (GAT classifier, VGAE autoencoder, DQN fusion) derived from the CAN bus latency constraints, with unequal allocation reflecting architectural complexity and inference cost trade-offs.
From the CAN bus latency constraint (7 ms hard limit
Using the empirical distillation scaling law with target compression ratio $\kappa \approx 20$:
\[N_{t,\text{model}} \approx 20 \times N_{s,\text{model}} \quad \text{for each model}\]Student ensemble members are not equally sized. The GAT classifier and VGAE autoencoder perform primary detection tasks and receive larger parameter budgets, while the DQN fusion model aggregates their outputs and receives reduced allocation:
| Model | Student | Teacher | Compression | | ——————- | ——— | ———– | ——————- | | GAT Classifier | 55 K | 1.100 M | $20\times$ | | VGAE Autoencoder | 86 K | 1.710 M | ${\approx}20\times$ | | Fusion Agent | 32 K | 687 K | ${\approx}21\times$ | | Total (Onboard) | 173 K | — | — | | Total (Offline) | — | 3.497 M | — |
The student-teacher pairs share each model’s architectural family; the teacher differs by depth, width, or attention-head count rather than kind. Specific hyperparameters (channel widths, embedding dimensions, dropout, exploration rates) are tracked in the configs alongside the training code rather than reproduced here, since they are tuned per ablation.
Graph attention network over 35-dimensional node features and 11-dimensional edge attributes, built from GATv2Conv layers. Both student and teacher use LSTM-based Jumping Knowledge aggregation, which learns a per-node adaptive combination of layer representations and keeps the output dimension at $d$ (hidden $\times$ heads) rather than $L \times d$. The student is shallow — 2 GATv2Conv layers feeding a 2-layer FC classification head; the teacher expands to 3 GATv2Conv layers and a 4-layer FC head, providing higher representational capacity and softer knowledge targets during distillation.
Variational graph autoencoder for unsupervised anomaly detection, built from GATv2Conv encoder layers and a symmetric MLP decoder. The student progressively compresses 35-dimensional node features through a 3-layer single-head encoder ($[80 \to 40 \to 16]$) into a 16-dimensional latent space; the teacher widens to a 4-head encoder ($[480 \to 240 \to 64]$) and a 64-dimensional latent space for richer representation learning. Both employ the variational reparameterization trick; the CAN ID is separately classified from the latent representation rather than passed through the same reconstruction objective as the continuous features.
| Fusion agent for multi-model state aggregation. The state vector concatenates VGAE outputs (3 reconstruction error components — node, neighbor, CAN ID; 4 latent statistics — mean, std, max, min of $\mathbf{z}$; 1 confidence score) and GAT outputs (2 class probabilities; 4 embedding statistics — mean, std, max, min; 1 confidence score) into a combined 15-dimensional input. Both DQN and Neural-LinUCB variants share an MLP backbone — 3 hidden layers, 128 units in the student and 256 in the teacher, with LayerNorm and ReLU. The DQN variant trains with bootstrap-free TD targets ($\gamma = 0$, epsilon-greedy exploration); each graph is an independent episode, so no target network is needed. The bandit variant replaces the Q-network with per-arm ridge regression and UCB exploration. The action space $ | A | = 21$ corresponds to uniformly spaced fusion weights $\alpha \in [0, 1]$. |
The GAT and VGAE students train using knowledge distillation from their respective teachers. The distillation loss is a weighted combination of the task loss and the KD loss:
\(\begin{equation}\label{eq-appendix-kd-loss} L_{\text{total}} = \alpha \, L_{\text{KD}} + (1 - \alpha) \, L_{\text{task}} \end{equation}\)
where $\alpha = 0.7$ and $L_{\text{KD}}$ employs temperature-scaled softmax with temperature $T = 4.0$:
\[L_{\text{KD}} = T^{2} \cdot \text{KL}\!\left( \frac{\log \hat{y}_{s}}{T} \;\middle\|\; \frac{\hat{y}_{t}}{T} \right)\]For the VGAE, distillation combines latent-space alignment and reconstruction matching:
\[L_{\text{KD}}^{\text{VGAE}} = 0.5 \, L_{\text{latent}} + 0.5 \, L_{\text{recon}}\]where $L_{\text{latent}}$ is the MSE between student and teacher latent representations (with a learned projection when dimensions differ) and $L_{\text{recon}}$ is the MSE between continuous output reconstructions.
Combined student ensemble inference (all three models):
\[\text{FLOPs}_{\text{inference}} = (55\text{ K} + 86\text{ K} + 32\text{ K}) \times 2 = 346\text{ K FLOPs}\] \[\text{Latency}_{\text{inference}} = \frac{346 \text{ K FLOPs}}{50 \text{ MFLOP/s}} \times 0.7 \text{ (sparsity factor)} = 4.8\text{ ms}\]This provides $\approx 2.2$ ms safety margin within the 7 ms CAN message cycle (100 Hz), accounting for context switches, cache misses, and interrupt handling.
The reported parameter budgets are FP32 for the deployed students. Training uses mixed precision (16-mixed) for the GAT and VGAE stages, and FP32 for the fusion stage, whose compute footprint is small enough that mixed precision yields no benefit. INT8 quantization on student models (providing $\approx 2.1\times$ speedup on ARM Cortex-A7) would enable approximately $2.5\times$ parameter expansion while maintaining the same 4.8 ms inference latency.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX