Risk-Manifold Analytics for Detecting Deceptive Reasoning in Edge-Deployed Large Language Models
Main article
Abstract
The proliferation of compact language models on resource-constrained edge hardware has created urgent demand for safety monitoring architectures that operate entirely offline, without reliance on cloud-based adjudication or heavyweight teacher models. Existing deceptive alignment detection methods reduce the problem to binary classification over Chain-of-Thought reasoning traces, a formulation that ignores the continuous nature of deceptive reasoning and requires external oracle annotation. This paper introduces Risk-Manifold Analytics (RMA), a geometric framework that characterises deceptive reasoning as a structured topological risk space rather than a discrete class boundary. RMA employs a three-stage pipeline: entropy-filtered autonomous label generation, manifold-constrained supervised fine-tuning with Triplet Loss optimisation, and frozen-monitor constrained proximal policy optimisation. A lightweight risk projector (0.1% of backbone parameters) maps Chain-of-Thought hidden states onto a 128-dimensional unit hypersphere where deceptive and safe reasoning clusters are geometrically separable, enabling multi-dimensional risk scoring that captures gradual deceptive transitions from surface hedging to objective substitution. Evaluated on DeceptionBench across five deception taxonomies with 180 adversarial scenarios, RMA achieves a Deception Tendency Rate (DTR) of 36.96% on Gemma-3-4B-IT under full offline operation on NVIDIA Jetson Orin Nano hardware consuming only 7.5 W active power. Ablation studies confirm a 2.33 percentage point improvement over binary cross-entropy baselines, while cross-model validation across five architectures spanning 2B to 7B parameters demonstrates consistent DTR reductions of 3.74–4.44 percentage points. The proposed framework establishes a theoretically grounded geometric foundation for autonomous, privacy-preserving deceptive alignment monitoring suitable for deployment in safety-critical edge environments.
