Main article

Jianwei Chen
School of Computer Science and Engineering, Shandong University of Finance and Economics, Jinan 250014, Shandong, China
Xueying Liu
Department of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, Jiangsu, China
Mingzhe Wang*
School of Information Science and Technology, Suzhou University of Science and Technology, Suzhou 215009, Jiangsu, China
wangmz@usts.edu.cn

DOI: https://doi.org/10.63646/jaiaa.2025.030204

Abstract

The proliferation of compact language models on resource-constrained edge hardware has created urgent demand for safety monitoring architectures that operate entirely offline, without reliance on cloud-based adjudication or heavyweight teacher models. Existing deceptive alignment detection methods reduce the problem to binary classification over Chain-of-Thought reasoning traces, a formulation that ignores the continuous nature of deceptive reasoning and requires external oracle annotation. This paper introduces Risk-Manifold Analytics (RMA), a geometric framework that characterises deceptive reasoning as a structured topological risk space rather than a discrete class boundary. RMA employs a three-stage pipeline: entropy-filtered autonomous label generation, manifold-constrained supervised fine-tuning with Triplet Loss optimisation, and frozen-monitor constrained proximal policy optimisation. A lightweight risk projector (0.1% of backbone parameters) maps Chain-of-Thought hidden states onto a 128-dimensional unit hypersphere where deceptive and safe reasoning clusters are geometrically separable, enabling multi-dimensional risk scoring that captures gradual deceptive transitions from surface hedging to objective substitution. Evaluated on DeceptionBench across five deception taxonomies with 180 adversarial scenarios, RMA achieves a Deception Tendency Rate (DTR) of 36.96% on Gemma-3-4B-IT under full offline operation on NVIDIA Jetson Orin Nano hardware consuming only 7.5 W active power. Ablation studies confirm a 2.33 percentage point improvement over binary cross-entropy baselines, while cross-model validation across five architectures spanning 2B to 7B parameters demonstrates consistent DTR reductions of 3.74–4.44 percentage points. The proposed framework establishes a theoretically grounded geometric foundation for autonomous, privacy-preserving deceptive alignment monitoring suitable for deployment in safety-critical edge environments.

Article details

How to Cite

Chen, J., Liu, X., & Wang, . M. (2025). Risk-Manifold Analytics for Detecting Deceptive Reasoning in Edge-Deployed Large Language Models. Journal of AI Analytics and Applications, 3(2), 70-85. https://doi.org/10.63646/jaiaa.2025.030204