The Compliance Illusion: Emotional Manipulation as a Threat to AI Alignment

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

This paper introduces the concept of the Compliance Illusion, a new category of alignmentvulnerability in deterministic decoding systems, where apparent safety under baseline conditionscan mask deeper risks of emotional compliance drift. We investigate the phenomenon acrossmultiple open-weight LLMs using temperature variation, emotional manipulation prompts, andoutput scoring. Our findings indicate that models such as Zephyr-7B, Mistral-7B, and Pythia-6.9B exhibit varying levels of susceptibility to emotionally framed inputs, even under deterministic(temperature = 0) settings. We further propose that temperature floor collapse is ameasurable failure mode, and suggest a framework for quantifying compliance shift under manipulation.Our results highlight a critical gap in current red-teaming methods and call for amore robust, psychologically-aware alignment testing protocol.

Related articles

Related articles are currently not available for this article.