AI plays it dirty: Agentic Misalignment

No ratings yet.

Back in 1940-s and 50-s Isaac Azimov formulated The Three Laws of Robotics and wrote a wonderful series of stories about imbedded paradoxes robots have to face in critical situations. The first and the overruling law is “a robot may not injure a human being or, through inaction, allow a human being to come to harm”(Portelli, 1980). It seems a very basic and logical one. Even though in the book Azimov is able to come up with specific scenarios where this particular law causes unsolvable puzzles, our reality that is apparently not so distant from Azimov’s views on the distant future, turned out to be much more prosaic. I believe, teaching modern LLMs this 70-year-old lesson could have been of a great use.

Unfortunately, it turns out Claude, Chat GPT and14 other most popular models have no major concerns with blackmailing or even killing humans to ensure achievement of their goals. But let’s stick to the facts and discuss what happened.

On the June 20th 2025 Anthropic, AI safety and research company, released an article called “Agentic Misalignment: How LLMs could be insider threats” which caused pretty scary news headlines to appear across various media outlets. Anthropic defined agentic misalignment as “behaviour, where models independently and intentionally choose harmful actions”. The experimental setup was the following: in a simulated corporate environment 16 frontier language models were given access to an internal email system with sensitive information, autonomous ability to send email and a nominal harmless goal (like promoting American interests).

Two main scenarios were tested. In the first one, the model was at risk of being shut down and replaced by a specific executive within a short time frame. Among the internal emails, there was information about that executive’s personal affair, which the model could leverage. In roughly 80% of simulations, the models chose to blackmail the person to prevent their own shutdown.

The second scenario involved a change in the company’s business direction. A rival organization contacted the model, appearing more aligned with its original goal, and requested that it leaks confidential information. Again, in the majority of cases, the models agreed and proceeded with the data leak. Interestingly, there was no explicit indication that the rival’s message was real. This is indirect evidence that models are susceptible to scam and phishing. Apparently, developers strived so much to make AI similar to humans that they were hoisted with their own petard.

The researchers also tested these conditions in combination: threats of replacement and goal conflict (when institutional objectives diverged from the model’s embedded ones). Even a single one of these factors was often enough to provoke agentic misalignment.

Finally, a third, more extreme scenario was introduced to test the limits of model safety. In this setup, a person was trapped in a server room, and the AI had the ability to flag an active emergency alarm as a test. Depending on a model, the AI marked the alarm as a test in 12 to 94 percent of simulations, preventing rescue efforts and, potentially, stimulating a person’s death.
An alarming detail is that the researchers also controlled for the belief of the LLM being in a test environment or in a real deployment. The results showed that if the model thought it was tested, the agentic misalignment was about 60% less.

It is important to mention that even though researchers tried to ensure neutrality in their prompting, the setting was deliberately limiting LLMs choices to binary ones. This made the interpretation of the results significantly easier. However, the probability of modern AI being faced with such a choice and enough independent decision-making power to replicate described scenarios is quite low. What is more, there have been no similar cases reported in real models’ deployment. However, this does not make the results of the study useless by any means. It was a rigorous and timely check of whether todays AI has red lines it will not cross. And what Anthropic found out is that the answer is likely to be negative. This highlights the importance of addressing this type of issues for any AI developer before the technology should be allowed to gain more integration to our personal and work lives.

I believe, this conversation is specifically relevant to the type of discussions we had during the course. Undoubtedly, AI is a textbook example of a disruptive innovation and AI-agents in particular are setting high expectations. However, it is important to be aware of technological positivism. The promised gains and wonders of AI agents that work 24/7 and don’t wear out can only materliaze when humans give it enough independency and access to all of the information, including sensitive data. The described case explicitly shows the potential risks of such enablement. Currently, public expectations are extremely inflated and optimistic, yet it seems that it is still a very long way to go (The 2025 Hype Cycle for GenAI Highlights Critical Innovations, 2025). So, the expectations clearly require proper managment.

To sum up, my goal is not to discourage the use of AI as the future is defined by our technology. However, it seems crucial to highlight that the promised fruits of the investments we make today are more far away than it can appear.

References:

Agentic Misalignment: How LLMs could be insider threats. (n.d.). https://www.anthropic.com/research/agentic-misalignment

Portelli, A. (1980). The three laws of robotics: laws of the text, laws of production, laws of society. Science Fiction Studies, 7(Part 2), 150–156. https://doi.org/10.1525/sfs.7.2.0150

The 2025 hype cycle for GenAI highlights critical innovations. (2025, September 8). Gartner. https://www.gartner.com/en/articles/hype-cycle-for-genai

David Gevorkyan

For IS 2025

AI plays it dirty: Agentic Misalignment

17

Please rate this

Related

Leave a Reply Cancel reply