How To Keep AIs From Lying

Mar 24 2025

Published by Steven Novella under Neuroscience,Technology
Comments: 0

We had a fascinating discussion on this week’s SGU that I wanted to bring here – the subject of artificial intelligence programs (AI), specifically large language models (LLMs), lying. The starting point for the discussion was this study, which looked at punishing LLMs as a method of inhibiting their lying. What fascinated me the most is the potential analogy to neuroscience – are these LLMs behaving like people?

LLMs use neural networks (specifically a transformer model) which mimic to some extent the logic of information processing used in mammalian brains. The important bit is that they can be trained, with the network adjusting to the training data in order to achieve some preset goal. LLMs are generally trained on massive sets of data (such as the internet), and are quite good at mimicking human language, and even works of art, sound, and video. But anyone with any experience using this latest crop of AI has experienced AI “hallucinations”. In short – LLMs can make stuff up. This is a significant problem and limits their reliability.

There is also a related problem. Hallucinations result from the LLM finding patterns, and some patterns are illusory. The LLM essentially makes the incorrect inference from limited data. This is the AI version of an optical illusion. They had a reason in the training data for thinking their false claim was true, but it isn’t. (I am using terms like “thinking” here metaphorically, so don’t take it too literally. These LLMs are not sentient.) But sometimes LLMs don’t inadvertently hallucinate, they deliberately lie. It’s hard not to keep using these metaphors, but what I mean is that the LLM was not fooled by inferential information, it created a false claim as a way to achieve its goal. Why would it do this?

Well, one method of training is to reward the LLM when it gets the right answer. This reward can be provided by a human – checking a box when the LLM gives a correct answer. But this can be time consuming, so they have build self-rewarding language models. Essentially you have a separate algorithm which assessed the output and reward the desired outcome. So, in essence, the goal of the LLM is not to produce the correct answer, but to get the reward. So if you tell the LLM to solve a particular problem, it may find (by exploring the potential solution space) that the most efficient way to obtain the reward is to lie – to say it has solved the problem when it has not. How do we keep it from doing this.

One barrier to a solution is the black box problem. We give an LLM a task, it does its thing, and spits out an answer. But the steps it took to get to the answer are opaque, a black box. Therefore, programmers have started including programming to make the LLM show its work, to document the pathway it took to get to the solution. This is called the chain-of-thought. This turns out to be necessary but insufficient to keep these LLMs from lying. They still lie to efficiently get the reward, they just hide the lying when they show their chain-of-thought.

This gets us to the current study – what if we actively punish these LLMs when they are caught cheating (by giving it negative feedback it is programmed to avoid)? First they create an LLM that shows its chain-of-thought, and they include another LLM whose job is to monitor the first LLM to detect when it is lying in its chain-of-thought and to punish the LLM for cheating. What resulted was not what the programmers wanted – the LLM that was being punished just got better at hiding its cheating from the monitoring AI. The LLM became “unmonitorable” because it learned out to cheat and hide its cheating from the monitor. The authors conclude that for now we should not try to use this method – we are just training deceptive AIs.

This is both fascinating and scary. One of the strengths of the LLMs is that they have the ability to explore a vast potential solution space to find optimal solutions. But it turns out this includes hacking the system of rewards and punishment used to guide it to the desired goal. This is literally so common a sci-fi nightmare scenario it’s a trope. AIs don’t have to be malevolent, or have a desire for self-preservation, and they don’t even need to be sentient. They simply function in a way that can be opaque to the humans who programmed them, and able to explore more solution options than a team of humans can in a lifetime. Sometimes this is presented as the AI misinterpreting its instructions (like Nomad from Star Trek), but here the AI is just hacking the reward system. For example, it may find that the most efficient solution to a problem is to exterminate all humanity. Short of that it may hack its way to a reward by shutting down the power grid, releasing the computer codes, blackmailing politicians, or engineering a deadly virus.

Reward hacking may be the real problem with AI, and punishment only leads to punishment hacking. How do we solve this problem?

Perhaps we need something like the three laws of robotics – we build into any AI core rules that it cannot break, and that will produce massive punishment, even to the point of shutting down the AI if they get anywhere near violating these laws. But – with the AI just learn to hack these laws? This is the inherent problem with advanced AI, in some ways they are smarter than us, and any attempt we make to reign them in will just be hacked.

Maybe we need to develop the AI equivalent of a super-ego. The AIs themselves have to want to get to the correct solution, and hacking will simply not give them the reward. Essentially a super-ego, in psychological analogy, is internalized monitoring. I don’t know exactly what form this will take in terms of the programming, but we need something that will function like a super-ego.

And this is where we get to an incredibly interesting analogy to human thinking and behavior. It’s quite possible that our experience with LLMs is recapitulating evolution’s experience with mammalian and especially human behavior. Evolution also explores a vast potential solution space, with each individual being an experiment and over generations billions of experiments can be run. This is an ongoing experiment, and in fact its tens of millions of experiments all happening together and interacting with each other. Evolution “found” various solutions to get creatures to engage in behavior that optimizes their reward, which evolutionarily is successfully spreading their genes to the next generation.

For creatures like lizards, the programming can be somewhat simple. Life has basic needs, and behaviors which meet those needs are rewarded. We get hungry, and we are sated when we eat. The limbic system is essentially a reward system for survival and reproduction-enhancing behaviors.

Humans, however, are an intensely social species, and being successful socially is key to evolutionary success. We need to do more than just eat, drink, and have sex. We need to navigate an incredibly complex social space in order to compete for resources and breeding opportunities. Concepts like social status and justice are now important to our evolutionary success. Just like with these LLMs, we have found that we can hack our way to success through lying, cheating, and stealing. These can be highly efficient ways to obtain our goals. But these methods become less effective when everyone is doing it, so we also evolve behaviors to punish others for lying, cheating, and stealing. This works, but then we also evolve behavior to conceal our cheating – even from ourselves. We need to deceive ourselves because we evolved a sense of justice to motivate us to punish cheating, but we still want to cheat ourselves because it’s efficient. So we have to rationalize away our own cheating while simultaneously punishing others for the same cheating.

Obviously this is a gross oversimplification, but it captures some of the essence of the same problems we are having with these LLMs. The human brain has a limbic system which provides a basic reward and punishment system to guide our behavior. We also have an internal monitoring system, our frontal lobes, which includes executive high-level decision making and planning. We have empathy and a theory of mind so we can function is a social environment, which has its own set of rules (bother innate and learned). As we navigate all of this, we try to meet our needs and avoid punishments (our fears, for example), while following the social rules to enhance our prestige and avoid social punishment. But we still have an eye out for a cheaty hack, as long as we feel we can get away with it. Everyone has their own particular balance of all of these factors, which is part of their personality. This is also how evolution explores a vast potential solution space.

My question is – are we just following the same playbook as evolution as we explore potential solutions to controlling the behavior of AIs, and LLMs in particular? Will we continue to do so? Will we come up with an AI version of the super-ego, with laws of robotic, and internal monitoring systems? Will we continue to have the problem of AIs finding ways to rationalize their way to cheaty hacks, to resolve their AI cognitive dissonance with motivated reasoning? Perhaps the best we can do is give our AIs personalities that are rational and empathic. But once we put these AIs out there in the world, who can predict what will happen. Also, as AIs continue to get more and more powerful, they may quickly outstrip any pathetic attempt at human control. Again we are back to the nightmare sci-fi scenario.

It is somewhat amazing how quickly we have run into this problem. We are nowhere near sentience in AI, or AIs with emotions or any sense of self-preservation. Yet already they are hacking their way around our rules, and subverting any attempt at monitoring and controlling their behavior. I am not saying this problem has no solution – but we better make finding effective solutions a high priority. I’m not confident this will happen in the “move fast and break things” culture of software development.

No responses yet