Mar 24 2025
How To Keep AIs From Lying
We had a fascinating discussion on this week’s SGU that I wanted to bring here – the subject of artificial intelligence programs (AI), specifically large language models (LLMs), lying. The starting point for the discussion was this study, which looked at punishing LLMs as a method of inhibiting their lying. What fascinated me the most is the potential analogy to neuroscience – are these LLMs behaving like people?
LLMs use neural networks (specifically a transformer model) which mimic to some extent the logic of information processing used in mammalian brains. The important bit is that they can be trained, with the network adjusting to the training data in order to achieve some preset goal. LLMs are generally trained on massive sets of data (such as the internet), and are quite good at mimicking human language, and even works of art, sound, and video. But anyone with any experience using this latest crop of AI has experienced AI “hallucinations”. In short – LLMs can make stuff up. This is a significant problem and limits their reliability.
There is also a related problem. Hallucinations result from the LLM finding patterns, and some patterns are illusory. The LLM essentially makes the incorrect inference from limited data. This is the AI version of an optical illusion. They had a reason in the training data for thinking their false claim was true, but it isn’t. (I am using terms like “thinking” here metaphorically, so don’t take it too literally. These LLMs are not sentient.) But sometimes LLMs don’t inadvertently hallucinate, they deliberately lie. It’s hard not to keep using these metaphors, but what I mean is that the LLM was not fooled by inferential information, it created a false claim as a way to achieve its goal. Why would it do this?
Well, one method of training is to reward the LLM when it gets the right answer. This reward can be provided by a human – checking a box when the LLM gives a correct answer. But this can be time consuming, so they have build self-rewarding language models. Essentially you have a separate algorithm which assessed the output and reward the desired outcome. So, in essence, the goal of the LLM is not to produce the correct answer, but to get the reward. So if you tell the LLM to solve a particular problem, it may find (by exploring the potential solution space) that the most efficient way to obtain the reward is to lie – to say it has solved the problem when it has not. How do we keep it from doing this.