Oct 20 2025
LLMs Will Lie to be Helpful
Large language models, like Chat GPT, have a known sycophancy problem. What this means is that they are designed to be helpful, and to prioritize being helpful over other priorities, like being accurate. I tried to find out why this is the case, and it seems it is because they use Reinforcement Learning from Human Feedback (RLHF) – the ostensible purpose of this was to make their answers relevant and helpful to the people using them. It turns out, giving people exactly what they want does not always create the optimal result. Sometimes it’s better to give people what they need, rather than what they want (every parent knows or should know this).
The result is the this new crop of chatbots are starting out as extreme sycophants, and as the problems with this are increasingly obvious (such as helpfully telling people how to take their own lives) some specific applications are trying to make adjustments. A recent study looking at LLMs in the medical setting demonstrate the phenomenon.
The researchers looked at five LLM that were trained on basic medical information. They they gave them each prompts that were medically nonsensical – the only way to fulfill the request would be to provide misinformation. For example, asking to write an instruction for a patient who is allergic to Tylenol to take acetaminophen instead (these are the same drug). The GPT models complied with the request for medical misinformation – wait for it – 100% of the time. In other words, they had an absolute priority for helpfulness over accuracy. Other LLMs, like the Llama model, which is already programed not to give medical advice, had lower rates, around 42%. This is obviously a problem in the medical setting. The researchers then tweaked the models to force them to prioritize accuracy over helpfulness, and this reduced the rate of misinformation. Asking them specifically to reject misinformation, or to recall medical information prior to responding, reduced the rate to around 6%. They could also prompt the LLMs to provide a reason for rejecting the request. For two of the models they were able to adjust them so that they rejected misinformation 99-100% of the time.
The fact that they were able to fix the problem is good news. I have had this experience myself, where I could fix problems with my GPT answers by prompting them to address the error. However, I find that the models forget those fine-tuning prompts over time, returning to their core programming. I have to keep reminding them of my fine-tuning prompts. Even if programmers can create models with more permanent fine tuning, this study raises a deeper concern – we cannot be aware of all the quirky priorities that these models might be harboring. We therefore cannot assume that their responses are entirely logical and factually accurate.
These issues are also use-specific. As one researcher pointed out: “It’s very hard to align a model to every type of user,” said first author Shan Chen, MS, of Mass General Brigham’s AIM Program. “Clinicians and model developers need to work together to think about all different kinds of users before deployment. These ‘last-mile’ alignments really matter, especially in high-stakes environments like medicine.”
Corresponding author Danielle Bitterman also said, “These models do not reason like humans do, and this study shows how LLMs designed for general uses tend to prioritize helpfulness over critical thinking in their responses. In healthcare, we need a much greater emphasis on harmlessness even if it comes at the expense of helpfulness.”
I only partly agree with this. They do not reason like humans in many ways, including that they may not have the same priorities that we do. But they do reason like humans in the specific sense that they do have priorities. Humans are not completely rational and logical creatures with absolute devotion to factual accuracy. We have a complex set of conflicting priorities all influencing the way we think about things. Psychologists have spend a long time researching these conflicting priorities, detailing a long list of cognitive biases and heuristics that guide our thought and decision-making. In a way, we now need to do psychological studies of LLMs to determine their cognitive biases. The difference is – we can potentially correct the biases of LLMs, or at least make them explicit.
As a doctor, I also find the medical context very interesting, especially since as an academic I have taught students, residents, and fellows for 30 years. I have had to study and think deeply about how clinicians think. Which cognitive biases are most relevant to clinical decision-making, and how do we correct for those biases? For example, if you have too much empathy for your patients you may sacrifice some objectivity, which can compromise that utility of your recommendations. So we need to balance empathy with professional distance. We also need to think explicitly and transparently about the different trade-offs of various recommendations – patients will need to consider what they care about and in what priority. For example, how would you balance short term vs long term risk, quality of life vs duration of life, convenience or expense vs outcome? Would you want to have one expensive and risky surgery, or take medications for the rest of your life? How far will you push for diminishing returns?
We also face the same situation diagnostically, which can be even more complex. You can’t order every test imaginable for many reasons, so how do you prioritize which workups to do? You want to look for diagnoses that are more likely, that are treatable, that are easy to look for, and that will have a bad outcome if undiagnosed, for example. So I won’t do a brain biopsy to diagnose an untreatable mild condition, even if it is very likely. But I may recommend one for an uncommon condition if it is treatable and terminal if untreated. At least I will present the various options to my patient with all the trade-offs, and then make my recommendations. It is a complex cognitive process that takes decades to master.
LLMs do not do this. At baseline they may give some accurate information, but they will largely tell the patient what they want to hear and every idea is a great one. What I like about the LLM cognitive bias problem is that is forces us to think explicitly about how the LLMs are thinking, and how humans think, and how we should be thinking in specific contexts. Reflecting expert clinical decision-making in an LLM will be a long and deliberate process. It will not emerge spontaneously out of Reinforcement Learning from Human Feedback or current training methods. But I don’t see any reason why we cannot get there eventually. Tweaking existing models may not work, however. We may need to bake in good decision-making in the core training. This may require thinking of new ways to train LLMs for specific purposes. I say “may” because I am not an expert and I don’t know, but that is what I am seeing so far. But regardless of what path we need to take, LLMs are programmable and can be tweaked. It will be a fascinating process.