Feb 12 2018

Significant but Irrelevant – Study on Correcting False Information

A study from a few months ago is making the rounds due to recent write ups in the media, including Scientific American. The SA titles reads: “Cognitive Ability and Vulnerability to Fake News: Researchers identify a major risk factor for pernicious effects of misinformation.”

The study itself is: ‘Fake news’: Incorrect, but hard to correct. The role of cognitive ability on the impact of false information on social impressions. In the paper the authors conclude:

“The current study shows that the influence of incorrect information cannot simply be undone by pointing out that this information was incorrect, and that the nature of its lingering influence is dependent on an individual’s level of cognitive ability.”

So it is understandable that reporters took that away as the bottom line. In fact, in an interview for another news report on the finding one of the authors is quoted:

“Our study suggests that, for individuals with lower levels of cognitive ability, the influence of fake news cannot simply be undone by pointing out that this news was fake,” De keersmaecker said. “This finding leads to the question, can the impact of fake news can be undone at all, and what would be the best strategy?”

The problem is, I don’t think this is what the study is saying at all. I think this is a great example of confusing statistically significant with clinically significant. It is another manifestation of over-reliance on p-values, and insufficient weight given to effect size.

Here’s the study – the researchers gave subjects a description of a woman. In the control group they asked them to rate her probable qualities, how much they liked her, how trustworthy she was, etc. In the intervention group they also gave the subject false information about her, that she sold drugs and shoplifted in order to support her shopping habit. They also gave a cognitive ability test to all subjects, and divided them into high cognitive ability (one standard deviation above average) and low cognitive ability) one standard deviation below average).

Above is the graph of the results (high definition version here). As you can see, in the control group both high and low cognitive subjects rated the fictitious person as an 82/100 in quality. The intervention group, after give the false negative information, rated her much lower, a 24/26 in the low/high cognitive ability groups respectively. Then, the false negative information was corrected – the subjects were told, never mind, this false information is not true. Now how would you rate the person? The high cognitive ability group then rated her as an 85, while the low cognitive ability group rated her as a 78. The difference between the 85 and the 78 was statistically significant.

I look at this data and what I see is that in both cognitive ability groups, the false information had about the same negative effect, and the correction had about the same corrective effect, and in fact essentially completely erased the effect of the false information. I am not impressed by a tiny but statistically significant effect.

The small effect size has two major implications. The first is that it calls into question the reality of the effect at all. Essentially, the study has a small signal to noise ratio, and this is problematic, especially when dealing with human subjects. It’s just too easy to miss confounding factors, or to inadvertently p-hack your way to a significant result. That probably has something to do with the replication problem in psychological research.

But even if you think the study is rock-solid (and I don’t), and the results are entirely reliable, the effect size is still really small. I don’t know how you can look at this data and conclude that the bottom line message is that people with lower cognitive ability (as defined in this study in a very limited way, but that’s another issue) had difficulty correcting false information. They corrected from a rating of 24 up to 78, with a control group rating of 82. Are you really making a big deal out of the difference between 78 and 82? Is that even detectable in the real world? What does it mean if you think someone’s personal qualities rank 78/100 vs 82/100?

With any kind of subjective evaluation like this, how much of a difference is detectable or meaningful is a very important consideration. I would bet that a difference between 78 and 82 has absolutely no practical implication at all. That means that effectively, even the people in the lower cognitive group, for all practical purposes, corrected the effect of the false information.

I am also not impressed by the difference between the low and high cognitive groups – 78 to 85. That is still not a difference with any likely practical difference. Even this small effect size may be an artifact of the fact that the high cognitive group over-corrected (compared to controls) by a small amount (also insignificant).

And yet the Scientific American headline writer wrote that with this research, “Researchers identify a major risk factor for pernicious effects of misinformation.” “Major” risk factor? I don’t think so. Minor and dubious is more like it. What this research does show is that subjects easily and almost completely corrected for misinformation when pointed out to them.

But even that conclusion is questionable from this data, because the research conditions were so artificial, and the researchers characterize them as ideal. The false information was quickly and unequivocally reversed. This is not always the case in the real world.

Unfortunately I see this phenomenon a lot. Social psychology studies take a tiny but statistically significant outcome, and then over-interpret the results as if it is a large and important factor. Then journalists report the tiny effect as if it is the bottom line result, leaving the public with a false impression, often the opposite of what the study actually shows.

We see this in medicine all the time also. This is the frequentist fallacy – if you can get that p-value down to 0.05, then the effect is both real and important. However, often times a statistically significant result is clinically insignificant. Further, small effect sizes are much more likely to be spurious – an illusion of noisy data or less than pristinely rigorous research methods.

This is why some journals are shifting their emphasis from p-values to other measures of outcome, like effect sizes. You have to look at the data thoroughly, not just the statistical significance.

Further, researchers have to be careful not to oversell their results, and science journalists need to put research results into a proper context.

 

19 responses so far

19 thoughts on “Significant but Irrelevant – Study on Correcting False Information”

  1. Michael Finfer, MD says:

    I find it really amazing that, in the control group, both subsets of individuals averaged 82/100. That strikes me as rather unlikely.

    Could there be something more to this?

  2. bend says:

    Fatalism is trendy. But fatalism isn’t helping to make anything better. It’s popular to believe that people can’t be educated, that you can’t fix stupid. Well, not with an attitude like that, you can’t. Imma keep correcting anti-vaxxers and creationists on facebook.

  3. Kabbor says:

    bend,
    “Imma keep correcting anti-vaxxers and creationists on facebook.”

    That is a tough line of work, and I wish you the best. I don’t have the stomach for that kind of thing myself so I stay clear of most social media. Making occasional comments here is as close as I get, and I try to keep about half of them particularly serious.

    As to the study, this is a great example of media turning a non-event into a news story by assuming real world significance of statistically significant information. It seems to me that journalism schools need to teach the value of writing articles about research that has NOT shifted our way of looking at a topic, and placed emphasis on why it does not move the needle. For it to be news, something has to be new and exciting (or in many cases new and upsetting), and should be addressed in the mainstream.

    It is one of the things I very much enjoy about this blog, it does not sensationalize, and yet remains optimistic about innovation and research.

  4. RC says:

    “I am also not impressed by the difference between the low and high cognitive groups – 78 to 85. That is still not a difference with any likely practical difference.”

    Sure, a 7 percent change in your opinion of some random stranger probably doesn’t make a difference, but a 7% shift in a bunch of independents’ opinions on a political candidate is literally the sort of thing that changes the world.

    What I’d want to see in further study is whether repeated attempts at fake-newsing are more successful (because of the already lowered opinion of the subject on the target), or whether fatigue sets in. I’d guess its the former.

    Also, I’d guess in politics the effects are stronger for people in the opposition party and weaker in your own party.

  5. Average Joe says:

    Something is bugging me in regards to moving the goal post.
    On one hand, statistical significance implies a difference. On the other, Clinical significances implies that even though there is a statistical significance, in context, the difference is too small to be meaningful. Is this or is it not a form of moving the goal post? … (furthermore, moving the goal post has a negative connotation so saying as such introduces bias.)

  6. BenE says:

    The most interesting thing about the study is that the low-cognitive-abilities group undercorrected and the high-cognitive-abilities group overcorrected. Almost by the same amount. The obvious question is why.

    And the obvious answer is that this study is mostly showing noise, as Steven points out.

    But there’s a not-so-obvious hypothesis for this result, too. That hypothesis is that the high-cognitive group builds inhibiting circuits into their understanding that biases them against believing the kinds of information they once believed but had to discard … whereas, the low-cognitive group remains more susceptible to the kind of information they once believed but had to discard. And this low group can be hammered into a new belief over time, just by exposure, and in the face of information to the contrary.

    I’d like to see more research.

  7. RC – I don’t think you can take this study and apply it to voting. A 7 point difference on a 100 point rating scale does not directly translate to a 7% difference in election outcome. That would be a good and interesting question, which would get at a real-world effect. It’s possible that this 7 point difference would translate into a 0% change in voting. The two can’t be equated.

    AJ – How is it moving the goalpost? I guess if someone holds the position that reaching a statistically significant p-value means a phenomenon is real, then in response to such a study says, “Well, it’s not clinically significant.” But that is not what I am doing.

    I have been consistent over years, and in fact this is a central point of SBM, that p-value statistical significance does not translate into a phenomenon being real. You have to consider prior probability, root out p-hacking, look at replicability, etc. You also have to look at effect sizes, with small effect sizes making it more difficult to interpret the results as being real.

    And that is just for the question of whether or not the science establishes the phenomenon. In applied science you have the separate question of how to apply the science in the real world. I may scientifically prove a phenomenon, but the effect cannot be exploited in the real world because it is too tiny or ephemeral. Two separate questions – no goalpost is being moved.

  8. BenE says:

    Average Joe –
    You wrote “Is this or is it not a form of moving the goal post?”

    Not really. It’s like you’re trying to kick a field goal from the 20 yard line, and the kick goes 20 feet wide right.

    The question is whether the wind is responsible. We’ve laid the field full of all kinds of sensors and we are 99.99999% certain (very significant) that a 1mph wind was blowing left to right.

    But there’s no way a 1mph wind caused a 37-yard field goal to sail that far to the right. So the wind wasn’t the cause, regardless of its super high significance.

  9. RC says:

    @SN

    “RC – I don’t think you can take this study and apply it to voting. A 7 point difference on a 100 point rating scale does not directly translate to a 7% difference in election outcome”

    I didn’t say it did. I said it would translate into a 7 percent difference in independent voter’s opinions of the candidate. That’s a very different thing – it would only affect voters who are somewhere around neutral. It would be a much smaller effect – maybe 1 or 2 percent – but that wins national elections.

    Unless the affects are cumulative – and if you put out enough negative fake news on someone – it really doesn’t matter how much it gets corrected – you’ve simply poisoned someone’s impressions too much.

    IE, if you can bring down someone’s opinion of a candidate 5 points with a fake news story, what happens when you put out several fake news stories over the course of a campaign?

  10. Average Joe says:

    Thank you SN for the clear response. Thumbs up

  11. Beamup says:

    I think there’s an even more fundamental problem here – they didn’t do the correct comparison. Surely the threshold question is whether there remained any effect after giving, then correcting, the “fake” information? That is to say, whether either cognitive group was statistically significantly different from the control post-correction? If neither was, then the only conclusion you can draw is that there was no detectable effect on anyone and so you can’t compare what the effect was between the groups.

    I find this particularly striking in light of the fact that the supposed headline result doesn’t involve the control group at all. If that’s the statistical analysis they were planning to do all along, why bother with a control group? They could have done the same study with fewer participants by omitting it and treating it as a straight comparison.

    It smells to me like the study design was to compare the two cognitive groups to control. Then they failed to get a statistically significant result that way, so changed the statistical strategy post hoc in order to generate a publishable result.

  12. BillyJoe7 says:

    Michael Finfer: “I find it really amazing that, in the control group, both subsets of individuals averaged 82/100. That strikes me as rather unlikely”

    Apparently there were 400 subjects, 200 controls and 200 in the test group. They don’t give a split up between low and high cognitive subgroups but, if it was even, there would be 100 in each subgroup. Usually 50 subjects is regarded as being sufficient to provide reliable results. But it does seem surprising that the average of both subsets was identical.

  13. BillyJoe7 says:

    Kabbor: “this is a great example of media turning a non-event into a news story by assuming real world significance of statistically significant information”

    It’s also a great example of the media – and the authors themselves! – misinterpreting the data.
    In nobody’s language can an increase from 24 to 78 in one subgroup, and from 26 to 85 in another subgroup, be interpreted as “fake news: hard to correct”.

    Of course, as SN pointed out, in real life, where fake news is often repeated over an extended period of time before it is corrected, and where the correction, as well as not being immediate, is usually much less emphatic as in this trial, the effect is very likely to be considerably less.

  14. BillyJoe7 says:

    RC,

    I think what SN is saying is that the 7 percentage points difference between the low and high cognitive subgroups of the test group is unlikely to be real. There may be no difference at all.

    Firstly, we have the curious result that the average for both the low and high cognitive subgroups of the control group was identical. Secondly, we have the fact that the low cognitive subgroup of the test group undervalued the woman by 4 percentage points compared with the control group; and the high cognitive subgroup of the test group overvalued the woman by 3 percentage points compared with the control group. If the control groups’ assessment is true, than both the low and high cognitive subgroups of the test group made roughly the same degree of error. In a situation where 82% is a probability of something being true, underestimating the probability as being 78% and overestimating the probability as being 85% are both off target by almost the same amount.

    And then there’s other factors such as unrecognized non-independent variables, p-hacking, and random chance.

  15. adrian111 says:

    What draws my attention is the fact that the subjects were told that the information was false.

    In real life the person spreading the fake news will not tell the targeted audience that they provide false info, as it would be against it’s purposes.

    From my point of view, the ability to recognize fake news has to be measured without providing any details about the quality of the news, as similar as it can be to a real life situation.

    It should be similar in principle to testing mathematical aptitudes, or whatever technical aptitudes, as I assume the goal would be to test logic, reasoning, comprehension, etc, in rapport to some given info, and then corroborate that info with the subjects cognitive capabilities.

    Or, as it is in this study, it will just measure the subjects willingness to reassume their stance after being told by an authority figure that they were supplied false info. I may consider the study in case relevant in some ways, like how much we tend to follow authority without checking facts, but not to our topic as per your well reasoned article.

  16. jt512 says:

    I don’t believe the finding that subjects’ cognitive ability was related to how much they changed their minds after the negative information was “corrected.” The p-value for this finding is only .03, which, even in the absence of p-hacking would be weak evidence. But I’m suspicious that the result was p-hacked. The statistical method used to obtain this result is not clearly explained (itself is a bad sign), but the authors appear to be saying that they compared subjects with cognitive ability scores below –1 SD with subjects whose scores were above +1 SD. If so, then they have, rather arbitrarily, excluded the middle third of subjects in the experimental group from the analysis.

    Since they had a continuous measure of cognitive ability and a continuous measure of change in attitude after given the correcting information, the natural thing to do would be to determine the correlation between the two measures. Or, if they insist on categorizing, to split the entire experimental group into high- and low-cognitive ability groups based on some criterion determined in advanced. However, there is no indication that the study was pre-registered and no justification given for definition of high- and low-cognitive ability used (and the apparent exclusion of the middle third). I have to wonder how many different analyses the authors tried (or would have tried) to obtain a p-value (barely) below .05.

  17. orthodoxcaveman says:

    Alternative conclusion : People with higher cognitive ability overcompensate their opinion of the quality of somebody after presented with new information that corrects false information on negative characteristics of the person.

    It is also interesting reading the discussion of future research by the author – does not suggest duplication or possible method of strengthening the study design – only discusses studies to extend the research to demonstrate the same thing.

  18. captainlechuck says:

    I appreciate that there may be significant issues with this study’s design and conclusions, but i dont think small effect sizes are inherently meaningless or can’t make a ‘practical difference’.

    My understanding is that large effect sizes are needed to draw conclusions at the level of the individual but if the target population size is sufficiently large then small effect can still have a meaningful impact (public health strategies are a good example of this).

  19. You have to consider what you mean by effect size. That could mean a small number of people affected, or the average effect on each person was small.

    Public health usually deals with the former – let’s say death is the outcome. This is a huge (some would say the ultimate) health outcome. Reducing death by a small amount is still useful (if the effect is real).

    But let’s say that pain is the outcome, and an intervention reduces pain on a 1-10 scale on average from 6 to 5. In this case the effect itself is small, as most people will not even perceive a 1 point difference in pain.

    This gets to the problem of clinical significance. However, small effect sizes also make it more difficult to conclude the effect is real, even if statistically significant, because small methodological errors can cause small but significant effects, and they are hard to eliminate from studies. Even a little bit of p-hacking can produce them, for example.

Leave a Reply