Publishing False Positives

Jan 05 2012

Published by Steven Novella under General Science
Comments: 18

Recently researchers published a paper in which their data show, with statistical significance, that listening to a song about old age (When I’m 64) actually made people younger – not just feel younger, but to rejuvenate to a younger age. Of course, the claim lacks plausibility, and that was the point. Simmons, Nelson, and Simonsohn deliberately chose a hypothesis that was impossible in order to make a point: how easy it is to manipulate data in order to generate a false positive result.

In their paper Simmons et al describe in detail what skeptical scientists have known and been saying for years, and what other research has also demonstrated, that researcher bias can have a profound influence on the outcome of a study. They are looking specifically at how data is collected and analyzed and showing that the choices the researcher make can influence the outcome. They referred to these choices as “researcher degrees of freedom;” choices, for example, about which variables to include, when to stop collecting data, which comparisons to make, and which statistical analyses to use.

Each of these choices may be innocent and reasonable, and the researchers can easily justify the choices they make. But when added together these degrees of freedom allow for researchers to extract statistical significance out of almost any data set. Simmons and his colleagues, in fact, found that using four common decisions about data (using two dependent variables, adding 10 more observations, controlling for gender, or dropping a condition from the test) would allow for false positive statistical significance at the p<0.05 level 60% of the time, and p<0.01 level 21% of the time.

This means that any paper published with a statistical significance of p<0.05 could be more likely to be a false positive than true positive.

Worse – this effect is not really researcher fraud. In most cases researchers could be honestly making necessary choices about data collection and analysis, and they could really believe they are making the correct choices, or at least reasonable choices. But their bias will influence those choices in ways that researchers may not be aware of. Further, researchers may simply be using the techniques that “work” – meaning they give the results the researcher wants.

Worse still – it is not necessary to disclose the information necessary to detect the effect of these choices on the outcome. All of these choices about the data can be excluded from the published study. There is therefore no way for a reviewer or reader of the article to know all the “degrees of freedom” the researchers had, what analyses they tried and rejected, how they decided when to stop collecting data, etc.

This is exactly why skeptics are not impressed when, for example, ESP researchers publish papers with statistically significant but small ESP effects, such as the recent Bem papers in which he purports to show a retroactive or precognitive effect. This is as impossible as music rejuvenating listeners and skeptics properly treated it the same way – the result of subtle data manipulation til proven otherwise. Researcher bias is one of the reasons that plausibility needs to be considered in interpreting research.

Simmons, Nelson, and Simonsohn do not just describe and document the problem, they also discuss possible solutions. They list six things researchers can do, and four things journal editors can do, to reduce this problem. These steps mainly involve transparency – disclosing all the data collected (including any data excluded from the final analysis), making decisions about end points prior to any analysis, and showing the robustness of the results by showing what the results would have been had other data analysis decisions been made. Reviewers essentially make sure this was all done.

They also discuss other options that they feel would not be effective or practical. Disclosing all the raw data is certainly a good idea, but readers are unlikely to analyze the raw data on their own. They also don’t like replacing p-value analysis with a Bayesian analysis because they feel this would just increase the degrees of freedom. I am not sure I agree with them there – for example, they argue that a Bayesian analysis requires judgments about the prior probability, but it doesn’t. You can simply calculate the change in prior probability from the new data (essentially what a Bayesian approach is), without deciding what the prior probability was. It seems to me that Bayesian vs p-value both have the same problems of bias, so I agree it’s not a solution but I don’t feel it would be worse.

They also discuss the problem with replications. An exact replication would partially fix the problem, because then all of the decisions about data collection have already been made. But, they point out, prestigious journals rarely publish exact replications, and so there is little incentive for researchers to do this. Richard Wiseman encountered this problem when he tried to publish exact replications of Bem’s psi research.

Conclusion

Science is not only a self-corrective process, the methods of science itself are self-corrective. (So it’s self-corrective in its self-correctedness.) Simmons and his colleagues have done a great service in this article, highlighting the problem of subtle researcher bias in handling data, and also being very specific in quantifying the effects of specific data decisions, and offering reasonable remedies. I essentially agree with their conclusions, and their discussion about the implications of this problem.

They hit the nail on the head when they write that the goal of science is to “discover and disseminate truth.” We want to find out what is really true, not just verify our biases and desires. That is the skeptical outlook, and it is why we are so critical of papers purporting to demonstrate highly implausible claims with flimsy data. We require high levels of statistical significance, reasonable effect sizes, transparency in the data and statistical methods, and independent replication before we would conclude that a new phenomenon is likely to be true. This is the reasonable position, historically justified, in my opinion, because of the many false positives that were prematurely accepted in the past (and continue to be today).

Science works, but it’s hard. There are many ways in which errors and bias can creep into research and so researchers have to be vigilant, journal reviewers and editors have to be vigilant, and the scientific community needs to continue to self-examine and look for ways to make the process of science more reliable. Those institutions and professions that lack this rigorous self-critical and self-corrective culture and process are simply not truly scientific.

18 responses so far