Feb 14 2017

Science – We Have a Reproducibility Problem

reproducibility-smallJohn Ioannidis has published an interesting commentary in JAMA about the current reproducibility crisis in basic and clinical scientific research. Ioannidis has built his career on examining the medical literature for overall patterns of quality. He is perhaps most famous for his study showing why most published research findings are wrong.

The goal here is to improve the science of science itself (or “metascience,” like “metacognition”). As science has progressed a few things have happened. The questions are getting deeper, more complex, and more subtle. Research methods have to be more rigorous in order to deal with these more subtle questions.

The institutions of science have also grown. Science is big business, which means that there are “market forces” which push institutions, scientists, and publishers into pathways of least resistance and maximal return. These pathways may not be optimal for quality research, however.

The stakes are also getting higher. We now have professions and regulatory schemes that are supposed to be science-based. If we take medical products, for example, the public is best served is products are safe and effective and truthful in the claims made for them. We need scientific research to tell us this, and we need to know where to set the bar. How much scientific evidence is enough? We can only answer this critical question if we know how reliable different kinds of scientific evidence are.

Ioannidis summarizes many of the challenges being faced by modern scientific institutions, all of which have been discussed here many times before. First he lays out the issue in the context of reproducibility. Whether or not the findings of a study can be replicated by independent researchers is the ultimate test of the quality of that research. There is probably something wrong with findings that cannot be reproduced. He writes:

Empirical efforts of reproducibility checks performed by industry investigators on a number of top-cited publications from leading academic institutions have shown reproducibility rates of 11% to 25%. Critics pointed out that these empirical assessments did not fully adhere to advocated principles of open, reproducible research (eg, full data sharing), so the lack of reproducibility may have occurred because of the inability to exactly reproduce the experiment. However, in the newest reproducibility efforts, all raw data become available, full protocols become public before any experiments start, and article reports are preregistered. Moreover, extensive efforts try to ensure the quality of the materials used and the rigor of the experimental designs (eg, randomization). In addition, there is extensive communication with the original authors to capture even minute details of experimentation. Results published recently on 5 cancer biology topics are now available for scrutiny, and they show problems. In brief, in 3 of them, the reproducibility efforts could not find any signal of the effect shown in the original study, and differences go beyond chance; eg, a hazard ratio for tumor-free survival of 9.37 (95% CI, 1.96-44.71) in the original study vs 0.71 (95% CI, 0.25-2.05) in the replication effort. However, in 2 of these 3 topics, the experiments could not be executed in full as planned because of unanticipated findings; eg, tumors growing too rapidly or regressing spontaneously in the controls. In the other 2 reproducibility efforts, the detected signal for some outcomes was in the same direction but apparently smaller in effect size than originally reported.

Basically formal attempts at assessing reproducibility of major scientific findings show disappointingly low rates. He goes on to caution that these results are also preliminary and limited. Further, we never know if the failure to replicate is a problem with the original study, the replication, or both.

Acknowledging the uncertainty, there is enough evidence to demonstrate that there is a problem worthy of attention. Essentially the scientific community is cranking out too many poor quality studies. They are also producing high quality research, make no mistake. There is still clear progress in science, but it is floating on a sea of mediocre research.

This is a problem because it wastes limited resources. It also becomes a challenge to keep up with all the research, especially while most of it is unreliable. It may also not be obvious from the paper itself when research is unreliable. The flaws may not be evident. Meanwhile, real world decisions may hinge on the outcomes.

What specific factors are causing the problem?

Sample sizes are generally smaller, statistical literacy is often limited, there is limited external oversight or regulation, and investigator conflicts to publish significant results (“publish or perish”) are probably as potent as investigator and sponsor conflicts in clinical research.

Doing rigorous research is hard, expensive, and time-consuming. I frequently hear researchers describe the study they would like to run, and then the study they are actually going to run because of limited resources. The choice often comes down to publishing one quality paper, or 3-4 mediocre ones. The pressure to publish pushes them toward the 3-4 mediocre studies. The incentive structure rewards that approach.

There is also pressure to publish positive and interesting research, including original research with surprising findings. Boring replications are less rewarded. Keeping a lab and a career going for 1-2 years before having enough data to publish a single paper can be difficult.

If we want to fix the replication problem we need to change the incentives. Researchers should be incentivized (even required) to keep all raw data, meticulously document procedures, make their protocol open-access, and even to get pre-approval for study designs before collecting data.

There is also a problem with statistical literacy. Researchers routinely engage in p-hacking without even realizing it, or perhaps they know they are “cutting corners” but not fully realize how much it invalidates their research. Part of the problem is the culture of research. There is an overreliance on a simplistic frequentist approach, where getting to a p-value of 0.05 by any means is the holy grail. This problem can be fixed by better educating researchers, better oversight, and higher standards at journals for publication.


Science still works, and advances are made by the most rigorous research which is independently replicated. It takes years, often even decades, for a research program to mature to this level, however. The conclusion, therefore, is not that science is wrong, but that science takes longer than you think. There is a long build up of low-quality preliminary research that is highly unreliable. Eventually, in some cases, we get to the kind of rigor and replication that is reliable.

Along the way, however, confusion reigns. The public is informed, often in breathless terms, about preliminary findings that are likely wrong. Regulations and standards of care may even be based upon unreliable preliminary findings. Industries have emerged on the wave of unreproducible evidence (such as the supplement industry and the alternative medicine industry).

For now we definitely need to raise the bar of how much scientific evidence is convincing.

Meanwhile we need to find ways to reduce the number of unreproducible studies and increase the percentage of reliable studies. We are simply wasting too many resources on worthless research.

I agree with all of Ioannidis’s recommendations. Scientists should be publishing fewer more rigorous studies. Standards for research need to be higher, and statistical literacy needs to be much higher. We should have the goal of eliminating p-hacking entirely from published research. Academic institutions need to change their reward structure, and journal editors need to change their priorities.

Ioannidis also points out that we need to do this without imposing crushing regulations on scientists themselves. That can be counterproductive. We need to be smart, not heavy-handed. This will be challenging, but scientists generally are pretty smart people.


Like this post? Share it!

32 responses so far