Nov 04 2016

Reproducibility is Critical

Part of expertise comes from doing something for so long, so many times, that you see patterns that might not be immediately apparent. I have been doing the skeptical thing for over 20 years, and one of the things I really enjoy about being an all-purpose skeptic, as I like to call it, is that I can see patterns of thought, argument, and behavior among disparate beliefs.

The anomaly hunting of ghosthunters is the same as the anomaly hunting of 911 truthers.

One very dominant pattern that I see on a frequent basis is the tendency to cite preliminary evidence as if it is rock-solid and confirmed. People will worry about the health effects of GMOs or vaccines because of a few flawed studies. They will promote the health benefits of a supplement based upon a preclinical study that is many steps removed from actual clinical claims. They will accept a new phenomenon as real based on studies that have never been replicated.

To put this into its broadest context, we need to think explicitly about the relationship between levels of scientific evidence and how much we should accept those results. When do we conclude that a scientific finding is probably real? For applied sciences like medicine this has a very practical form – when do we recommend a treatment?

In fact this is an absolutely core aspect of skepticism – calibrating your sense of when scientific evidence is reliable. Where is the threshold of acceptance? True believers (to use the term broadly) have too low a threshold. Deniers have too high a threshold, or conclude that science is incapable of reaching a threshold of reliability, either in general or with respect to their specific area of interest.

The Role of Reproducibility

A recent article in the Naturejobs blog discusses the importance of reproducibility in science. The author, Andy Tay, discusses the importance of reproducibility in science at the personal, institutional, and societal level, writing:

At the societal level, ensuring reproducibility in science helps to maintain the public’s faith in science and to ensure continual support for scientists who wish to utilize science to elevate societal well-being.

The economic loss in supporting irreproducible science is obvious.

As one example of the economic and opportunity loss Tay cites a review of preclinical cancer research which found that such studies could not be replicated 80-90% of the time. The study also found that preclinical studies that were not reproduced were cited more overall than studies that were replicated.

The authors were investigating this question because of the very low rate at which possible cancer treatments end up working in clinical trials, which is lower than for medical research in general. They conclude that reliance on preclinical studies that have not been replicated is a major factor.

This is just one piece of what is being called the reproducibility crisis in science. I think that’s a bit dramatic, but it is true that we need to emphasize the importance of reproducibility and perhaps make structural changes to the funding and reporting of science to increase the role of replication studies.

The reason reproducibility is so important is because science is hard and the world is complex. It is expensive and tedious to do an extremely rigorous study. Often scientists don’t even know how to do a truly rigorous study until they have completed preliminary studies to learn more about the phenomenon. So by necessity there are many preliminary studies out there – studies which follow some scientific methodology, but are not strong enough to warrant confident conclusions. Their results should guide further research but nothing else.

There is also the problem of researcher bias. Rigorous methods are designed to control for bias, but researcher bias can be very subtle.

Performing independent replications of other people’s research is the best way to determine if a phenomenon is real or just an illusion of bias, flawed methodology, or quirky results.

How much replication is necessary? That is a judgment call specific to each individual question. There are many variables to consider.

My overall threshold is this: Before we conclude that a phenomenon or scientific finding is probably true we need to simultaneously see:

  • Studies that are sufficiently rigorous
  • Results that are unambiguous ( significant, with good effect sizes)
  • A reasonable signal to noise ratio (effect size has to be large compared to the background noise)
  • Sufficiently independently replicated

We never see this, for example, with ESP research. There is no research paradigm demonstrating a psi phenomenon that meets all these criteria. The same is true of homeopathy, acupuncture, pyramid power, cold fusion, dowsing, or any of the other countless alleged phenomena about which skeptics are skeptical.

How much is “sufficient?” That is where prior probability comes in. How plausible are the claims to begin with? The more extraordinary the claims, the higher the threshold of evidence. (But to be clear, the phenomena I listed above do not meet even minimal scientific thresholds, let alone an extraordinary threshold commensurate to their low plausibility.)

As Tay points out in his commentary, the stakes can be very high. We waste tremendous resources chasing preliminary results that have not been replicated. With respect to cancer research, for example, the authors of the review suggest that progress is slowed and resources wasted when clinical studies are based on preclinical studies that have not been replicated.

Part of the solution, it seems, must be to shift more resources toward performing replications of promising studies. This means giving replications more academic credit, and more priority for publication and funding.

It is not enough to just say that replications are important – the scientific community has to put their money where their mouth is. As the old saying goes, “Don’t tell me your priorities, show me your budget.”

Researchers need incentives to perform replications. Right now the emphasize is on new and sexy findings, to maximize the impact factor of journals. But those are the findings that are least likely to be replicated, and most likely to be cited.

This is potentially a massive inefficiency in the entire scientific infrastructure.

I am not one of those people who will argue that because of structural problems like this that “science is broken.” I disagree with that conclusion. Replications eventually get done, and bad results are weeded out. Keep in mind – the cancer researchers identified the problem because the preclinical studies were not working when tested clinically.

We should see this more as a problem of efficiency. As a society we waste tremendous precious scientific resources when we accept preliminary findings too soon, prior to adequate replication.

Part of this problem is when such findings are communicated to the public, often in the form of explicit claims. Preliminary and unreliable findings tend to dominate controversial areas, like the safety of GMOs or vaccines, the effectiveness of public health measures or educational paradigms, and pretty much the entire supplement, alternative medicine, dieting and self-help industries.

We have identified the problem and the fixes. We need to keep up the pressure until real systemic changes are made. We need to educate the public to minimize the damage of over-reporting preliminary findings. We need to put pressure on journalists to raise their standards of science reporting.

10 responses so far