Aug 31 2015

The Reproducibility Problem

A recent massive study attempting to replicate 100 published studies in psychology has been getting a lot of attention, deservedly so. Much of the coverage has been fairly good, actually – probably because the results are rather wonky. Many have been quick to point out that “science isn’t broken” while others ask, “is science broken?”

While many, including the authors, express surprise at the results of the study, I was not surprised at all. The results support what I have been saying in this blog and at SBM for years – we need to take replication more seriously.

Here are the results of the study:

We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects (Mr = .197, SD = .257) were half the magnitude of original effects (Mr = .403, SD = .188), representing a substantial decline. Ninety-seven percent of original studies had significant results (p < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and, if no bias in original results is assumed, combining original and replication results left 68% with significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Let’s unpack this. The big result is that while 97% of the original 100 studies had statistically significant results, only 36% of the replications did. If you look at effect sizes rather that significance, 47% of the original studies were replicated within the 95% confidence interval. If you combine these two, then the authors concluded that 39% of the replication studies confirmed the results of the original studies.

Using a threshold for statistical significance of p=0.05, you might think that there should be a 95% chance that the results are “real” and therefore most of the replications should also be positive. This is a misinterpretation of p-values, however.

I have discussed many reasons why a single study, even with high levels of statistical significance, may not be reliable. One is that p-values themselves don’t reliably replicate. Watch this video, dance of the p-values, to see what I mean. Even when you use a computer program with a fixed effect size and then generate random data, the resulting p-values are all over the place. P-values were never meant to be a solitary indication of if the results of an experiment are real – only a first approximation of whether or not the results are interesting.

So, even with iron-clad experimental design and execution, we would not expect 95% of studies with a p-value of 0.05 to replicate. But most studies do not have iron-clad experimental design and execution. I have discussed before experimenter degrees of freedom and p-hacking. Essentially, this is the practice (whether intentional, innocently, or with a bit of a wink – cutting corners but thinking it doesn’t really matter) of tweaking the execution of an experiment as one looks at the data in order to get across the magical line of statistical significance. You could, for example, keep collecting data until the drunken walk wanders over the line of significance, then stop and publish.

Exact replications take away researcher degrees of freedom, and therefore p-hacking, and therefore expose many false positive studies. Another way to reduce degrees of freedom, which I also discussed recently, is registering trials prior to execution. Simply doing this, in a recent analysis, reduced positive studies from 57% to 8%.

This new study supports another remedy to the apparent abundance of false-positive studies in the scientific literature – valuing replications more. The Journal of Personality and Social Psychology, infamously refused to publish an exact replication by Richard Wiseman of Bem’s ESP research. The journal simply said that they do not publish exact replications as a matter of policy.

The reason for this is that replications are boring while publishing new exciting research increases a journal’s prestige and impact factor. Of course, new exciting research is more likely to be wrong, precisely because it is new and because of what makes it exciting – it goes against the grain.

One aspect of this study often overlooked in popular reporting is the impact of effect sizes. First, the authors found that effect sizes in the replicated studies were about half that of the original studies. This is a well-known phenomenon known as the decline effect – the tendency of effect sizes to decrease as new studies of the same question are published. Sometimes effect sizes decline to zero, sometimes to a positive but diminished result.

ESP researchers, grappling with their own decline effect (consistently to zero) actually proposed that ESP as a phenomenon tends to decline over time as researchers pay attention to it. This is not only absurd, it is completely unnecessary. The far simpler explanation is that as researchers address a scientific question they get better and better at designing studies, learning from the previous studies. As study design and execution gets more rigorous over time, researcher bias is constrained, and effect sizes diminish.

What do we do now?

Every discipline of science has its own culture, journals, and practice, and some are more rigorous than others. But across the board it seems we need to shift the emphasis towards replications. Everywhere along the line of research replications need to have more academic and scientific value. If you talk to researchers, they know the value of replications. That is how we know what it real and what isn’t. But the incentives are all toward doing new and exciting research, and the rewards for doing exact replications are slight.

Journal editors hold a large amount of the blame. They need to start publishing more replications. There is probably an optimal balance in there somewhere – the perfect mix of new exploratory science with replications and confirmatory science. This is like getting the proper amount of oxygen and fuel in an engine. If the mix is not right, the engine is inefficient.

All of these problems with science does not mean science is broken. It means it is inefficient. In the long run, replications are done, and the science sorts itself out. Only real effects will persist over time. But I don’t think we have the mix right. Perverse incentives have pushed the system too far in the exploratory direction, resulting in a flood of false positive studies and a deficit of replications to tell which ones are real.

We know what the problem is and how to fix it.

Putting it All Together

There are also implications from this study for the average person in informing them how to evaluate scientific studies and scientific knowledge. The question is – what scientific results are compelling and where should we set the threshold for accepting that a claim is probably true.

From my experience it seems that many people believe that if a single study shows something, then they can treat the results as real – especially if it confirms what they want to believe. I am often challenged with single studies, as if they support a position against which I am arguing.

I have laid out exactly what kind of evidence I find compelling. Research evidence is compelling when it has all of the following features simultaneously:

1- Rigorous study design

2 – Statistically significant results

3 – Effect sizes that are substantially above noise levels

4 – Independent replication

Many people focus only on criterion #2 – if the results are significant, then the phenomenon is real. But #2 is perhaps the least important of these four. The current study of replications showed that effect sizes were a better predictor of replicability than statistical significance.

For example, I am not convinced of the reality of ESP, homeopathy, acupuncture, or astrology because with these disciplines you never see a specific effect that shows a significant and large effect size with a rigorous study design that is reliably independently replicated. Either effects sizes are razor thin, or the study design is loose, or a one-off study cannot be replicated.

All of this is also only looking at the evidence itself. In addition, you have to consider scientific plausibility. There is always the subjective question of – how significant, how large an effect size, how many times does it have to be replicated? The answer to these questions is – it depends on how plausible or implausible the alleged effect is. The threshold for something like homeopathy is very high, because the plausibility is close to zero. (But to be clear, the evidence for homeopathy does not even reach the minimal threshold for a highly plausible effect, let alone the magic that is homeopathy.)

Keep all this in mind the next time a new exciting study is being shared around social media. Put the study through the filter I outlined above.

31 responses so far