May 08 2015

A Reproducibility Experiment

I have been writing quite a bit here and on Science-Based Medicine about metascience – the study of science itself. When you think about it, science is perhaps the most critical and broadly applicable technology of our modern civilization. It is the one endeavor from which all other technologies derive.

It therefore is very important that we understand how the institutions and processes of science are functioning. If there are any inefficiencies or biases in the system they can cause great harm, a waste of resources and a slowing of progress.

A recent metascience project called, The Reproducibility Project: Psychology, asks a very important question:

Do normative scientific practices and incentive structures produce a biased body of research evidence?

The project focuses on one specific aspect of this question – how reproducible are published psychological studies? They looked at 100 studies published in prominent psychology journals with positive results that have been generally accepted. They then crowdsourced scientists conducting replications of these studies, with 270 authors participating.

The results have yet to be formally published, but have been made public (not surprising, since the project is being led by the Center for Open Science). Here are the basic results: Of the 100 replications, 39 met predetermined criteria for replicating the results of the original study, while 61 did not replicate. Of the 61 non-replications, however, 24 had results similar to the original study, but not meeting the criteria for replication. This generally meant that there was a statistical trend in the right direction, but not statistical significance.

Taken at face value, this could mean that about 60% of published positive original psychological studies are false positives (it would take more replications to know for sure). This is intriguing because that 60% figure keeps cropping up in metascience research.

In 2011 Simmons et.al. demonstrated that by exploiting four researcher degrees of freedom you could manufacture positive results to a statistical significance of 0.05 from a completely negative set of data 60% of the time.

In 2013 Prasad et. al. published a review in which they found that 62% of studies looking at currently accepted practices in medicine failed to support the practice (40.2% were reversals and 21.8% were inconclusive).

Ioannidis also famously published an analysis in which he argues that most published research findings are false. The exact percentage depends on a number of variables, such as the prior probability of the idea being tested, the number of scientists investigated the question, and the inherent bias toward one outcome (often revealed in the funding source).

This is not to say that 60% of medical research findings are wrong, and there is evidence to suggest otherwise. A 2013 study estimated the false positive rate in top medical journals as 14%. A 2005 JAMA study looking at reported intervention effects found that 16% were refuted by later research.

Conclusion

It is difficult to say how much published research comes to what is ultimately the wrong conclusion. Different ways of addressing this question come to different conclusions. We can draw some clear conclusions about the scientific literature, however.

One conclusion is that no individual study is definitive. An individual study can be wrong for many reasons: a fluke of the data, researcher bias (or p-hacking), or poor methodology. Various forms of publication bias may also favor false positive studies, so they are overrepresented in the published literature. Basing conclusions on one or a few studies is therefore unreliable.

It is also clear that small and preliminary (meaning less rigorous) studies are more likely to be wrong with a bias in the false positive direction.

What this means is that study findings are only compelling once they have been sufficiently independently replicated. Most preliminary findings will probably not be replicated, depending on their prior probability. Researchers and journals need to value replications more, and shift the bias away from new and provocative research to replications, which are actually the workhorse of the scientific literature.

When looking at the scientific literature to answer a question (is something real, does a treatment work, etc.) I follow this basic approach:

First I determine the prior probability. This is a gestalt assessment based on everything we know from every relevant discipline of science. How likely is it that the claim is true? This judgement does not have to be very detailed or precise to be useful. Even dividing claims into “very likely,” “likely,” “neutral,” “unlikely,” and “hell no” is very useful.

I then look at all the scientific literature, relying upon systematic reviews to get a good overview, and when necessary looking closer at individual studies that are most important. I have several questions:

– how many studies are there
– how rigorous are those studies
– what are the results of the most rigorous studies
– what are the effect sizes
– how statistically robust are the results
– are results generally (and independently) replicated
– are the results fairly consistent or are they heterogenous

I tend to accept claims based upon published rigorous evidence that shows a consistent robust result with reasonable effect sizes with evidence in proportion to the plausibility. What I tend to find is that for highly implausible claims you never get all of these things at the same time. You may get statistically robust results, but with razor thin effect sizes that don’t replicate. Or perhaps only one research team is able to generate positive results. Or results are all over the place with no consistency.

When looking properly at the entire scientific literature relating to a question, you can come to highly reliable conclusions. However, that literature needs to be as high quality and unbiased as possible, and composed of studies that are rigorous and free from bias or fraud. That is what we are trying to improve by examining the inherent biases and flaws in the literature.

11 responses so far