Aug 31 2015

## The Reproducibility Problem

A recent massive study attempting to replicate 100 published studies in psychology has been getting a lot of attention, deservedly so. Much of the coverage has been fairly good, actually – probably because the results are rather wonky. Many have been quick to point out that “science isn’t broken” while others ask, “is science broken?”

While many, including the authors, express surprise at the results of the study, I was not surprised at all. The results support what I have been saying in this blog and at SBM for years – we need to take replication more seriously.

Here are the results of the study:

We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects (Mr = .197, SD = .257) were half the magnitude of original effects (Mr = .403, SD = .188), representing a substantial decline. Ninety-seven percent of original studies had significant results (p < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and, if no bias in original results is assumed, combining original and replication results left 68% with significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Let’s unpack this. The big result is that while 97% of the original 100 studies had statistically significant results, only 36% of the replications did. If you look at effect sizes rather that significance, 47% of the original studies were replicated within the 95% confidence interval. If you combine these two, then the authors concluded that 39% of the replication studies confirmed the results of the original studies.

Using a threshold for statistical significance of p=0.05, you might think that there should be a 95% chance that the results are “real” and therefore most of the replications should also be positive. This is a misinterpretation of p-values, however.

I have discussed many reasons why a single study, even with high levels of statistical significance, may not be reliable. One is that p-values themselves don’t reliably replicate. Watch this video, dance of the p-values, to see what I mean. Even when you use a computer program with a fixed effect size and then generate random data, the resulting p-values are all over the place. P-values were never meant to be a solitary indication of if the results of an experiment are real – only a first approximation of whether or not the results are interesting.

So, even with iron-clad experimental design and execution, we would not expect 95% of studies with a p-value of 0.05 to replicate. But most studies do not have iron-clad experimental design and execution. I have discussed before experimenter degrees of freedom and p-hacking. Essentially, this is the practice (whether intentional, innocently, or with a bit of a wink – cutting corners but thinking it doesn’t really matter) of tweaking the execution of an experiment as one looks at the data in order to get across the magical line of statistical significance. You could, for example, keep collecting data until the drunken walk wanders over the line of significance, then stop and publish.

Exact replications take away researcher degrees of freedom, and therefore p-hacking, and therefore expose many false positive studies. Another way to reduce degrees of freedom, which I also discussed recently, is registering trials prior to execution. Simply doing this, in a recent analysis, reduced positive studies from 57% to 8%.

This new study supports another remedy to the apparent abundance of false-positive studies in the scientific literature – valuing replications more. The Journal of Personality and Social Psychology, infamously refused to publish an exact replication by Richard Wiseman of Bem’s ESP research. The journal simply said that they do not publish exact replications as a matter of policy.

The reason for this is that replications are boring while publishing new exciting research increases a journal’s prestige and impact factor. Of course, new exciting research is more likely to be wrong, precisely because it is new and because of what makes it exciting – it goes against the grain.

One aspect of this study often overlooked in popular reporting is the impact of effect sizes. First, the authors found that effect sizes in the replicated studies were about half that of the original studies. This is a well-known phenomenon known as the decline effect – the tendency of effect sizes to decrease as new studies of the same question are published. Sometimes effect sizes decline to zero, sometimes to a positive but diminished result.

ESP researchers, grappling with their own decline effect (consistently to zero) actually proposed that ESP as a phenomenon tends to decline over time as researchers pay attention to it. This is not only absurd, it is completely unnecessary. The far simpler explanation is that as researchers address a scientific question they get better and better at designing studies, learning from the previous studies. As study design and execution gets more rigorous over time, researcher bias is constrained, and effect sizes diminish.

**What do we do now?**

Every discipline of science has its own culture, journals, and practice, and some are more rigorous than others. But across the board it seems we need to shift the emphasis towards replications. Everywhere along the line of research replications need to have more academic and scientific value. If you talk to researchers, they know the value of replications. That is how we know what it real and what isn’t. But the incentives are all toward doing new and exciting research, and the rewards for doing exact replications are slight.

Journal editors hold a large amount of the blame. They need to start publishing more replications. There is probably an optimal balance in there somewhere – the perfect mix of new exploratory science with replications and confirmatory science. This is like getting the proper amount of oxygen and fuel in an engine. If the mix is not right, the engine is inefficient.

All of these problems with science does not mean science is broken. It means it is inefficient. In the long run, replications are done, and the science sorts itself out. Only real effects will persist over time. But I don’t think we have the mix right. Perverse incentives have pushed the system too far in the exploratory direction, resulting in a flood of false positive studies and a deficit of replications to tell which ones are real.

We know what the problem is and how to fix it.

**Putting it All Together**

There are also implications from this study for the average person in informing them how to evaluate scientific studies and scientific knowledge. The question is – what scientific results are compelling and where should we set the threshold for accepting that a claim is probably true.

From my experience it seems that many people believe that if a single study shows something, then they can treat the results as real – especially if it confirms what they want to believe. I am often challenged with single studies, as if they support a position against which I am arguing.

I have laid out exactly what kind of evidence I find compelling. Research evidence is compelling when it has all of the following features simultaneously:

1- Rigorous study design

2 – Statistically significant results

3 – Effect sizes that are substantially above noise levels

4 – Independent replication

Many people focus only on criterion #2 – if the results are significant, then the phenomenon is real. But #2 is perhaps the least important of these four. The current study of replications showed that effect sizes were a better predictor of replicability than statistical significance.

For example, I am not convinced of the reality of ESP, homeopathy, acupuncture, or astrology because with these disciplines you never see a specific effect that shows a significant and large effect size with a rigorous study design that is reliably independently replicated. Either effects sizes are razor thin, or the study design is loose, or a one-off study cannot be replicated.

All of this is also only looking at the evidence itself. In addition, you have to consider scientific plausibility. There is always the subjective question of – how significant, how large an effect size, how many times does it have to be replicated? The answer to these questions is – it depends on how plausible or implausible the alleged effect is. The threshold for something like homeopathy is very high, because the plausibility is close to zero. (But to be clear, the evidence for homeopathy does not even reach the minimal threshold for a highly plausible effect, let alone the magic that is homeopathy.)

Keep all this in mind the next time a new exciting study is being shared around social media. Put the study through the filter I outlined above.

## 31 thoughts on “The Reproducibility Problem”

## Leave a Reply

You must be logged in to post a comment.

I think it makes sense for researchers to consult a statistician prior to IRB submission. This is not just for making sure the study is designed to have the appropriate statistical power to say something useful, but to also be sure the researcher has in mind an appropriate question that is well defined.

Interestingly, the editors of one psychology journal — Basic and Applied Social Psychology — announced this year that they are no longer allowing p values to be used in reporting of research results. The editors argue, as you do here, that the p < .05 bar is too easy to achieve and this allows poorly designed studies to get published (see http://www.nature.com/news/psychology-journal-bans-p-values-1.17001).

Just one minor correction: it was the Journal of Personality & Social Psychology — which published the original Bem study — and not Psychology Today that famously refused to publish the attempt at replication. Psychology Today does not publish original research.

Ack – faulty memory. Correction made, thanks.

Extraordinary claims demand extraordinary evidence. ESP, homeopathy, acupuncture, astrology etc. make claims which challenge the foundations of science. To convince the world’s scientists that they all need to rethink everything they thought they knew from scratch, you have to have truly, extraordinarily, rock-solid evidence. So yes, you need 1,2,3,4 in spades dressed up in a bow. And yes, there is far too much focus on 2 – by those who don’t know too much statistics.

@Sherrington

Statistical significance of p < .05 is fair enough for an initial point of principle trial. From there on a higher and higher bar ought to to be set.

toots wrote:

Such a sweeping generalization cannot be justified. Experimental results with p-values between .01 and .05 often provide little-to-no evidence that the null hypothesis is false, and, in fact, often support the null hypothesis over the alternative. Such results are at best useless, and at worst misleading, for justifying follow-up studies on a hypothesis.

One of the real benefits to me personally from spending time reading this blog, the SBM blog, and following SGU–plus a lot of independent reading–is to realize the limits of a headline reading: “A new study has found…”. I’ve been skeptically oriented my whole life, family and friends might call it a personality defect, but to have a clearer picture regarding the nuts and bolts of scientific research has been a boon for me.

Thanks to all of you folks who blog here for helping me become better educated.

Well, I was happy that the “free will” study wasn’t replicated.

jt512 wrote:

“Such a sweeping generalization cannot be justified. Experimental results with p-values between .01 and .05 often provide little-to-no evidence that the null hypothesis is false, and, in fact, often support the null hypothesis over the alternative. Such results are at best useless, and at worst misleading, for justifying follow-up studies on a hypothesis.”

I’m not certain I agree. By definition, a p-value of .05 indicates that there’s a 5% chance we are wrong about the alternative hypothesis. An independent study would give us a composite probability closer to the truth. I’m in agreement with Toots.

rdgroves wrote:

You’ve stated the common misconception about p-values that Steve mentioned. Specifically, you are misinterpreting the p-value as the (posterior) probability that the alternative hypothesis is false (or, equivalently, that the null is true). What the p-value actually is, is the probability,

assuming that the null hypothesis is true,of observing a result at least as extreme as the result actually obtained. In other words, it is the probability of observing the data (or more extreme data), given the null hypothesis; not the probability of the null hypothesis, given the data.The probability that the null hypothesis is true, given that the p-value is .05, depends on the prior probability of the null hypothesis. Berger and Sellke (1987) pdf (see Table 6) showed that if the prior probability of the null hypothesis is 0.5 and the p-value is 0.05, then the (posterior) probability that the null hypothesis must be at least 0.29. And it is often considerably greater than 0.5. Berger has an informative applet at his web site that computes by simulation the posterior probabilities of the null and alternative hypotheses given a prior probability, range of p-values, and other settings. (To simulate a precise p-value (eg, .05), use a small range, (eg, .049–.050)).

Thanks jt512. I need to study this one more. School never ends!

Jt512 said

“In other words, it is the probability of observing the data (or more extreme data), given the null hypothesis; not the probability of the null hypothesis, given the data.”

I now understand this. Thank you. But Steve mentioned that the #2 requirement is getting statistically significant results. So without knowing the prior probability of the null being true in many cases, how would we know how many replications to do to achieve #2? Would it be when the effect size stabilized?

Rex

I guess I meant to ask, does a stabilization of effect size within a certain range coupled with a low p-value for each replication tell us that we’ve achieved “truth” about a hypothesis?

rdgroves: “without knowing the prior probability of the null being true in many cases, how would we know how many replications to do to achieve #2? (statistical significance)”

I think the concept of Power might get at your question? Power is the probability of rejecting the null, assuming the null is false (roughly, the probability of finding a statistically significant outcome when a true effect is there to be found).

Power is determined by: significance level (alpha often .05), the sample size, and the suspected effect size of the treatment. The effect size can be estimated based on previous related research, or a practical effect size can be chosen (say, we are only interested in an effect if it is X large or perhaps greater). Using the concept of power, we can then determine how many samples to take (in some cases, this is the number of repetitions of data points we gather in our sample).

So, power doesn’t tell you how many study replications to run, but it can give you an idea of how many samples in a given study are needed to find an estimated (or practical) effect size if there really is an effect there waiting to be found. If a high-powered study fails to find a significant effect, often this is good evidence for the null (or at least suggests the effect size is smaller than previously thought, and may even be impractically small to be meaningful).

Stats guys: am I explaining this correctly? This is my understanding of power and its interplay with sample sizes and effect sizes.

Good intro paper on these concepts: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

Excellent quotes from paper to keep in mind:

“you must determine what number of subjects in the study will be sufficient to ensure (to a particular degree of certainty) that the study has acceptable power to support the null hypothesis. That is, if no difference is found between the groups, then this is a true finding.”

“With a sufficiently large sample, a statistical test will almost always demonstrate a significant difference, unless there is no effect whatsoever, that is, when the effect size is exactly zero; yet very small differences, even if significant, are often meaningless.”

“Sometimes a statistically significant result means only that a huge sample size was used.”

Though not specifically addressed in the original post, a major contributor to the observation of smaller effect size in replications is simple regression to the mean.

Published studies are not random selections of research data. Studies that demonstrate larger effect sizes and smaller P values are more likely to be submitted to scientific journals. Among those submitted, those with larger effect sizes and smaller P values are more likely to be published.

So there is a powerful bias favoring more extreme results entering this literature. This is a prototypical situation where regression to the mean is a certainty.

Rdgroves wrote:

There is no “the” prior probability to know. The prior probability of a hypothesis is a quantification of your belief about the plausibility of that hypothesis; it’s your opinion. So it’s not the prior probability; it’s

yourprior probability. And in terms of when to believe that a hypothesis is true, there is no reason to actually quantify your prior probability. A rational thinker will naturally require more evidence to be convinced that a hypothesis is true if they think the hypothesis is implausible than if they think the hypothesis is plausible.I don’t think there is a surefire algorithm that tells us when we’ve achieved truth about a hypothesis. It’s always going to be a judgment call. But, surely, multiple consistent independent replications of an effect size using well-controlled, well-powered, pre-registered studies would increase the plausibility of a hypothesis, and often would be convincing.

The Other John Mc wrote:

Your explanation of power and its relation to sample size and effect size is basically correct. However, I think you’ve misinterpreted Rdgrove’s question. He wasn’t asking about the relationship between power and replication, but between prior probability and replication.

Those authors don’t know what they are talking about. What they are saying to do, above, is almost never done. The authors don’t even do it in their own illustration of a sample size calculation in the very same paper (see the box in the paper).

That paper was published in the

Journal of Graduate Medical Education, of which the article’s first author is Editor-in-Chief. Apparently, those who can, do; those who can’t, teach. Those who can’t teach, teach education. Those who can’t teach education, teachmedicaleducation; and those who can’t even do that, become Editors of their journals.Ha! OK thanks for clarifying

Thanks to both of you “The Other John” and “jt512”. I basically get it, but it’s still going to require a couple of beers. Wait..I’m here in Colorado…maybe something else…

rdgroves wrote:

Like a nice long walk in the mountains? 😉

I find it extremely difficult to understand various study designs and the meaning of p-values. Some of the above comments have made me realize that I don’t seem to understand what is meant by statements along the lines: the null hypothesis is true/false. I hope that someone will be kind enough to help me understand it using a practical example that I am able to understand, such as the following…

My hypothesis is that light switches (the cause) switch lamps on and off (the effect). I’ve chosen this example because the effect size is large although the effect is less than 100% reliable/guaranteed. When we close the switch, the lamp will not illuminate if the lamp has failed, the switch has failed, or the electricity supply has failed. Similarly, when we open the switch, the lamp will stay illuminated if the switch contacts have stuck together.

Obviously, if the sample size of my trial is too small, my hypothesis would be demonstrated to be false if I happen to get a batch of faulty switches or lamps; although the p-value would be very close to zero suggesting that the data are statistically valid.

If the sample size is very high and the study is independently replicated, it will not only confirm my hypothesis, it will also provide a reasonable indication of the percentage of faulty switches and lamps.

If the sample size of each trial is low (underpowered), but the study is replicated multiple times, I think the combined data could be used to give a reliable indication of my hypothesis and the percentage of faulty switches and lamps. I’m not sure if a systematic review of the multiple studies would provide a reliable result because I think the acceptance/rejection criteria might bias the results in favour of the non-faulty batches of lamps and switches.

Sorry that was so long. As usual, I don’t know what it is that I don’t know therefore I’m unable to ask succinct questions.

@Pete A:

Your example is rather difficult for purposes of understanding what a p-value is and what it means for null hypothesis to be true or false, but I’ll try to go with it anyway.

Assume that your conceptual hypothesis is that switches change the state of lights. That hypothesis is not directly testable statistically, because it is qualitative, not quantitative. In order to test your hypothesis, you have to make a quantitative prediction from it. Let’s say, then, that your quantitative prediction is that if light switches change the state of lights, then if you flip a switch, the probability that the light will change state is greater than 0.1 (10%). So, that’s your research, or “alternative,” hypothesis. Your null hypothesis would then be the complementary hypothesis, namely, that the probability that the light changes state when a switch is flipped is no greater than 0.1.

Now, you conduct an experiment to test your quantitative hypothesis. You flip 5 (independent) switches, and you find that 4 times the light changes state. The p-value is the probability under the assumption that the null hypothesis is true of observing data at least as extreme as that actually observed. We can calculate this p-value. We assume the null hypothesis is true and that the true probability of the light changing state when we flip the switch is 0.1. We performed 5 trials and observed 4 “successes.” The only more extreme case would be if we had observed 5 success. Thus the p-value for our results is the sum of the probability of observing 4 and 5 successes if the true probability of a success is 0.1. Those probabilities are, respectively, 0.00045 and 0.00001. Thus our p-value is 0.00045 + 0.00001 = 0.00046. This is a small number and shows that our result (or a more extreme one) is unlikely if the null hypothesis is true.

@Pete A

Let me try to make your hypothesis a little more relevant to the types of studies being addressed here. Your hypothesis is that putting a switch in a circuit, and closing the switch turns on the light more frequently than chance.

We set up 2 conditions: We create 10 circuits with a switch and a lamp, and 10 “circuits” with the switch not wired to the lamp (control group). We then flip each switch and note how often the lamp turns on. The null hypothesis would be that the lamp alights with equal frequency under both conditions.

This example is not very instructive unless we assume that there is a baseline rate of activation of the lamp independent of the switch.

After we do our study, we find that when the switch is in the circuit, the lamp lights 8 out of 10 times. and if the switch is not in the circuit, the lamp lights 2 out of 10 times. The P value for such a result is 0.02. What this P value tells us is that if the the null hypothesis was true, this result, or a more extreme one would be expected in 2% of studies of this design,

If the sample size is half of the above, the results would be 4 of 5 positive results in the “wired” group and 1 of 5 in the control group. This give a P value of 0.2. This result or a more extreme on would be achieved 20% of the time if the null hypothesis was true.

Thanks very much for your explanation, David. In my field of work, our hypotheses were always tested by third parties who had no vested interest in our work, and we were never told how the testing was performed. IIRC the sample size they used was a minimum of 250.

@jt512, Many thanks for your interesting and informative reply.

You and David Weinberg have very kindly provided me with enough information to start learning much more about this difficult-to-understand subject. I’ve added a link to your replies in my “stats folder”. I have many questions to ask, but rather than be lazy by posting them here, it’s better for me to learn slowly and steadily by using online resources — which will make a lot more sense now that I’ve read your replies.

Best wishes,

Pete

Pete, you’re welcome. I have subscribed to the RSS feed for comments to this article, so if you post another question here, I’ll be notified of it.

jt512, I’ve been trying to figure out the essential difference between qualitative and quantitative since you explained that my light switch hypothesis “is not directly testable statistically, because it is qualitative, not quantitative.” Yep, I think I now clearly understand that my example was far more about testing the quality of light switches (and lamps) rather than testing my ‘hypothesis’ that light switches actually control lamps.

I’ve reread the Wikipedia article “Null hypothesis” and was particularly struck by this paragraph:

“Caution: A statistical significance test is intended to test a hypothesis. If the hypothesis summarizes a set of data, there is no value in testing the hypothesis on that set of data. Example: If a study of last year’s weather reports indicates that rain in a region falls primarily on weekends, it is only valid to test that null hypothesis on weather reports from any other year. Testing hypotheses suggested by the data is circular reasoning that proves nothing; It is a special limitation on the choice of the null hypothesis.”

https://en.wikipedia.org/wiki/Null_hypothesis#Choice_of_the_null_hypothesis

It now seems to me that my light switch ‘hypothesis’ is similar to the Wiki example of studying last year’s weather reports — the studies likely contain loads of accurate data, but they cannot have any predictive power, unless and until, they are statistically tested against alternative data collected using a different/alternative hypothesis (or hypotheses).

Am I on the right track, or just getting even more confused?

@Pete A:

I think you’re just confusing yourself more. Your switch–light hypothesis was qualitative simply because it didn’t predict a number. Statistics is a quantitative science; it works by performing calculations on quantitative data. So, if you have a qualitative hypothesis, you have to make a quantitative prediction from it, such as, If switches change the state of lights, then if I flip a switch 10 times, I predict the light will change state at least, say, 3 times.

Rather than worry about light switches, let’s use an easier example to deal with: coin flips. You want to test whether a coin is “biased.” That’s a qualitative statement, but it suggests an obvious quantitative prediction: if the coin is biased, the probability of it landing heads will not equal 1/2. This is the hypothesis you are trying to prove. Significance testing works by assuming that this hypothesis, confusingly called the “alternative hypothesis,” is false, and that its complement, called the null hypothesis, is true. The null hypothesis in this case is that the probability that the coin lands heads is exactly 1/2.

To test your hypothesis, you decide to flip a coin 100 times and count the number of times that heads appears. Let’s say that the result of your experiment is that heads appears 58 times. This is more than the expected number of heads, 50, if the null hypothesis is true, but is it enough more to justify rejecting the null hypothesis in favor of the alternative hypothesis and concluding that the coin is biased? To decide, we calculate the p-value, which is the probability of observing a result at least extreme as the result actually observed if the null hypothesis is true. To compute the p-value we add up all the probabilities for the result we observed and all more extreme results that we could have observed, under the assumption that the true probability of the coin landing heads is 1/2. The set of at-least-as-extreme possible number of heads is 58–100 heads and 0–42 heads; those are all possible outcomes that are at least 8 heads away from the expected number 50. The inclusion of 0–42 heads might seem unintuitive, but our (alternative) hypothesis did not specify in which direction the coin was biased, and so had it come up heads 42 or fewer heads the result would be just as extreme (and we would be just as surprised) as if it had come up heads 58 ore more times. So need to we add up the probabilities of each of these extreme possibilities to find out just how unusual such results would be if the true probability of heads is 1/2. If you want to understand the details of the calculation, google “binomial test,” but the result, or p-value, is 0.13. So there was a 13% chance of observing a result at least as extreme as the one we observed if the null hypothesis were true. A 13% chance is generally considered too likely to reject the null hypothesis, and so we conclude that we do not have enough evidence to reject the null hypothesis that the true probability of heads is 1/2. We don’t have good evidence that the coin is biased.

Regarding the quote from Wikipedia, it is just a warning not to use the same data to test your hypothesis that you used to come up with the hypothesis in the first place. With respect to coin flipping, if you happen to notice a coin come up heads 7 times out of 10, and it makes you wonder if that coin might be biased, don’t use those 10 flips to test the hypothesis. Collect fresh data.

@jt512,

Thank you very much indeed for your feedback. Your biased coin test example makes perfect sense to me, and you’ve enabled me to finally understand this subject (after many years of trying really hard to grasp it). If you have written (or ever write) a book about statistics I shall order a copy from my local bookshop.

Kindest regards,

Pete

@Pete A,

Glad to help.

J