Feb 13 2014
I love learning new terms that precisely capture important concepts. A recent article in Nature magazine by Regina Nuzzo reviews all the current woes with statistical analysis in scientific papers. I have covered most of the topics here over the years, but the Nature article in an excellent review. It also taught be a new term – P-hacking, which is essentially working the data until you reach the goal of a P-value of 0.05. .
In a word, the big problem with the way statistical analysis is often done today is the dreaded P-value. The P-value is just one way to look at scientific data. It first assumes a specific null-hypothesis (such as, there is no correlation between these two variables) and then asks, what is the probability that the data would be at least as extreme as it is if the null-hypothesis were true? A P-value of 0.05 (a typical threshold for being considered “significant”) indicates a 5% probability that the data is due to chance, rather than a real effect.
Except – that’s not actually true. That is how most people interpret the P-value, but that is not what it says. The reason is that P-values do not consider many other important variables, like prior probability, effect size, confidence intervals, and alternative hypotheses. For example, if we ask – what is the probability of a new fresh set of data replicating the results of a study with a P-value of 0.05, we get a very different answer. Nuzzo reports:
These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see ‘Probable cause’). According to one widely used calculation, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl’s finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another ‘very significant’ result. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails.
Let me restate that – a study with a P-value of 0.01 may only have a 50% chance in an exact replication of producing another P-value of 0.01 (not the 99% chance that most people assume).
The problem goes deeper, however, and this is where “P-hacking” comes in. I discussed previously a paper by Simmons et al that reveal the effects of exploiting “researcher degrees of freedom.” This means choosing when to stop recording data, what variables to follow, which comparisons to make, and which statistical methods to use – all decisions that researcher have to make about every study. If, however, they monitor the data or the outcomes in any way while making these decisions, they can “consciously or unconsciously” exploit their “degrees of freedom” to reach the magic P-value of 0.05. In fact, Simmons showed you can do this 60% of the time out of completely negative data.
The process of exploiting these degrees of freedom is called, “P-hacking.” Simonsohn, a co-author on the Simmons paper, points out that P-values in published papers cluster suspiciously around the 0.05 level – implying that researchers were engaging in P-hacking until they reached this minimal publishable threshold.
The problems with overusing the P-value and P-hacking are all fixable. As I discussed in my previous post on the Simmons paper, one important fix is exact replications. Exact replication remove all the degrees of freedom. In fact, exact replications can be done by researchers prior to publishing. Statistician Andrew Gelman of Columbia University suggests that researchers should do research in two steps. First collect preliminary data, If it looks promising then design a replication where all the decisions about data collection are pre-determined – then register the study methods before collecting any data. Then collect a fresh set of data according to the published methods. At least then we will have honest P-values and eliminate P-hacking.
Researchers should not only rely on P-values. They should also report effect sizes and confidence intervals, which are a more thorough way of looking at the data. Tiny effect sizes, no matter how significant, are always dubious because subtle but systematic biases, errors, or unknown factors can influence the results.
Simonsohn advocates researchers disclosing everything they do – all decisions about data collection and analysis. This way at least they cannot hide their P-hacking, and the disclosure will discourage the practice.
Nuzzo also brings up one of my favorite solutions – Bayesian analysis. The Bayesian approach asks the right question – what is the probability that this effect is real? In order to make this statement you not only look at the new data, you consider the plausibility of the phenomenon itself – the prior probability.
Science is not broken, but there are definitely problems that need to be addressed. What all of this means for the average science enthusiast, or for the science-based practitioner, is that you have to look beyond the P-values when evaluating any new scientific study or claim.
We can still get to a high degree of confidence that a phenomenon is real. This is what it takes for the research to be convincing:
1 – Rigorous studies that appear to minimize the effect of bias or unrelated variables
2 – Results that are not only statistically significant, but are significant in effect size as well (reasonable signal to noise ratio).
3 – A pattern of independent replication consistent with a real phenomenon
4 – Evidence proportional to the plausibility of the claim.
What we often see from proponents of various pseudoscience or dubious claims trumpeting one or two of these features at once, but never all four (or even the first three). They showcase the impressive P-values, but ignore the tiny effect sizes, or that lack of replication, for example.
Homeopathy, acupuncture, and ESP research are all plagued by these deficiencies. They have not produced research results anywhere near the threshold of acceptance. Their studies reek of P-hacking, generally have tiny effect sizes, and there is no consistent pattern of replication (just chasing different quirky results).
But there is no a clean dichotomy between science and pseudoscience. Yes, there are those claims that are far toward the pseudoscience end of the spectrum. All of these problems, however, plague mainstream science as well.
The problems are fairly clear, as are the necessary fixes. All that is needed is widespread understanding and the will to change entrenched culture.
62 Responses to “P-Hacking and Other Statistical Sins”
Leave a Reply
You must be logged in to post a comment.