Feb 13 2014

## P-Hacking and Other Statistical Sins

I love learning new terms that precisely capture important concepts. A recent article in Nature magazine by Regina Nuzzo reviews all the current woes with statistical analysis in scientific papers. I have covered most of the topics here over the years, but the Nature article in an excellent review. It also taught be a new term – P-hacking, which is essentially working the data until you reach the goal of a P-value of 0.05. .

**The Problem**

In a word, the big problem with the way statistical analysis is often done today is the dreaded P-value. The P-value is just one way to look at scientific data. It first assumes a specific null-hypothesis (such as, there is no correlation between these two variables) and then asks, what is the probability that the data would be at least as extreme as it is if the null-hypothesis were true? A P-value of 0.05 (a typical threshold for being considered “significant”) indicates a 5% probability that the data is due to chance, rather than a real effect.

Except – that’s not actually true. That is how most people interpret the P-value, but that is not what it says. The reason is that P-values do not consider many other important variables, like prior probability, effect size, confidence intervals, and alternative hypotheses. For example, if we ask – what is the probability of a new fresh set of data replicating the results of a study with a P-value of 0.05, we get a very different answer. Nuzzo reports:

These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see ‘Probable cause’). According to one widely used calculation, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl’s finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another ‘very significant’ result. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails.

Let me restate that – a study with a P-value of 0.01 may only have a 50% chance in an exact replication of producing another P-value of 0.01 (not the 99% chance that most people assume).

The problem goes deeper, however, and this is where “P-hacking” comes in. I discussed previously a paper by Simmons et al that reveal the effects of exploiting “researcher degrees of freedom.” This means choosing when to stop recording data, what variables to follow, which comparisons to make, and which statistical methods to use – all decisions that researcher have to make about every study. If, however, they monitor the data or the outcomes in any way while making these decisions, they can “consciously or unconsciously” exploit their “degrees of freedom” to reach the magic P-value of 0.05. In fact, Simmons showed you can do this 60% of the time out of completely negative data.

The process of exploiting these degrees of freedom is called, “P-hacking.” Simonsohn, a co-author on the Simmons paper, points out that P-values in published papers cluster suspiciously around the 0.05 level – implying that researchers were engaging in P-hacking until they reached this minimal publishable threshold.

**The Fix**

The problems with overusing the P-value and P-hacking are all fixable. As I discussed in my previous post on the Simmons paper, one important fix is exact replications. Exact replication remove all the degrees of freedom. In fact, exact replications can be done by researchers prior to publishing. Statistician Andrew Gelman of Columbia University suggests that researchers should do research in two steps. First collect preliminary data, If it looks promising then design a replication where all the decisions about data collection are pre-determined – then register the study methods before collecting any data. Then collect a fresh set of data according to the published methods. At least then we will have honest P-values and eliminate P-hacking.

Researchers should not only rely on P-values. They should also report effect sizes and confidence intervals, which are a more thorough way of looking at the data. Tiny effect sizes, no matter how significant, are always dubious because subtle but systematic biases, errors, or unknown factors can influence the results.

Simonsohn advocates researchers disclosing everything they do – all decisions about data collection and analysis. This way at least they cannot hide their P-hacking, and the disclosure will discourage the practice.

Nuzzo also brings up one of my favorite solutions – Bayesian analysis. The Bayesian approach asks the right question – what is the probability that this effect is real? In order to make this statement you not only look at the new data, you consider the plausibility of the phenomenon itself – the prior probability.

**Conclusion**

Science is not broken, but there are definitely problems that need to be addressed. What all of this means for the average science enthusiast, or for the science-based practitioner, is that you have to look beyond the P-values when evaluating any new scientific study or claim.

We can still get to a high degree of confidence that a phenomenon is real. This is what it takes for the research to be convincing:

1 – Rigorous studies that appear to minimize the effect of bias or unrelated variables

2 – Results that are not only statistically significant, but are significant in effect size as well (reasonable signal to noise ratio).

3 – A pattern of independent replication consistent with a real phenomenon

4 – Evidence proportional to the plausibility of the claim.

What we often see from proponents of various pseudoscience or dubious claims trumpeting one or two of these features at once, but never all four (or even the first three). They showcase the impressive P-values, but ignore the tiny effect sizes, or that lack of replication, for example.

Homeopathy, acupuncture, and ESP research are all plagued by these deficiencies. They have not produced research results anywhere near the threshold of acceptance. Their studies reek of P-hacking, generally have tiny effect sizes, and there is no consistent pattern of replication (just chasing different quirky results).

But there is no a clean dichotomy between science and pseudoscience. Yes, there are those claims that are far toward the pseudoscience end of the spectrum. All of these problems, however, plague mainstream science as well.

The problems are fairly clear, as are the necessary fixes. All that is needed is widespread understanding and the will to change entrenched culture.

## 62 thoughts on “P-Hacking and Other Statistical Sins”

## Leave a Reply

You must be logged in to post a comment.

The P value is calculated using effect size, sample size and variance. It does not consider prior probability, but that is subjective and depends on your perspective.

“The Bayesian approach asks the right question – what is the probability that this effect is real? In order to make this statement you not only look at the new data, you consider the plausibility of the phenomenon itself – the prior probability.”

This makes it possible to reject any finding you don’t agree with — just set the prior probability low enough. Science cannot function with bendable rules.

It is misleading to say that p value does not measure the probability that data would be as observed under the null hypothesis. Anyone with rudimentary stats knows that p value takes things like sample size, variance, etc.. into account. E.g., we correct for multiple comparisons by changing our criterion for significance (e.g., Bonferonni correction).

Bayesian analysis is no panacea, and frankly seems to have its biggest enthusiasts among people that do not do experiments. Those in the Cult of Bayes tend to underestimate the tools at the disposal of traditional statisticians.

If your effect is strong, it won’t matter if you use Bayesian, Fisherian, or whatever statistical measures. It is when effects are subtle and small that there will always be room for fudging, in any framework. As Dr Novella pointed out, the best antidote is replication. And honesty/rigour on the part of the person doing the analysis.

I’d like to see an uncontroversial example of Bayesian analysis in the natural sciences that told us something we couldn’t have learned from more traditional Fisherian statistics.

Note also those committed to full-blown Bayesian subjectivism, where probabilities are a measure of subjective confidence, are never going to be taken seriously among a majority of natural scientists. This coin is either fair or not, it doesn’t matter if I think it is biased to come up heads. Might as well go team up with Deepak Chopra and change your subjective belief that people will give you money, and start peddling The Secret merchandise. Then maybe you will become rich.

There are so many problems with the way statistics are done in medicine and

(worse) the social sciences that it’s hard to know where to start. Besides the

pervasive problem of significance searching (ie, p-hacking), there is, as Steve

says, the nearly universal mistake of interpreting a p-value as the posterior

probability of the null hypothesis. P-values between .01 and .05 are usually

weak evidence against the null, and sometimes even evidence in favor of it.

If it weren’t bad enough that our intuitions about what a p-values means tends

to mislead us, our intuition about the effect of sample size does too: given a

p-value, the larger the sample size, the

lessthe evidence the p-valueprovides against the null. When the sample size is really big, as in a

large-scale prospective cohort study or RCT, p-values between, say, .01 and .05

tell us nothing about the probability of the null hypothesis, yet

becausethe sample size is large, such results are the ones most likely to be published

in high-profile journals and interpreted as definitive, or nearly so. It is no

mystery why so many high-profile results in medicine (never mind the social

sciences) can’t be replicated.

As to the “criticism” that Bayesian analysis is subjective, give me a break!

Scientific judgment is subjective! The scientific process considers prior

probability whether individual scientists consciously do or not. That is why we

believe that there is a Higgs boson, but don’t believe that bacteria can use

arsenic in place of phosphorus in their DNA, neutrinos can travel faster than

the speed of light, or people can influence the output of a computer random

number generator. What Bayesian analysis would do would be to force scientists

to lay their cards on the table about their prior probabilities, rather than

hide them in a supposedly objective frequentist significance test. Such a test

has a built in prior distribution: that the probability under the alternative

hypothesis is 1 for the exact result obtained, and 0 for every other result that

could have been obtained, but wasn’t. Viewed this way, the idea that decisions

about hypotheses cannot be made from the data alone is revealed to be

preposterous.

Just so this doesn’t get lost in the kerfuffle:

I’d like to see an uncontroversial example of Bayesian analysis in the natural sciences that told us something we couldn’t have learned from more traditional Fisherian statistics.Lots of strong rhetoric is bound to be thrown around by both sides, but this is really the key. Where’s the beef?

” given a

p-value, the larger the sample size, the less the evidence the p-value

provides against the null.”

That is wrong. A larger sample is closer to the reality you are testing than a smaller sample. In a larger sample, illusory effects are more likely to be washed out, while real effects are more likely to show up.

Some of the people who have been criticizing traditional statistics (T tests, etc.) obviously misunderstand the fundamental basic concepts.

A small P value is no guarantee you found a real effect. That is why it is P, for probability. Any research that uses statistics involves probabilities, not certainties.

A small P value means that given the variance within groups, the variance between groups is more likely to be meaningful, not just random.

Although the basic concepts are simple, they must be easy to misunderstand if you have not actually used them in research.

Effect size means nothing outside of a context. It only has meaning relative to the sample size, and the inter and intra group variance.

There might be times when it might be justified to include a priori probabilities in your analysis. But I think most of the time it would not be. Only if you had a solid mathematical way of estimating the a priori probability. Otherwise, you are just inserting your own bias into the interpretation. Which, as someone already said here, undermines the whole purpose of statistical research.

“Just so this doesn’t get lost in the kerfuffle:

I’d like to see an uncontroversial example of Bayesian analysis in the natural sciences that told us something we couldn’t have learned from more traditional Fisherian statistics.

Lots of strong rhetoric is bound to be thrown around by both sides, but this is really the key. Where’s the beef?”

Endamame,

I think the point is more that Fisherian statistics are more likely to give confidence to statistical noise ie where Bayesian will not show more positives or tell us anything about real effects it does a better job of weeding out false results.

“That is why we believe that there is a Higgs boson, but don’t believe that bacteria can use arsenic in place of phosphorus in their DNA, neutrinos can travel faster than the speed of light, or people can influence the output of a computer random number generator”

Your last assertion is completely false, as any cryptographer could tell you.

I made a very unfortunate typo in my first post. I wrote:

I mean to say “can,” there; not “cannot.”

Hardnose wrote:

What I wrote is correct and has been explained repeatedly in the theoretical and applied statistical literature. For a freely available, easy-to-understand explanation, see Wagenmakers (2007) (pdf), and note in particular Figure 6, which depicts for a p-value of .05, the posterior probability of the null hypothesis vs. the sample size for various priors* that would be reasonable in practice. The plot assumes that the null and alternative hypotheses were equally likely

a priori. The plot reveals two important facts. First, that regardless of the prior, a p-value of .05 almost always implies that the data favor the null hypothesis over the alternative—exactly the opposite of what we are supposed to believe. Secondly, regardless of the prior, as the sample size increases, the posterior probability of the null hypothesis increases and, in fact, approaches a limit of 1.Indeed apparently even if you have.

*Here, by “prior,” I mean the prior distribution of the effect size if the alternative hypothesis is true. Don’t confuse this with the prior probability that the alternative hypothesis, itself, is true, which is assumed to be .5 in the figure.

When using Bayes you don’t have to assume a prior probability. You can calculate the change in probability based on the new data. What it reveals is that, even with a significant P-value, the change in the probability of the hypothesis being true can be very small. It is an antidote to the false impression generated by using P-values alone.

Of course, you can use proper statistical analysis, control for multiple comparisons, etc, and if you get a large effect size and very significant result, those results are meaningful and believable, especially if they can be replicated.

But damn, read the literature – it’s littered with improper statistical analysis, small effect sizes, etc. But the magical 0.05 P-value was achieved, so the knee-jerk is to reject the null hypothesis.

Regarding increasing sample size – this can result in “overpowering” a study, which means that tiny effect sizes can achieve statistical significance, and these tiny effects are more likely to be spurious because the signal to noise ratio is low. Larger studies are still great for their statistical power – but you have to be suspicious of any small effects that emerge, and the value of statistical significant is ironically less.

[Regarding increasing sample size – this can result in “overpowering” a study, which means that tiny effect sizes can achieve statistical significance, and these tiny effects are more likely to be spurious because the signal to noise ratio is low.]

Yes, tiny effect sizes can achieve statistical significance — but what is “tiny” outside any context? A large sample means that spurious effects are LESS likely to show up as significant, NOT more likely.

A phenomenon might be real, yet have a “tiny” effect size (depending, of course, on how you happen to define “tiny”). With a small sample, it might not reach statistical significance because intra-group variance is high relative to N. If you increase N, and intra- and inter-group variances don’t change much, then the effect may reach significance.

ANY effect at all, whatever its “size” (and as I said, size is only meaningful within some context), needs a big enough sample to reach statistical significance. High intra-group variance requires a bigger sample.

All very simple and straightforward, and any other interpretation is an attempt to confuse and distort reality.

And yes, the key to all of this is replication. But there are still problems because different meta-analyses can reach different conclusions. Selecting which studies to include or exclude inevitably reflects bias.

hardose – what you are saying is essentially correct but you are missing one important point. You seem to be looking at this entirely from a statistical point of view, but are ignoring the complexities of gathering data from subjective and noisy systems.

Increasing sample size does reduce random noise by increasing the probability that it will average out. However, it does not reduce any systematic biases in the study. A subtle bias or influence by an unknown variable or effect can creep in. Relatively small such effects can achieve statistical significance in large studies – meaning that the bias can be very subtle if the effect size is very small. (The smaller the effect size, the more subtle biases need to be ruled out, and the more difficult that becomes.)

The larger the study, the more difficult it is to eliminate any bias that can achieve statistical significance.

The main point here is that statistical significance is not everything. This is worth pointing out because (at least in my field of medicine) many people overvalue statistical significant and underappreciate other factors. This is partly, I think, from an unconscious desire to simplify things by boiling them down to a single number.

Yes – effect size is a judgment call that requires context. In medicine we use the basic concept of clinical significance. For example, is the amount of pain reduction being reported something that a person can actually notice? It also requires the context of – how objective is the outcome measure, how quantitative, and how does this relate to “noise” in the system.

Math is a core discipline, and while it’s satisfying to say, “numbers don’t lie,” they most certainly can. You have to take into account the methods that provided those numbers. If the test protocols are sloppy and allow confounding factors in or provide degrees of freedom for researchers to subconsciously manipulate, you’re almost certainly going to have some kind of bias. That bias undermines your conclusion, no matter how confident your statistical calculations say you should be.

Science isn’t easy.

“Except – that’s not actually true.”

I was very concerned about the quality of this post before I read this line, but it turns out you were describing the common misonceptions of “p values” in the second paragraph.

A general comment about this topic. Articles and comments often focus on the particular statistical methods (and fighting about that), when the issue of “researcher degrees of freedom” has more to do with not sticking to a rigorous process to avoid the insertion of biases (intended or not).

Isn’t this a major purpose of having a research protocol to begin with?- to lay everything out before-hand to avoid many of these problems? At what point is the process failing? Is it just this focus on getting interesting results (e.g. significant p value) that make researchers stray from the process?

I just want way to more specifically characterize the issues, because it is hard to have a productive conversation when the problems described are so vague.

“The main point here is that statistical significance is not everything.’

That is right. Statistics are not a substitute for thinking. Defective research gives defective results, and statistically significant does not mean “true.”

@hardnose:

That sentence (like most you have written) is ambiguous. It is unclear what you are conditioning on. My best guess is you are claiming that if the null hypothesis is true, then the larger the sample size, the smaller the probability that the result will be statistically significant. That claim is wrong for two reasons. One, Steve already pointed out. Results from even the best-designed studies have small systematic error. And, if the null hypothesis is true, then the larger the sample size, the more likely it will be that the test will be significant, with the detected effect being the systematic error.

The second reason your claim is wrong is that the probability that the result will be significant due to

randomerror if the null hypothesis is true is, by definition, the probability of committing a Type I error. If, as convention usually dictates, alpha is set at .05, then the probability of making a Type I error—the probability of rejecting the null hypothesis if the null hypothesis is true—will be .05 regardless of the sample size.Thus, if the null hypothesis is true, the probability of rejecting it due to systematic error increases with the sample size, and the probability of rejecting it due to random error is independent of sample size. Therefore, your claim is wrong for two reasons.

@Steven Novella:

I’m not sure whether you were commenting on my comment or not, but I want to clarify one thing (that you may already know). To calculate the change in the probability of the hypothesis brought about by the data—that is, to calculate the Bayes factor—we need to specify a prior probability distribution on the effect size under the alternative hypothesis. This “prior” is distinct from the prior probability

ofthe alternative hypothesis itself. The Bayes factor can then be applied to any prior probabilityofthe alternative hypothesis. To put it another way, the Bayes factor is independent of the prior probabilitythatthe alternative hypothesis is true, but the Bayes factor depends on what we believe the probability distribution of possible true effect sizes is,ifthe alternative hypothesis is true. A convenient feature of the Bayes factor is that it equals the posterior odds of the hypothesis if the prior odds are 1:1, or equivalently, that the prior probability is .5, which is often a reasonable assumption for confirmatory studies.Perhaps one day we will all come to our senses and realize how dumb it is to test a null hypothesis of an exactly zero effect size, since in every field except parapsychology, the true effect size under study is never exactly zero. One alternative would be to switch the emphasis in our analyses to Bayesian estimation of effect size, which has the benefit over Bayes factor hypothesis testing of being insensitive to the choice or prior distribution, provided that the sample size is reasonably large, as it ordinarily is in practice.

Hardnose:

Classical / frequentest stats have problems with very large sample sizes – this is pretty well known, and a paper’s been offered that you should read.

HEre’s another idea. Play with this:

http://www.danielsoper.com/statcalc3/calc.aspx?id=44

It’s a simple calculator that gives the p that an r value was arrived at via sampling error given N. Note the very tiny r values that become significant when you ramp up the N. If you were doing an epidem study with thousands of people, would you trust a significant correlation of .02? No, you’d realize that you were overpowered and this is meaningless.

I’ll ever understand why those who object to Bayesian stats insist that priors have to be pulled from nether-regions. T’aint so.

I’ve linked to this before, but for those who haven’t seen it, it’s a riot to watch the “Dance of the P-values” video:

https://www.youtube.com/watch?v=ez4DgdurRPg

I’d never seen that video. It is very well done. It’s not hard to see how a scientific field that routinely uses underpowered studies, and publishes only statistically significant results will end up publishing mostly false positive results.

…or exaggerated effect sizes.

I thought I recognised the accent…that video was made by statisticians at Latrobe University, only an half hour drive from where I live (in Mooroolbark, Australia)

“the true effect size under study is never exactly zero”

If a treatment makes no difference, the true effect size is exactly zero. Because of intra-group variance, which results from factors not currently of interest, there will usually be a difference between group means.

If the means differ, but P is above whatever cut-off is being used, then we assume it may have been the result of factors other than the treatment being studied.

The true effect size can certainly be zero, and it often is.

@Hardnose:

But in real-life experiments (outside of parapsychology), the true effect size is never exactly 0. It may be minute, but never exactly 0. Even if you were just testing two placebos against each other, since the placebos are chemically different, they would be expected to have at least some tiny difference in effect.

As Cohen (1994) (pdf) wrote,

Hardnose continued:

If you know that the true effect size is often 0, then you should have no trouble giving, oh, three examples of peer reviewed studies (say from the natural sciences, medicine, or the social sciences) where the true effect size is 0. For each example, please explain what evidence you used to conclude that the true effect size is 0.

jt512,

If you are comparing two different drugs, for example, and both are expected to cause some improvement in patients, then there might often be a real difference between treatment and control groups. The difference might be tiny, yet statistically significant. In that case you would have to decide if it’s clinically significant (a subjective judgment). Or there might be a difference that does not reach statistical significance, according to your agreed-on cutoff. In that case, you can provisionally accept null, until your results are confirmed by other experiments.

But if, for example, you are comparing a treatment to no treatment, there could very well be no effect at all. If the treatment has absolutely no effect on the disease, the true effect size is 0.

Hardnose, I don’t think that it is physically possible for a treatment to have absolutely no effect on an outcome, though it certainly could have too small an effect to be detected statistically with a practicable sample size.

It is certainly possible for a treatment to have no effect whatsoever on the dependent variable (what the experimenter has decided to measure).

hardnose,

“The true effect size can certainly be zero, and it often is”

“It is certainly possible for a treatment to have no effect whatsoever”

Is that a retraction?

Because the first statement is false, and the second statement is at least logically true.

The problem is that it rarely, if ever, actually happens.

Otherwise I think you would have been able to supply an example as requested a couple of days ago.

If a result is not statistically significant, there is probably no true effect. That is the very foundation of statistical research. If most experiments had a positive true effect, there would be no such thing as science.

Hardnose wrote:

It is trivially easy to show that Hardnose’s statement, above, begs the question of whether a true effect size can ever be exactly 0. From Bayes’ Theorem, the odds that the true effect is exactly 0, given that a statistical test is not significant, is equal to

[ (1–alpha) / beta ] × [ P(H0) / P(H1) ] ,

where alpha is the significance level of the test, beta is 1 minus the power of the test, P(H0) is the prior probability that the true effect size is exactly 0 and P(H1) is 1–P(H0). Note that P(H0), the prior probability that the true effect size is exactly 0, is the probability that the true effect size is exactly 0 before we conduct the test. This quantity must be 0 if the true effect size can

neverbe exactly 0, which is the question we have been arguing about. Since the odds that the true effect is 0, given that a statistical test is not significant, depends on the quantity P(H0), Hardnose’s statement just begs the question about whether a true effect size can ever be 0 in the first place.By “true effect size” must mean what the effect size would be if we were able to perform the test on the entire population, not just a sample.

So, for example, you might want to test whether taking vitamin C will cure cancer. So you get two groups of cancer patients, 20 in each group, and give one group vitamin C, and you give the other group pills that look and taste like vitamin C but are fake.

After a predetermined amount of time, you take a predetermined measurement that you think indicates whether the patient has improved or not.

You find that the mean scores of the two groups are slightly different, and the vitamin C group shows slight improvement according to your measurement. Get ready to collect your Nobel prize! Wait, not so fast, gotta do some statistical tests first.

The statistical tests show that your P value is .90. Uh oh, that means there is a 90% chance that the difference in the two means has nothing to do with the vitamin C treatment. There is a 90% chance that the true effect size is exactly zero.

There is also a 10% chance that vitamin C really does cure cancer, but you experiment didn’t have enough power.

So keep trying and better luck next time. But eventually, after enough failed experiments, you might give up and admit that vitamin C does not cure cancer.

According to your reasoning, almost every hypothesis must be true. Well there would be a lot more Nobel prize winners if that were the case.

I don’t understand what you guys are arguing about.

Of course a true effect size can be 0. This simply means that what was manipulated had no effect. But the odds are low that it will come out as exactly zero because of sampling error.

@hardnose:The statistical tests show that your P value is .90. Uh oh, that means there is a 90% chance that the difference in the two means has nothing to do with the vitamin C treatment.

It is remarkable that you could make such a statement in the comments section of a blog post whose main point was that that is not the correct interpretation of a p-value.

And, once again, that statement, besides being pure nonsense, is still just begging the question.

@steve12

Did you read the material I quoted earlier from statisticians Cohen and Tukey? They disagree with you, as do I. At best, the conventional point null hypothesis is an approximation to the physical truth: that the treatment effect is in a small neighborhood of 0.

“I don’t understand what you guys are arguing about.

Of course a true effect size can be 0. This simply means that what was manipulated had no effect. But the odds are low that it will come out as exactly zero because of sampling error.”

Actually I don’t understand what we’re arguing about either. The “true effect size” is not what you measure, it is what you are inferring from your measurements and the statistical tests.

Of course the “true effect size” can be zero. This is obviously and absolutely a fact, no matter what anybody says.

Does taking a love potion make people fall in love with you? Do the experiment and find out. Maybe you will get a low T and a high P, causing you to infer that the true effect size is zero.

The statistical tests used in scientific research are called “inferential statistics.” That means they help you infer what is true in the real world, even though you only looked at a sample. Very often, what you imagine might be true in the real world is not true at all. The true effect size is zero.

Just because you got a low p value doesn’t mean it’s time to break out the champagne. It’s just a probability. Well duh, we knew that already or should have.

It appears that you missed the lesson in elementary statistics that a non-significant result does not imply that the null hypothesis is true. If you don’t even understand that much, there is no point in trying to explain subtler concepts to you.

@Steven Novella

“It also taught be [me] a new term – P-hacking, which is essentially working the data until you reach the goal of a P-value of 0.05.”

Seriously Steve, you’ve *just* learned a new term called ‘p-hacking’ ?

It’s actually called a (statistical) ‘fishing expedition’ and that phrasing has been around forever. I’m citing this link from a text written in 1974 because it happens to be the first link that came up in google –

http://www.jstor.org/discover/10.2307/3234273?uid=3737536&uid=2&uid=4&sid=21103456633327

“Science is not broken… What all of this means for the average science enthusiast, or for the science-based practitioner, is that you have to look beyond the P-values when evaluating any new scientific study or claim.”

So if the majority of published studies can’t be replicated or make a claim that turns out to be falsified later, you don’t consider the process broken? Given the fact that the perverse emphasis now is on publishing as many papers as possible, even if they are mainly of garbage quality, then you believe things can be fixed by admonishing researchers to (a) actually think intelligently about what they study and (b) do better quality statistical work? Neither (a) or (b) will happen because it conflicts with publishing quantity.

You’re hardly a sceptic if your ‘solution’ is to defend and lecture researchers to do what they should know in the first place. The problem is, sceptics aren’t apologists for junk science. Unfortunately, activist sceptics are. I’ve often wondered why this is the case. I suspect it’s because activist sceptics must maintain a ‘core’ belief system and part of that belief system is that scientific practice is incorruptible. This is somewhat similar to a priesthood in a Church. The Church hierarchy may be corrupt, but *The* Church is inviolate. In a like manner, Jesus is the best and kindest. The ‘core’ is always defended, even if a defence is absurd.

jt512:

“Did you read the material I quoted earlier from statisticians Cohen and Tukey? They disagree with you, as do I.”

This is really more of musing than anything mathematically provable.

I would agree that in most actual experiments, this is probably true. But since this is a theoretical discussion, I would ask you to imagine a truly insipid experiment where this is no reasonable relationship between DV and IV. Maybe a psi experiment or some such. There COULD indeed be no effect size. I maintain that it is possible.

“Seriously Steve, you’ve *just* learned a new term called ‘p-hacking’ ?”

Everyone’s heard of the term fishing expedition, and strictly speaking it’s not the same thing as p-hacking. It has a more specific meaning.

I mean are you *just* learning this!?! That p-hacking and a fishing expedition are the same thing!?! Duhhhh!!!!!!

What is your deal? You show up every few days to make some absurd derisive comment with no clear point. You’re some type of anti-sciencer. My bet’s on postmodernist rather than religious nut. Just a theory.

Steve12, in my first post on the subject, I stated that psi hypotheses were the exception. I think the question comes down to physics, not math. Psi isn’t real. But, any real physical intervention must have an effect; and no two interventions can be so similar that they could have

exactlythe same effect. Just as you can’t manufacture a coin so perfectly that it would have exactly a 1/2 probability of landing either heads or tails, I don’t think you can design two interventions that would be exactly alike, and would hence have the same physical effect, if measured to a sufficiently high precision. I can’t prove that, but my girlfriend, a theoretcial physical chemist, agrees.All that said, it’s really unimportant to the practice of statistics. What is important, is what you said: that in actual experiments, there is almost surely at least a tiny non-zero effect. Any difference between treatments worth testing is almost surely non-zero; and so, if it is a foregone conclusion that the null hyothesis is false, what is the point of conducting a statistical hypothesis test? Any non-significant result must almost sure be false. It seems, then, that hypothesis testing is pointless, and we should focus our efforts, instead, on interval estimation.

OK, gottcha. I didn’t see what you had specifically exempted insipid things like psi above – my bad.

And I agree with your point – that in any real situation there will be at least some non-zero effect size to some umpteenth decimal point.

“It appears that you missed the lesson in elementary statistics that a non-significant result does not imply that the null hypothesis is true.”

I already said in other comments that you can’t accept the null hypothesis based on one experiment. You can’t make any strong inferences either way without replications. I already said, repeatedly, that this is all probability, not certainty.

” Any difference between treatments worth testing is almost surely non-zero; and so, if it is a foregone conclusion that the null hyothesis is false, what is the point of conducting a statistical hypothesis test? Any non-significant result must almost sure be false. It seems, then, that hypothesis testing is pointless, and we should focus our efforts, instead, on interval estimation.”

You are assuming that all researchers are comparing drugs that are already known to have an effect.

Perhaps someone can help me with a problem I have with the use of Bayes.

As new evidence comes in, we can update our belief system in a mathematically correct way using Bayes assuming the data conforms to what the mathematics applies to. I think this is normal behavior for brains– what fires together, wires together, is a Bayesian approach to information. (The more often ‘x’ appears, the more certain we become ‘x’ will appear again).

But where are we getting the evidence?

It seems that if we narrow our search for evidence based on a Bayesian analysis, we are in effect ‘cherry picking’ our sources of information on a subject.

Start with a strong ‘prior’, look to the sources most likely to agree with that prior, and this is likely to produce a stronger and stronger certainty the prior is correct.

It seems we misapply Bayes as soon as we start to narrow our search based on the analysis. But narrowing the search is how it is applied.

Is that a problem?

@sonic

You seem to be confusing confirmation bias with Bayesian inference. Obviously, if you only “look to the sources most likely to agree with that prior,” then all you are doing is cherry picking the available data in order to confirm your prior belief. To correctly use Bayes you have to use all the available data, or at least an unbiased sampling of it. Unlike frequentist hypothesis testing, which does not allow accumulation of evidence in favor of the null, but only against it, Bayes allows both. Consequently, if your prior probability was strongly in favor of H1, but subsequent studies favor H0, then your probability for H1 will be reduced, and you probability for H0 increased. And if enough evidence for H0 accumulates, Bayesian inference says you eventually have to change your mind.

jt512-

Thanks.

I’m not confusing the two as much as wondering if people using Bayes don’t fall into this.

For example- I know people who think that anyone who has any question about AGW is a paid hack and must be ignored or (hopefully) censored.

I note that some ‘pro- GMO’ types run into this as well. (see rezistnzisfutl)

My concern is that the use of Bayes leads to this behavior- when in fact for Bayes to work one would search the information base at random.

My problem is not with the math- it is the way I see it being used.

Am I seeing things?

sonic that is a concern with every technique.

I don’t see any advantage in including past evidence in a current analysis.

A meta-analysis will, hopefully, combine all the evidence afterwards.

Including prior probabilities is just one more way for subjectivity to enter the picture.

@sonic,

I don’t see any reason that Bayesian inference would promote biased, confirmation-seeking behavior. On the contrary, if done correctly, Bayes is the mathematically correct way to update one’s probability of a hypothesis, given new data. If you sit down and force yourself to do the math—to calculate the relative likelihood of the results of each relevant study under each hypothesis—then Bayes forces you to give each study its proper weight, permitting more accurate assessment of the probaliity of a hypothesis than would any informal, ad hoc method of synthesizing the evidence.

Then, again, my experience with Bayes is as a professional statistician rather than as an active skeptic.

“I don’t see any advantage in including past evidence in a current analysis.”

All things should be treated as a priori equiprobable? I’m not a statistician I’m a scientist that uses stats, and I haven’t used Bayesian analyses, honestly because it’s a PITA to get stuff published with it as a junior guy (not the best reason, but there it is). The more I’ve investigated, the more I’m convinced that priors are as subjective as you make them, and no one will accept them if they are.

“A meta-analysis will, hopefully, combine all the evidence afterwards.”

But if meta-analyses are performed with studies suffering from the problems discussed above, this will not help determine a good estimate of effect size – GIGO. Bayesian approaches have the advantage of addressing the actual problems.

Of course, there are non-Bayesian ways of addressing these problems, but meta-analyses aren’t one of them.

jt512-

I’ve done the math.

Allow me to quote myself-

“As new evidence comes in, we can update our belief system in a mathematically correct way using Bayes assuming the data conforms to what the mathematics applies to.”

But doesn’t the mathematics assume that each piece of new information is found at random? (the bag of balls analogy)?

Perhaps I’m wrong about this– but it seems that one thing people use Bayes for is to narrow the area of investigations– they will only pick from the ‘left side’ of the bag if you will.

If I hear the ‘pro’ GMO people are paid shills 8 times, then I’m likely to restrict my search of information to people who know this-

I might be wrong about this behavior I think I see.

But I have done the math–

Sonic, I don’t see how what you are describing can be considered a Bayesian analysis.

jt512-

Agreed- I’m not describing a correct Bayesian analysis.

I’m describing a human behavior that comes from what I think is a misapplication of Bayes.

@sonic:

The sun has exploded (p<.05)

jt512-

Why only $50?

You can’t lose the bet after all.

hardnose (who you quote) is being a bit extreme– perhaps we should look at each new piece of information with the willingness to throw out whatever priors we might have.

Michelson – Morley comes to mind.

Do you consider yourself a Bayesian?

@sonic, oops, sorry, that last post should have been tagged “@hardnose.”

Maybe the best compromise is to use both. T tests and anovas are easy for everyone to understand, even old fashioned guys like me. Then you can add the fancier stuff if you want, and everyone will be happy.

I am wary and skeptical about the possible subjectivity of estimating prior probabilities. Yes meta-analyses have the same kind of problems. But we should be able to easily see the naked p values, if we want.

It is natural to be biased, but bias is the enemy of science, and we have to constantly watch out for it.

This is just a symptom of the larger problems with over emphasis of quantity of publications rather than quality.

hardnose,

There was once a tribe of people isolated from the rest of the world. When they first saw white men, they thought they were the ghosts of their dead relatives, and when they first saw horses they thought they were seeing giant pigs. I don’t think scientists would do very well to put themselves in the position of these isolated tribespeople.