Jan 05 2012

## Publishing False Positives

You are currently browsing comments. If you would like to return to the full story, you can read the full entry here: “Publishing False Positives”.

### 18 Responses to “Publishing False Positives”

### Leave a Reply

You must be logged in to post a comment.

on 05 Jan 2012 at 9:10 amI am no Bayesian expert, but I am pretty sure you can’t do Bayesian updating, i.e. calculate a change from prior to posterior, without making some assumption about the prior. This is the whole appeal of Bayesian methods over frequentist methods, no? If something has an extremely low or high prior, then additional data to the contrary will not update the posterior very much. So the amount a particular data set will update the posterior probability depends necessarily on the prior probability.

Most of the time this is resolved by assuming an “uninformative prior”, i.e. all outcomes are equally likely, and then deriving a posterior from that. But you are still making an assumption about the prior.

on 05 Jan 2012 at 9:34 ammlegower – I had that same question. I specifically asked Wagenmakers about it- he was referenced in the current article as supporting a Bayesian approach. He said you can do a Bayesian analysis without making a judgement about the prior probability. Essentially you can make statements about how much the prior probability will change, and perhaps this can be adjusted to whatever you take to be the prior probability.

I will ask him what he thinks about the comments in this article.

on 05 Jan 2012 at 9:40 amSteve, when you wrote P<0.5, did you mean P<0.05?

on 05 Jan 2012 at 10:12 amYes, P<0.05 is what is meant.

This is an interesting study and is summed up rather neatly here:

http://xkcd.com/882/

Having worked in market research for a few years, this idea of fishing for significance is all too prevalent, generally without the analysts themselves realising or understanding the likelihood/existence of false positives.

The idea of sharing the data is probably the best way to address this for public research – it not only helps more technical readers when trying to understand the nightmare of expressing p values, relative risks and sample sizes in prose, but also in enabling other researchers to use the data for their own research when trying to replicate or build on the findings.

It should also be said that more accurate and descriptive method sections would often make these effects fairly clear to even an outsider

on 05 Jan 2012 at 10:13 amWow, this takes me back a long way, back to the days when I was learning about statistics, and studied some ESP experiments! The paper linked to is a really nice piece of work. A few observations – Steve, I think you meant to write p < 0.05 rather than p = 0.5 (two places). Second, with regards to degrees of freedom, we are used to using the factor sqrt(N-1) to increase computed sample standard deviation relative to the true population value, where N is the sample size. This comes from "fixing" one degree of freedom, namely the mean. But the general relationship is sqrt(N – m), where m is the number of degrees of freedom "used up". Thus, if one applies enough conditions to the data so that m = N – 1 (you can't have it larger), then the correct sample standard deviation to report would be sqrt(N) times as large as the true population value.

Since p is so nonlinear in the S.D., this alone would greatly increase the proper p values for a lot of research.

Third, the reported p values are nearly always based on gaussian distributions. But it's almost never established that the distributions in question are actually near-gaussian. For an arbitrary distribution, there is a limit between p and the S.D., and it's given by the Chebychev inequality. For p value of 0.05, you have to be about 4.4 S.D. out from the mean, instead of about 2 as with the Gaussian distribution.

In my opinion, if a researcher can't demonstrate that his distribution is Gaussian, he should use Chebychev's inequality instead. that would kick a lot of results out of "statistical significance".

on 05 Jan 2012 at 10:54 amI will be interested to see what he says. But my understanding from my admittedly limited graduate education in Econometrics, both Bayesian and otherwise, is that every Bayesian analysis begins at least implicitly with a prior distribution of the parameter values of the data generating process and ends with posterior confidence intervals (and point estimates if you like) of those same parameters which are implied by the prior and the data.

Obviously the shift in the point estimate and the width of the interval around that point estimate indicates the degree to which the posterior changed from the prior, in the same sense that a frequentist analysis would produce a test statistic that indcates the ex ante probability that the data came from the parameters specified in the null hypothesis.

In any event, the same criticism that applies to the Bayesians (freedom to choose the prior) applies to the frequentists (freedom to choose the null hypothesis); it’s just that frequentists almost always choose the same straw man null (some parameter(s) = 0) which is identical in most cases to a Bayesian having a uniform (or uninformative) prior. Nothing stops a frequentist from testing the null hypothesis that a given parameter is different from 2, 1000, Pi, or any other number and thereby getting a “statistically significant” hypothesis test. But in every case you have to be clear on what hypothesis is being tested and on what prior distribution is assumed.

on 05 Jan 2012 at 11:02 ammlegower – my understanding is that Bayes theorem says that P(H|E) = P(H) * (P(E|H)/P(E))

The posterior probability P(H|E) is the product of the prior probability P(H) and a “Bayes Factor” K = P(E|H)/P(E) which does not directly depend on P(H). If K can be computed, it can be reported separately from P(H) and allow anyone to plug in their own prior probability to compute their own posterior probability.

K isn’t completely independent of P(H), as P(E) = P(H)P(E|H) + P(not H)P(E|not H). K is monotonic on P(H), and goes to 1 as P(H) goes to 1, so one can use P(E|H)/P(E|not H) to give a limit on K.

If both P(E|H) and P(E|not H) were reported, the reader could calculate their Bayes Factor and posterior probability themselves, without the researcher having to assume any given prior themselves.

on 05 Jan 2012 at 11:30 amIt’s depressing that this wasn’t found and subsequently fixed 50 years ago. Science is pathetically slow in fixing and improving meta-science–it still uses a centuries old journal system. Also, this applies to all scientific research, and not just ESP research.

For example, there is wide body of research showing humans detect pheromones, and that this has all kinds of interesting effects. But, all this research lacks basic scientific plausibility in that adult humans don’t have a functional vomeronasal organ.

This kind of researcher bias and others will be magnified in fields or areas that have a motivational component or incentive. Medicine, evolutionary psychology, economics, and so on.

on 05 Jan 2012 at 11:59 amI really hope he meant “0.5″ I’m just about to publish the killer results of my coin-toss experiment.

on 05 Jan 2012 at 1:58 pmSorry – didn’t have time to edit until this afternoon. Yes 0.05 – now fixed.

on 05 Jan 2012 at 2:05 pm” if a researcher can’t demonstrate that his distribution is Gaussian, he should use Chebychev’s inequality instead. that would kick a lot of results out of “statistical significance”.

Indeed. Color me Chebychev.

@ 2 s”sigma”1-(1/k^2)=.75, @ 3 “sigma it’s .89

LOL. Any time I see the word “Gaussian,” particularly in conjunction with purported assessment of some non-physical attribute, my hand slides reflexively over my wallet.

“Gaussian Copula” – well, now, THAT worked out really, REALLY swell, didn’t it?

on 05 Jan 2012 at 3:01 pmblaisepascal- “If both P(E|H) and P(E|not H) were reported, the reader could calculate their Bayes Factor and posterior probability themselves, without the researcher having to assume any given prior themselves.”

[Presented entirely absent of hostility and in the interest of mutual education]

But the formula for the Bayes’ factor is K = P(E|H)/[P(H)P(E|H) + P(not H)P(E|not H)], which means that you have to assume something about P(H) to calculate it, right? Which means that you can report the probabilities of observing the data given each regime (P(E|H) and P(E|~H)), but you can’t go on to infer anything about the probabilities of each regime given the data unless you establish a prior over the regimes, correct? But if you are only reporting the probabilities of observing the data given the regime, then you are back to what is essentially a frequentist approach I would imagine.

Certainly, given the data and the methods, you can establish the posterior for any prior you might have. And maybe the best route is to report simply P(E|H), P(E|~H), and Bayes Rule so that the interested observer can plug in their own prior. You can even test the sensitivity of the posterior to the choice of prior. But it seems like that is the nature of the criticism above.

on 05 Jan 2012 at 3:33 pmSteven,

Bayesian hypothesis tests are based on the odds form of Bayes theorem:

(posterior odds) = (Bayes factor) * (prior odds), where

(Bayes factor) = P(D|H1) / P(D|H0).

The Bayes factor, above, is the amount by which the data change your degree of belief in the alternative (versus the null) hypothesis. As is evident from the odds form of Bayes theorem, the Bayes factor is independent of the prior odds, your degree of belief in the hypothesis before seeing the data. Different people will have different prior odds of the hypothesis, and the Bayes factor is independent of those subjective judgments.

However, in order to calculate the Bayes factor, itself, the statistician must specify a prior distribution on the alternative hypothesis. This distribution is needed to calculate P(D|H1), the numerator of the Bayes factor. The choice of distribution on the alternative hypothesis will affect the Bayes factor, and so, some subjectivity in a Bayesian hypothesis test is unavoidable.

However, that does not mean that anything goes. Some distributions on the alternative are more reasonable than others, and whatever distribution is chosen, it needs to be disclosed and justified. Furthermore, a sensitivity analysis can be conducted to investigate how different choices of reasonable distributions affect the Bayes factor.

Jay

on 05 Jan 2012 at 3:45 pmBlaisepascal,

Your understanding of the Bayes factor is wrong. The Bayes factor does not appear in the form of Bayes theorem that you have presented. It only appears in the odds form of Bayes theorem, which I gave in my previous post. From that equation, it is evident that the Bayes factor does not depend on the prior odds; however, it does depend on the statistician’s choice of distribution for the alternative hypothesis, as I explained, above.

Jay

on 06 Jan 2012 at 10:55 amBrilliant idea for a study.

Reading it, I found this: “We used father’s age to control for variation in baseline age across participants.” Can someone explain what this means?

on 06 Jan 2012 at 12:47 pmjt512-

is correct about how the Bayes factor works (you can assume a prior probability and/or probability distribution– but you must make the assumption).

Further for Bayes to be valid- the information must be coming in randomly– that is the next piece of information must come randomly from all possible sources of information about the topic. (One can’t look into the bag before picking a ball)

If a researcher decides to do a study based on his/her understanding of a situation– then it is questionable that Bayes applies. (He looked into the bag and made decisions about how to pick the next ball)

And I do apologize for the strained analogy.

I’m pretty sure the misuse of statistics is not new. Back in the late 1970′s early 1980′s computer programs were developed for statistical analysis. These were then used by people who have no idea about the limitations of the mathematics.

For example– a statistical analysis is valid for a population that has been randomly sampled.

What is the population that is randomly sampled in the case of a study done on college sophomores who got paid to do the study?

The answer is NONE. It is not a random sample of any population.

So to make any conclusions about any population (other than the actual participants) using this method to get subjects is not valid according to the math.

A small study of non-randomly selected people can be a means of doing a study– the results of which might be interesting enough to do a real study (costly, time consuming).

This is one reason replication is important– but who gets paid for that? Heck, it seems the magazine wouldn’t publish attempts (both successful and not) to replicate Bem’s work.

The demand for novel findings seems higher than the demand for careful analysis and testing of said results right now.

on 06 Jan 2012 at 8:01 pmbanyan wrote:

The reason you don’t understand that sentence is because it is utter nonsense. “Baseline” refers to the starting time of a longitudinal study—a study where repeated measures are taken on subjects over time. “Baseline age” would then be the ages of the subjects at the beginning of the study.

However, the study (Study 2) in the paper, was not longitudinal; it was cross-sectional: only a single measurement was taken on each subject. Therefore, “baseline,” and hence “baseline age,” have no meaning. Furthermore, even if the study were longitudinal, you could not use the subjects’ fathers ages to adjust for differences in the subjects ages between experimental groups. If you wanted to make such an adjustment, you’d put the subjects’ own baseline ages in the models, not their fathers’.

What the “researchers” in this “study” (Study 2) appear to have done was to divide subjects into groups who listened to one of two songs. There was no significant difference in the mean ages of the subjects between the groups. The researchers then tried statistically adjusting the subjects’ ages by using a number of nonsensical factors until they found one that produced a statistically significant difference between the mean subjects’ ages between groups. That factor happened to be the subjects’ father’s age. They then dreamt up some science-y sounding rationale (“adjusting for baseline age”) to give the procedure the appearance of legitimacy. They then made the ridiculous claim that one of the songs caused a regression in age for the subjects who listened to it.

It was a silly exercise, because a difference in ages between the groups, whether due to nonsensical statistical modeling or not, does not imply regression in age.

Jay

on 03 May 2015 at 4:17 pmUnder weak assumptions it’s possible to show that, if you claim to have made a discovery when you observe P = 0.047, you have at least a 30% chance of being wrong (and a lot worse if it’s an implausible hypothesis).

P values do exactly what’s claimed of them. The problem is that what they tell you isn’t what you want to know. What you want to know is the false discovery rate. i.e. the probability that a “significant” result is wrong. A lot of people think that’s what the P value tells you. It isn’t.

There are simple explanations on my blog and on Youtube

http://www.dcscience.net/2014/03/24/on-the-hazards-of-significance-testing-part-2-the-false-discovery-rate-or-how-not-to-make-a-fool-of-yourself-with-p-values/

https://www.youtube.com/watch?v=tRZMD1cYX_c

There is a proper description in Royal Society Open Science

http://rsos.royalsocietypublishing.org/content/1/3/140216