Mar 19 2010

Statistics in Science

Tom Siegfried wrote an article on the use of statistics in science that is simultaneously excellent and frustrating. It is an excellent review of common errors in using and thinking about statistics in science. But it is frustrating because Siegfried frames his article as a problem with “science” – as if his criticisms are criticisms of science itself, rather than the failings of individuals. At times he also writes as if the problems with statistics that he points out are fatal flaws or “some” researchers are just now starting to take into consideration.

Rather, from my perspective statistics are a very complex and challenging field. Most scientists I know have a moderately sophisticated understanding of statistics, many know far more of statistics than I, although some apparently less. However, most large studies consult with statisticians who are experts in statistical analysis. The primary difficulty is with interpreting studies once they are published (even when the paper itself gets the statistics correct).

But the specific problems Siegfried points out have been widely known for years, and those researchers with a better understanding of statistics have been taking them into account for as long as I can remember. Most of them I learned about in medical school at the hands of researchers and experts in public health.

But all that aside – Siegfried does provide an excellent review of common mistakes interpreting statistics in research, especially medical research. The article is worth a thorough read, but I will give some highlights.

The first and most common error is the misinterpretation of the meaning of statistical significance, often stated mathematically as the p-value. Often the p-value is described as the probability that the results of the study were due to chance alone, so a p-value of 0.05 means that there is only a 5% chance that the study results are a false positive. But this is not an accurate description of p-value.

Rather, the p-value says that if we assume the null hypothesis (no effect) what are the odds of getting the results of the study or greater. This may seem like a subtle difference (and it is) but it’s important. This is not the same thing as saying there is a 95% chance that positive results reflect a real effect.

First, statistical significance does not account for the rigor and quality of the study. It assumes no bias or flaws, and it does not account for other statistical flukes that could alter the results (such as poor randomization – see below).

But most importantly, the p-value of an individual study was never meant to be the final arbiter of what is true in science. Kimball Atwood has already written an excellent review of this question over at Science-Based Medicine. You should give that a read as well – but quickly, the point is that a Bayesian analysis is more appropriate. In other words, we begin with a prior probability of a claim being true based upon all existing research. We can then add to that the results of the current study to arrive at a post-probability. So a study with a 95% significance may still only increase the probability of a treatment working from 5% pre-probability to 10% post-probability.

Another way to look at this is that you cannot interpret a single study in terms of whether or not a treatment works, and Siegfried makes this point as well.  You have to put in into the context of prior research (sound familiar?).

There are other common problems in statistic as well. Siegfried points out the common problem of multiple analysis – if you look at 20 variables, one of them will achieve a p-value of 0.05 on average even if we assume the null hypothesis. But this is an old problem, long ago solved by using statistics designed to account for multiple analysis. In fact readers of this blog and SBM have likely encountered this before as a criticism of uncritical analysis of some studies. The take home lessons is – always ask yourself, how many different comparisons did the researchers do (different variables, different points in time, different outcome measures) and did they cherry pick those that were positive.

Next up is randomization – this is an important aspect of clinical trials. Randomization means that people were assigned at random to either the treatment group or the control group. The purpose of this is to avoid selection bias, but also to average out as many variables as possible. So you want to get equal numbers of people with red hair in each group, and randomization should take care of that.

However, Siegfried points out that there is no guarantee that randomization will do this – it may be unlikely, but you can still flip 10 heads in a row. If you think about all the unknown variables, chances are some of them will be unequal in the two groups.

This is exactly why we are so concerned with the size of trials – how many subjects were in the trial. Because randomization gets more and more effective with larger and larger numbers (you may flip heads 10 times in a row, but not a thousand times). Multiple trials also help – chances are random flukes won’t be the same across multiple trials.

Further, there is a process called stratification – with known variables, like age, sex, and race, you can make sure equal numbers get into each treatment group and not rely upon randomization. (But we have to rely on randomization for unknown variables.)


Statistical analysis is just another tool of modern science – it is part of the technology of science. And like all things, there is a wide variation in quality and understand across individuals, and what filters down to the public is generally oversimplified to the point of being wrong.

So I applaud efforts to educate the public about the proper use of statistics, and to educate scientists and professionals for quality control. But Siegfried could have framed his article more as – here are some common mistakes to avoid and how to fix them, rather than – science is broken.

I admit this can be challenging. I lecture on how to interpret the medical literature, where I cover many of these points, and often I get questions from the audience such as – “So you’re saying that most of science is wrong?” When actually what I am saying is that many individual studies are wrong, and studies are often misinterpreted. And further you have to base conclusions on the literature, not individual studies.

But when the technology of scientific studies is used properly, they work just fine.

Like this post? Share it!

17 responses so far