May 29 2015

A Chocolate Science Sting

John Bohannon is at it again. In 2013 he published the results of a sting operation in which he submitted terrible papers with fake credentials to 304 open access journals. Over half of the journals accepted the paper for publication. He published his results in Science magazine, and it caused a bit of a stir, although arguably not as much as it should have.

Bohannon was asked to repeat this feat, this time to expose the schlocky science of the diet industry. He was asked to do this for a documentary film which will be release shortly, but he has already published his reveal. You can read his full account for details, but here is the quick summary.

He collaborated with others to perform a real (although crappy) scientific study. His researchers recruited 16 people, with one drop out, the remaining 15 were divided into three groups: low carb diet for three weeks, low carb diet plus daily chocolate for three weeks, and no change in diet. The results were not surprising in that the two diet groups lost 5 pounds on average, while the no diet group did not. However, they also found that the chocolate group lost 10% more weight. He explains:

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Yes, our old friend p-hacking. Bohannon is simply demonstrating effects that we have been discussing here and on Science-Based Medicine for years – published small studies are likely to be false positives because they are more susceptible to quirky results and manipulation through p-hacking. Throw in a little publication bias and we have a literature flooded with worthless positive studies.

I have argued that the press should not even be reporting such preliminary studies. They are nothing but noise in the background of science. At best they serve as a way to guide future research (by giving some indication of which questions are worthwhile and helping design more rigorous studies). They are useful for scientists, but really should not be presented to the public as if their results were reliable.

Here Bohannon is using one common method of p-hacking – multiple analysis. He looked at 18 variables, but did not adjust the statistics to reflect this. You can make an adjustment for multiple analysis so that the p-values reflect the “multiple lottery tickets.” At least then the p-value is legitimate. Otherwise the p-value is meaningless.

Even when legitimate, p-values are problematic. Some scientific journals are discouraging or even banning their use. The essence of the problem is that p-values are being over-used as a single measure of reliability of scientific results. The p-value was never meant to be used this way. It is only really a quick assessment of how seriously to take the data, or the signal to noise ratio in the data. But because p-value became the one measure of a study’s results, that led  to methods that essentially amount to p-hacking – tweaking the methods and massaging the data until you get across the magical 0.05 p-value.

Mutliple analysis is just one of those methods (sometimes called researcher degrees of freedom). Other methods include collecting data until you get the result you want, making multiple comparisons among the variables, and looking at multiple types of statistical analysis then using the one that works the “best.” These tricks are often done innocently, and sometimes not-so-innocently.

It is great that more attention is being paid to these problems with published scientific studies, and Bohannon’s efforts should be applauded.

This is a good time to point out, however, that these problems do not mean that science is broken or that no results can be trusted. It just means you have to understand the structure of the scientific literature and how to interpret results. Science progresses through eventually performing high quality studies with clear results that can be independently replicated. Until we get to this level of evidence I am suspicious of any claims.

Preliminary, small, soft, one-off studies are just noise. I pay no attention to them (although I do pay attention to the sloppy press they often receive). More robust results start to get interesting, and then you have to pay attention to plausibility. There is no sharp demarcation line, but at some point the combination of plausibility and direct evidence is sufficient to conclude that a phenomenon is more likely to be true than untrue. Of course, scientific conclusions are always tentative and subject to revision, but probabilities can reach so high that it is reasonable to treat certain conclusions as if they were facts. At the very least, overturning them would require a mountain of evidence equal to the mountain of evidence that establishes them as true.

I can’t give you a formula for this. It takes scientific knowledge and judgment. That is why we further rely on the consensus of scientific opinion to indicate when a claim has crossed the threshold and should be generally accepted. The judgment of any individual scientist can be quirky or mistaken, but the consensus of many scientists has a greater chance of being valid because individual quirkiness should average out. It is like crowdsourcing, but within an expert population.

In any case, this is the best we can do. It has proven very effective overall, as the amazing progress in science and technology attests. So we are clearly doing something right.

But we can do better. I look at this in terms of efficiency. Science is grinding forward, and eventually most claims work themselves out. Bad ideas do not get long term traction within science, while good ideas eventually do. The real question is, how rapid and efficient is this progress. Sloppy research techniques and poor journal filters slow progress by making the system inefficient.

I would argue that sloppy science journalism does as well by contributing to the scientific illiteracy of the population. The elaborate and expensive institutions of science depend upon public support. Public beliefs also tend to drive funding and therefore research, and sometimes scientists have to waste resources addressing popular, but not very scientifically valid, ideas. Think of the money wasted researching highly implausible alternative medicine treatments, or proving yet again that vaccines do not cause autism.

How much more rapid would our progress be if these inefficiencies were worked out of the system, or at least minimized?

Like this post? Share it!

32 responses so far