Jan 07 2011

Bem’s Psi Research

The Journal of Personality and Social Psychology plans to publish a series of 9 experiments by Daryl Bem that purport to show evidence for precognition. This has sparked a heated discussion among psychologists and other scientists – mainly is it appropriate to publish such studies in a respected peer-reviewed journal and what are the true implications of these studies? I actually think the discussion can be very constructive, but it also entails the risk of the encroachment of pseudoscience into genuine science.


Before I delve into these 9 studies and what I think about them, let me explore one of the key controversies – should these studies be published in a peer-reviewed journal? This question comes up frequently, and there are always two camps: The case in favor of publication states that it is necessary to provoke discussion and exploration. Put the data out there, and then let the community tear it apart.

The other side holds that the peer-reviewed literature holds a special place that should be reserved only for studies that meet certain rigorous criteria, and the entire enterprise is diminished if nonsense is allowed through the gates. Once the debate is over, the controversial paper will forever be part of the scientific record and can be referenced by those looking to build a case for a nonsensical and false idea.

Both points have merit, and I tend to be conflicted. I agree that it is good to get controversial topics out there for discussion. In this case, as we will see, I think the discussion has been particularly fruitful. But there is also significant harm in publishing studies just to provoke discussion. Peer-review is an implied endorsement. Journals can mitigate this by publishing an accompanying editorial (which will be done in this case), and that is a reasonable measure. But seriously flawed studies that are published for this reason still end up as part of the official record and will cause mischief for years, even decades.

John Maddox, editor-in-chief at the time of Nature magazine, fell prey to this fallacy. He published a highly flawed review that was positive of homeopathic research. He did it to spark critical discussion. What he found was that mainstream scientists ignored it, and homeopaths used it as propaganda to promote quackery. He later commented that it was his worse editorial decision. One can argue that the cautionary principle is more important in medical and biological journals, but this is a minor point, in my opinion. Even the basic scientific literature can have immense practical implications.

We also have to recognize that we live in the age of mass democratized information. The gatekeepers can no longer control the dissemination of information. So when studies get published in the peer-reviewed literature, that information is now out there and will be used and exploited by every sector. The mass media will use it for sensational headlines. Charlatans will use it to sell their goods. And believers in nonsense will use it as propaganda to promote their beliefs. After the controversy has died down and perhaps even forgotten, the studies will be there as an enduring source of confusion.

Essentially, at this time my position is that individual decisions need to be made regarding specific papers, and that the accompanying editorial is a good way to mitigate the controversy. But even this is not adequate. What I propose is that peer-reviewed journals who wish to publish controversial studies for the sake of discussion should have a special section in the journal for doing so. This section will be outside the bounds of peer-review, and can be used explicitly for the purpose of getting controversial studies out there for scientific discussion. It will be clear that publication in this section is not in any way an endorsement of quality, and articles published there will not be included in the official scientific literature. You can call this section “Discussion and Controversies”, or something similar, and I imagine they would be popular sections for scientific journals. Also, such a section would free up editors from having to make agonizing decisions about publishing controversial papers like Bem’s.

Bem’s Psi Studies

Bem’s approach to these 9 studies which he has been conducting over the last 10 years is interesting. He took standard social psychology protocols, and then reversed them to see if there was influence back in time. For example, researchers have had subjects practice words, and then later perform memory tests using the practiced words and new words. Not surprisingly, words that were previously practiced are easier to remember than novel words. Bem conducted this study in reverse – he had subject perform memory tests, and then later had them practice with some of the words. He found that subjects tended to perform better with words they would later practice.

Of course, if this result is real and not due to an artifact of statistics or trial execution, that would imply that the future can influence the past – a reversal of the arrow of cause and effect. This is, by everything we currently know about physics and the way the universe works, impossible. It is, at least, as close to impossible as we can get in science. It is so massively implausible that only the most solid and reproducible evidence should motivate a rational scientist to even entertain the idea that it could be real.

Previously I have argued, along the lines of “extraordinary claims require extraordinary evidence,” that any claims for a new phenomenon (not just psi or paranormal, but anything new to science), in order to be accepted as probably true, should meet several criteria. The studies showing evidence for this new phenomenon should show:

1- A statistically significant effect

2- The effect size should also be significant, meaning that it is well beyond the range of statistical and methodological “noise” that studies in that field are likely to generate. (This differs by field – electrons are more predictable and quantifiable than the subjective experiences of people, for example.)

3- The results should be reproducible. Once you develop a study design, anyone who accurately reproduces the study protocol should get similar results.

The above is a minimum – it’s enough to be taken seriously and to justify further research, but also is no guarantee of being correct. It’s also nice if there are plausible theories to explain the new phenomenon, and if these theories are compatible with existing theories and knowledge about how the world works. Such theories should have implications that go beyond the initial phenomenon, and should be no more complex than is necessary to explain all data.

How do Bem’s results stack up to the above criteria? Not well. It is important to add that in order to be taken seriously, experimental results should meet all three basic criteria simultaneously. Bem’s results only meet the first criterion – statistical significance (which I will discuss more below). The effect sizes are tiny. For example, in the word test described above subjects were correct 53% of the time, when 50% is predicted by chance.

That is a small fluctuation, and for a social psychology study, in my opinion, does not deserve to be taken seriously. Even subtle problems with the execution of the study (and one or more such problems are almost always found when study protocols are investigated first hand) can result in such small effect sizes. Essentially, that is within the experimental noise of social psychology studies.

You can also look at it this way – there are hundreds of ways to bias such studies and skew the results away from statistical neutrality. When you have effect sizes that are 20-30% or more, such biases should be easy to spot and eliminate from the protocol. But as the effect sizes get smaller and smaller, you get diminishing returns in terms of locating and eliminating all sources of bias. When you get down to only a few percent difference, it is essentially impossible to be confident that every source of subtle bias has been eliminated.

That is the reason for the third criterion, replication. It is less likely that the same sources of bias would be present when different researchers perform the same protocol independently. Therefore, the more a protocol has been replicated with similar results, the smaller an effect size we would take seriously. There is no magic formula either – but we can reasonably accept somewhat smaller effect sizes with replication. Even then, 1-3% is hard to ever take seriously in a psychology experiment. There can still be sources of bias inherent to the protocol, and history has shown these can be subtle and evade detection for years. And, it should be noted, even with large effect sizes, we still wait for replication before granting tentative probability to a new phenomenon.

Bem’s research so far has failed the replication criterion. There have been three completed attempts to replicate part of Bem’s research – all negative so far. Other studies are ongoing.

So at this time we have a series of studies with tiny effect sizes that have not been replicated, and in fact with negative replication so far. Regardless of the nature of the phenomenon under study, this is not impressive. It is preliminary at best and very far from the kind of evidence needed to conclude that a new phenomenon is probably real and deserves further research. If we add that the new phenomenon is also probably impossible, that puts the research into an even more critical context.

Evidence-Based vs Science-Based

Perhaps the best thing to come out of Bem’s research is an editorial to be printed with the studies – Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi by Eric Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, & Han van der Maas from the University of Amsterdam. I urge you to read this paper in its entirety, and I am definitely adding this to my filing cabinet of seminal papers. They hit the nail absolutely on the head with their analysis.

Their primary point is this – when research finds positive results for an apparently impossible phenomenon, this is probably not telling us something new about the universe, but rather is probably telling us something very important about the limitations of our research methods.

This is a core bit of skeptical wisdom. It is supreme naivete and hubris to imagine that our research methods are so airtight that a tiny apparent effect in studies involving things as complex as people should result in rewriting major portions of our science textbooks. It is far more likely (and history bears this out) that there is something wrong with the research methods or the analysis of the data.

Wagenmakers and his coauthors explain this in technical detail. They write:

Here we discuss several limitations of Bem’s experiments on psi; in particular, we show that the data analysis was partly exploratory, and that one-sided p-values may overstate the statistical evidence against the null hypothesis. We reanalyze Bem’s data using a default Bayesian t-test and show that the evidence for psi is weak to nonexistent.

This sound remarkably similar to what we have been saying over at Science-Based Medicine – and is related to the difference between evidence-based medicine and science-based medicine. In fact, this paper, if applied to medicine, would be a perfect SBM editorial. They expand their primary point, writing:

The most important flaws in the Bem experiments, discussed below in detail, are the following: (1) confusion between exploratory and confirmatory studies; (2) insufficient attention to the fact that the probability of the data given the hypothesis does not equal the probability of the hypothesis given the data (i.e., the fallacy of the transposed conditional); (3) application of a test that overstates the evidence against the null hypothesis, an unfortunate tendency that is exacerbated as the number of participants grows large.

The first point is critical, and one you probably have heard me make many times. Preliminary research is preliminary, and most of it turns out to be wrong. But entire fields have been based upon unreliable preliminary research.

The second point is a technical one regarding p-values, which are often misinterpreted as the chance that the phenomenon is real. This is not the case – it’s just the chance that the data would have turned out the way it did given the null hypothesis is true. Relying on p-values tends to favor rejection of the null-hypothesis (concluding the phenomenon is real). It is more appropriate, as we have explicitly argued on SBM, to use a Bayesian analysis – what is the probability of a phenomenon being true giving this new data. This type of analysis takes into account the prior probability of a phenomenon being true, which means the more unlikely it is, the better the new data has to be in order to significantly change the likelihood of rejecting the null hypothesis. In other words – extraordinary claims require extraordinary evidence.

The third point is a technical one about statistical analysis. It is noteworthy that none of the peer-reviewers on the Bem studies were statisticians – a practice that perhaps journals need to correct.


In the final analysis, this new data from Bem is not convincing at all. It shows very small effects sizes, within the range of noise, and has not been replicated. Further, the statistical analysis used was biased in favor of finding significance, even for questionable data.

But examining Bem’s studies is a very useful exercise, and I am glad that the conversation has taken the direction it has. I am particularly happy with the Wagenmakers editorial, which is an endorsement of exactly what we have been arguing for in skeptical circles in general, and at Science-Based Medicine in particular.

It further demonstrates the utility for science and scientists in addressing fringe claims. The lessons to be learned from Bem’s research about the methodological limitations of research and how to interpret results, apply to all of science, especially the social sciences and medicine (which also deal with people as the primary subjects of research). We can and should take these lesson when dealing with acupuncture, EMDR therapy, homeopathy – and further into even mainstream practices within medicine and psychology.

Bem has unwittingly performed a useful set of experiments, by conducting careful research into claims that we can confidently conclude are impossible, he has exposed many aspects of the limitations of such research. He has also sparked a discussion of the purpose and effectiveness of peer-review.

58 responses so far