Oct 20 2020

Daryl Bem, Psi Research, and Fixing Science

In 2011 Daryl Bem published a series of ten studies which he claimed demonstrated psi phenomena – that people could “feel the future”. He took standard psychological study methods and simply reversed the order of events, so that the effect was measured prior to the stimulus. Bem claimed to find significant results – therefore psi is real. Skeptics and psychologists were not impressed, for various reasons. At the time, I wrote this:

Perhaps the best thing to come out of Bem’s research is an editorial to be printed with the studies – Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi by Eric Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, & Han van der Maas from the University of Amsterdam. I urge you to read this paper in its entirety, and I am definitely adding this to my filing cabinet of seminal papers. They hit the nail absolutely on the head with their analysis.

Their primary point is this – when research finds positive results for an apparently impossible phenomenon, this is probably not telling us something new about the universe, but rather is probably telling us something very important about the limitations of our research methods

I interviewed Wagenmakers for the SGU, and he added some juicy tidbits. For example, Bem had previously authored a chapter in a textbook on research methodology in which he essentially advocated for p-hacking. This refers to a set of bad research methods that gives the researchers enough wiggle room to fudge the results, enough to make negative data seem statistically significant. This could be as seemingly innocent as deciding when to stop collecting data after you have already peeked at some of the results.

Richard Wiseman, who was one of the first psychologists to try to replicate Bem’s research and came up with negative results, recently published a paper discussing this very issue. In his blog post about the article he credits Bem’s research with being a significant motivator for improving research rigor in psychology:

Several researchers noted that the criticisms aimed at Bem’s work also applied to many studies from mainstream psychology. Many of the problems surrounded researchers changing their statistics and hypotheses after they had looked at their data, and so commentators urged researchers to submit a detailed description of their plans prior to running their studies. In 2013, psychologist Chris Chambers played a key role in getting the academic journal Cortex to adopt the procedure (known as a Registered Report), and many other journals quickly followed suit.

The paper itself reviews the history of pre-registration of study protocols in psychology. Wiseman and his coauthors found that the first historical case was, ironically, in the parapsychological literature.

Although this approach is usually seen as a relatively recent development, we note that a prototype of this publishing model was initiated in the mid-1970s by parapsychologist Martin Johnson in the European Journal of Parapsychology (EJP).

Further:

To find out if this was the case, I recently teamed up with Caroline Watt and Diana Kornbrot to examine the studies. The results were as expected – around 8% of the analyses from the studies that had been registered in advance were positive, compared to around 28% from the other papers.

What are the lessons we can take from all of this? In my opinion this is a perfect case study for the integration of more scientific skepticism into mainstream science and academia (and Richard Wiseman is also a major proponent and practitioner of this). It shows the positive results that can come from mainstream scientists studying pseudoscience. In my experience academics often live in a bubble, based upon a false dichotomy – the notion that there is mainstream science on one hand, and then there are the “kooks” who can comfortably be ignored. In fact, they should be ignored and giving them any attention is counterproductive and sullies one’s academic reputation. Often when they are published it is often because some editor dropped their guard and gullibly promoted nonsense.

I have often argued, however, that studying pseudoscience is like a physician studying disease. We don’t just study healthy subjects – we study disease precisely so that we can learn how best to maintain health. Studying exactly what goes wrong with pseudoscience should provide lessons and insight that can be applied to mainstream science, which may suffer more subtle versions of the problems that plague pseudoscience. In fact – science vs pseudoscience is a false dichotomy (something known as the demarcation problem). There is, rather, a continuum, so when studying “pseudoscience” experts are actually studying science.

As I stated in 2011 – Bem published a series of studies in the peer-reviewed literature using standard research techniques that claim to demonstrate a phenomenon that should be impossible. It is far less likely that we need to rewrite all books on physics and neuroscience than that there is something wrong with standard research techniques. In fact, we already knew what was wrong, and how Bem’s research was a prime example of it. When I say “we” I mean scientific skeptics, and those scientists sufficiently familiar these principles.

Bem published studies with razor-thin effect sizes and without any internal replications. He also did not pre-register his methods, and given his prior advocacy for dubious “p-hacking” methods this is a huge red flag. Later attempts to replicate his research findings have largely failed. But it is important to recognize that the errors Bem committed are actually very common in psychological and other areas of research. Parapsychology in general is a great case study in how not to conduct research. They are famous for things like “optional starting and stopping”, which is nothing but a method for cherry picking evidence after the fact.

It’s hard to know how many fixes would have been put in place without the stimulus of Bem’s research. Some journals now require publishing of effect sizes, or ban p-values altogether. As Wiseman points out, perhaps the best fix is to simply preregister your methods, which should eliminate p-hacking. Requiring internal replications is another solution, as well as using a variety of statistical methods, such as Bayesian analysis. Bem, in addition to many other examples of pseudoscience, helped show the cracks in the system. Now, some of those cracks are being filled, but there is still a lot of work to do.

I think (although I am admittedly massively biased) that science-based medicine is another example of applying knowledge gained through skepticism to mainstream science. Evidence-based medicine has all the cracks that psychological research has – problems with small effect sizes, overinterpreting preliminary results, p-hacking, lack of replications, publication bias, citation bias, and failure to properly consider prior plausibility. SBM is the fix.

I would like to end with one final example of the power of improving the rigor of research methodology. Bem, to his credit, was open to replicating his research, and collaborated with other psychologists to do follow up studies with high rigor, including preregistration, to definitely answer the question of the validity of his research. I suspect he thought this would prove he was right, but in any case, he did it. The result:

They presented their results last summer (2016), at the most recent annual meeting of the Parapsychological Association. According to their pre-registered analysis, there was no evidence at all for ESP, nor was there any correlation between the attitudes of the experimenters—whether they were believers or skeptics when it came to psi—and the outcomes of the study. In summary, their large-scale, multisite, pre-registered replication ended in a failure.

Pre-registration fixed Bem’s spurious results, which is pretty convincing evidence that they were the result of p-hacking in the first place. How did Bem respond to these results?

In their conference abstract, though, Bem and his co-authors found a way to wring some droplets of confirmation from the data. After adding in a set of new statistical tests, ex post facto, they concluded that the evidence for ESP was indeed “highly significant.”

Ack! He reintroduced p-hacking to manufacture positive results. He learned nothing. But hopefully the broader psychological community, and scientific community, did.

No responses yet