Feb 26 2007

Data Mining – Adventures in Pattern Recognition

Correlations everywhere, now I must stop to think.

Science is often tricky business, requiring both detailed knowledge and sophisticated understanding. Pseudoscientists often earn the label of “pseudo” because they just don’t get the details right – often persistently in one direction with profound implications to their conclusions. To be a good scholar in any field, and in science in particular, it is a prerequisite to have at least a basic understanding of logic – of how to think. This includes knowledge of the common pitfalls of human thinking and how to avoid them. (Logic, as the name of my blog implies, is a theme I will return to frequently.)

The problem of data mining is both common and often subtle and missed. It is rooted in both the nature of human brain function and a common logical fallacy. The former is that of pattern recognition and the latter the fallacy of confusing correlation with causation.

That people have pattern seeking mental habits is often pointed out by skeptics, since we frequently engage in debunking fake patterns. Also, this psychological observation is being increasingly reinforced by neurological advances in our knowledge of how the brain functions. The human brain, it turns out, processes and remembers information primarily through pattern recognition.

It is further critical to recognize that the world contains many appearances of patterns that do not reflect underlying reality. We are confronted with a tremendous amount of information about the world around us. There exist true patterns in this information, such as the recurrence of the seasons, but much of it is random. Random information, further, is likely to contain patterns by chance alone. As Sagan eloquently pointed out, randomness is clumpy. We tend to notice clumps. They stick out as potential patterns. Apparently, although we evolved to be highly skilled at noticing clumps, we did not evolve to be very discriminating about which clumps are real and which ones are not real –it seems there was a greater evolutionary advantage to just assuming that any noticed clumps or patterns are probably real and then acting accordingly.

This is why we need science. Science is partly the task of separating those patterns that are real from those that are accidents of random clumpiness.

Data mining refers to the process of actively looking at large sets of data for any patterns (correlations). But since random data is clumpy, we should expect to see accidental correlations even when there is no real underlying phenomenon causing the correlation. This is often done in the context of statistical analysis of compiled data – whether gathered in a study or from databases of biographical or other information.

The key feature of data mining is that potential correlations are not determined ahead of time – so any correlation counts as a hit. Functionally this is the same as just noticing a pattern even though you were not actively looking. For example, a doctor may notice that they have seen a cluster of a particular rare disease recently. Or someone may notice that they typically have a bad day at work on Tuesdays. We are all, in effect, mining the data of the world around us every day, subconsciously looking for patterns.

Such patterns may be real – may reflect an actual underlying cause – but they are more likely to be random clumpiness. The reason for this is the vast number of potential correlations that could happen by chance. There are so many, that we should be confronted with patterns, whether actively or passively mining the data, on a regular basis.

From a statistical point of view, you cannot simply calculate the odds of the particular correlation occurring by chance alone. That particular correlation may be incredibly unlikely – with odds of thousands or even millions to one against its occurrence by chance. This may seem compelling, but it is misleading because that is asking the wrong question. The real question is – what are the odds of ANY correlation occurring in this data set?

Therefore any pattern or correlation that is the result of searching (again, whether active or passive) a large data set for any potential correlation should be viewed as only a possible correlation, requiring further validation. Such correlations can be used as the beginning of meaningful research (not the conclusion). Legitimate scientific procedure is to then ask the question – is this correlation real? The doctor in the example above should ask – is this a real outbreak of this rare disease or a random cluster? To confirm a correlation you must then ask ahead of time – what is the probability of this specific correlation occurring, and then test a new or fresh set of data to see if the correlation holds up. Since you are looking for a specific pre-determined correlation, now you can apply statistics and ask what is the probability of that correlation occurring by chance.

There is still another statistical pitfall to avoid, however. When looking at the new data set, you cannot include the original data that made you notice the correlation in the first place. The new data has to be entirely independent. This is to avoid simply carrying forward a random correlation that was mined out of a large set of data.

Data mining errors occur in science all the time. They are most prominent in epidemiological studies – which are basically studies that look at large data sets to look for correlations. To be fair, most of the time they are presented properly – as preliminary data that needs verification. Such correlations are run through the mill of science and either hold up or they don’t. But this process may take years. Meanwhile the media often reports such preliminary correlations as if they are conclusions, without putting them into their proper scientific context. So the public is treated to an endless parade of possible correlations coming from scientists, and are largely unaware of what role such data plays in the overall process of science. I also think that scientists and the institutions that support them are often to blame, for example by sending out press releases to announce an interesting new health correlation before it has been verified.

Data mining is a nuisance in mainstream science (competent scientists and statisticians should be well aware of it and know how to avoid it), but it is endemic in pseudoscience. It is (in my personal experience – I don’t have any controlled data) one of the most common errors made by those on the fringe. Astrology is an excellent example. Studies alleged to support astrology are built almost entirely on data mining – and they always evaporate when valid statistics or later independent tests are applied.

It is also a huge factor in our everyday interaction with the world around us. We are all pattern seeking, data mining, creatures. We are compelled by the patterns we see – they speak to us. Our “common sense” often fails to properly guide us, apparently being shaped by evolution to err hugely on the side of accepting whatever patterns we see. The only way to navigate through the sea of patterns is with the systematic methods of logic and testing that collectively are known as science.

No responses yet