Dec 17 2010
Google Books Ngram Viewer
This is my new favorite internet toy – the Google Books Ngram Viewer. You may already know that Google Books is a project to digitize as many books as possible. Many of the recent books have copyright, so they cannot be made available online for free (authors and publishers have to eat too). But the words can be made available. The Ngram viewer is a great example of the power of computers and the internet to facilitate research and human knowledge.
Google currently has 5,195,769 books digitized – this is a massive storehouse of knowledge about 400 years of human culture. What the Ngram Viewer allows you to do is to search on words, and it will print a graph of how many times those words appear in the books it has digitized. This allows you to see trends over time. Obviously this can be used to track word usage, and is a boon to etymologists, but those words also have meanings, and that can be tracked also. Obviously there are multiple variables involved, but still this is a powerful window into the reflection of human culture in the written word.
This is already the source of a great deal of research – which goes beyond simply searching on words to comparing multiple searches looking for trends. For example, one researcher compared the use of names of famous people to track their fame. He found that over time people tend to become famous quicker and younger, and their fame fades faster also.
I decided to do a little quick research myself. Here’s what I found:
First – how popular is skepticism in the last two hundred years?
It seems that people are becoming more skeptical over time, with a huge uptick since 1900. Skeptics emerge in 1820 and then remain fairly steady. Meanwhile, writing about the paranormal and pseudoscience make their appearance around 1940, and steadily increase but are still dwarfed by references to being skeptical. What does this actually mean? That’s the hard part – who knows. That would take much more triangulation and investigation. But these raw numbers are still interesting.
How have our favorite pseudosciences fared over the last 200 years? Let’s see:
Not surprisingly, references to UFOs appear around 1950 and steadily increase to the present time. I would have thought there would be more ups and downs, but it appears publishing books about UFOs continues to be a increasing market. Around the same time Bigfoot comes into popular culture. Interest in Bigfoot is much less than UFOs, and has seemed to level off.
ESP has a blip around 1900 but interest does not take off until around 1930. Interestingly, the popularity of ESP seems to peak in the late 1970s and has dropped off considerably since then.
Next up – evolution vs creationism:
Not surprisingly, reference to “evolution” takes off after the publication by Charles Darwin of Origin of the Species in 1859. Writing about evolution dipped perceptively after the Scopes “Monkey trial” in 1925, but then recovers in the 1940s. Today it continues to increase steadily. By contrast references to “Creation” have been flat, and losing to “evolution” since 1859. The search is case-sensitive. I also searched on “creation” with a small “c”, but obviously this word can refer to much more than the origin of life so really is not meaningful. I also searched on “creationism” and “intelligent design” and these terms barely register in the last 30 years.
Finally, I ran the ultimate cultural death match – science vs faith.
Interest in science has slowly but steadily increased over the last two hundred years. Faith, on the other hand, has plummeted since around 1845. I don’t have the background to put that into any scholarly context – but it is interesting. The streams cross around 1925 (ironically, also the date of the Scopes trial – not sure of the significance), and science has been beating faith for the last 85 years – go science!
Obviously my quickie searches are of limited value as actual research, because there is no context or statistical analysis. But I did this in about 5 minutes, just to show the power (and fun) of this new tool. In the hands of actual researchers who have hours to spend on thoughtful analysis, imagine what this research tool can be used for.
This new tool represents the power of the digital information age – increasingly the drudgery and cost of time and other resources of doing stuff is being lowered, freeing individuals to utilize their pure creativity. Anyone can use this sight to do research – all you need is an idea.
21 Responses to “Google Books Ngram Viewer”
Leave a Reply
You must be logged in to post a comment.






Hello Mr Novella,
this Ngram viewer really rocks. I just stumbled over it reading about it on the “Not Exactly Rocket Science” blog.
About the “science” vs “faith” chart I have to ask if there’s a good reason for limiting the time axis to the year 2000 – except for the fact that it’s the default?
Because if you enter 2010 (it jumps to 2008 cause that’s probably the last available year) the chart makes an unpleasant turn back to the “dark side”
– so actually it doesn’t look so bright.
Greetings from snowy Southern Germany and thanks for some great years of delightful blogs and podcasts,
Joerg
I used 2000 because it was the default. When I played around with it, the 2001-2008 numbers always seem to jump one way or the other, so I wondered if that data is not as reliable, so I stuck with the default.
Faith does appear to make a comeback since 2000, but the overall long term trend is still much in favor of science. You pretty much get the same curve with “religion” as you do with “faith”, BTW.
I can’t stop fiddling with this darn thing.
A few things I’ve found:
1.) The percentage use of the word ‘the’ has declined over the past 200 years about a half percent while ‘a’ remains relatively constant.
2.) It’s interesting to find words that correlate. Some are obvious like “war, peace” over the past 200 years. You can see how they rise and fall around WWII, WWI, etc.. Or there is “war, peace, gun”, all follow a similar pattern.
3.) There are some words that seem to correlate, but I have no idea why. Like “ghost, turkey” from 1940 to the present. What the hell is going on there?
4.) at the beginning of the 19th century we were all about “Sparta” now is the age of “Athens”
I’m sure if I work hard enough, and put in the right combination of words I’ll find proof that 9-11 was an inside job.
Speaking of job. I suppose I should get back to doing mine.
Uh oh. “Ngram” looks just like “engram”. I wonder how long it will be before Scientology sues Google for stealing their word for made up nonsense and applying it to something actually useful?
Google’s “Books Ngram Viewer” is another free online tool that allows us to visualize information in new ways. I explore how to use the tool in the classroom to help students better understand the research method in my blog post – “How To Quantify Culture? Explore 500 Billion Published Words With Google’s Ngram Viewer” http://bit.ly/gcKJdp
PS – It includes a Rickrolling Easter Egg – Search for “never gonna give you up” and see what pops up!
Good news, if you compare faith and reason, reason clearly dominates, and shows a similar increase to faith after the year 2000. So post-2000 trends aren’t as bad as it would seem.
Sadly though science was in decline, at least on the graph.
Homeopathy, acupuncture, chiropractic
PZ Myers is winning:
nonsense, bullshit
http://ngrams.googlelabs.com/graph?content=nonsense%2Cbullshit&year_start=1950&year_end=2008&corpus=5&smoothing=3
There is a gradual decline in the use of “nonsense” as the use of “bullshit” increases.
Get ready…
Fight!
birds, monkeys
Pwnd.
“America” peaks in the mid seventies.
http://ngrams.googlelabs.com/graph?content=america%2Cengland%2Caustralia&year_start=1950&year_end=2008&corpus=5&smoothing=3
Interestingly, the Vietnam War ended in 1973
Sex neutral language:
“spokesman and “spokesperson”
http://ngrams.googlelabs.com/graph?content=spokesman%2Cspokesperson&year_start=1950&year_end=2008&corpus=5&smoothing=3
Simplicity vs Complexity:
http://ngrams.googlelabs.com/graph?content=simplicity%2Ccomplexity&year_start=1950&year_end=2008&corpus=5&smoothing=0
Richard Dawkins,Daniel Dennett,Christopher Hitchens,Victor Stenger
Early Microsoft vs. Linux
http://ngrams.googlelabs.com/graph?content=Microsoft%2C+Linux&year_start=1860&year_end=1920&corpus=0&smoothing=3
Here’s a link to one of the references for “Microsoft” in 1900.
You can see it is in the upper left on the top line. It is pointing to “Microscope” instead of “Microsoft”. It is weird that Microsoft and Linux have similar patterns, even though google was reading the term “Microsoft” incorrectly.
The flatness of “skeptic” in the first graph isn’t quite so clear when you add the hits for “sceptic” as well.
Skeptic versus Sceptic
Penis vs Vagina (I dunno,’ I just did, ok?)
http://ngrams.googlelabs.com/graph?content=Vagina,+Penis&year_start=1800&year_end=2000&corpus=0&smoothing=3
There is a HUGE spike from about 1810-1830, can anyone fill me in on what that would be caused by?
Check out the effects of the enlightenment.
One more interesting chart showing a negative correlation between the word “control” and “God”. Makes me think of the studies showing a correlation between loss of control and elevated superstition.
“seems that people are becoming more skeptical over time”
Oh really? Are you really hinting at this conclusion from this data? So let’s see, the word “buccaneer” peaked in the 1930’s, therefore it seems more people went into piracy during that time. Unfortunately the profession has been declining since then, thus increasing global warming.
I found almost all the early data unreliable – partly because of the poor OCR software, which makes some basic errors – for examle, before 1800 an initial s was usually rendered as an f – so if you type fuck in, it makes the early folks seem quite raunchy, proportionately. But when you go and look at the texts that the program is using, you see how the OCR thinks that the bee fucks up the honey from the flower.
When you look at the original texts, some of them appear to consist entirely of OCR errors! So don’t trust the data much, and don’t trust anything before 1800.