Friday, December 17, 2010

Google Books Ngrams and the number of words for "snow"

As I mentioned yesterday, Google has put out a big data set (downloadable) and a handy interface for tracking the incidence of words and phrases. As many have pointed out, one can do a lot more with the raw data set than with the handy, handy online tool, but it's that latter that the New York Times called
a diversion that can quickly become as addictive as the habit-forming game Angry Birds.
(I've never heard of Angry Birds, but that's the kind of thing I'm likely to be out of the loop on, so okay.)

I said yesterday that Google Books Ngrams was a lot more sophisticated than Googlefight, and it is. But I'm troubled by the model of cheap history that's presented in the NYT article--as if to suggest that if you want to do cultural studies now, all you need to do is Google (Books Ngram) it:
With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986.

You can also learn that Mickey Mouse and Marilyn Monroe don’t get nearly as much attention in print as Jimmy Carter; compare the many more references in English than in Chinese to “Tiananmen Square” after 1989; or follow the ascent of “grilling” from the late 1990s until it outpaced “roasting” and “frying” in 2004.

“The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard.
I will concede that newspaper articles are necessarily glib, but it's easy to see how the fallacy that this article promotes would be broadly accepted. The first quoted paragraph above correlates the incidence of words with known historical events; the second moves on to suggest the ngrams' predictive capacity. There's a narrative implicit in each statement of "just the facts," only the assumptions that go into them are effaced.

Let's look at the first of these reports: "With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold."

The implicit narrative is that nobody even bothered to talk about women until second-wave feminism came along. In fact, if you go by the incidence of the words "men" and "women" in the Google Books Ngrams data set, sure, you might be tempted to really believe that the 1970s was the time "when feminism gained a foothold." I can imagine the suffragists who fought for and won the franchise that I as a woman can enjoy annually asking, "what are we, chopped liver?"

What distinguishes the feminist movements of the 1970s, for the purposes of this data set, is its renewed attention to language. The suffragists wanted a policy change: they wanted the vote (and the freedoms that the vote could give them). The second-wave feminists wanted policy changes too (still working on that wage gap, people!) but they also wanted a deeper change: they wanted to change the way we thought about women and--here's the kicker--spoke about women. The 1970s is when it became broadly recognized as problematic to treat "man" as a synonym for "person," and I suspect that a significant percentage of the uses of "men" were and remain the "universal" usage. That's a nuance that the online Ngrams tool can't give you ("with a click").

Likewise, if you got your understanding of history through Google Books Ngrams, you wouldn't expect to hear this from 1929:
Have you any notion of how many books are written about women in the course of one year? Have you any notion how many are written by men? Are you aware that you are, perhaps, the most discussed animal in the universe? Here had I come with a notebook and a pencil proposing to spend a morning reading, supposing that at the end of the morning I should have transferred the truth to my notebook. But I should need to be a herd of elephants, I thought, and a wilderness of spiders, desperately referring to the animals that are reputed longest lived and most multitudinously eyed, to cope with all this. I should need claws of steel and beak of brass even to penetrate the husk. How shall I ever find the grains of truth embedded in all this mass of paper, I asked myself, and in despair began running my eye up and down the long list of titles. Even the names of the books gave me food for thought. Sex and its nature might well attract doctors and biologists; but what was surprising and difficult of explanation was the fact that sex--woman, that is to say--also attracts agreeable essayists, light-fingered novelists, young men who have taken the M.A. degree; men who have taken no degree; men who have no apparent qualification save that they are not women. (27)
That's Virginia Woolf, of course, giving a fictionalized, subjective encounter with the British Library. Yes, it's a bit longer than a sentence, and you have to read it; you can't just click! But it gives you much more women's history than does the Google Books Ngrams example cited by the NYT.

Google Books Ngrams is a fun tool (as everyone keeps pointing out) and, if you download the data set, even a useful one. But it can only get you so far, and uncontextualized, it encourages assumptions that it does not announce. I mention the number of words for "snow" in my title above because it's a famous fallacy--the notion that Inuit has [insert high number here] words for snow, always with the implicit suggestion that having a lot of words for something means that something is extremely important to the culture. Language Log uses this as their go-to example of stupid assertions about language widely believed by the public; it's a cheap Whorfism, claiming broad cultural significance for something incidental. We have a widely accepted term for a magical being that flies by night and runs a clandestine cash-for-baby-teeth operation. That doesn't make it central to American culture. ("Mom, is the Tooth Fairy real?" "Yes! Check Google Books Ngrams if you don't believe me!")

There's a certain Words For Snowism in the online Google Books Ngrams tool, the suggestion that the more frequently a word is used, the more important it is in a collective unconscious of which the Google Books data set serves as a convenient index. This importance is not the same thing as significance, in the sense of significant digits or statistical significance; it's not the difference that makes a difference, but rather a psychologized importance--attachment, cathexis. Which is really kind of garbage.

The web interface is, as my friend Will says, a toy. For the serious scholar, there's much more to be done with ngrams, and one can be careful as well as lazy with the conclusions one draws. But the toy has a "boom! proven with statistics!" quality, a reality-effect that's enormously pleasurable, even, as Patricia Cohen writes for the NYT, "addictive." (That's the point of toys, isn't it?) That's why I'm inclined to agree with Jen Howard, who writes that her "skepticism is mostly directed at how people will use it and what kinds of conclusions they will jump to on dubious evidence." That sort of jumping is practically built into the ngrams tool.

Woolf, Virginia. A Room of One's Own. Annot. and introd. Susan Gubar. 1929; Orlando: Harcourt, 2005. Print.

Previously on text-mining:
Dec. 16, 2010
Dec. 14, 2010
Google's automatic writing and the gendering of birds


SEB said...

Oh man. My first reaction was that the Chinese word for the Tian'anmen Square incident of 1989 doesn't include the Chinese words "Tian'anmen Square."

Natalia said...

Fans of things that make sense will also be irritated to learn that GBN reads the long S as an F.

The more you know.

skg said...

Angry Birds: casual game played on smartphones. Not very exciting.

Natalia said...

Heh. Thanks, S.

Sean Jacobs said...

My five year old plays Angry Birds incessantly on my iPhone.

Natalia said...

If you show your kid the Google Books Ngrams Viewer, maybe you can get your phone back.