Sunday, December 19, 2010

A Supposedly Fun Thing: Text-Mining and the Amusement/Knowledge System; or, the Epistemological Sentimentalists

If we could text-mine the internets of the last few days for the correlation between the words "n-gram" and "fun," I'm sure we'd get a nontrivial number. One of the most striking things about the reception of the Google Books Ngrams, largely in the form of the web tool, is the giddy delight with which people have announced how much fun it is. Exhibit A is the bit I quoted yesterday from Patricia Cohen at the New York Times:
The intended audience is scholarly, but a simple online tool allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming game Angry Birds.
But that's just one example--the fun of the Google Books Ngrams tool is almost universally noted. See, for instance, "Fun With Google's Ngram Viewer" (Mother Jones), "Fun with Google NGram Viewer" (WSJ), and "BRB, Can't Stop NGraming" (The Awl). And Dorothea Salo tweets,
What I like about the GBooks n-grams is seeing all kinds of people playing with it. Just playing. THAT, friends, is how one learns.

The prevalence of this language of play raises two questions.

1. What rhetorical work is this move (calling Google Books Ngrams a fun toy) doing?

2. What experiential dimension of Google Books Ngrams does this rhetorical move describe, and what does it tell us about the tool's epistemic significance?

Answering the first question feeds into answering the second. To call the Google Books Ngrams web tool (henceforth "GBN") a fun toy is to hedge one's bets, to express approval without necessarily venturing into the higher-stakes terrain of approving it as a research method. Any assessment of the tool's epistemic value is channeled through an expression of pleasure (or, as Patricia Cohen and The Awl's Choire Sicha rather interestingly suggest, compulsion). Play can of course be a form of learning, and very important--that's what Dorothea Salo's tweet indicates. But play is a good learning environment precisely because the stakes are low and mistakes can be made safely, as a comment by Bill Flesch suggests: "I played around with it for about half an hour. Now I'm bored." New toy, please! With respect to knowledge, the language of play is deeply ambivalent.

As I read it, the universal declaration of fun that has surrounded the release of GBN is as much about guilt as about pleasure. Those who are compulsively "ngraming," as Sicha so amusingly puts it, are often all too aware of GBN's limitations, which have been blogged extensively, all the way down to what Natalie Binder points out, in her much-retweeted post, has to underlie the whole operation: inevitably imperfect OCR.*

Why does the GBN web tool even exist? Not to advance knowledge, I don't think, or at least not directly, but rather because it's fun. Because it directs interest toward the more substantive element of the project, the downloadable data set that relatively few people are actually going to download.

There are huge problems with using GBN (and throughout I'm alluding to the web tool/toy that everybody is saying is so much fun) as any sort of meaningful index of culture, and everyone knows it. And yet.

I would argue that the universal declaration of fun is a form of confession: I am deriving epistemological satisfaction from this unsound tool, with its built-in Words for Snowism. It's a guilty pleasure, epistemic candy: the sensation of knowledge, lacking in any nutritional value.

But the guilt goes rather deeper than the simple tension between GBN's unreliability for actual research and the "gee whiz!" quality of the graphs: GBN is fun because it is so limited.

That great scholar of nineteenth-century culture, Walter Benjamin, described a mode of writing that he called "information."
Villemessant, the founder of Le Figaro, characterized the nature of information in a famous formulation. 'To my readers,' he used to say, 'an attic fire in the Latin Quarter [Paris] is more important than a revolution in Madrid.' This makes strikingly clear that what gets the readiest hearing is no longer intelligence coming from afar, but the information which supplies a handle for what is nearest. Intelligence that came from afar--whether over spatial distance (from foreign countries) or temporal (from tradition)--possessed an authority which gave it validity, even when it was not subject to verification. Information, however, lays claim to prompt verifiability. The prime requirement is that it appear 'understandable in itself.' (147, emphasis added)
What GBN delivers is information in this sense. It is near at hand, easy to use, and puts out a nice visualization that appears "understandable in itself." It's easy to deliver, in that way, not unlike a pizza. It's no good to point out, as Mark Davies does, that the Corpus of Historical American English (COHA) allows one to look at specific syntactic forms, or include related words, or track usages by the genre of the source. Such capacities only raise anxieties. (For example, what gets tagged as "nonfiction"? Where, for instance, do autobiographies go? I once, to my astonishment, saw The Autobiography of Alice B. Toklas in the nonfiction section of a book store--along with Three Lives! But I digress.)

As soon as we raise such questions, the graph stops being "understandable in itself," stops being information. Conversely, when you aren't given the choice to sort by genre, how genres are defined necessarily stops being a question. It's the very fact that the toy is a black box and a blunt instrument that makes it feel immediate and incontrovertible and, in that very satisfying way, obvious. We get the epistemic satisfaction of information, and the thing that gives it to us is precisely that information's lack of nuance.

Yesterday I used the word "cheap" to describe the kind of historical narratives GBN suggests. There is indeed a kind of economic dimension to the satisfaction that GBN delivers. Of Oscar Wilde's many quotable lines, I am reminded of this one:
The fact is that you were, and are I suppose still, a typical sentimentalist. For a sentimentalist is simply one who desires to have the luxury of an emotion without paying for it. (768)
Feeling, Wilde suggests, has to be earned.** Bracketing the question of whether this is a good description of sentimentalism, it's a good analogue for the epistemic candy of GBN. One receives the apparent solidity of research--the nice graph that summarizes and visualizes what might otherwise be years of labor in the making--without having to have actually done any research. This is only a cheap thrill, "fun," when it is actually cheap--that is, when we don't inquire into how the corpus was prepared, or what effects GBN's case-sensitivity is having on our results.

The analogy to sentimentalism is useful not only because it gives us a model for understanding the economy of feeling here, but also because it allows us to recognize that there is an element of feeling in the way that we encounter information. We are likely to find it ethically reprehensible when our emotions or what we believe we know are manipulated. And yet there are times when we want the cheap thrill. Most people I know will freely cop to liking a good emotionally manipulative movie or novel, whether a thriller or a romance or one of those movies where the dog dies. As the fun of ngrams demonstrates, we like a little intellectual manipulation too.

(I know, I know, it doesn't tell you anything conclusively, but...try Foucault versus Habermas!)

What does it mean, this liking it?

I mentioned Bill Brown's term, the "amusement/knowledge system," in my title above because it's another, perhaps more explicit way of describing the close interweaving of knowledge and fun at the end of the nineteenth century that so fascinated Benjamin (208). In my own work I have tried to make a case for taking seriously both the knowledge and the amusement in that system, notably in naturalist fiction, because it's often in such liminal places that the terms of what counts as knowledge are most at stake. Part of the reason experimental literature seems to be here to stay is that the amusement/knowledge system is, too.

The point is not to condemn fun as something that has no place in knowledge--far from it. Fun is central to how we vet knowledge--just think of how important it is that research be "interesting"! It is our highest (and also most common) praise.*** Indeed, play lies at the heart of our most cherished models of intellectual inquiry--a nonutilitarian curiosity to "see what happens." As I quoted Dorothea Salo at the beginning of this post: "THAT, friends, is how one learns."

So condemning fun is not at all on my agenda. Rather, I want to draw attention to the emotional content of the way we talk about knowledge, and to the ambivalence that intellectual "fun" signifies. Ours is an age of "news junkies" (again with the pleasure bordering on unpleasurable compulsion, à la the "addictive" ngrams) and "armchair policy wonks" and people who read voraciously, but only in the proverbial dubiously defined "nonfiction" category. Nate Silver and the Freakonomics dudes are minor celebrities. Lies, damned lies, and statistics are our idea of fun, as powerfully as a Victorian melodrama was ever considered fun. Which means we need to think much more about how fun operates, and why, and what that means for knowledge. And just as crucially: what knowledge means for pleasure.

*In fairness, Ben Schmidt argues that GBN's OCR is pretty accurate, given the state of the field, and also that "No one is in a position to be holier-than-thou about metadata. We all live in a sub-development of glass houses." But there's a big difference between "this is really good, for OCR" and "this degree of accuracy is good enough for supplying evidence for X kinds of claims."

**Taken out of context, Wilde appears here to be describing sentimentalism through an economic metaphor. In fact, it's rather the reverse, or at the very least something more confused than that: most of the surrounding text is taken up with Wilde chastising Douglas for his financial mooching.

***As Sianne Ngai points out, the "interesting," like the language of play, has a hedging quality, bridging epistemological and aesthetic domains.

Benjamin, Walter. "The Storyteller: Observations on the Works of Nikolai Leskov." Trans. Harry Zohn. Selected Writings: Volume 3, 1935-1938. Ed. Howard Eiland and Michael W. Jennings. Cambridge, Mass.: Belknap-Harvard UP, 2002. Print.

Brown, Bill. The Material Unconscious: American Amusement, Stephen Crane, and the Economies of Play. Cambridge, Mass.: Harvard UP, 1996. Print.

Ngai, Sianne. "Merely Interesting." Critical Inquiry 34.4 (Summer 2008): 777-817. Print.

Wilde, Oscar. "To Alfred Douglas." Jan.-Mar. 1897. The Complete Letters of Oscar Wilde. Eds. Merlin Holland and Rupert Hart-David. New York: Henry Holt, 2000. Print.

Previously on text-mining:
Google Books Ngrams and the number of words for "snow"
Dec. 16, 2010
Dec. 14, 2010
Google's automatic writing and the gendering of birds


Natalie Binder said...

I've enjoyed "playing" with ngrams too, but the more I work with it, the less I see it as "information pizza." I think people are finding it difficult to interpret the graphs. There is no title, instruction or labels on the graphs. Not even on the axes. And it's very unclear what (if anything) these graphs are supposed to measure. At some point, the frustration outweighs the fun.

The Harvard-Google team was very bold when they titled their paper "Quantitative Analysis of Culture." I find the whole idea slightly creepy, since no humanities scholars were involved in the project.

Natalia said...

I probably should have said up front that this post builds on a previous one. It's very difficult to interpret these data in a way that would feel rigorous, yes. But there are various unsatisfactory interpretations that float to the surface immediately, always implicit in the tool itself. This is why information is "shot through with explanations," as Benjamin puts it: it relies on implicit warrants, usually beliefs we are already inclined to hold.

Most saliently, in this particular case, there's the hidden proposition that a high word frequency amounts to cultural importance, what I called in my last post "Words for Snowism." It's that implicit narrative that's leading people to link graphs and say, "isn't this interesting," before backing off and admitting, "well, of course it doesn't mean...." It's, you know. Fun.

But yes, if you're actually trying to do research and not stage a fake Foucault-Habermas deathmatch, it's very frustrating, too.