Saturday, February 14, 2009

Wordle Word Clouds

There is a cool site called Wordle. Wordle makes word clouds out of text. The more often a word appears in the text, the larger it is in the cloud. I've made some interesting ones with large source texts.

The first exercise was to compare word use by various presidents. I input collections of speeches from George Washington (11 speeches), Thomas Jefferson (12 speeches), Abraham Lincoln (9 speeches), Richard Nixon (10 speeches), George W. Bush (12 speeches), and Barack Obama (10 speeches). They are mostly state of the union addresses, inauguration speeches, and nomination acceptance speeches. For Obama, I used a lot of campaign speeches, so keep that in mind when you look at the word cloud. (This explains, for instance, why "McCain" is one of the larger words.)

I would be interested to hear your observations. Here are some of mine. Presidents in the past called it "United States," while more recent presidents call it "America." Jefferson used neither. Washington, Jefferson, and Lincoln were fond of the word "may," whereas Nixon used "let," and Bush and Obama use "must." The word "government" appears more in pre-Bush speeches as well. Here are the word clouds (click for larger):



Next, I wanted to see how large of a text the system could handle. I tried the entire Bible, and couldn't get it to work. After I broke it down into the Old and New Testaments, it found those chunks digestible. The program is supposed to ignore common words, such as "the," "a," "but," and so on, but the common words in King James-style English (an old-fashioned style of speech even when written, by the way) are different, so we see a lot of "thy," and "hast." (click for larger)


I tried a novel next, Mark Twain's "Huckleberry Finn." I picked Finn because it is noted for being written in a vernacular style, the opposite of the Bible. What words was Twain fond of? (What word were Twain fond of?) (click for larger)


I next did Charles Darwin's "Origin of Species," in honor of his 200th birthday. (click for larger)

I showed my results to some friends, and one of them replied that he'd imagine a word cloud of Noam Chomsky talks would consist mainly of

"East Timor"
"Uncontroversial"
hand motions
"I hate America"

To test this hypothesis, I input 28 Chomsky talks, totaling over 200 pages (found at chomsky.info), and got this result (click for larger):


So word clouds are a neat way to display information.


No comments: