“I cannot live without books” – Thomas Jefferson
One of the great things about Project Gutenberg, Bartleby, and other sites that collect great works of literature and provide them on the internet for free is that they make wonderfully easy to derive data sets.
I know, I know, I should be pilloried for not saying how grateful I am that the wisdom of Marcus Aurelius Antoninus is now available to me at the click of a mouse. I should cover myself in shame that I’m not trumpeting the availability of the beauty of lines from Wordsworth. But the truth of the matter is that ever since we started arguing about the authorship of the Federalist Papers and Shakespeare plays, literature has been a great way to learn about the power of statistics.
So let’s try something easy, because that’s how you build confidence.
One of the first things that we want to do with data is graph it to see if it contains things we don’t want. If we copy I Wandered Lonely as a Cloud straight out of Bartleby, you can paste it into Minitab without delimeters. Then, calculate the number of characters in each line.
The very first thing you’ll see is that there are 3 lines with no characters, and 2 lines with over 60 characters. The zeroes are clearly line breaks, but what about the long ones. Maybe they’re when Wordsworth had a lot to say? It turns out they’re not, they’re the cases where the line numbers are written. If we’re really interested in the characters per line, it probably makes sense to clean up this data.
That’s the lesson for the day too: look at the data, and make sure that it represents what you really want to know. Graphing is an easy way to do that. And literature is an easy way to collect some data to teach yourself that lesson.