So far, we’ve had three different ideas for figuring out the author of an unknown paper (top n word ordering in Part 1 and stop word frequency / 4-grams in Part 2). Here’s something interesting though from the comments on the Programming Praxis post:
July 19, 2013 at 12:29 PM
Patrick Juola has a guest post on Language Log describing the approach he took.
If you read through this though, there are some interesting points. Essentially, the worked with only a single known novel from JK Rowling (which I don’t have): The Casual Vacancy. Theoretically that will help, since it’s more likely to be similar in style than the Harry Potter books; although that seems to defeat the idea of an author having a universal writing style. Other than that though, they analyzed three other Brittish crime authors. So the works they used are completely different.
Another change is that, rather than analyzing the entire document as a whole, they broken each text into 1000 word chunks. This should theoretically help with outliers and somewhat offset the fact that there is a significantly smaller library. On the other hand though, as the author talks about more and more features, I can’t help but feel that they choose features specifically to out JK Rowling as the author… In their results, they have a 6/5 split favoring JK Rowling for one test, a 4/4/3 favoring her for another, and a 8/3 against JK Rowling in the third. So it’s still not a particularly solid conclusion…
In any case, I still do have one more test to run based on their article (if you’re read the source on GitHub you’ve probably already seen this): word length. That’s actually the most successful of their tests, with the 6/5 split, so theoretically it might work out better?
At this point, the code is almost trivial to write. Basically, we’ll take the previous top n word code and cut out most of it:
; Calculate the relative frequencies of stop words in a text (define (word-lengths [in (current-input-port)]) ; Store the length counts (define counts (make-hash)) ; Count all of the base words in the text (for* ([line (in-lines in)] [word (in-list (string-split line))]) (define len (string-length word)) (hash-set! counts len (add1 (hash-ref counts len 0)))) counts)
And that’s all there is to it. Yes, it does treat punctuation as part of word length. That’s perfectly expected (if not entirely optimal). Let’s see it in action:
> (define cc-hash (with-input-from-file "Cuckoo's Calling.txt" word-lengths)) > (for/list ([i (in-range 1 10)]) (hash-ref cc-hash i 0)) '(0.031 0.134 0.222 0.173 0.122 0.103 0.083 0.053 0.033) > (define dh-lengths (with-input-from-file "Deathly Hallows.txt" word-lengths)) > (for/list ([i (in-range 1 10)]) (hash-ref dh-lengths i 0)) '(0.045 0.138 0.211 0.174 0.128 0.100 0.078 0.051 0.034)
To the naked eye, those look pretty dang similar. In fact, they are:
> (cosine-similarity cc-hash dh-hash) 0.964
That looks good, but does it work overall? Here are the by-book results:
|1||0.979||Jordan, Robert||The Path of Daggers|
|2||0.975||Jordan, Robert||Knife of Dreams|
|3||0.973||Jordan, Robert||Crossroads of Twilight|
|4||0.972||Croggon, Alison||The Gift|
|5||0.971||Pratchett, Terry||Equal Rites|
|6||0.970||Butcher, Jim||Furies of Calderon|
|7||0.970||Jordan, Robert||A Crown of Swords|
|8||0.969||Jordan, Robert||Winter’s Heart|
|9||0.968||Jordan, Robert||The Gathering Storm|
|10||0.967||Jordan, Robert||A Memory of Light|
Not so good… It turns out that these numbers are similar across pretty much all English language texts. The lowest score for any book I have is still 0.902. Perhaps the by-author results will do better:
|3||0.959||Robinson, Kim Stanley|
That’s not so bad at least. There are a fair number of authors further down the list (I don’t know if I’ve mentioned this, but I have 35 authors and over 200 books in my sample set). But it’s still not perfect. It’s still pretty interesting to see though.
Well, I think that about wraps up this series. It’s about how I found it when doing my undergraduate research: interesting and you can get some neat results, but ultimately you have to do a fair bit of tuning to get any meaningful results. I hope you found it as interesting as I did.
As always, if you’d like to see the full source code for this post, you can find it on GitHub: authorship attribution
|← Authorship attribution: Part 2||All||Adventures in Racket: gzip →|
|← Authorship attribution: Part 2||By category||Adventures in Racket: gzip →|