Authorship attribution: Part 2

Last time, we used word rank to try to figure out who could possibly have written Cuckoo’s calling. It didn’t work out so well, but we at least have a nice framework in place. So perhaps we can try a few more ways of turning entire novels into a few numbers.

Rather than word rank, how about stop word frequency. Essentially, stop words are small words such as articles and prepositions that don’t always carry much weight for a sentence’s meaning. On the other hand though, those are exactly the same words that appear most commonly, so perhaps the frequencies will tell us something more.

The code is actually rather similar. To start out with, we want to load in a set of stop words. There are dozens of lists out there; any of them will work.

; Load the stop words
(define stop-words
  (with-input-from-file "stop-words.txt"
    (lambda ()
      (for/set ([line (in-lines)]) (fix-word line)))))

With that, we just need to count the occurances of each word and then normalize. This will help when some books have more or less stop words overall.

; Calculate the relative frequencies of stop words in a text
(define (stop-word-frequency [in (current-input-port)])
  ; Store the frequency and word count
  (define counts (make-hash))
  (define total (make-parameter 0.0))

  ; Loop across the input text
  (for* ([line (in-lines in)]
         [word (in-list (string-split line))])
    (define fixed (fix-word word))

    (when (set-member? stop-words fixed)
      (total (+ (total) 1))
      (hash-set! counts word (add1 (hash-ref counts word 0)))))

  ; Normalize and return frequencies
  ; Use the order in the stop words file
  (for/vector ([word (in-set stop-words)])
    (/ (hash-ref counts word 0)

Using Cuckoo’s Calling and a particular list with 50 words, we have:

> (with-input-from-file "../target.txt" stop-word-frequency)
'#(0.043 0.003 0.000 0.002 0.026
   0.001 0.000 0.000 0.000 0.000)

Well that doesn’t mean much to us. Hopefully it means more to the computer. 😄

So how similar does this one say Cuckoo’s Calling and the Deathly Hallows are?

> (let ([a (with-input-from-file "Cuckoo's Calling.txt" stop-word-frequency)]
        [b (with-input-from-file "Deathly Hallows.txt" stop-word-frequency)])
    (cosine-similarity a b))

That’s a lot higher! Unfortunately, that doesn’t really mean that they’re more similar than the other test. For all we know, everything could be more similar. So let’s try the entire library again:

10.896Stephen, KingWizard and Glass
20.896Rowling, J.K.Harry Potter and the Order of the Phoenix
30.895Riordan, RickThe Mark of Athena
40.891Jordan, RobertKnife of Dreams
50.891Riordan, RickThe Lost Hero
60.888Jordan, RobertA Crown of Swords
70.888Riordan, RickThe Son of Neptune
80.887Croggon, AlisonThe Singing
90.887Stephen, KingThe Drawing of the Three
100.884Jordan, RobertCrossroads of Twilight

Well, that’s good and bad. It’s unfortunate that it’s not first, but we actually have a Harry Potter book in the top 10! The rest aren’t that low down either, mostly appearing in the top 25. That should help with the author averages:

10.876Jordan, Robert
20.876Rowling, J.K.
30.873Stephen, King
40.865Martin, George R. R.
50.851Riordan, Rick

None too shabby! It’s a bit surprising that Robert Jordan is up at the top, but if we only consider authors that were actually around to write Cuckoo’s Calling, JK Rowling is actually at the top of the list.

Still, can we do better?

Here’s another idea (that I used in my <a href="//"/category/programming/anngram/">previous work): n-grams. Essentially, take constant sized slices of text, completely ignoring the content. So if you were dealing with the text ‘THE DUCK QUACKS’ and 4-grams, you would have these:

'THE '  'HE D'  'E DU'  ' DUC'  'DUCK'
'UCK '  'CK Q'  'K QU'  ' QUA'  'QUAC'

How does this help us? Well, in addition to keeping track of the most common words, n-grams will capture the relationships between words. Theoretically, this extra information might help out. So how do we measure it?

; Calculate n gram frequencies
(define (n-gram-frequency [in (current-input-port)] #:n [n 4] #:limit [limit 100])

  ; Store counts and total to do frequency later
  (define counts (make-hash))

  ; Keep a circular buffer of text, read char by char
  (define n-gram (make-string 4 #\nul))
  (for ([c (in-port read-char in)])
    (set! n-gram (substring (string-append n-gram (string c)) 1))
    (hash-set! counts n-gram (add1 (hash-ref counts n-gram 0))))

  ; Find the top limit many values
  (define top-n 
      (for/list ([(key val) (in-hash counts)])
        (list val key))
      (lambda (a b) (> (car a) (car b))))

  ; Cacluate the frequency of just those
  (define total (* 1.0 (for/sum ([vk (in-list top-n)]) (car vk))))
  (for/hash ([vk (in-list top-n)])
    (values (cadr vk) (/ (car vk) total))))

It’s pretty much the same as the previous code. The only difference is the code to read the n-grams rather than the words, but that should be pretty straight forward. It’s certainly not the most efficient, but it’s fast enough. It can churn through a few hundred books in a few minutes. Good enough for me.

How does it perform though?

10.777Stephen,Wizard and Glass
20.764Jordan, RobertThe Gathering Storm
30.757Card, Orson ScottHeart Fire
40.757Card, Orson ScottChildren of the Mind
50.756Stephen,Song of Susannah
60.756Stephen,The Dark Tower
70.753Butcher, JimWhite Night
80.751Butcher, JimTurn Coat
90.746Butcher, JimCaptain’s Fury
100.746Butcher, JimSide Jobs

That’s not so good. How about the averages?

10.731Stephen, King,
20.724Martin, George R. R.
30.715Jordan, Robert
40.708Butcher, Jim
50.698Robinson, Kim Stanley

It turns out that JK Rowling is actually second from the bottom. Honestly, I’m not sure what this says. Did I mess up the algorithm? Well then why are Steven King, Robert Jordan, and Jim Butcher still so high up?

I still have a few more ideas though. Next week it is!