Authorship attribution: Part 1

About two weeks ago, the new crime fiction novel Cuckoo’s Calling was revealed to have actually been written by J.K. Rowling under the pseudonym Robert Galbraith. What’s interesting is exactly how they came to that conclusion. Here’s a quote from Time magazine (via Programming Praxis):

As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.

“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”

It’s actually pretty similar to what I did a few years ago for my undergraduate thesis: AnnGram. In that case, I used a similar technique to what they described above, n-grams, and self organizing maps to classify works by author. It’s been awhile, but let’s take a crack at re-implementing some of these techniques.

(If you’d like to follow along, you can see the full code here: authorship attribution on github)

First, we’ll use the technique described above. The idea is to take the most common words throughout a book and rank them. Theoretically, this will give us a unique fingerprint for each author that should be able to identify them even under a pseudonym.

Let’s start by cleaning up the words. For the time being, we want only alphabetic characters and only in lowercase. That way we should avoid position in sentences and the like. This should be an easy enough way to do that:

; Remove non word characters
(define (fix-word word)
  (list->string 
   (for/list ([c (in-string word)] 
              #:when (char-alphabetic? c)) 
     (char-downcase c))))

Easy enough. So let’s actually count the words. To start, we’ll keep a hash of counts. They’re easy enough to work with in Racket, albeit not quite so easy as say Python. With that, we only need to loop through the words in the text:

; Store the word counts
(define counts (make-hash))

; Count all of the base words in the text
(for* ([line (in-lines in)]
       [word (in-list (string-split line))])
  (define fixed (fix-word word))
  (hash-set! counts fixed (add1 (hash-ref counts fixed 0))))

Using the three argument form of hash-ref allows us to specify a default. That way the hash is effectively acting like Python’s defaultdict (a particular favorite data structure of mine).

After we’ve done that, we can find the top-n most common words:

; Extract the top limit words
(define top-n
  (map first
       (take
        (sort 
         (for/list ([(word count) (in-hash counts)])
           (list word count))
         (lambda (a b) (> (second a) (second b))))
        limit)))

Finally, we want to replace the count with the ordering. Later, we’ll try using the relative frequencies but at the moment the ordering will do well enough. Since we’re going to later use a default value of 0 which should be near to a low rank, we’ll count down.

; Add an order to each, descending
(for/hash ([i (in-range limit 0 -1)]
           [word (in-list top-n)])
  (values word i)))

All together, this can take a text file (as input port) and return the most common words. For example, using Cuckoo’s Calling:

> (with-input-from-file "Cuckoo's Calling.txt" word-rank)
'#hash(("the" . 10)  ("to" . 9)   ("and" . 8)
       ("a" . 7)     ("of" . 6)   ("he" . 5)
       ("was" . 4)   ("she" . 3)  ("in" . 2)
       ("her" . 1))

If the post was correct (and they did identify JK Rowling after all), then this should be a similar ordering for any book written by her while other authors will be slightly different. Let’s take for example the text of the 7th Harry Potter book:

> (with-input-from-file "Deathly Hallows.txt" word-rank)
'#hash(("the" . 10)  ("and" . 9)    ("" . 8)
       ("to" . 7)    ("of" . 6)     ("he" . 5)
       ("a" . 4)     ("harry" . 3)  ("was" . 2)
       ("it" . 1))

It seems that and has moved up, a and she have swapped, and harry is there–It’s pretty impressive that’s the 7th most common word in the entire book but rather unlikely to appear in Cuckoo’s Calling. But overall, it’s pretty similar. So let’s try to compare it to a few more books.

We do need one more peace first though. We need to be able to tell how similar two books are. In this case, we’ll use the idea of cosine similarity. Essentially, given two vectors we can calculate the angle between them. The more similar two vectors are, the closer to zero the result will be.

One problem is that we have hashes instead of vectors. We can’t even guarantee that the same words will appear in two different lists. So first, we’ll unify the keys. Add zeros for missing words, put them in the same order, and we have vectors we can measure:

; Calculate the similarity between two vectors
; If inputs are hashes, merge them before calculating similarity
(define (cosine-similarity a b)
  (cond
    [(and (hash? a) (hash? b))
     (define keys
       (set->list (set-union (list->set (hash-keys a))
                             (list->set (hash-keys b)))))
     (cosine-similarity
      (for/vector ([k (in-list keys)]) (hash-ref a k 0))
      (for/vector ([k (in-list keys)]) (hash-ref b k 0)))]
    [else
     (define cossim (acos (/ (dot-product a b) (* (magnitude a) (magnitude b)))))
     (- 1.0 (/ (abs cossim) (/ pi 2)))]))

The last line normalizes it to the range [0, 1.0] where the higher the number, the better match. This isn’t strictly necessary, but I think it looks nicer. 😄

Finally, we can calculate the similarity between two books. So how similar are Cuckoo’s Calling and the Deathly Hallows?

> (let ([a (with-input-from-file "Cuckoo's Calling.txt" word-rank)]
        [b (with-input-from-file "Deathly Hallows.txt" word-rank)])
    (cosine-similarity a b))
0.6965

About 70% (not that the numbers mean particularly much). So let’s try a few more.

Unfortunately, I don’t have much in the way of crime fiction–I’m more interested in science fiction and fantasy. But that should work well enough. Using a bit of framework (linky), we can measure this easily enough.

So, who among the author I have could have written Cuckoo’s Calling? Here are the most similar books:

10.740Butcher, JimStorm Front
20.739Butcher, JimSide Jobs
30.738Butcher, JimTurn Coat
40.736Butcher, JimSmall Favor
50.735Butcher, JimWhite Night
60.734Butcher, JimCold Days
70.731Butcher, JimProven Guilty
80.729Butcher, JimGhost Story
90.728Stirling, S. M. & Meier, ShirleyShadow’s Son
100.728Stephen, KingWizard and Glass
110.728Lovegrove, JamesThe Age of Zeus
120.726Butcher, JimDead Beat
130.726Duncan, GlenLast Werewolf, The
140.724Butcher, JimFool Moon
150.723Stephen, KingThe Drawing of the Three
160.723Adams, DouglasSo Long, and Thanks for All the Fish
170.722Stephen, KingThe Dark Tower
180.718Lovegrove, JamesThe Age of Odin
190.718Butcher, JimChanges
200.715Chima, Cinda WilliamsThe Wizard Heir

Perhaps it’s not surprising that Jim Butcher’s books are at the top of the list. After all, it’s about the closest thing that I have to crime fiction. Still, it doesn’t look good that absolutely none of JK Rowling’s books are in the top 20. In fact, we have to go all of the way down to 43 to find Harry Potter and the Half-Blood Prince, with a score of 0.704.

What if we average each author’s books? Perhaps JK Rowling is more consistently matched against Cuckoo’s Calling?

10.714Stephen, King
20.709Butcher, Jim
30.704Briggs, Patricia
40.704Benson, Amber
50.698Robinson, Kim Stanley
60.694Colfer, Eoin
70.693Jordan, Robert
80.692Rowling, J.K.
90.687Steele, Allen
100.687Orwell, George
110.682Croggon, Alison
120.681Adams, Douglas
130.680Riordan, Rick
140.679Card, Orson Scott
150.671Brin, David

Not so much better, that. I have a few ideas though. Perhaps in a few days, we’ll see what we can do.

If you’d like to see the full source, you can do so here: authorship attribution on github