Authorship attribution: Part 1

About two weeks ago, the new crime fiction novel Cuckoo’s Calling was revealed to have actually been written by J.K. Rowling under the pseudonym Robert Galbraith. What’s interesting is exactly how they came to that conclusion. Here’s a quote from Time magazine (via Programming Praxis):

As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.

“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”

It’s actually pretty similar to what I did a few years ago for my undergraduate thesis: AnnGram. In that case, I used a similar technique to what they described above, n-grams, and self organizing maps to classify works by author. It’s been awhile, but let’s take a crack at re-implementing some of these techniques.

(If you’d like to follow along, you can see the full code here: authorship attribution on github)

First, we’ll use the technique described above. The idea is to take the most common words throughout a book and rank them. Theoretically, this will give us a unique fingerprint for each author that should be able to identify them even under a pseudonym.

Let’s start by cleaning up the words. For the time being, we want only alphabetic characters and only in lowercase. That way we should avoid position in sentences and the like. This should be an easy enough way to do that:

; Remove non word characters
(define (fix-word word)
  (list->string 
   (for/list ([c (in-string word)] 
              #:when (char-alphabetic? c)) 
     (char-downcase c))))

Easy enough. So let’s actually count the words. To start, we’ll keep a hash of counts. They’re easy enough to work with in Racket, albeit not quite so easy as say Python. With that, we only need to loop through the words in the text:

; Store the word counts
(define counts (make-hash))

; Count all of the base words in the text
(for* ([line (in-lines in)]
       [word (in-list (string-split line))])
  (define fixed (fix-word word))
  (hash-set! counts fixed (add1 (hash-ref counts fixed 0))))

Using the three argument form of hash-ref allows us to specify a default. That way the hash is effectively acting like Python’s defaultdict (a particular favorite data structure of mine).

After we’ve done that, we can find the top-n most common words:

; Extract the top limit words
(define top-n
  (map first
       (take
        (sort 
         (for/list ([(word count) (in-hash counts)])
           (list word count))
         (lambda (a b) (> (second a) (second b))))
        limit)))

Finally, we want to replace the count with the ordering. Later, we’ll try using the relative frequencies but at the moment the ordering will do well enough. Since we’re going to later use a default value of 0 which should be near to a low rank, we’ll count down.

; Add an order to each, descending
(for/hash ([i (in-range limit 0 -1)]
           [word (in-list top-n)])
  (values word i)))

All together, this can take a text file (as input port) and return the most common words. For example, using Cuckoo’s Calling:

> (with-input-from-file "Cuckoo's Calling.txt" word-rank)
'#hash(("the" . 10)  ("to" . 9)   ("and" . 8)
       ("a" . 7)     ("of" . 6)   ("he" . 5)
       ("was" . 4)   ("she" . 3)  ("in" . 2)
       ("her" . 1))

If the post was correct (and they did identify JK Rowling after all), then this should be a similar ordering for any book written by her while other authors will be slightly different. Let’s take for example the text of the 7th Harry Potter book:

> (with-input-from-file "Deathly Hallows.txt" word-rank)
'#hash(("the" . 10)  ("and" . 9)    ("" . 8)
       ("to" . 7)    ("of" . 6)     ("he" . 5)
       ("a" . 4)     ("harry" . 3)  ("was" . 2)
       ("it" . 1))

It seems that and has moved up, a and she have swapped, and harry is there–It’s pretty impressive that’s the 7th most common word in the entire book but rather unlikely to appear in Cuckoo’s Calling. But overall, it’s pretty similar. So let’s try to compare it to a few more books.

We do need one more peace first though. We need to be able to tell how similar two books are. In this case, we’ll use the idea of cosine similarity. Essentially, given two vectors we can calculate the angle between them. The more similar two vectors are, the closer to zero the result will be.

One problem is that we have hashes instead of vectors. We can’t even guarantee that the same words will appear in two different lists. So first, we’ll unify the keys. Add zeros for missing words, put them in the same order, and we have vectors we can measure:

; Calculate the similarity between two vectors
; If inputs are hashes, merge them before calculating similarity
(define (cosine-similarity a b)
  (cond
    [(and (hash? a) (hash? b))
     (define keys
       (set->list (set-union (list->set (hash-keys a))
                             (list->set (hash-keys b)))))
     (cosine-similarity
      (for/vector ([k (in-list keys)]) (hash-ref a k 0))
      (for/vector ([k (in-list keys)]) (hash-ref b k 0)))]
    [else
     (define cossim (acos (/ (dot-product a b) (* (magnitude a) (magnitude b)))))
     (- 1.0 (/ (abs cossim) (/ pi 2)))]))

The last line normalizes it to the range [0, 1.0] where the higher the number, the better match. This isn’t strictly necessary, but I think it looks nicer. 😄

Finally, we can calculate the similarity between two books. So how similar are Cuckoo’s Calling and the Deathly Hallows?

> (let ([a (with-input-from-file "Cuckoo's Calling.txt" word-rank)]
        [b (with-input-from-file "Deathly Hallows.txt" word-rank)])
    (cosine-similarity a b))
0.6965

About 70% (not that the numbers mean particularly much). So let’s try a few more.

Unfortunately, I don’t have much in the way of crime fiction–I’m more interested in science fiction and fantasy. But that should work well enough. Using a bit of framework (linky), we can measure this easily enough.

So, who among the author I have could have written Cuckoo’s Calling? Here are the most similar books:

1	0.740	Butcher, Jim	Storm Front
2	0.739	Butcher, Jim	Side Jobs
3	0.738	Butcher, Jim	Turn Coat
4	0.736	Butcher, Jim	Small Favor
5	0.735	Butcher, Jim	White Night
6	0.734	Butcher, Jim	Cold Days
7	0.731	Butcher, Jim	Proven Guilty
8	0.729	Butcher, Jim	Ghost Story
9	0.728	Stirling, S. M. & Meier, Shirley	Shadow’s Son
10	0.728	Stephen, King	Wizard and Glass
11	0.728	Lovegrove, James	The Age of Zeus
12	0.726	Butcher, Jim	Dead Beat
13	0.726	Duncan, Glen	Last Werewolf, The
14	0.724	Butcher, Jim	Fool Moon
15	0.723	Stephen, King	The Drawing of the Three
16	0.723	Adams, Douglas	So Long, and Thanks for All the Fish
17	0.722	Stephen, King	The Dark Tower
18	0.718	Lovegrove, James	The Age of Odin
19	0.718	Butcher, Jim	Changes
20	0.715	Chima, Cinda Williams	The Wizard Heir

Perhaps it’s not surprising that Jim Butcher’s books are at the top of the list. After all, it’s about the closest thing that I have to crime fiction. Still, it doesn’t look good that absolutely none of JK Rowling’s books are in the top 20. In fact, we have to go all of the way down to 43 to find Harry Potter and the Half-Blood Prince, with a score of 0.704.

What if we average each author’s books? Perhaps JK Rowling is more consistently matched against Cuckoo’s Calling?

1	0.714	Stephen, King
2	0.709	Butcher, Jim
3	0.704	Briggs, Patricia
4	0.704	Benson, Amber
5	0.698	Robinson, Kim Stanley
6	0.694	Colfer, Eoin
7	0.693	Jordan, Robert
8	0.692	Rowling, J.K.
9	0.687	Steele, Allen
10	0.687	Orwell, George
11	0.682	Croggon, Alison
12	0.681	Adams, Douglas
13	0.680	Riordan, Rick
14	0.679	Card, Orson Scott
15	0.671	Brin, David

Not so much better, that. I have a few ideas though. Perhaps in a few days, we’ll see what we can do.

If you’d like to see the full source, you can do so here: authorship attribution on github