Last time, we used word rank to try to figure out who could possibly have written Cuckoo’s calling. It didn’t work out so well, but we at least have a nice framework in place. So perhaps we can try a few more ways of turning entire novels into a few numbers.

Rather than word rank, how about stop word frequency. Essentially, stop words are small words such as articles and prepositions that don’t always carry much weight for a sentence’s meaning. On the other hand though, those are exactly the same words that appear most commonly, so perhaps the frequencies will tell us something more.

The code is actually rather similar. To start out with, we want to load in a set of stop words. There are dozens of lists out there; any of them will work.

; Load the stop words
(define stop-words
(with-input-from-file "stop-words.txt"
(lambda ()
(for/set ([line (in-lines)]) (fix-word line)))))


With that, we just need to count the occurances of each word and then normalize. This will help when some books have more or less stop words overall.

; Calculate the relative frequencies of stop words in a text
(define (stop-word-frequency [in (current-input-port)])
; Store the frequency and word count
(define counts (make-hash))
(define total (make-parameter 0.0))

; Loop across the input text
(for* ([line (in-lines in)]
[word (in-list (string-split line))])
(define fixed (fix-word word))

(when (set-member? stop-words fixed)
(total (+ (total) 1))
(hash-set! counts word (add1 (hash-ref counts word 0)))))

; Normalize and return frequencies
; Use the order in the stop words file
(for/vector ([word (in-set stop-words)])
(/ (hash-ref counts word 0)
(total))))


Using Cuckoo’s Calling and a particular list with 50 words, we have:

> (with-input-from-file "../target.txt" stop-word-frequency)
'#(0.043 0.003 0.000 0.002 0.026
...
0.001 0.000 0.000 0.000 0.000)


Well that doesn’t mean much to us. Hopefully it means more to the computer. 😄

So how similar does this one say Cuckoo’s Calling and the Deathly Hallows are?

> (let ([a (with-input-from-file "Cuckoo's Calling.txt" stop-word-frequency)]
[b (with-input-from-file "Deathly Hallows.txt" stop-word-frequency)])
(cosine-similarity a b))
0.877


That’s a lot higher! Unfortunately, that doesn’t really mean that they’re more similar than the other test. For all we know, everything could be more similar. So let’s try the entire library again:

1 0.896 Stephen, King Wizard and Glass
2 0.896 Rowling, J.K. Harry Potter and the Order of the Phoenix
3 0.895 Riordan, Rick The Mark of Athena
4 0.891 Jordan, Robert Knife of Dreams
5 0.891 Riordan, Rick The Lost Hero
6 0.888 Jordan, Robert A Crown of Swords
7 0.888 Riordan, Rick The Son of Neptune
8 0.887 Croggon, Alison The Singing
9 0.887 Stephen, King The Drawing of the Three
10 0.884 Jordan, Robert Crossroads of Twilight

Well, that’s good and bad. It’s unfortunate that it’s not first, but we actually have a Harry Potter book in the top 10! The rest aren’t that low down either, mostly appearing in the top 25. That should help with the author averages:

1 0.876 Jordan, Robert
2 0.876 Rowling, J.K.
3 0.873 Stephen, King
4 0.865 Martin, George R. R.
5 0.851 Riordan, Rick

None too shabby! It’s a bit surprising that Robert Jordan is up at the top, but if we only consider authors that were actually around to write Cuckoo’s Calling, JK Rowling is actually at the top of the list.

Still, can we do better?

Here’s another idea (that I used in my <a href="//blog.jverkamp.com"/category/programming/anngram/">previous work): n-grams. Essentially, take constant sized slices of text, completely ignoring the content. So if you were dealing with the text ‘THE DUCK QUACKS’ and 4-grams, you would have these:

'THE '  'HE D'  'E DU'  ' DUC'  'DUCK'
'UCK '  'CK Q'  'K QU'  ' QUA'  'QUAC'
'UACK'


How does this help us? Well, in addition to keeping track of the most common words, n-grams will capture the relationships between words. Theoretically, this extra information might help out. So how do we measure it?

; Calculate n gram frequencies
(define (n-gram-frequency [in (current-input-port)] #:n [n 4] #:limit [limit 100])

; Store counts and total to do frequency later
(define counts (make-hash))

; Keep a circular buffer of text, read char by char
(define n-gram (make-string 4 #\nul))
(set! n-gram (substring (string-append n-gram (string c)) 1))
(hash-set! counts n-gram (add1 (hash-ref counts n-gram 0))))

; Find the top limit many values
(define top-n
(take
(sort
(for/list ([(key val) (in-hash counts)])
(list val key))
(lambda (a b) (> (car a) (car b))))
limit))

; Cacluate the frequency of just those
(define total (* 1.0 (for/sum ([vk (in-list top-n)]) (car vk))))
(for/hash ([vk (in-list top-n)])
(values (cadr vk) (/ (car vk) total))))


It’s pretty much the same as the previous code. The only difference is the code to read the n-grams rather than the words, but that should be pretty straight forward. It’s certainly not the most efficient, but it’s fast enough. It can churn through a few hundred books in a few minutes. Good enough for me.

How does it perform though?

1 0.777 Stephen, Wizard and Glass
2 0.764 Jordan, Robert The Gathering Storm
3 0.757 Card, Orson Scott Heart Fire
4 0.757 Card, Orson Scott Children of the Mind
5 0.756 Stephen, Song of Susannah
6 0.756 Stephen, The Dark Tower
7 0.753 Butcher, Jim White Night
8 0.751 Butcher, Jim Turn Coat
9 0.746 Butcher, Jim Captain’s Fury
10 0.746 Butcher, Jim Side Jobs

That’s not so good. How about the averages?

1 0.731 Stephen, King,
2 0.724 Martin, George R. R.
3 0.715 Jordan, Robert
4 0.708 Butcher, Jim
5 0.698 Robinson, Kim Stanley

It turns out that JK Rowling is actually second from the bottom. Honestly, I’m not sure what this says. Did I mess up the algorithm? Well then why are Steven King, Robert Jordan, and Jim Butcher still so high up?

I still have a few more ideas though. Next week it is!