So far, we’ve had three different ideas for figuring out the author of an unknown paper (top n word ordering in Part 1 and stop word frequency / 4-grams in Part 2). Here’s something interesting though from the comments on the Programming Praxis post:

Globules said July 19, 2013 at 12:29 PM Patrick Juola has a guest post on Language Log describing the approach he took.

Last time, we used word rank to try to figure out who could possibly have written Cuckoo’s calling. It didn’t work out so well, but we at least have a nice framework in place. So perhaps we can try a few more ways of turning entire novels into a few numbers.

About two weeks ago, the new crime fiction novel Cuckoo’s Calling was revealed to have actually been written by J.K. Rowling under the pseudonym Robert Galbraith. What’s interesting is exactly how they came to that conclusion. Here’s a quote from Time magazine (via Programming Praxis):

# FLAIRS 2010 - Augmenting n-gram Based Authorship Attribution With Neural Networks

Co-authors: Michael Wollowski, and Maki Hirotani Abstract: While using statistical methods to determine authorship attribution is not a new idea and neural networks have been applied to a number of statistical problems, the two have not often been used together. We show that the use of articial neural networks, specically self-organizing maps, combined with n-grams provides a success rate on the order of previous work with purely statistical methods. Using a collection of documents including the works of Shakespeare, William Blake, and the King James Version of the Bible, we were able to demonstrate classication of documents into individual groups.

# AnnGram - nGrams vs Words

Overview

For another comparison, I’ve been looking for a way to replace the nGrams with another way of turning a document into a vector.  Based on word frequency instead of nGrams, I’ve run a number of tests to see how the accuracy and speed of the algorithm compares for the two.

nGrams

I still intend to look into why the Tragedy of Macbeth does not stay with the rest of Shakespeare’s plays.  I still believe that it is because portions of it were possible written by another author.

# AnnGram vs k-means

Overview As a set of benchmarks to test whether or not the new AnnGram algorithm is actually working correctly, I’ve been trying to come up with different yet similar methods to compare it too. Primarily, there are two possibilities: Replace the nGram vectors with another form Process the nGrams using something other than Self-Organizing Maps I’m still looking through the related literature to decide if there is some way to use something other than the nGrams to feed into the SOM; however, I haven’t been having any luck.

# AnnGram - Self-Organizing Map GUI

They say a picture is worth a thousand words:

One Thousand Words

# AnnGram - New GUI

The old GUI framework just wasn’t working out (so far as adding new features went).  So, long story short, I’ve switched GUI layout.

# AnnGram - Neural Network Progress

As expected, I’ve decided to change libraries from the** **poor results with the original tests may have been a direct results of a misunderstanding with the code base.  I think that the layers were not being hooked up correctly, resulting in low/random values.