AnnGram - Cosine Distance

Overview

The first algorithm that I’ve chosen to implement is a simple cosine difference between the n-gram vectors.  This was the first method used in multiple of the papers that I’ve read and it seems like a good benchmark.

Essentially, this method gives the similarity of two n-gram documents (either Documents or Authors) as an angle ranging from 0 (identical documents) to \pi/2 (completely different documents).  Documents written by the same author should have the lowest values.

Equation

\theta = arccos \left ( \frac{a \cdot b}{|a||b|} \right )

Example values

Comparing all of the works of Shakespeare with the Book of Genesis:

  • n = 3, θ = 0.132
  • n = 4, θ = 0.387
  • n = 5, θ = 0.453
  • n = 6, θ = 0.527
  • average θ = 0.375

Comparing Shakespeare with one of his plays (As You Like It):

  • n = 3, θ = 0.083
  • n = 4, θ = 0.095
  • n = 5, θ = 0.096
  • n = 6, θ = 0.083
  • average θ = 0.090

As you can see, the basic premise is valid.  Shakespeare is very similar to his own plays and much less so to the Book of Genesis.