AnnGram - Cosine Distance

Overview

The first algorithm that I’ve chosen to implement is a simple cosine difference between the n-gram vectors.  This was the first method used in multiple of the papers that I’ve read and it seems like a good benchmark.

Essentially, this method gives the similarity of two n-gram documents (either Documents or Authors) as an angle ranging from 0 (identical documents) to \pi/2 (completely different documents).  Documents written by the same author should have the lowest values.

Equation

\theta = arccos \left ( \frac{a \cdot b}{|a||b|} \right )

Example values

Comparing all of the works of Shakespeare with the Book of Genesis:

  • n = 3, θ = 0.132
  • n = 4, θ = 0.387
  • n = 5, θ = 0.453
  • n = 6, θ = 0.527
  • average θ = 0.375

Comparing Shakespeare with one of his plays (As You Like It):

  • n = 3, θ = 0.083
  • n = 4, θ = 0.095
  • n = 5, θ = 0.096
  • n = 6, θ = 0.083
  • average θ = 0.090

As you can see, the basic premise is valid.  Shakespeare is very similar to his own plays and much less so to the Book of Genesis.

comments powered by Disqus