AnnGram - Cosine Distance

Overview

The first algorithm that I’ve chosen to implement is a simple cosine difference between the n-gram vectors.  This was the first method used in multiple of the papers that I’ve read and it seems like a good benchmark.

Essentially, this method gives the similarity of two n-gram documents (either Documents or Authors) as an angle ranging from 0 (identical documents) to \pi/2 (completely different documents).  Documents written by the same author should have the lowest values.

Equation

\theta = arccos \left ( \frac{a \cdot b}{|a||b|} \right )

Example values

Comparing all of the works of Shakespeare with the Book of Genesis:

• n = 3, θ = 0.132
• n = 4, θ = 0.387
• n = 5, θ = 0.453
• n = 6, θ = 0.527
• average θ = 0.375

Comparing Shakespeare with one of his plays (As You Like It):

• n = 3, θ = 0.083
• n = 4, θ = 0.095
• n = 5, θ = 0.096
• n = 6, θ = 0.083
• average θ = 0.090

As you can see, the basic premise is valid.  Shakespeare is very similar to his own plays and much less so to the Book of Genesis.