Dictionary tries in Racket


For the next few posts, we're going to need a way to represent a dictionary. You could go with just a flat list containing all of the words in the dictionary, but the runtime doesn't seem optimal. Instead, we want a data structure that lets you easily get all possible words that start with a given prefix. We want a trie.

. Source: Wikipedia, public domain

Essentially a trie is a misspelled tree. :) More seriously, a trie is basically a tree where any node can have any number of children. In this case, the root of our trie will represent the start of every word in the dictionary. Then at each level, each branch will be either a single character or the symbol word. If it's a character, the branch represents continuing down the word. If 'word is set, that means that this is a valid ending of a word.

For example, here is such a dictionary containing just the words CAT, COT, COG and their plurals:

So how do we build such a structure?

First, we'll start with a structure that basically just wraps the nested hashtables in case I print one without meaning to (they're kind of large).

; make the structure opaque in case I try to print them
(define-struct :dictionary: (value))

; make a dictionary
(define (make-dictionary)
  (make-:dictionary: (make-hasheq)))

After that, we need a function that can add a word to an already existing dictionary:

; add a word to a dictionary (internally a trie)
; keys are either characters => nested trie or 'word => #t/#f
(define (add-word! dict word)
  (let loop ([cs (string->list (string-upcase word))]
             [node (:dictionary:-value dict)])
    (cond
      [(null? cs)
       (hash-set! node 'word #t)]
      [else
       (when (not (hash-ref node (car cs) #f))
         (hash-set! node (car cs) (make-hasheq)))
       (loop (cdr cs) (hash-ref node (car cs)))])))

Basically, we use the loop to recur down the nested structure, adding new branches to the tree as needed with the (when ...) part. When we get to the end, we set that hash's word-ness to #t.

Easy enough. So what if we wanted to reverse that and check if a word is in the dictionary? (Not strictly necessary for the given problem, but useful for testing.)

; helper to do a partial lookup in a dictionary
; returns a node in the trie if given a valid prefix/word
; returns #f otherwise
(define (get-node dict word)
  (let loop ([cs (string->list (string-upcase word))]
             [node (:dictionary:-value dict)])
    (or (and (null? cs)
             node)
        (and (not (null? cs))
             node
             (loop (cdr cs)
                   (hash-ref node (car cs) #f))))))

; check if a word is in the dictionary
(define (contains? dict word)
  (let ([node (get-node dict word)])
    (and node (hash-ref node 'word #f))))

get-node abstracts out the first part of that by handling the main part of the recursion for us. It will go down a tree until it gets the node that we want or #f if the specified prefix/word doesn't exist. All contains? has to do then is to make sure that we did actually get a hash and that it represents the end of a word.

After that, the next step is to write a function that will load a dictionary file into this trie structure. Basically, I can use yesterday's code with only minor modifications:

; load a dictionary from a file
; one word per line, case doesn't matter
; ignores any words that have non-alphabetic characters
(define (load-dictionary filename)
  (let ([dict (make-dictionary)])
    (with-input-from-file filename
      (? ()
        (let loop ([line (read-line)])
          (unless (eof-object? line)
            (let ([word (string-trim line)])
              (add-word! dict word))
            (loop (read-line))))))
    dict))

The string-trim call is necessary to remove the trailing linefeed that read-line returns.

Finally, a few more helper functions. First, a lookup function to make going one level down a tree easier.

; given a dictionary (or it's node (or #f)) and a letter
; return the nested value or #f it there isn't one
(define (lookup dict letter)
  (let ([node (if (:dictionary:? dict) (:dictionary:-value dict) dict)])
    (and node
         (hash-ref node letter #f))))

And then we want a function that will return all words that start with a given prefix. This is a little more complicated, but hopefully not unreasonably so:

; get a list of all words in the dictionary with a given prefix
(define (words-by-prefix dict prefix)
  (let ([prefix (string-upcase prefix)])
    ; once we have the suffixes, add our prefix
    (map
     (? (ls) (string-append prefix (list->string ls)))
     ; loop from the node to all leaves
     (let loop ([node (get-node dict prefix)])
       (if node
           (let ([r (apply
                     append
                     ; add the current letter to each recursions
                     (for/list ([(k v) node]
                                #:when (and (not (eq? k 'word)) v))
                       (map (? (ls) (cons k ls))
                            (loop v))))])
             ; include the prefix if it's a word
             (if (hash-ref node 'word #f)
                 (cons '() r)
                 r))
           '())))))

The core of that deserves some special attention:

; add the current letter to each recursions
(for/list ([(k v) node]
           #:when (and (not (eq? k 'word)) v))
  (map (? (ls) (cons k ls))
       (loop v)))

Basically, I love Racket's for family of macros. This one says for each key/value pair k/v in the current level (ignoring word endings at this point), recur into that branch. Then we append each of those sublists together. Finally, if the node we're currently at ends a word add that to the recursion. And that's it. Pretty simple, yes? Let's try it.

> (words-by-prefix dict "DRAGON")
'("DRAGON" "DRAGONET" "DRAGONFLY" "DRAGONFLIES" "DRAGONHEAD" "DRAGONS")

If you would like the source code for this module, you can download it here: dictionary source code

    comments powered by Disqus