An optimal alphabetizing cipher

Here is today’s /r/dailyprogramming challenge: Generate a simple substitution cipher such that the maximum number of words in a given dictionary of six letter words (there are 7,260 of them) are encoded as strings in alphabetical order.

For example, if you have the key jfbqpwcvuamozhilgrxtkndesy, a would may to j, b to f and so on. Thus the word bakery would become fjmprs (which is alphabetical). It turns out that of the given 7,260 word list, exactly 60 would evaluate to alphabetical orderings. So the score for that cipher key could be thought of as 60. The goal? Find the highest scoring cipher key.

If you’d like to follow along, you can download the entire code from GitHub.

To start with, we want a few utility functions to generate the ciphers from a key and to score them. This is one place where a functional language shines as you can write a function to generate ciphers that returns another function that does the encoding:

; create an encoding function from a key
(define (make-cipher key)
  (lambda (plaintext)
     (string-length plaintext)
     (lambda (i)
       (string-ref key (- (char->integer (string-ref plaintext i)) 97))))))

Likewise, it’s very straight forward to read in the entire word list, to test if a string is sorted, and to score a cipher over and entire word list.

; load a word list to score from
(define fn "word-list.txt")
(define word-list
  (with-input-from-file fn
    (lambda () (for/list ([word (in-lines)]) word))))

; test if a word is sorted
(define (word-sorted? word)
   (lambda (i) (char<=? (string-ref word i) (string-ref word (+ i 1))))
   (range (- (string-length word) 1))))

; score a cipher
(define (score-cipher cipher)
  (for/sum ([word (in-list word-list)])
    (if (word-sorted? (cipher word)) 1 0)))

To make sure that things are working well, we can test the provided sample cipher:

> (define test-encode (make-cipher "jfbqpwcvuamozhilgrxtkndesy"))
> (test-encode "bakery")
> (score-cipher test-encode)

Everything is as it should be.

Next, we start getting into solutions. As of the writing, several people on the original thread have found solutions with scores of 474 (over 100 unique solutions), so it seems that’s the score to beat (if it’s even possible; there’s not actually a known solution to the problem). So I’m justing going to throw a few things out there and see what sticks.

My first idea was to just start with some random solution and repeatedly mutate it (by swapping two random letters, a sort of Monte Carlo algorithm). If the new solution is better, keep it. If not, try again. It turns out that the code for that is pretty straight forward, although I did add some code that after a given number of seconds have passed without a new solution we’ll just random start over. Basically, I was seeing a lot of dead ends that it couldn’t get out of even with the maximum number of swaps (10).

; solve via random swapping
(define (solve/monte-carlo [guess (random-key)]
                           [score -inf.0]
                           [timeout 10]
                           [timer (current-seconds)])

  ; generate a new guess by swapping up to 10 pairs
  (define new-guess (string-copy guess))
  (for ([i (in-range (+ (random 10) 1))])
    (string-swap! new-guess (random 26) (random 26)))
  (define new-score (score-key new-guess))

  ; if we have a new best, report it
    [(> (- (current-seconds) timer) timeout)
     (printf "timeout\n")
     (solve/monte-carlo (random-key) -inf.0 timeout (current-seconds))]
    [(> new-score score)
     (printf "~a (~a)\n" new-guess new-score)
     (solve/monte-carlo new-guess new-score timeout (current-seconds))]
     (solve/monte-carlo guess score timeout timer)]))

And here’s a sample run of that (with default arguments):

> (solve/monte-carlo)
svpakuqixjtmngdbrywozehlcf (9)
svkapgyxijtmqudbrnwozehlcf (27)
svbkopyxinteqldajrwgzmhucf (48)
... 23 lines cut ...
kacyxdwgtbvelufspjzminhrqo (353)
kacyxbwgtdvelufspjzminhrqo (354)
kacyxbwftdvelugspjzminhrqo (369)
vknreomstqpwihjfgxydczlaub (14)
vkngtomreqpwicjfzxydbslauh (45)
deogtnvrmqpwickxafyjbslzuh (53)
... 35 lines cut ...
jbdvtapenirfgohczuyqmslwxk (402)
jbdvtapenirfgohczuyqkslwxm (404)
jbdvtapenhrfgoiczuyqkslwxm (411)
rsfykdbjaepmczwnoihxtgulvq (4)
ysfrkdtjaepmczwnoihxbgulvq (5)
ysfmkctjaeprdzwnoihxbgulvq (7)
... 39 lines cut ...
madxvesgqfuhbrickwztnolpyj (430)
kadxvesgqfuhbricmwztnolpyj (435)
kadxvesgqfuhbricmwztnojpyl (451)

Basically, it’s finding some decently good solutions pretty quickly. (Since the timeout is 10 seconds, each step cannot take longer than that. In actuality, they’re only taking a few seconds each to get stuck.) Leave it running long enough and you can find some pretty nice solutions:

; found solutions: 

; solve/monte-carlo
; iacxvdreofugkphbtwzsjlmqyn (440)
; hcawtqpemyrkbnfduvzsilgjxo (385)
; hbrxucsenftdmogavwzqipjlyk (425)
; idpwuamrkbqfelgczvyojthnxs (448)

But still, nothing reaching even the known optimal case yet.

My second idea was to directly implement a hill-climbing solution, always greedily choosing the best single step (if there is one). Basically, given a single solution try all 351 (choose one of 26 positions for i and a position after that for j) possible swaps. If there’s a better one, recur on that. If not, shake up the current solution using a method similar to the one used in the solve/monte-carlo method.

; solve via direct hill climbing
(define (solve/hill-climbing [guess (random-key)]
                             [score -inf.0]
                             [overall-guess guess]
                             [overall-score score])

  ; try every possible single swap
  (define-values (new-guess new-score)
    (for*/fold ([guess guess]
                [score score])
               ([i (in-range 26)]
                [j (in-range i 26)])
      (define new-guess (string-swap guess i j))
      (define new-score (score-key new-guess))
      (if (> new-score score)
          (values new-guess new-score)
          (values guess score))))

  ; update the overall best (will actually print next round)
  (define-values (new-overall-guess new-overall-score)
    (if (>= new-score overall-score)
        (values new-guess new-score)
        (values overall-guess overall-score)))

  ; print out local best values and best overall
    [(equal? guess new-guess)
     (printf "local maximum, shuffling\n")
     (for ([i (in-range (+ (random 6) 4))])
       (string-swap! guess (random 26) (random 26)))
     (define new-score (score-key new-guess))
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]))

It turns out that there are a lot of local maximums for this sort of solution. Just about every time, the new solution will go 2-4 rounds and find a position that can’t be improved with a single swap. Here’s an example run (yes, the overall output does lag by one, but that doesn’t really matter in the long run):

> (solve/hill-climbing)
dbawvnpclotefmgisxzqhkjuyr (322)        overall: wlrqujtcmeifhdgpykzaxbsvno (-inf.0)
hdsxvauflcteiogbmwzqnkjpyr (413)        overall: dbawvnpclotefmgisxzqhkjuyr (322)
jdcxvauflgtesohbmwzqnpikyr (432)        overall: hdsxvauflcteiogbmwzqnkjpyr (413)
jdcxvauflgtepohbmwzqnsikyr (433)        overall: jdcxvauflgtesohbmwzqnpikyr (432)
jdcxvauflgteophbmwzqnsikyr (435)        overall: jdcxvauflgtepohbmwzqnsikyr (433)
local maximum, shuffling
ueixvqmflgtdophbjzwkrscayn (43)         overall: ueixvqmflgtdophbjzwkrscayn (435)
idswvfeuoctgaphbjxzqlnkmyr (432)        overall: ueixvqmflgtdophbjzwkrscayn (435)
kdsxvcueoftgaphbjwzqlnimyr (443)        overall: ueixvqmflgtdophbjzwkrscayn (435)
jdsxvcueoftgaphbkwzqlnimyr (447)        overall: kdsxvcueoftgaphbjwzqlnimyr (443)
local maximum, shuffling
jpsbocuegfdvatqxzwnhlkimyr (12)         overall: jpsbocuegfdvatqxzwnhlkimyr (447)
qnarubtdpexcjsfkgvzwhoimyl (252)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
fnaxubtdrpqcjsekgvzwhoimyl (328)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
local maximum, shuffling

So this time we haven’t gotten an ‘optimal solution’ but in other runs I have:

; found solutions: 

; solve/hill-climbing
; jdrxvasuobtegphcfwzqknimyl (470)
; ibsxvaruoetfdpgclwzqmnhkyj (470)
; kjpxvuqioesdfngcawzrtmhlyb (473)
; iarzkcfjotudepgbswxqvmhlyn (474)

It also works much more quickly. In general, I’ve gotten to a 474 solution in only a few minutes, although it’s never yet gone past that.

Finally, I have a partial solution. I was wondering if you could optimize the problem specifically for the given 7,260 words rather than trying to solve it in general (as both of the current algorithms will do). To do this, I constructed a 26x26 matrix that shows how often one letter follows another in a word. What we would want to do with that is solve a maximization problem, essentially finding a maximum path through this matrix. It’s a bit more complicated than that though as we aren’t directly dealing with the words, we’re dealing with their encoded forms. I didn’t actually make it beyond that point (and all of the solutions that I found with what I did have were terrible, with scores hovering around 30 or lower). Still, I think the idea has some merit.

Here’s what I have to construct the matrix if anyone feels like working from here:

; solve using matrix math
; TODO: unfinished
(define (solve/matrix-math)
  ; count how many times each letter should be after each letter
  (define m
    (for/vector ([i (in-range 26)])
      (for/vector ([i (in-range 26)])

  ; increment a cell in that matrix
  (define (@ i j)
    (vector-ref (vector-ref m i) j))
  (define (++ i j)
    (vector-set! (vector-ref m i) j (+ 1 (@ i j))))

  ; add all of the words
  (for* ([word (in-list word-list)]
         [i (in-range (string-length word))]
         [j (in-range i (string-length word))])
    (++ (- (char->integer (string-ref word i)) 97)
        (- (char->integer (string-ref word j)) 97)))

  ; reset diagonal
  (for ([i (in-range 26)])
    (vector-set! (vector-ref m i) i 0))


If you convert that into an HTML table, you’d see values something like this:


(Yes it’s tiny.)

Basically, they’re pretty chaotic at first glance, but there are definite pockets of both higher and lower values.

If I get some more time, I’ll see what more I can do with this solution, but for now, that’s what I have.

Well that’s all I have for now. Like always, the entire code (which will also be updated if I work more on this) is availble on GitHub:

Edit 28 Jan 2013:

An idea I’ve been working on is to generate a graph of pairwise compatible words (any word that doesn’t have a before b in one word and b before a in the other). If we can find that, the maximum clique should (theoretically) be the exact set of words generated by an optimal cipher.

Unfortunately, finding the maximum clique of a graph is NP-hard. And with a 7260 node dense graph with quite a lot of maximal yet non-maximum cliques, it takes a really long time to run. I’ve been running the Bron-Kerbosch algorithm (with pivot) on it since last night, but the largest clique so far is only 44 items. It still hasn’t gone beyond cliques that have to include the first three words.

Does anyone have any thoughts? Is there anything that would have to be added to the definition of compatible? words to make it work? Any quicker way to find maximum cliques?

(Here’s a copy of the graph in the binary DIMACS format if it would be helpful: graph.b.clq, 3.1 mb)

Edit 31 Jan 2013:

Still running, although the best clique so far is only 49 nodes. I doubt this will complete while I still have access to the machine it’s running on, but I guess there’s no harm in letting it go.