An optimal alphabetizing cipher

Here is today’s /r/dailyprogramming challenge: Generate a simple substitution cipher such that the maximum number of words in a given dictionary of six letter words (there are 7,260 of them) are encoded as strings in alphabetical order.

For example, if you have the key jfbqpwcvuamozhilgrxtkndesy, a would may to j, b to f and so on. Thus the word bakery would become fjmprs (which is alphabetical). It turns out that of the given 7,260 word list, exactly 60 would evaluate to alphabetical orderings. So the score for that cipher key could be thought of as 60. The goal? Find the highest scoring cipher key.

If you’d like to follow along, you can download the entire code from GitHub.

To start with, we want a few utility functions to generate the ciphers from a key and to score them. This is one place where a functional language shines as you can write a function to generate ciphers that returns another function that does the encoding:

; create an encoding function from a key
(define (make-cipher key)
  (lambda (plaintext)
    (build-string
     (string-length plaintext)
     (lambda (i)
       (string-ref key (- (char->integer (string-ref plaintext i)) 97))))))

Likewise, it’s very straight forward to read in the entire word list, to test if a string is sorted, and to score a cipher over and entire word list.

; load a word list to score from
(define fn "word-list.txt")
(define word-list
  (with-input-from-file fn
    (lambda () (for/list ([word (in-lines)]) word))))

; test if a word is sorted
(define (word-sorted? word)
  (andmap 
   (lambda (i) (char<=? (string-ref word i) (string-ref word (+ i 1))))
   (range (- (string-length word) 1))))

; score a cipher
(define (score-cipher cipher)
  (for/sum ([word (in-list word-list)])
    (if (word-sorted? (cipher word)) 1 0)))

To make sure that things are working well, we can test the provided sample cipher:

> (define test-encode (make-cipher "jfbqpwcvuamozhilgrxtkndesy"))
> (test-encode "bakery")
"fjmprs"
> (score-cipher test-encode)
60

Everything is as it should be.

Next, we start getting into solutions. As of the writing, several people on the original thread have found solutions with scores of 474 (over 100 unique solutions), so it seems that’s the score to beat (if it’s even possible; there’s not actually a known solution to the problem). So I’m justing going to throw a few things out there and see what sticks.


My first idea was to just start with some random solution and repeatedly mutate it (by swapping two random letters, a sort of Monte Carlo algorithm). If the new solution is better, keep it. If not, try again. It turns out that the code for that is pretty straight forward, although I did add some code that after a given number of seconds have passed without a new solution we’ll just random start over. Basically, I was seeing a lot of dead ends that it couldn’t get out of even with the maximum number of swaps (10).

; solve via random swapping
(define (solve/monte-carlo [guess (random-key)]
                           [score -inf.0]
                           [timeout 10]
                           [timer (current-seconds)])

  ; generate a new guess by swapping up to 10 pairs
  (define new-guess (string-copy guess))
  (for ([i (in-range (+ (random 10) 1))])
    (string-swap! new-guess (random 26) (random 26)))
  (define new-score (score-key new-guess))

  ; if we have a new best, report it
  (cond
    [(> (- (current-seconds) timer) timeout)
     (printf "timeout\n")
     (solve/monte-carlo (random-key) -inf.0 timeout (current-seconds))]
    [(> new-score score)
     (printf "~a (~a)\n" new-guess new-score)
     (solve/monte-carlo new-guess new-score timeout (current-seconds))]
    [else
     (solve/monte-carlo guess score timeout timer)]))

And here’s a sample run of that (with default arguments):

> (solve/monte-carlo)
svpakuqixjtmngdbrywozehlcf (9)
svkapgyxijtmqudbrnwozehlcf (27)
svbkopyxinteqldajrwgzmhucf (48)
... 23 lines cut ...
kacyxdwgtbvelufspjzminhrqo (353)
kacyxbwgtdvelufspjzminhrqo (354)
kacyxbwftdvelugspjzminhrqo (369)
timeout
vknreomstqpwihjfgxydczlaub (14)
vkngtomreqpwicjfzxydbslauh (45)
deogtnvrmqpwickxafyjbslzuh (53)
... 35 lines cut ...
jbdvtapenirfgohczuyqmslwxk (402)
jbdvtapenirfgohczuyqkslwxm (404)
jbdvtapenhrfgoiczuyqkslwxm (411)
timeout
rsfykdbjaepmczwnoihxtgulvq (4)
ysfrkdtjaepmczwnoihxbgulvq (5)
ysfmkctjaeprdzwnoihxbgulvq (7)
... 39 lines cut ...
madxvesgqfuhbrickwztnolpyj (430)
kadxvesgqfuhbricmwztnolpyj (435)
kadxvesgqfuhbricmwztnojpyl (451)
timeout
...

Basically, it’s finding some decently good solutions pretty quickly. (Since the timeout is 10 seconds, each step cannot take longer than that. In actuality, they’re only taking a few seconds each to get stuck.) Leave it running long enough and you can find some pretty nice solutions:

; found solutions: 

; solve/monte-carlo
; iacxvdreofugkphbtwzsjlmqyn (440)
; hcawtqpemyrkbnfduvzsilgjxo (385)
; hbrxucsenftdmogavwzqipjlyk (425)
; idpwuamrkbqfelgczvyojthnxs (448)

But still, nothing reaching even the known optimal case yet.


My second idea was to directly implement a hill-climbing solution, always greedily choosing the best single step (if there is one). Basically, given a single solution try all 351 (choose one of 26 positions for i and a position after that for j) possible swaps. If there’s a better one, recur on that. If not, shake up the current solution using a method similar to the one used in the solve/monte-carlo method.

; solve via direct hill climbing
(define (solve/hill-climbing [guess (random-key)]
                             [score -inf.0]
                             [overall-guess guess]
                             [overall-score score])

  ; try every possible single swap
  (define-values (new-guess new-score)
    (for*/fold ([guess guess]
                [score score])
               ([i (in-range 26)]
                [j (in-range i 26)])
      (define new-guess (string-swap guess i j))
      (define new-score (score-key new-guess))
      (if (> new-score score)
          (values new-guess new-score)
          (values guess score))))

  ; update the overall best (will actually print next round)
  (define-values (new-overall-guess new-overall-score)
    (if (>= new-score overall-score)
        (values new-guess new-score)
        (values overall-guess overall-score)))

  ; print out local best values and best overall
  (cond
    [(equal? guess new-guess)
     (printf "local maximum, shuffling\n")
     (for ([i (in-range (+ (random 6) 4))])
       (string-swap! guess (random 26) (random 26)))
     (define new-score (score-key new-guess))
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]
    [else
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]))

It turns out that there are a lot of local maximums for this sort of solution. Just about every time, the new solution will go 2-4 rounds and find a position that can’t be improved with a single swap. Here’s an example run (yes, the overall output does lag by one, but that doesn’t really matter in the long run):

> (solve/hill-climbing)
dbawvnpclotefmgisxzqhkjuyr (322)        overall: wlrqujtcmeifhdgpykzaxbsvno (-inf.0)
hdsxvauflcteiogbmwzqnkjpyr (413)        overall: dbawvnpclotefmgisxzqhkjuyr (322)
jdcxvauflgtesohbmwzqnpikyr (432)        overall: hdsxvauflcteiogbmwzqnkjpyr (413)
jdcxvauflgtepohbmwzqnsikyr (433)        overall: jdcxvauflgtesohbmwzqnpikyr (432)
jdcxvauflgteophbmwzqnsikyr (435)        overall: jdcxvauflgtepohbmwzqnsikyr (433)
local maximum, shuffling
ueixvqmflgtdophbjzwkrscayn (43)         overall: ueixvqmflgtdophbjzwkrscayn (435)
idswvfeuoctgaphbjxzqlnkmyr (432)        overall: ueixvqmflgtdophbjzwkrscayn (435)
kdsxvcueoftgaphbjwzqlnimyr (443)        overall: ueixvqmflgtdophbjzwkrscayn (435)
jdsxvcueoftgaphbkwzqlnimyr (447)        overall: kdsxvcueoftgaphbjwzqlnimyr (443)
local maximum, shuffling
jpsbocuegfdvatqxzwnhlkimyr (12)         overall: jpsbocuegfdvatqxzwnhlkimyr (447)
qnarubtdpexcjsfkgvzwhoimyl (252)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
fnaxubtdrpqcjsekgvzwhoimyl (328)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
local maximum, shuffling
...

So this time we haven’t gotten an ‘optimal solution’ but in other runs I have:

; found solutions: 

; solve/hill-climbing
; jdrxvasuobtegphcfwzqknimyl (470)
; ibsxvaruoetfdpgclwzqmnhkyj (470)
; kjpxvuqioesdfngcawzrtmhlyb (473)
; iarzkcfjotudepgbswxqvmhlyn (474)

It also works much more quickly. In general, I’ve gotten to a 474 solution in only a few minutes, although it’s never yet gone past that.


Finally, I have a partial solution. I was wondering if you could optimize the problem specifically for the given 7,260 words rather than trying to solve it in general (as both of the current algorithms will do). To do this, I constructed a 26x26 matrix that shows how often one letter follows another in a word. What we would want to do with that is solve a maximization problem, essentially finding a maximum path through this matrix. It’s a bit more complicated than that though as we aren’t directly dealing with the words, we’re dealing with their encoded forms. I didn’t actually make it beyond that point (and all of the solutions that I found with what I did have were terrible, with scores hovering around 30 or lower). Still, I think the idea has some merit.

Here’s what I have to construct the matrix if anyone feels like working from here:

; solve using matrix math
; TODO: unfinished
(define (solve/matrix-math)
  ; count how many times each letter should be after each letter
  (define m
    (for/vector ([i (in-range 26)])
      (for/vector ([i (in-range 26)])
        0)))

  ; increment a cell in that matrix
  (define (@ i j)
    (vector-ref (vector-ref m i) j))
  (define (++ i j)
    (vector-set! (vector-ref m i) j (+ 1 (@ i j))))

  ; add all of the words
  (for* ([word (in-list word-list)]
         [i (in-range (string-length word))]
         [j (in-range i (string-length word))])
    (++ (- (char->integer (string-ref word i)) 97)
        (- (char->integer (string-ref word j)) 97)))

  ; reset diagonal
  (for ([i (in-range 26)])
    (vector-set! (vector-ref m i) i 0))

  m)

If you convert that into an HTML table, you’d see values something like this:

0141335603139681333227761142647353258753972001989912226633151451146136568
3280811975362110576278773300602092662653403951852069511413124
452530220765257433030922352921302654181301243754229621949381517816
2323581051525120412805461608919121248127942116812330351012811
5107619789805413712642756045718255429611861052130157016037556432319
189186112031106138198537243661151535321622216213811161910714
248293313951320010520717206791782033702543941811332323110818
33829541935521760028458219611420827257031338815911227361116315
3626334645711947644518901016450423399726112814512104053514173174224032
57101632892227610253615485514047633659363171
72991324059383213730821910447190137206633611241143
47235143371107536157994681126016636631681835170535324770433036125
3501321041815771378713477602420273240145623247419216415121414313
268301913547863845197272815324991021946431069636813830362020719
4031071984431043432461525348138449211661014710660992461372871675424134
340201212096071889135387368280522932950536642227715322272116112
3807656038400108319131026343110222083
626113205423103060208171677614036522549844416090107342425577752433532
3984017929280833108285389312631113029627717017409052317955692424013
3624314522178321106257407188258103288333883501619015220522425523
3051071893327913119014137861453822224591401862463724442030322117524
10052149306232101540487149092801381927041079394
172728132335167193204046119331585736019519891232204926
231121356113255012243625702560309100200
941236671712553411401984541321014619419063326171002
38711251113189671528104647403973199100330

(Yes it’s tiny.)

Basically, they’re pretty chaotic at first glance, but there are definite pockets of both higher and lower values.

If I get some more time, I’ll see what more I can do with this solution, but for now, that’s what I have.

Well that’s all I have for now. Like always, the entire code (which will also be updated if I work more on this) is availble on GitHub:

Edit 28 Jan 2013:

An idea I’ve been working on is to generate a graph of pairwise compatible words (any word that doesn’t have a before b in one word and b before a in the other). If we can find that, the maximum clique should (theoretically) be the exact set of words generated by an optimal cipher.

Unfortunately, finding the maximum clique of a graph is NP-hard. And with a 7260 node dense graph with quite a lot of maximal yet non-maximum cliques, it takes a really long time to run. I’ve been running the Bron-Kerbosch algorithm (with pivot) on it since last night, but the largest clique so far is only 44 items. It still hasn’t gone beyond cliques that have to include the first three words.

Does anyone have any thoughts? Is there anything that would have to be added to the definition of compatible? words to make it work? Any quicker way to find maximum cliques?

(Here’s a copy of the graph in the binary DIMACS format if it would be helpful: graph.b.clq, 3.1 mb)

Edit 31 Jan 2013:

Still running, although the best clique so far is only 49 nodes. I doubt this will complete while I still have access to the machine it’s running on, but I guess there’s no harm in letting it go.