An optimal alphabetizing cipher

Here is today's /r/dailyprogramming challenge: Generate a simple substitution cipher such that the maximum number of words in a given dictionary of six letter words (there are 7,260 of them) are encoded as strings in alphabetical order.

For example, if you have the key jfbqpwcvuamozhilgrxtkndesy, a would may to j, b to f and so on. Thus the word bakery would become fjmprs (which is alphabetical). It turns out that of the given 7,260 word list, exactly 60 would evaluate to alphabetical orderings. So the score for that cipher key could be thought of as 60. The goal? Find the highest scoring cipher key.

To start with, we want a few utility functions to generate the ciphers from a key and to score them. This is one place where a functional language shines as you can write a function to generate ciphers that returns another function that does the encoding:

; create an encoding function from a key
(define (make-cipher key)
(lambda (plaintext)
(build-string
(string-length plaintext)
(lambda (i)
(string-ref key (- (char->integer (string-ref plaintext i)) 97))))))

Likewise, it's very straight forward to read in the entire word list, to test if a string is sorted, and to score a cipher over and entire word list.

; load a word list to score from
(define fn "word-list.txt")
(define word-list
(with-input-from-file fn
(lambda () (for/list ([word (in-lines)]) word))))

; test if a word is sorted
(define (word-sorted? word)
(andmap
(lambda (i) (char<=? (string-ref word i) (string-ref word (+ i 1))))
(range (- (string-length word) 1))))

; score a cipher
(define (score-cipher cipher)
(for/sum ([word (in-list word-list)])
(if (word-sorted? (cipher word)) 1 0)))

To make sure that things are working well, we can test the provided sample cipher:

> (define test-encode (make-cipher "jfbqpwcvuamozhilgrxtkndesy"))
> (test-encode "bakery")
"fjmprs"
> (score-cipher test-encode)
60

Everything is as it should be.

Next, we start getting into solutions. As of the writing, several people on the original thread have found solutions with scores of 474 (over 100 unique solutions), so it seems that's the score to beat (if it's even possible; there's not actually a known solution to the problem). So I'm justing going to throw a few things out there and see what sticks.

My first idea was to just start with some random solution and repeatedly mutate it (by swapping two random letters, a sort of Monte Carlo algorithm). If the new solution is better, keep it. If not, try again. It turns out that the code for that is pretty straight forward, although I did add some code that after a given number of seconds have passed without a new solution we'll just random start over. Basically, I was seeing a lot of dead ends that it couldn't get out of even with the maximum number of swaps (10).

; solve via random swapping
(define (solve/monte-carlo [guess (random-key)]
[score -inf.0]
[timeout 10]
[timer (current-seconds)])

; generate a new guess by swapping up to 10 pairs
(define new-guess (string-copy guess))
(for ([i (in-range (+ (random 10) 1))])
(string-swap! new-guess (random 26) (random 26)))
(define new-score (score-key new-guess))

; if we have a new best, report it
(cond
[(> (- (current-seconds) timer) timeout)
(printf "timeout\n")
(solve/monte-carlo (random-key) -inf.0 timeout (current-seconds))]
[(> new-score score)
(printf "~a (~a)\n" new-guess new-score)
(solve/monte-carlo new-guess new-score timeout (current-seconds))]
[else
(solve/monte-carlo guess score timeout timer)]))

And here's a sample run of that (with default arguments):

> (solve/monte-carlo)
svpakuqixjtmngdbrywozehlcf (9)
svkapgyxijtmqudbrnwozehlcf (27)
svbkopyxinteqldajrwgzmhucf (48)
... 23 lines cut ...
kacyxdwgtbvelufspjzminhrqo (353)
kacyxbwgtdvelufspjzminhrqo (354)
kacyxbwftdvelugspjzminhrqo (369)
timeout
vknreomstqpwihjfgxydczlaub (14)
vkngtomreqpwicjfzxydbslauh (45)
deogtnvrmqpwickxafyjbslzuh (53)
... 35 lines cut ...
jbdvtapenirfgohczuyqmslwxk (402)
jbdvtapenirfgohczuyqkslwxm (404)
jbdvtapenhrfgoiczuyqkslwxm (411)
timeout
rsfykdbjaepmczwnoihxtgulvq (4)
ysfrkdtjaepmczwnoihxbgulvq (5)
ysfmkctjaeprdzwnoihxbgulvq (7)
... 39 lines cut ...
timeout
...

Basically, it's finding some decently good solutions pretty quickly. (Since the timeout is 10 seconds, each step cannot take longer than that. In actuality, they're only taking a few seconds each to get stuck.) Leave it running long enough and you can find some pretty nice solutions:

; found solutions:

; solve/monte-carlo
; iacxvdreofugkphbtwzsjlmqyn (440)
; hcawtqpemyrkbnfduvzsilgjxo (385)
; hbrxucsenftdmogavwzqipjlyk (425)
; idpwuamrkbqfelgczvyojthnxs (448)

But still, nothing reaching even the known optimal case yet.

My second idea was to directly implement a Hill climbing, always greedily choosing the best single step (if there is one). Basically, given a single solution try all 351 (choose one of 26 positions for i and a position after that for j) possible swaps. If there's a better one, recur on that. If not, shake up the current solution using a method similar to the one used in the solve/monte-carlo method.

; solve via direct hill climbing
(define (solve/hill-climbing [guess (random-key)]
[score -inf.0]
[overall-guess guess]
[overall-score score])

; try every possible single swap
(define-values (new-guess new-score)
(for*/fold ([guess guess]
[score score])
([i (in-range 26)]
[j (in-range i 26)])
(define new-guess (string-swap guess i j))
(define new-score (score-key new-guess))
(if (> new-score score)
(values new-guess new-score)
(values guess score))))

; update the overall best (will actually print next round)
(define-values (new-overall-guess new-overall-score)
(if (>= new-score overall-score)
(values new-guess new-score)
(values overall-guess overall-score)))

; print out local best values and best overall
(cond
[(equal? guess new-guess)
(printf "local maximum, shuffling\n")
(for ([i (in-range (+ (random 6) 4))])
(string-swap! guess (random 26) (random 26)))
(define new-score (score-key new-guess))
(printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score)
(solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]
[else
(printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score)
(solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]))

It turns out that there are a lot of local maximums for this sort of solution. Just about every time, the new solution will go 2-4 rounds and find a position that can't be improved with a single swap. Here's an example run (yes, the overall output does lag by one, but that doesn't really matter in the long run):

> (solve/hill-climbing)
dbawvnpclotefmgisxzqhkjuyr (322)        overall: wlrqujtcmeifhdgpykzaxbsvno (-inf.0)
hdsxvauflcteiogbmwzqnkjpyr (413)        overall: dbawvnpclotefmgisxzqhkjuyr (322)
jdcxvauflgtesohbmwzqnpikyr (432)        overall: hdsxvauflcteiogbmwzqnkjpyr (413)
jdcxvauflgtepohbmwzqnsikyr (433)        overall: jdcxvauflgtesohbmwzqnpikyr (432)
jdcxvauflgteophbmwzqnsikyr (435)        overall: jdcxvauflgtepohbmwzqnsikyr (433)
local maximum, shuffling
ueixvqmflgtdophbjzwkrscayn (43)         overall: ueixvqmflgtdophbjzwkrscayn (435)
idswvfeuoctgaphbjxzqlnkmyr (432)        overall: ueixvqmflgtdophbjzwkrscayn (435)
kdsxvcueoftgaphbjwzqlnimyr (443)        overall: ueixvqmflgtdophbjzwkrscayn (435)
jdsxvcueoftgaphbkwzqlnimyr (447)        overall: kdsxvcueoftgaphbjwzqlnimyr (443)
local maximum, shuffling
jpsbocuegfdvatqxzwnhlkimyr (12)         overall: jpsbocuegfdvatqxzwnhlkimyr (447)
qnarubtdpexcjsfkgvzwhoimyl (252)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
fnaxubtdrpqcjsekgvzwhoimyl (328)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
local maximum, shuffling
...

So this time we haven't gotten an 'optimal solution' but in other runs I have:

; found solutions:

; solve/hill-climbing
; jdrxvasuobtegphcfwzqknimyl (470)
; ibsxvaruoetfdpgclwzqmnhkyj (470)
; kjpxvuqioesdfngcawzrtmhlyb (473)
; iarzkcfjotudepgbswxqvmhlyn (474)

It also works much more quickly. In general, I've gotten to a 474 solution in only a few minutes, although it's never yet gone past that.

Finally, I have a partial solution. I was wondering if you could optimize the problem specifically for the given 7,260 words rather than trying to solve it in general (as both of the current algorithms will do). To do this, I constructed a 26x26 matrix that shows how often one letter follows another in a word. What we would want to do with that is solve a maximization problem, essentially finding a maximum path through this matrix. It's a bit more complicated than that though as we aren't directly dealing with the words, we're dealing with their encoded forms. I didn't actually make it beyond that point (and all of the solutions that I found with what I did have were terrible, with scores hovering around 30 or lower). Still, I think the idea has some merit.

Here's what I have to construct the matrix if anyone feels like working from here:

; solve using matrix math
; TODO: unfinished
(define (solve/matrix-math)
; count how many times each letter should be after each letter
(define m
(for/vector ([i (in-range 26)])
(for/vector ([i (in-range 26)])
0)))

; increment a cell in that matrix
(define (@"@" i j)
(vector-ref (vector-ref m i) j))
(define (++ i j)
(vector-set! (vector-ref m i) j (+ 1 (@"@" i j))))

; add all of the words
(for* ([word (in-list word-list)]
[i (in-range (string-length word))]
[j (in-range i (string-length word))])
(++ (- (char->integer (string-ref word i)) 97)
(- (char->integer (string-ref word j)) 97)))

; reset diagonal
(for ([i (in-range 26)])
(vector-set! (vector-ref m i) i 0))

m)

If you convert that into an HTML table, you'd see values something like this:

 0 141 335 603 1396 81 333 227 761 14 264 735 325 875 397 200 19 899 1222 663 315 145 114 61 365 68 328 0 81 197 536 21 105 76 278 7 73 300 60 209 266 26 5 340 395 185 206 9 51 14 131 24 452 53 0 220 765 25 74 330 309 2 235 292 130 265 418 130 12 437 542 296 219 49 38 15 178 16 232 35 81 0 515 25 120 41 280 5 46 160 89 191 212 48 1 279 421 168 123 30 35 10 128 11 510 76 197 898 0 54 137 126 427 5 60 457 182 554 296 118 6 1052 1301 570 160 37 55 64 323 19 189 18 61 120 311 0 61 38 198 5 37 243 66 115 153 5 3 216 222 162 138 11 16 19 107 14 248 29 33 139 513 20 0 105 207 1 7 206 79 178 203 37 0 254 394 181 133 23 23 1 108 18 338 29 54 193 552 17 60 0 284 5 82 196 114 208 272 57 0 313 388 159 112 27 36 11 163 15 362 63 346 457 1194 76 445 189 0 10 164 504 233 997 261 128 14 512 1040 535 141 73 17 42 240 32 57 10 16 32 89 2 22 7 61 0 25 36 15 48 55 14 0 47 63 36 59 3 6 3 17 1 72 9 9 132 405 9 38 32 137 3 0 82 19 104 47 19 0 137 206 63 36 11 2 4 114 3 472 35 143 371 1075 36 157 99 468 1 126 0 166 366 316 81 8 351 705 353 247 70 43 30 361 25 350 132 104 181 577 13 78 71 347 7 60 242 0 273 240 145 6 232 474 192 164 15 12 14 143 13 268 30 191 354 786 38 451 97 272 8 153 249 91 0 219 46 4 310 696 368 138 30 36 20 207 19 403 107 198 443 1043 43 246 152 534 8 138 449 211 661 0 147 10 660 992 461 372 87 167 54 241 34 340 20 121 209 607 18 89 135 387 3 68 280 52 293 295 0 5 366 422 277 153 22 27 21 161 12 38 0 7 6 56 0 3 8 40 0 10 8 3 19 13 1 0 26 34 31 102 2 2 0 8 3 626 113 205 423 1030 60 208 171 677 6 140 365 225 498 444 160 9 0 1073 424 255 77 75 24 335 32 398 40 179 292 808 33 108 285 389 3 126 311 130 296 277 170 17 409 0 523 179 55 69 24 240 13 362 43 145 221 783 21 106 257 407 1 88 258 103 288 333 88 3 501 619 0 152 20 52 24 255 23 305 107 189 332 791 31 190 141 378 6 145 382 222 459 140 186 2 463 724 442 0 30 32 21 175 24 100 5 21 49 306 2 32 10 154 0 4 87 14 90 92 8 0 138 192 70 41 0 7 9 39 4 172 7 28 132 335 16 71 93 204 0 46 119 33 158 57 36 0 195 198 91 23 22 0 4 92 6 23 1 12 13 56 1 13 2 55 0 1 22 4 36 25 7 0 25 60 30 9 1 0 0 20 0 94 12 36 67 171 2 55 34 114 0 19 84 54 132 101 46 1 94 190 63 32 6 17 10 0 2 38 7 11 25 111 3 18 9 67 1 5 28 10 46 47 4 0 39 73 19 9 1 0 0 33 0

(Yes it's tiny.)

Basically, they're pretty chaotic at first glance, but there are definite pockets of both higher and lower values.

If I get some more time, I'll see what more I can do with this solution, but for now, that's what I have.

Well that's all I have for now. Like always, the entire code (which will also be updated if I work more on this) is availble on GitHub: - alphabetizing cipher source

Edit 28 Jan 2013:

An idea I've been working on is to generate a graph of pairwise compatible words (any word that doesn't have a before b in one word and b before a in the other). If we can find that, the Maximal clique should (theoretically) be the exact set of words generated by an optimal cipher.

Unfortunately, finding the maximum clique of a graph is NP-hard. And with a 7260 node dense graph with quite a lot of maximal yet non-maximum cliques, it takes a really long time to run. I've been running the Bron-Kerbosch algorithm (with pivot) on it since last night, but the largest clique so far is only 44 items. It still hasn't gone beyond cliques that have to include the first three words.

Does anyone have any thoughts? Is there anything that would have to be added to the definition of compatible? words to make it work? Any quicker way to find maximum cliques?

(Here's a copy of the graph in the binary DIMACS format if it would be helpful: graph.b.clq, 3.1 mb)

Edit 31 Jan 2013:

Still running, although the best clique so far is only 49 nodes. I doubt this will complete while I still have access to the machine it's running on, but I guess there's no harm in letting it go.