An optimal alphabetizing cipher

Here is today’s /r/dailyprogramming challenge: Generate a simple substitution cipher such that the maximum number of words in a given dictionary of six letter words (there are 7,260 of them) are encoded as strings in alphabetical order.

For example, if you have the key jfbqpwcvuamozhilgrxtkndesy, a would may to j, b to f and so on. Thus the word bakery would become fjmprs (which is alphabetical). It turns out that of the given 7,260 word list, exactly 60 would evaluate to alphabetical orderings. So the score for that cipher key could be thought of as 60. The goal? Find the highest scoring cipher key.

If you’d like to follow along, you can download the entire code from GitHub.

To start with, we want a few utility functions to generate the ciphers from a key and to score them. This is one place where a functional language shines as you can write a function to generate ciphers that returns another function that does the encoding:

; create an encoding function from a key
(define (make-cipher key)
  (lambda (plaintext)
    (build-string
     (string-length plaintext)
     (lambda (i)
       (string-ref key (- (char->integer (string-ref plaintext i)) 97))))))

Likewise, it’s very straight forward to read in the entire word list, to test if a string is sorted, and to score a cipher over and entire word list.

; load a word list to score from
(define fn "word-list.txt")
(define word-list
  (with-input-from-file fn
    (lambda () (for/list ([word (in-lines)]) word))))

; test if a word is sorted
(define (word-sorted? word)
  (andmap 
   (lambda (i) (char<=? (string-ref word i) (string-ref word (+ i 1))))
   (range (- (string-length word) 1))))

; score a cipher
(define (score-cipher cipher)
  (for/sum ([word (in-list word-list)])
    (if (word-sorted? (cipher word)) 1 0)))

To make sure that things are working well, we can test the provided sample cipher:

> (define test-encode (make-cipher "jfbqpwcvuamozhilgrxtkndesy"))
> (test-encode "bakery")
"fjmprs"
> (score-cipher test-encode)
60

Everything is as it should be.

Next, we start getting into solutions. As of the writing, several people on the original thread have found solutions with scores of 474 (over 100 unique solutions), so it seems that’s the score to beat (if it’s even possible; there’s not actually a known solution to the problem). So I’m justing going to throw a few things out there and see what sticks.

My first idea was to just start with some random solution and repeatedly mutate it (by swapping two random letters, a sort of Monte Carlo algorithm). If the new solution is better, keep it. If not, try again. It turns out that the code for that is pretty straight forward, although I did add some code that after a given number of seconds have passed without a new solution we’ll just random start over. Basically, I was seeing a lot of dead ends that it couldn’t get out of even with the maximum number of swaps (10).

; solve via random swapping
(define (solve/monte-carlo [guess (random-key)]
                           [score -inf.0]
                           [timeout 10]
                           [timer (current-seconds)])

  ; generate a new guess by swapping up to 10 pairs
  (define new-guess (string-copy guess))
  (for ([i (in-range (+ (random 10) 1))])
    (string-swap! new-guess (random 26) (random 26)))
  (define new-score (score-key new-guess))

  ; if we have a new best, report it
  (cond
    [(> (- (current-seconds) timer) timeout)
     (printf "timeout\n")
     (solve/monte-carlo (random-key) -inf.0 timeout (current-seconds))]
    [(> new-score score)
     (printf "~a (~a)\n" new-guess new-score)
     (solve/monte-carlo new-guess new-score timeout (current-seconds))]
    [else
     (solve/monte-carlo guess score timeout timer)]))

And here’s a sample run of that (with default arguments):

> (solve/monte-carlo)
svpakuqixjtmngdbrywozehlcf (9)
svkapgyxijtmqudbrnwozehlcf (27)
svbkopyxinteqldajrwgzmhucf (48)
... 23 lines cut ...
kacyxdwgtbvelufspjzminhrqo (353)
kacyxbwgtdvelufspjzminhrqo (354)
kacyxbwftdvelugspjzminhrqo (369)
timeout
vknreomstqpwihjfgxydczlaub (14)
vkngtomreqpwicjfzxydbslauh (45)
deogtnvrmqpwickxafyjbslzuh (53)
... 35 lines cut ...
jbdvtapenirfgohczuyqmslwxk (402)
jbdvtapenirfgohczuyqkslwxm (404)
jbdvtapenhrfgoiczuyqkslwxm (411)
timeout
rsfykdbjaepmczwnoihxtgulvq (4)
ysfrkdtjaepmczwnoihxbgulvq (5)
ysfmkctjaeprdzwnoihxbgulvq (7)
... 39 lines cut ...
madxvesgqfuhbrickwztnolpyj (430)
kadxvesgqfuhbricmwztnolpyj (435)
kadxvesgqfuhbricmwztnojpyl (451)
timeout
...

Basically, it’s finding some decently good solutions pretty quickly. (Since the timeout is 10 seconds, each step cannot take longer than that. In actuality, they’re only taking a few seconds each to get stuck.) Leave it running long enough and you can find some pretty nice solutions:

; found solutions: 

; solve/monte-carlo
; iacxvdreofugkphbtwzsjlmqyn (440)
; hcawtqpemyrkbnfduvzsilgjxo (385)
; hbrxucsenftdmogavwzqipjlyk (425)
; idpwuamrkbqfelgczvyojthnxs (448)

But still, nothing reaching even the known optimal case yet.

My second idea was to directly implement a hill-climbing solution, always greedily choosing the best single step (if there is one). Basically, given a single solution try all 351 (choose one of 26 positions for i and a position after that for j) possible swaps. If there’s a better one, recur on that. If not, shake up the current solution using a method similar to the one used in the solve/monte-carlo method.

; solve via direct hill climbing
(define (solve/hill-climbing [guess (random-key)]
                             [score -inf.0]
                             [overall-guess guess]
                             [overall-score score])

  ; try every possible single swap
  (define-values (new-guess new-score)
    (for*/fold ([guess guess]
                [score score])
               ([i (in-range 26)]
                [j (in-range i 26)])
      (define new-guess (string-swap guess i j))
      (define new-score (score-key new-guess))
      (if (> new-score score)
          (values new-guess new-score)
          (values guess score))))

  ; update the overall best (will actually print next round)
  (define-values (new-overall-guess new-overall-score)
    (if (>= new-score overall-score)
        (values new-guess new-score)
        (values overall-guess overall-score)))

  ; print out local best values and best overall
  (cond
    [(equal? guess new-guess)
     (printf "local maximum, shuffling\n")
     (for ([i (in-range (+ (random 6) 4))])
       (string-swap! guess (random 26) (random 26)))
     (define new-score (score-key new-guess))
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]
    [else
     (printf "~a (~a)  \toverall: ~a (~a)\n" new-guess new-score overall-guess overall-score) 
     (solve/hill-climbing new-guess new-score new-overall-guess new-overall-score)]))

It turns out that there are a lot of local maximums for this sort of solution. Just about every time, the new solution will go 2-4 rounds and find a position that can’t be improved with a single swap. Here’s an example run (yes, the overall output does lag by one, but that doesn’t really matter in the long run):

> (solve/hill-climbing)
dbawvnpclotefmgisxzqhkjuyr (322)        overall: wlrqujtcmeifhdgpykzaxbsvno (-inf.0)
hdsxvauflcteiogbmwzqnkjpyr (413)        overall: dbawvnpclotefmgisxzqhkjuyr (322)
jdcxvauflgtesohbmwzqnpikyr (432)        overall: hdsxvauflcteiogbmwzqnkjpyr (413)
jdcxvauflgtepohbmwzqnsikyr (433)        overall: jdcxvauflgtesohbmwzqnpikyr (432)
jdcxvauflgteophbmwzqnsikyr (435)        overall: jdcxvauflgtepohbmwzqnsikyr (433)
local maximum, shuffling
ueixvqmflgtdophbjzwkrscayn (43)         overall: ueixvqmflgtdophbjzwkrscayn (435)
idswvfeuoctgaphbjxzqlnkmyr (432)        overall: ueixvqmflgtdophbjzwkrscayn (435)
kdsxvcueoftgaphbjwzqlnimyr (443)        overall: ueixvqmflgtdophbjzwkrscayn (435)
jdsxvcueoftgaphbkwzqlnimyr (447)        overall: kdsxvcueoftgaphbjwzqlnimyr (443)
local maximum, shuffling
jpsbocuegfdvatqxzwnhlkimyr (12)         overall: jpsbocuegfdvatqxzwnhlkimyr (447)
qnarubtdpexcjsfkgvzwhoimyl (252)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
fnaxubtdrpqcjsekgvzwhoimyl (328)        overall: jpsbocuegfdvatqxzwnhlkimyr (447)
local maximum, shuffling
...

So this time we haven’t gotten an ‘optimal solution’ but in other runs I have:

; found solutions: 

; solve/hill-climbing
; jdrxvasuobtegphcfwzqknimyl (470)
; ibsxvaruoetfdpgclwzqmnhkyj (470)
; kjpxvuqioesdfngcawzrtmhlyb (473)
; iarzkcfjotudepgbswxqvmhlyn (474)

It also works much more quickly. In general, I’ve gotten to a 474 solution in only a few minutes, although it’s never yet gone past that.

Finally, I have a partial solution. I was wondering if you could optimize the problem specifically for the given 7,260 words rather than trying to solve it in general (as both of the current algorithms will do). To do this, I constructed a 26x26 matrix that shows how often one letter follows another in a word. What we would want to do with that is solve a maximization problem, essentially finding a maximum path through this matrix. It’s a bit more complicated than that though as we aren’t directly dealing with the words, we’re dealing with their encoded forms. I didn’t actually make it beyond that point (and all of the solutions that I found with what I did have were terrible, with scores hovering around 30 or lower). Still, I think the idea has some merit.

Here’s what I have to construct the matrix if anyone feels like working from here:

; solve using matrix math
; TODO: unfinished
(define (solve/matrix-math)
  ; count how many times each letter should be after each letter
  (define m
    (for/vector ([i (in-range 26)])
      (for/vector ([i (in-range 26)])
        0)))

  ; increment a cell in that matrix
  (define (@ i j)
    (vector-ref (vector-ref m i) j))
  (define (++ i j)
    (vector-set! (vector-ref m i) j (+ 1 (@ i j))))

  ; add all of the words
  (for* ([word (in-list word-list)]
         [i (in-range (string-length word))]
         [j (in-range i (string-length word))])
    (++ (- (char->integer (string-ref word i)) 97)
        (- (char->integer (string-ref word j)) 97)))

  ; reset diagonal
  (for ([i (in-range 26)])
    (vector-set! (vector-ref m i) i 0))

  m)

If you convert that into an HTML table, you’d see values something like this:

0	141	335	603	1396	81	333	227	761	14	264	735	325	875	397	200	19	899	1222	663	315	145	114	61	365	68
328	0	81	197	536	21	105	76	278	7	73	300	60	209	266	26	5	340	395	185	206	9	51	14	131	24
452	53	0	220	765	25	74	330	309	2	235	292	130	265	418	130	12	437	542	296	219	49	38	15	178	16
232	35	81	0	515	25	120	41	280	5	46	160	89	191	212	48	1	279	421	168	123	30	35	10	128	11
510	76	197	898	0	54	137	126	427	5	60	457	182	554	296	118	6	1052	1301	570	160	37	55	64	323	19
189	18	61	120	311	0	61	38	198	5	37	243	66	115	153	5	3	216	222	162	138	11	16	19	107	14
248	29	33	139	513	20	0	105	207	1	7	206	79	178	203	37	0	254	394	181	133	23	23	1	108	18
338	29	54	193	552	17	60	0	284	5	82	196	114	208	272	57	0	313	388	159	112	27	36	11	163	15
362	63	346	457	1194	76	445	189	0	10	164	504	233	997	261	128	14	512	1040	535	141	73	17	42	240	32
57	10	16	32	89	2	22	7	61	0	25	36	15	48	55	14	0	47	63	36	59	3	6	3	17	1
72	9	9	132	405	9	38	32	137	3	0	82	19	104	47	19	0	137	206	63	36	11	2	4	114	3
472	35	143	371	1075	36	157	99	468	1	126	0	166	366	316	81	8	351	705	353	247	70	43	30	361	25
350	132	104	181	577	13	78	71	347	7	60	242	0	273	240	145	6	232	474	192	164	15	12	14	143	13
268	30	191	354	786	38	451	97	272	8	153	249	91	0	219	46	4	310	696	368	138	30	36	20	207	19
403	107	198	443	1043	43	246	152	534	8	138	449	211	661	0	147	10	660	992	461	372	87	167	54	241	34
340	20	121	209	607	18	89	135	387	3	68	280	52	293	295	0	5	366	422	277	153	22	27	21	161	12
38	0	7	6	56	0	3	8	40	0	10	8	3	19	13	1	0	26	34	31	102	2	2	0	8	3
626	113	205	423	1030	60	208	171	677	6	140	365	225	498	444	160	9	0	1073	424	255	77	75	24	335	32
398	40	179	292	808	33	108	285	389	3	126	311	130	296	277	170	17	409	0	523	179	55	69	24	240	13
362	43	145	221	783	21	106	257	407	1	88	258	103	288	333	88	3	501	619	0	152	20	52	24	255	23
305	107	189	332	791	31	190	141	378	6	145	382	222	459	140	186	2	463	724	442	0	30	32	21	175	24
100	5	21	49	306	2	32	10	154	0	4	87	14	90	92	8	0	138	192	70	41	0	7	9	39	4
172	7	28	132	335	16	71	93	204	0	46	119	33	158	57	36	0	195	198	91	23	22	0	4	92	6
23	1	12	13	56	1	13	2	55	0	1	22	4	36	25	7	0	25	60	30	9	1	0	0	20	0
94	12	36	67	171	2	55	34	114	0	19	84	54	132	101	46	1	94	190	63	32	6	17	10	0	2
38	7	11	25	111	3	18	9	67	1	5	28	10	46	47	4	0	39	73	19	9	1	0	0	33	0

(Yes it’s tiny.)

Basically, they’re pretty chaotic at first glance, but there are definite pockets of both higher and lower values.

If I get some more time, I’ll see what more I can do with this solution, but for now, that’s what I have.

Well that’s all I have for now. Like always, the entire code (which will also be updated if I work more on this) is availble on GitHub:

alphabetizing cipher source

Edit 28 Jan 2013:

An idea I’ve been working on is to generate a graph of pairwise compatible words (any word that doesn’t have a before b in one word and b before a in the other). If we can find that, the maximum clique should (theoretically) be the exact set of words generated by an optimal cipher.

Unfortunately, finding the maximum clique of a graph is NP-hard. And with a 7260 node dense graph with quite a lot of maximal yet non-maximum cliques, it takes a really long time to run. I’ve been running the Bron-Kerbosch algorithm (with pivot) on it since last night, but the largest clique so far is only 44 items. It still hasn’t gone beyond cliques that have to include the first three words.

Does anyone have any thoughts? Is there anything that would have to be added to the definition of compatible? words to make it work? Any quicker way to find maximum cliques?

(Here’s a copy of the graph in the binary DIMACS format if it would be helpful: graph.b.clq, 3.1 mb)

Edit 31 Jan 2013:

Still running, although the best clique so far is only 49 nodes. I doubt this will complete while I still have access to the machine it’s running on, but I guess there’s no harm in letting it go.