Generating non-repeating strings

Based on this post from Programming Praxis, today’s goal is to write an algorithm that, given a number N and an alphabet A, will generate all strings of length N made of letters from A with no adjacent substrings that repeat.

So for example, given N = 5 and A = {a, b, c} the string abcba will be allowed, but none of abcbc, ababc, nor even aabcb will be allowed (the bc, ab, and a repeat).

It’s a little more general even than the version Programming Praxis specifies (they limit the alphabet to exactly *A = {1, 2, 3} *and more more general still than their original source which requires only one possible string, but I think it’s worth the extra complications.

To start out with, let’s just brute force the problem. Generate all possible sequences of characters in turn. For each, loop over every possible splitting point and every possible length at that splitting point. If any of those result in a duplicate substring, throw it out. We’re doing more work than we have to, generating strings where we can already tell after just the first two characters that they repeat, but that’s for future optimizations, yes?

So we’ll start with code that given a length and an alphabet will give us all possible strings. This is a perfect use of recursion and generators (in the Python version).

Python:

def combinations(N, A):
	'''Generate all strings from A of length N.'''
	if N == 0:
		yield ''
	else:
		for c in A:
			for s in combinations(N - 1, A):
				yield c + s

Racket:

; generate all strings from A of length N
(define (combinations n a)
  (map list->string
       (let loop ([n n])
         (cond
           [(zero? n) '(())]
           [else
            (for*/list ([c (in-string a)]
                        [r (in-list (loop (- n 1)))])
              (cons c r))]))))

In the case of Racket, I nest the helper loop just so that I can map list->string at the end. If I were just looking for lists or if I had used string-append rather than cons that wouldn’t have been necessary.

Next, we want a function that can check if there are any duplicate substrings. As aforementioned, we’ll generate each index in the string (ignoring the very first and last) and check each length substring before and after that index.

Python:

def has_repeat(s):
	'''Check if a string has a non-empty repeating substring.'''

	N = len(s)
	for i in xrange(1, N):
		for j in xrange(1, min(i, N - i) + 1):
			if s[i-j:i] == s[i:i+j]:
				return True

	return False

Racket:

; check if a string has a non-empty repeating substring
(define (repeats? s)
  (call/cc
   (lambda (return)
     (define N (string-length s))
     (for* ([i (in-range 1 N)]
            [j (in-range 1 (+ 1 (min i (- N i))))])
       (when (equal? (substring s (- i j) i)
                     (substring s i (+ i j)))
         (return #t)))
     (return #f))))

Using call/cc like this is a trick to bail out of the loop early. Otherwise, we could have used andmap, but I much prefer being able to use for*.

And now that the hard part is out of the way, we can put it all together:

Python:

def non_repeating_direct(N, A):
	'''
	Generate all strings of length N from the alphabet A
	such that no adjacent substrings are repeated.
	'''

	return [s for s in combinations(N, A) if not has_repeat(s)]

Racket:

; generate all strings of length N from alphabet A with no repeating substrings
(define (non-repeating-direct N A)
  (for/list ([s (in-list (combinations N A))]
             #:when (not (repeats? s)))
    s))

So how does it work out?

> (time (non-repeating-direct 5 "abc"))
cpu time: 0 real time: 2 gc time: 0
'("abaca" "abacb" "abcab" "abcac" "abcba"
  "acaba" "acabc" "acbab" "acbac" "acbca"
  "babca" "babcb" "bacab" "bacba" "bacbc"
  "bcaba" "bcabc" "bcacb" "bcbab" "bcbac"
  "cabac" "cabca" "cabcb" "cacba" "cacbc"
  "cbabc" "cbaca" "cbacb" "cbcab" "cbcac"))

Looks good so far.

As an interesting aside, Ben over in the Programming Praxis comments pointed out that the number of such words corresponds with the number of ternary squarefree words of length n:

> (for/list ([n (in-range 10)]) (length (non-repeating-direct n "abc")))
'(1 3 6 12 18 30 42 60 78 108)

And actually once you parse through all of the jargon, it makes perfect sense. Ternary is base three (thus the alphabet A = {a, b, c}), words are what we are generating of length n, and a squarefree word is one without any adjacent subwords.

The full source code (along with unit tests) is available here: