## Eastman maximal comma-free codes in Haskell

This past January I watched a video of Don Knuth’s most recent annual Christmas lecture. Typically his Christmas lectures have been about trees, but breaking with tradition, he gave this lecture about comma-free codes, and presented an implementation of an interesting algorithm due to Willard Eastman. Of course his implementation was written in CWEB, and during the course of the lecture he noted that his algorithm was iterative, and he didn’t know of a nice way to write it recursively (or something like that). Naturally, that sounded like a challenge, so I implemented it in Haskell, and I think it came out quite nicely. (It still uses “iteration” in the sense of the `iterate` function, but of course that uses recursion, so…?) Unfortunately, that was in January, it is now July, and I don’t really remember how it works. So I decided I had better write about it now, before I forget even more about how it works.

A comma-free code is a set $C$ of strings such that if you concatenate any two strings in $C$, the result does not contain any elements of $C$ as internal substrings. The term “comma-free” refers to the fact that sequences of strings from $C$ can be unambiguously concatenated, without the need for separators like commas. Even if you start reading somewhere in the middle of a message, you can unambiguously figure out how to partition the message into codewords. For example, {`bear`, `like`} is a comma-free code, but {`bear`, `like`, `earl`} is not, since `bearlike` contains `earl` as a substring. A comma-free code also obviously cannot contain any periodic strings (that is, strings which consist of repeated copies of some shorter string), like `abcabc`, since concatenating such a string with itself produces a string containing the same string as an internal prefix.

Given a fixed alphabet and codeword length, one is naturally led to ask how large a comma-free code can possibly be. Eastman solved this problem for odd codeword lengths, by showing how to construct a maximal commafree code. To understand Eastman’s solution, consider the set $S$ of all aperiodic strings of length $n$ over an alphabet $\Sigma$ (we have already seen that periodic strings cannot be part of a comma-free code). Consider two strings equivalent if they are rotations of each other. For example, `bear`, `earb`, `arbe`, and `rbea` are all equivalent. This is an equivalence relation on strings, and so it defines a partition of $S$ into classes of equivalent strings. Note that we can never have two equivalent strings as part of the same comma-free code, since if we concatenate a string with itself, the result contains all other equivalent strings as substrings. For example, `bearbear` contains `earb`, `arbe`, and `rbea`. So at most a comma-free code could contain one string from each equivalence class.

In fact, Eastman shows that for odd $n$ there are comma-free codes that contain exactly one string from each equivalence class! What’s more, his proof is constructive: he shows how to pick a particular, canonical representative from each equivalence class such that the collection of all such canonical representatives is a comma-free code. This is what the program below does: given an odd-length string, it outputs the canonical rotation of that string which is part of a maximal comma-free code.

So, without further ado, let’s see the implementation! Again, I really don’t remember much about the details of how (much less why) this works. For that, I recommend watching Knuth’s lecture or reading the explanations in his code (you’ll probably want to compile it into LaTeX first).

First, some imports and such. Look, ma, no LANGUAGE extensions!

``````> module Commafree where
>
> import           Control.Arrow      (first)
> import           Data.List          (findIndex, intercalate)
> import           Data.List.Split
> import           Data.Maybe         (catMaybes)
> import           Data.Monoid        ((<>))
>
> import           System.Environment
> import           System.Exit
> import           Text.Printf
``````

Here’s the main Eastman algorithm, which actually works for any list of things with a total order (unlike Knuth’s, which only works for lists of nonnegative integers, although that is obviously just a cosmetic difference, since any finite set with a total order is isomorphic to a set of nonnegative integers). We turn each item into a singleton “block”, then iterate the `eastmanRound` function, which partitions the blocks into subsequences of blocks, which we coalesce into blocks again. So each iteration makes the partition coarser, i.e. the blocks get bigger. We keep iterating until there is only one block left, which contains the rotation that we seek.

``````> eastman :: Ord a => [a] -> [a]
> eastman
>   . dropWhile ((>1) . length)
>   . iterate (map mconcat . eastmanRound)
>   . map mkBlock
``````

Some code for dealing with blocks. A block is just a list that keeps track of its length for efficiency. The important point about blocks is that they are ordered first by length, then lexicographically (see the `Ord` instance below). The `Monoid` instance is straightforward.

``````> data Block a = Block { blockLen :: !Int, blockContent :: [a] }
>   deriving (Show, Eq)
>
> instance Ord a => Ord (Block a) where
>   compare (Block m as) (Block n bs)
>     = compare m n <> compare as bs
>
> instance Monoid (Block a) where
>   mempty = Block 0 []
>   mappend (Block m as) (Block n bs) = Block (m+n) (as ++ bs)
>
> mkBlock :: a -> Block a
> mkBlock a = Block 1 [a]
``````

One round of the algorithm works as follows: we duplicate the list, partition it after each “dip” (`chop splitDip`, to be explained below), possibly drop some of the leading parts and coalesce other parts based on size parity (`pickOdds`), and then keep only a total amount of stuff equal to the length of the original list (`takeTotal`). This last part with `takeTotal` ensures that we will end up with something which is a rotation of the original input (though partitioned). In an implementation with random-access arrays, one would just wrap the indices around using mod; in this context it’s easier to first duplicate the input list so we can deal with all rotations at once, determine which rotation we want by dropping some stuff from the beginning, then drop any excess stuff at the end.

``````> eastmanRound :: Ord a => [a] -> [[a]]
> eastmanRound as
>   = takeTotal (length as)
>   . pickOdds
>   . chop splitDip
>   \$ (as ++ as)
``````

It’s interesting to note that in `eastmanRound` the type `a` is actually going to be instantiated with `Block b` for some type `b`. In the first round, all the blocks are singletons, so this is no different than just taking a list of `b`. But in subsequent rounds the distinction is nontrivial.

A “dip” is a decreasing substring followed by a single increase, for example, 976325. (Though again, remember that we are actually dealing with sequences of blocks, not integers, so a dip is essentially a sequence of blocks of decreasing length followed by a longer one, with the requisite caveat about blocks of the same length.) `splitDip` looks for the first place in the list that looks like $a > b < c$ and breaks the list right after it. This is used with the `chop` function to split the list into a sequence of dips.

``````> splitDip :: Ord a => [a] -> ([a],[a])
> splitDip (a:b:cs)
>   | a < b     = ([a,b], cs)
>   | otherwise = first (a:) (splitDip (b:cs))
> splitDip as = (as,[])
``````

`pickOdds` does something like the following: it looks for maximal sequences of dips where the first dip has odd length and the rest have even length, and merges such sequences into one long partition. It also drops everything prior to the first odd dip. Something like that at least; my memory on this is a bit fuzzy.

``````> pickOdds :: [[a]] -> [[a]]
> pickOdds
>   = map concat
>   . dropWhile (even . length . head)
>   . drop 1
>   . splitAtOdds
>
> splitAtOdds :: [[a]] -> [[[a]]]
> splitAtOdds = chop \$
>   \(x:xs) -> let (ys,zs) = break (odd.length) xs
>              in  (x:ys, zs)
``````

Finally, `takeTotal` just takes lists until their total length matches the given total.

``````> takeTotal :: Int -> [[a]] -> [[a]]
> takeTotal _ [] = []
> takeTotal n _ | n <= 0 = []
> takeTotal n (xs:xss) = xs : takeTotal (n - length xs) xss
``````

And that’s it! I also put together a `main` which more or less emulates what Knuth’s C program does. My program and Knuth’s give the same output on every example I have tried (except that Knuth’s prints out some intermediate information about each iteration step; mine just prints the final answer).

``````> main :: IO ()
> main = do
>   progName <- getProgName
>   args <- getArgs
>   let n = length args
>   when (n < 3) \$ do
>     printf "Usage: %s x1 x2 ... xn\n" progName
>     exitWith (ExitFailure 1)
>   when (even n) \$ do
>     printf "The number of items, n, should be odd, not `%d'!\n" n
>     exitWith (ExitFailure 2)
>   let ns :: [Maybe Int]
>       ns = map readMaybe args
>   case findIndex (maybe True (<0) . snd) (zip [1..] ns) of
>     Just i -> do
>       printf "Argument %d should be a nonnegative integer, not `%s'!\n"
>                         i                             (args !! (i-1))
>       exitWith (ExitFailure 3)
>     Nothing ->
>       putStrLn .
>       (' ' :) . intercalate " " . map show .
>       eastman . catMaybes \$ ns
`````` 