Today I’d like to talk about generating random trees. First, some imports and such (this post is literate Haskell).

```
> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
>
> module BoltzmannTrees where
>
> import Control.Applicative
> import Control.Arrow ((&&&))
> import Control.Lens ((??))
> import Control.Monad.Random
> import Control.Monad.Reader
> import Control.Monad.State
> import Control.Monad.Trans.Maybe
> import Data.List (sort)
> import Data.Maybe (fromJust)
> import System.Environment (getArgs)
```

So here’s a simple type of binary tree shapes, containing no data:

```
> data Tree = Leaf | Branch Tree Tree
> deriving Show
```

We’ll count each constructor (`Leaf`

or `Branch`

) as having a size of 1:

```
> size :: Tree -> Int
> size Leaf = 1
> size (Branch l r) = 1 + size l + size r
```

Now, suppose we want to randomly generate these trees. This is an entirely reasonable and useful thing to do: perhaps we want to, say, randomly test properties of functions over `Tree`

using `QuickCheck`

. Here’s the simplest, most naïve way to do it:

```
> randomTree :: (Applicative m, MonadRandom m) => m Tree
> randomTree = do
> r <- getRandom
> if r < (1/2 :: Double)
> then return Leaf
> else Branch <$> randomTree <*> randomTree
```

We choose each of the constructors with probability , and recurse in the `Branch`

case.

Now, as is well-known, this works rather poorly. Why is that? Let’s generate 100 random trees and print out their sizes in descending order:

```
ghci> reverse . sort . map size <$> replicateM 100 randomTree
[118331,7753,2783,763,237,203,195,163,159,73,65,63,49,41,39,29,29,23,23,21,19,19,15,11,9,9,9,9,7,7,7,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
```

As you can see, this is a really weird distribution of sizes. For one thing, we get lots of trees that are very small—in fact, it’s easy to see that we expect about 50 of them to be single leaf nodes. The other weird thing, however, is that we also get some really humongous trees. The above output gets randomly regenerated every time I process this post—so I don’t know exactly what sizes you’ll end up seeing—but it’s a good bet that there is at least one tree with a size greater than . To get an intuitive idea of why this happens, imagine generating the tree in a breadth-first manner. At each new level we have a collection of “active” nodes corresponding to pending recursive calls to `randomTree`

. Each active node generates zero or two new active nodes on the next level with equal probability, so *on average* the number of active nodes remains the same from level to level. So if we happen to make a lot of `Branch`

choices right off the bat, it may take a long time before the tree “thins out” again. And if this distribution didn’t seem weird enough already, it turns out (though it is far from obvious how to prove this) that the *expected* size of the generated trees is *infinite*!

The usual solution with `QuickCheck`

is to use the `sized`

combinator to limit the size of generated structures, but this does not help with the problem of having too many very small trees.

Here’s a (seemingly!) stupid idea. Suppose we want to generate trees of size approximately 100 (say, within 10%). Let’s simply use the above algorithm, but with the following modifications:

- If we generate a tree of size < 90, throw it away and start over.
- If we generate a tree of size > 110, throw it away and start over. As an optimization, however, we will stop
*as soon as the size goes over 110*; that is, we will keep track of the current size while generating and stop early if the size gets too big.

Here’s some code. First, a monad onion:

```
> newtype GenM a = GenM
> { unGenM :: ReaderT (Int,Int) (StateT Int (MaybeT (Rand StdGen))) a }
> deriving (Functor, Applicative, Monad, MonadPlus, MonadRandom,
> MonadState Int, MonadReader (Int,Int))
```

The `ReaderT`

holds the min and max allowed sizes; the `StateT`

holds the current size; the `MaybeT`

allows for possible failure (if the tree gets too big or ends up too small), and the `Rand StdGen`

is, of course, for generating random numbers. To run a computation in this monad we take a target size and a tolerance and use them to compute minimum and maximum sizes. (The `(??)`

in the code below is an infix version of `flip`

, defined in the `lens`

package.)

```
> runGenM :: Int -> Double -> GenM a -> IO (Maybe a)
> runGenM targetSize eps m = do
> let wiggle = floor $ fromIntegral targetSize * eps
> minSize = targetSize - wiggle
> maxSize = targetSize + wiggle
> g <- newStdGen
> return . (evalRand ?? g) . runMaybeT . (evalStateT ?? 0)
> . (runReaderT ?? (minSize, maxSize)) . unGenM
> $ m
```

Here’s the code to try generating a tree: we call the `atom`

function to record the increase in size, and choose between the two constructors with equal probability. `atom`

, in turn, handles failing early if the size gets too big.

```
> genTreeUB :: GenM Tree
> genTreeUB = do
> r <- getRandom
> atom
> if r <= (1/2 :: Double)
> then return Leaf
> else Branch <$> genTreeUB <*> genTreeUB
>
> atom :: GenM ()
> atom = do
> (_, maxSize) <- ask
> curSize <- get
> when (curSize >= maxSize) mzero
> put (curSize + 1)
```

`genTreeLB`

calls `genTreeUB`

and then performs the lower bound check on the size.

```
> genTreeLB :: GenM Tree
> genTreeLB = do
> put 0
> t <- genTreeUB
> tSize <- get
> (minSize, _) <- ask
> guard $ tSize >= minSize
> return t
```

Finally, `genTree`

just calls `genTreeLB`

repeatedly until it succeeds.

```
> genTree :: GenM Tree
> genTree = genTreeLB `mplus` genTree
```

Let’s make sure it works:

```
ghci> map size . fromJust <$> runGenM 100 0.1 (replicateM 30 genTree)
[105,91,105,103,107,101,105,93,93,93,95,91,103,91,91,107,105,103,97,95,105,107,93,97,93,103,91,103,101,95]
```

Neat! Okay, but surely this is really, really slow, right? We spend a bunch of time just throwing away trees of the wrong size. Before reading on, would you care to guess the asymptotic time complexity to generate a tree of size using this algorithm?

And while you think about that, here is a random binary tree of size approximately 1000.

And the answer is… it is *linear*! That is, it takes time to generate a tree of size . This is astounding—it’s the best we could possibly hope for, because of course it takes at least time to generate an object of size . If you don’t believe me, I invite you to run some experiments with this code yourself. I did, and it sure looks linear:

```
main = do
[sz] <- getArgs
Just ts <- runGenM (read sz) 0.1 $ replicateM 1000 genTree
print . (/fromIntegral n) . fromIntegral . sum . map size $ ts
archimedes :: research/species/boltzmann » time ./GenTree 50
49.682
./GenTree 50 1.37s user 0.01s system 99% cpu 1.387 total
archimedes :: research/species/boltzmann » time ./GenTree 100
99.474
./GenTree 100 3.11s user 0.02s system 99% cpu 3.152 total
archimedes :: research/species/boltzmann » time ./GenTree 200
198.494
./GenTree 200 6.82s user 0.04s system 99% cpu 6.876 total
archimedes :: research/species/boltzmann » time ./GenTree 400
398.798
./GenTree 400 13.08s user 0.08s system 99% cpu 13.208 total
archimedes :: research/species/boltzmann » time ./GenTree 800
795.798
./GenTree 800 25.99s user 0.16s system 99% cpu 26.228 total
```

The proof of this astounding fact uses some *complex analysis* which I do not understand; I wish I was joking. Of course, the constant factor can be big, depending on how small you set the “epsilon” allowing for wiggle room around the target size.^{1} But it is still quite feasible to generate rather large trees (with, say, nodes).

There is much, much more to say on this topic. I just wanted to start out with a simple example before jumping into more of the technical details and generalizations, which I plan to write about in future posts. I also hope to package this and a bunch of other stuff into a library. In the meantime, you can read Duchon *et. al*^{2} if you want the details.