## Random binary trees with a size-limited critical Boltzmann sampler

Today I’d like to talk about generating random trees. First, some imports and such (this post is literate Haskell).

``````> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
>
> module BoltzmannTrees where
>
> import           Control.Applicative
> import           Control.Arrow                  ((&&&))
> import           Control.Lens                   ((??))
> import           Data.List                      (sort)
> import           Data.Maybe                     (fromJust)
> import           System.Environment             (getArgs)
``````

So here’s a simple type of binary tree shapes, containing no data:

``````> data Tree = Leaf | Branch Tree Tree
>   deriving Show
``````

We’ll count each constructor (`Leaf` or `Branch`) as having a size of 1:

``````> size :: Tree -> Int
> size Leaf = 1
> size (Branch l r) = 1 + size l + size r
``````

Now, suppose we want to randomly generate these trees. This is an entirely reasonable and useful thing to do: perhaps we want to, say, randomly test properties of functions over `Tree` using `QuickCheck`. Here’s the simplest, most naïve way to do it:

``````> randomTree :: (Applicative m, MonadRandom m) => m Tree
> randomTree = do
>   r <- getRandom
>   if r < (1/2 :: Double)
>     then return Leaf
>     else Branch <\$> randomTree <*> randomTree
``````

We choose each of the constructors with probability $1/2$, and recurse in the `Branch` case.

Now, as is well-known, this works rather poorly. Why is that? Let’s generate 100 random trees and print out their sizes in descending order:

``````ghci> reverse . sort . map size <\$> replicateM 100 randomTree
[118331,7753,2783,763,237,203,195,163,159,73,65,63,49,41,39,29,29,23,23,21,19,19,15,11,9,9,9,9,7,7,7,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
``````

As you can see, this is a really weird distribution of sizes. For one thing, we get lots of trees that are very small—in fact, it’s easy to see that we expect about 50 of them to be single leaf nodes. The other weird thing, however, is that we also get some really humongous trees. The above output gets randomly regenerated every time I process this post—so I don’t know exactly what sizes you’ll end up seeing—but it’s a good bet that there is at least one tree with a size greater than $10^4$. To get an intuitive idea of why this happens, imagine generating the tree in a breadth-first manner. At each new level we have a collection of “active” nodes corresponding to pending recursive calls to `randomTree`. Each active node generates zero or two new active nodes on the next level with equal probability, so on average the number of active nodes remains the same from level to level. So if we happen to make a lot of `Branch` choices right off the bat, it may take a long time before the tree “thins out” again. And if this distribution didn’t seem weird enough already, it turns out (though it is far from obvious how to prove this) that the expected size of the generated trees is infinite!

The usual solution with `QuickCheck` is to use the `sized` combinator to limit the size of generated structures, but this does not help with the problem of having too many very small trees.

Here’s a (seemingly!) stupid idea. Suppose we want to generate trees of size approximately 100 (say, within 10%). Let’s simply use the above algorithm, but with the following modifications:

1. If we generate a tree of size < 90, throw it away and start over.
2. If we generate a tree of size > 110, throw it away and start over. As an optimization, however, we will stop as soon as the size goes over 110; that is, we will keep track of the current size while generating and stop early if the size gets too big.

Here’s some code. First, a monad onion:

``````> newtype GenM a = GenM
>     { unGenM :: ReaderT (Int,Int) (StateT Int (MaybeT (Rand StdGen))) a }
``````

The `ReaderT` holds the min and max allowed sizes; the `StateT` holds the current size; the `MaybeT` allows for possible failure (if the tree gets too big or ends up too small), and the `Rand StdGen` is, of course, for generating random numbers. To run a computation in this monad we take a target size and a tolerance and use them to compute minimum and maximum sizes. (The `(??)` in the code below is an infix version of `flip`, defined in the `lens` package.)

``````> runGenM :: Int -> Double -> GenM a -> IO (Maybe a)
> runGenM targetSize eps m = do
>   let wiggle  = floor \$ fromIntegral targetSize * eps
>       minSize = targetSize - wiggle
>       maxSize = targetSize + wiggle
>   g <- newStdGen
>   return . (evalRand ?? g) . runMaybeT . (evalStateT ?? 0)
>          . (runReaderT ?? (minSize, maxSize)) . unGenM
>          \$ m
``````

Here’s the code to try generating a tree: we call the `atom` function to record the increase in size, and choose between the two constructors with equal probability. `atom`, in turn, handles failing early if the size gets too big.

``````> genTreeUB :: GenM Tree
> genTreeUB = do
>   r <- getRandom
>   atom
>   if r <= (1/2 :: Double)
>     then return Leaf
>     else Branch <\$> genTreeUB <*> genTreeUB
>
> atom :: GenM ()
> atom = do
>   curSize <- get
>   when (curSize >= maxSize) mzero
>   put (curSize + 1)
``````

`genTreeLB` calls `genTreeUB` and then performs the lower bound check on the size.

``````> genTreeLB :: GenM Tree
> genTreeLB = do
>   put 0
>   t <- genTreeUB
>   tSize <- get
>   guard \$ tSize >= minSize
>   return t
``````

Finally, `genTree` just calls `genTreeLB` repeatedly until it succeeds.

``````> genTree :: GenM Tree
> genTree = genTreeLB `mplus` genTree
``````

Let’s make sure it works:

``````ghci> map size . fromJust <\$> runGenM 100 0.1 (replicateM 30 genTree)
[105,91,105,103,107,101,105,93,93,93,95,91,103,91,91,107,105,103,97,95,105,107,93,97,93,103,91,103,101,95]
``````

Neat! Okay, but surely this is really, really slow, right? We spend a bunch of time just throwing away trees of the wrong size. Before reading on, would you care to guess the asymptotic time complexity to generate a tree of size $n$ using this algorithm?

And while you think about that, here is a random binary tree of size approximately 1000.

And the answer is… it is linear! That is, it takes $O(n)$ time to generate a tree of size $n$. This is astounding—it’s the best we could possibly hope for, because of course it takes at least $O(n)$ time to generate an object of size $O(n)$. If you don’t believe me, I invite you to run some experiments with this code yourself. I did, and it sure looks linear:

``````main = do
[sz] <- getArgs
Just ts <- runGenM (read sz) 0.1 \$ replicateM 1000 genTree
print . (/fromIntegral n) . fromIntegral . sum . map size \$ ts

archimedes :: research/species/boltzmann » time ./GenTree 50
49.682
./GenTree 50  1.37s user 0.01s system 99% cpu 1.387 total
archimedes :: research/species/boltzmann » time ./GenTree 100
99.474
./GenTree 100  3.11s user 0.02s system 99% cpu 3.152 total
archimedes :: research/species/boltzmann » time ./GenTree 200
198.494
./GenTree 200  6.82s user 0.04s system 99% cpu 6.876 total
archimedes :: research/species/boltzmann » time ./GenTree 400
398.798
./GenTree 400  13.08s user 0.08s system 99% cpu 13.208 total
archimedes :: research/species/boltzmann » time ./GenTree 800
795.798
./GenTree 800  25.99s user 0.16s system 99% cpu 26.228 total``````

The proof of this astounding fact uses some complex analysis which I do not understand; I wish I was joking. Of course, the constant factor can be big, depending on how small you set the “epsilon” allowing for wiggle room around the target size.1 But it is still quite feasible to generate rather large trees (with, say, $10^5$ nodes).

There is much, much more to say on this topic. I just wanted to start out with a simple example before jumping into more of the technical details and generalizations, which I plan to write about in future posts. I also hope to package this and a bunch of other stuff into a library. In the meantime, you can read Duchon et. al2 if you want the details.

1. Actually, if you set epsilon to zero, the asymptotic complexity jumps to $O(n^2)$.

2. Duchon, Philippe, et al. “Boltzmann samplers for the random generation of combinatorial structures.” Combinatorics Probability and Computing 13.4-5 (2004): 577-625.

Assistant Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in combinatorics, haskell, math, species and tagged , , , , , . Bookmark the permalink.

### 13 Responses to Random binary trees with a size-limited critical Boltzmann sampler

1. Neat! That’s an interesting paper you’ve linked there. I recently had the problem of generating (truly) random Rose Trees — asking a quetion on SO only pointed me towards using sized to determine the tree’s size, but in order to generate a truly random *structure* some more thought would have to be put into it.

Also, I’ve already learned the word “boustrophedonic” from that article, and I haven’t read past the intro yet!

2. ocharles says:

Great post Brent – while not really the focus of the post, I found it very interesting to see how you implemented a solution for the problem with the composition of a bunch of monads. I’ve only recently started seeing a problem in this light, and it’s extremely powerful – as evident by how naturally the solution emerged.

3. “it turns out (though it is far from obvious how to prove this) that the expected size of the generated trees is infinite”

I think it’s quite straightforward to show that the expected size is infinite. We can define “size” in a straightforward way:

size : Tree -> Int
size Leaf = 1
size (Branch l r) = 1 + size l + size r

Now, let’s split up the recursion to get the following:

children : Tree -> [Tree]
children Leaf = []
children (Branch l r) = [l, r]

size : Tree -> Int
size x = 1 + map size (children x)

It’s difficult to deal with something that’s half-Leaf/half-Branch, so let’s turn things into numbers to make it easier:

childNumber Leaf = 0
childNumber (Branch _ _) = 2

averageChildSize : Tree -> Int
averageChildSize t = sum (map size (children t)) / (childNumber t)

size x = 1 + (averageChildSize x) * (childNumber x)

Now that we’ve got numbers, let’s replace all of our Trees with half-Leaf/half-Branch probabilities. Let’s call such Trees “randTree”, and importantly we don’t need to keep track of sub-trees; all randTrees are equivalent. First we get the following for childNumber:

childNumber randTree = 1/2 * 0 + 1/2 * 2
childNumber randTree = 0 + 1
childNumber randTree = 1

Next we look at averageChildSize:

averageChildSize randTree = sum (map size (children randTree)) / (childNumber randTree)
averageChildSize randTree = sum (map size (children randTree)) / 1
averageChildSize randTree = sum (map size (children randTree))
averageChildSize randTree = 1/2 * sum [] + 1/2 * sum [size randTree, size randTree]
averageChildSize randTree = 1/2 * 0 + 1/2 * 2 * (size randTree)
averageChildSize randTree = size randTree

Now we can plug these into size:

size randTree = 1 + (averageChildSize randTree) * (childNumber randTree)
size randTree = 1 + (size randTree) * 1
size randTree = 1 + (size randTree)

The solution will be a fixed point of this recursive equation, but the “1 +” shows this is clearly infinite:

size randTree = 1 + (1 + (1 + …))

QED

• Oops!

size x = 1 + map size (children x)

Should be:

size x = 1 + sum (map size (children x))

4. Jonas Duregård says:

Nice post Brent!

One comment: wouldn’t it be nicer to keep the simple tree generator and just use a lazy size function and lazy size comparison to sieve out the large (and small) trees? For instance size could return a peano number or you could just make a specialized function smallerThan :: Tree -> Int -> Bool.

One potential pitfall is that the simple generator is not as lazy as it could be (can be fixed using a splitting random number generator).

• Brent says:

Hi Jonas, good idea! That had not occurred to me. I will give it a try. You’re right that splitting would be essential to make this work — and while people have doubts about the theoretical properties of splitting that probably only matters for cryptographic applications.

• Did you ever get around to the splitting method?

• Brent says:

Nope, but this is definitely on my list of things to look at again at some point. Perhaps I can find a student interested in packaging some of this up as a nice Haskell library.