Random binary trees with a size-limited critical Boltzmann sampler

Today I’d like to talk about generating random trees. First, some imports and such (this post is literate Haskell).

> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
> module BoltzmannTrees where
> import           Control.Applicative
> import           Control.Arrow                  ((&&&))
> import           Control.Lens                   ((??))
> import           Control.Monad.Random
> import           Control.Monad.Reader
> import           Control.Monad.State
> import           Control.Monad.Trans.Maybe
> import           Data.List                      (sort)
> import           Data.Maybe                     (fromJust)
> import           System.Environment             (getArgs)

So here’s a simple type of binary tree shapes, containing no data:

> data Tree = Leaf | Branch Tree Tree
>   deriving Show

We’ll count each constructor (Leaf or Branch) as having a size of 1:

> size :: Tree -> Int
> size Leaf = 1
> size (Branch l r) = 1 + size l + size r

Now, suppose we want to randomly generate these trees. This is an entirely reasonable and useful thing to do: perhaps we want to, say, randomly test properties of functions over Tree using QuickCheck. Here’s the simplest, most naïve way to do it:

> randomTree :: (Applicative m, MonadRandom m) => m Tree
> randomTree = do
>   r <- getRandom
>   if r < (1/2 :: Double)
>     then return Leaf
>     else Branch <$> randomTree <*> randomTree

We choose each of the constructors with probability 1/2, and recurse in the Branch case.

Now, as is well-known, this works rather poorly. Why is that? Let’s generate 100 random trees and print out their sizes in descending order:

ghci> reverse . sort . map size <$> replicateM 100 randomTree

As you can see, this is a really weird distribution of sizes. For one thing, we get lots of trees that are very small—in fact, it’s easy to see that we expect about 50 of them to be single leaf nodes. The other weird thing, however, is that we also get some really humongous trees. The above output gets randomly regenerated every time I process this post—so I don’t know exactly what sizes you’ll end up seeing—but it’s a good bet that there is at least one tree with a size greater than 10^4. To get an intuitive idea of why this happens, imagine generating the tree in a breadth-first manner. At each new level we have a collection of “active” nodes corresponding to pending recursive calls to randomTree. Each active node generates zero or two new active nodes on the next level with equal probability, so on average the number of active nodes remains the same from level to level. So if we happen to make a lot of Branch choices right off the bat, it may take a long time before the tree “thins out” again. And if this distribution didn’t seem weird enough already, it turns out (though it is far from obvious how to prove this) that the expected size of the generated trees is infinite!

The usual solution with QuickCheck is to use the sized combinator to limit the size of generated structures, but this does not help with the problem of having too many very small trees.

Here’s a (seemingly!) stupid idea. Suppose we want to generate trees of size approximately 100 (say, within 10%). Let’s simply use the above algorithm, but with the following modifications:

  1. If we generate a tree of size < 90, throw it away and start over.
  2. If we generate a tree of size > 110, throw it away and start over. As an optimization, however, we will stop as soon as the size goes over 110; that is, we will keep track of the current size while generating and stop early if the size gets too big.

Here’s some code. First, a monad onion:

> newtype GenM a = GenM 
>     { unGenM :: ReaderT (Int,Int) (StateT Int (MaybeT (Rand StdGen))) a }
>   deriving (Functor, Applicative, Monad, MonadPlus, MonadRandom,
>             MonadState Int, MonadReader (Int,Int))

The ReaderT holds the min and max allowed sizes; the StateT holds the current size; the MaybeT allows for possible failure (if the tree gets too big or ends up too small), and the Rand StdGen is, of course, for generating random numbers. To run a computation in this monad we take a target size and a tolerance and use them to compute minimum and maximum sizes. (The (??) in the code below is an infix version of flip, defined in the lens package.)

> runGenM :: Int -> Double -> GenM a -> IO (Maybe a)
> runGenM targetSize eps m = do
>   let wiggle  = floor $ fromIntegral targetSize * eps
>       minSize = targetSize - wiggle
>       maxSize = targetSize + wiggle
>   g <- newStdGen
>   return . (evalRand ?? g) . runMaybeT . (evalStateT ?? 0)
>          . (runReaderT ?? (minSize, maxSize)) . unGenM
>          $ m

Here’s the code to try generating a tree: we call the atom function to record the increase in size, and choose between the two constructors with equal probability. atom, in turn, handles failing early if the size gets too big.

> genTreeUB :: GenM Tree
> genTreeUB = do
>   r <- getRandom
>   atom
>   if r <= (1/2 :: Double)
>     then return Leaf
>     else Branch <$> genTreeUB <*> genTreeUB
> atom :: GenM ()
> atom = do
>   (_, maxSize) <- ask
>   curSize <- get
>   when (curSize >= maxSize) mzero
>   put (curSize + 1)

genTreeLB calls genTreeUB and then performs the lower bound check on the size.

> genTreeLB :: GenM Tree
> genTreeLB = do
>   put 0
>   t <- genTreeUB
>   tSize <- get
>   (minSize, _) <- ask
>   guard $ tSize >= minSize
>   return t

Finally, genTree just calls genTreeLB repeatedly until it succeeds.

> genTree :: GenM Tree
> genTree = genTreeLB `mplus` genTree

Let’s make sure it works:

ghci> map size . fromJust <$> runGenM 100 0.1 (replicateM 30 genTree)

Neat! Okay, but surely this is really, really slow, right? We spend a bunch of time just throwing away trees of the wrong size. Before reading on, would you care to guess the asymptotic time complexity to generate a tree of size n using this algorithm?

And while you think about that, here is a random binary tree of size approximately 1000.

And the answer is… it is linear! That is, it takes O(n) time to generate a tree of size n. This is astounding—it’s the best we could possibly hope for, because of course it takes at least O(n) time to generate an object of size O(n). If you don’t believe me, I invite you to run some experiments with this code yourself. I did, and it sure looks linear:

main = do
  [sz] <- getArgs
  Just ts <- runGenM (read sz) 0.1 $ replicateM 1000 genTree
  print . (/fromIntegral n) . fromIntegral . sum . map size $ ts

archimedes :: research/species/boltzmann » time ./GenTree 50
./GenTree 50  1.37s user 0.01s system 99% cpu 1.387 total
archimedes :: research/species/boltzmann » time ./GenTree 100
./GenTree 100  3.11s user 0.02s system 99% cpu 3.152 total
archimedes :: research/species/boltzmann » time ./GenTree 200
./GenTree 200  6.82s user 0.04s system 99% cpu 6.876 total
archimedes :: research/species/boltzmann » time ./GenTree 400
./GenTree 400  13.08s user 0.08s system 99% cpu 13.208 total
archimedes :: research/species/boltzmann » time ./GenTree 800
./GenTree 800  25.99s user 0.16s system 99% cpu 26.228 total

The proof of this astounding fact uses some complex analysis which I do not understand; I wish I was joking. Of course, the constant factor can be big, depending on how small you set the “epsilon” allowing for wiggle room around the target size.1 But it is still quite feasible to generate rather large trees (with, say, 10^5 nodes).

There is much, much more to say on this topic. I just wanted to start out with a simple example before jumping into more of the technical details and generalizations, which I plan to write about in future posts. I also hope to package this and a bunch of other stuff into a library. In the meantime, you can read Duchon et. al2 if you want the details.

  1. Actually, if you set epsilon to zero, the asymptotic complexity jumps to O(n^2).

  2. Duchon, Philippe, et al. “Boltzmann samplers for the random generation of combinatorial structures.” Combinatorics Probability and Computing 13.4-5 (2004): 577-625.

About Brent

Assistant Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in combinatorics, haskell, math, species and tagged , , , , , . Bookmark the permalink.

14 Responses to Random binary trees with a size-limited critical Boltzmann sampler

  1. Neat! That’s an interesting paper you’ve linked there. I recently had the problem of generating (truly) random Rose Trees — asking a quetion on SO only pointed me towards using sized to determine the tree’s size, but in order to generate a truly random *structure* some more thought would have to be put into it.

    Also, I’ve already learned the word “boustrophedonic” from that article, and I haven’t read past the intro yet!

  2. ocharles says:

    Great post Brent – while not really the focus of the post, I found it very interesting to see how you implemented a solution for the problem with the composition of a bunch of monads. I’ve only recently started seeing a problem in this light, and it’s extremely powerful – as evident by how naturally the solution emerged.

  3. “it turns out (though it is far from obvious how to prove this) that the expected size of the generated trees is infinite”

    I think it’s quite straightforward to show that the expected size is infinite. We can define “size” in a straightforward way:

    size : Tree -> Int
    size Leaf = 1
    size (Branch l r) = 1 + size l + size r

    Now, let’s split up the recursion to get the following:

    children : Tree -> [Tree]
    children Leaf = []
    children (Branch l r) = [l, r]

    size : Tree -> Int
    size x = 1 + map size (children x)

    It’s difficult to deal with something that’s half-Leaf/half-Branch, so let’s turn things into numbers to make it easier:

    childNumber Leaf = 0
    childNumber (Branch _ _) = 2

    averageChildSize : Tree -> Int
    averageChildSize t = sum (map size (children t)) / (childNumber t)

    size x = 1 + (averageChildSize x) * (childNumber x)

    Now that we’ve got numbers, let’s replace all of our Trees with half-Leaf/half-Branch probabilities. Let’s call such Trees “randTree”, and importantly we don’t need to keep track of sub-trees; all randTrees are equivalent. First we get the following for childNumber:

    childNumber randTree = 1/2 * 0 + 1/2 * 2
    childNumber randTree = 0 + 1
    childNumber randTree = 1

    Next we look at averageChildSize:

    averageChildSize randTree = sum (map size (children randTree)) / (childNumber randTree)
    averageChildSize randTree = sum (map size (children randTree)) / 1
    averageChildSize randTree = sum (map size (children randTree))
    averageChildSize randTree = 1/2 * sum [] + 1/2 * sum [size randTree, size randTree]
    averageChildSize randTree = 1/2 * 0 + 1/2 * 2 * (size randTree)
    averageChildSize randTree = size randTree

    Now we can plug these into size:

    size randTree = 1 + (averageChildSize randTree) * (childNumber randTree)
    size randTree = 1 + (size randTree) * 1
    size randTree = 1 + (size randTree)

    The solution will be a fixed point of this recursive equation, but the “1 +” shows this is clearly infinite:

    size randTree = 1 + (1 + (1 + …))


  4. Pingback: Monad transformers: a cautionary tale | blog :: Brent -> [String]

  5. Jonas Duregård says:

    Nice post Brent!

    One comment: wouldn’t it be nicer to keep the simple tree generator and just use a lazy size function and lazy size comparison to sieve out the large (and small) trees? For instance size could return a peano number or you could just make a specialized function smallerThan :: Tree -> Int -> Bool.

    One potential pitfall is that the simple generator is not as lazy as it could be (can be fixed using a splitting random number generator).

    • Brent says:

      Hi Jonas, good idea! That had not occurred to me. I will give it a try. You’re right that splitting would be essential to make this work — and while people have doubts about the theoretical properties of splitting that probably only matters for cryptographic applications.

  6. Pingback: Boltzmann sampling for generic Arbitrary instances | blog :: Brent -> [String]

  7. Pingback: The generic-random library, part 1: simple generic Arbitrary instances | blog :: Brent -> [String]

  8. Naren Sundar says:

    Just a quick note. I suppose a way to get an idea why the simple method generates very large trees is to ask “if a tree has n active nodes, what is the probability that the tree will grow in the next iteration?”

    – with n=1, this is 0.5
    – with n=2, a tree grows if both or either of the nodes grow which happens with probability 1 – 0.5*0.5 = 0.75
    – with n=N the probability that this tree will grow is 1 – 0.5^N, which rapidly goes to 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.