## Boltzmann sampling for generic Arbitrary instances

tl;dr: I know how to generate random instances of data types in a generic way, and even have some old code that already does all the hard work, but won’t have time to polish and package it until this summer. If you’re interested in helping, let me know!

This morning Kenny Foner pointed out to me this tweet by Gabriel Gonzales, asking why there isn’t a default `Arbitrary` instance for types implementing `Generic`. It reminded me that I’ve been meaning for a while now (years, in fact!) to get around to packaging up some code that does this.

As several pointed out on Twitter, this seems obvious, but it isn’t. It’s easy to write a generic `Arbitrary` instance, but hard to write one that generates a good distribution of values. The basic idea is clear: randomly pick a constructor, and then recursively generate random subtrees. The problem is that this is very likely to either blow up and generate gigantic (even infinite) trees, or to generate almost all tiny trees, or both. I wrote a post about this three years ago which illustrates the problem. It also explains half of the solution: generate random trees with a target size in mind, and throw out any which are not within some epsilon of the target size (crucially, stopping the generation early as soon as the tree being generated gets too big).

However, I never got around to explaining the other half of the solution: it’s crucially important to use the right probabilities when picking a constructor. With the wrong probabilities, you will spend too much time generating trees that are either too small or too big. The surprising thing is that with exactly the right probabilities, you can expect to wait only $O(n)$ time before generating a tree of size (approximately1) $n$.2

So, how does one pick the right probabilities? Essentially, you turn the generic description of your data type into a mutually recursive system of generating functions, and (numerically) find their radii of convergence, when thought of as functions in the complex plane. Using these values it is straightforward to compute the right probabilities to use. For the intrepid, this is explained in Duchon et. al3.

I have some old Haskell code from Alexis Darrasse which already does a bunch of the work. It would have to be updated a bit to work with modern libraries and with `GHC.Generics`, and packaged up to go on Hackage. I won’t really have time to work on this until the summer—but if anyone else is interested in working on this, let me know! I’d be happy to send you the code and provide some guidance in figuring it out.

1. The constant factor depends on how approximate you are willing to be.

2. I wanted to put an exclamation point at the end of that sentence, because this is really surprising. But it looked like $n$ factorial. So, here is the exclamation point: !

3. Duchon, Philippe, et al. “Boltzmann samplers for the random generation of combinatorial structures.” Combinatorics Probability and Computing 13.4-5 (2004): 577-625.

Assistant Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in combinatorics, haskell, math, species and tagged , , , , , . Bookmark the permalink.

### 14 Responses to Boltzmann sampling for generic Arbitrary instances

1. Eitan says:

I’d love to take a look at the code, though I don’t know if I’d be able to complete it.

2. Tom says:

Interesting stuff. Would you mind publishing what you have on github with a large warning sign and a link to this blog post?

3. Mithrandir says:

Hi,

I think I might give it a try, if there’s no one else willing to do it. I have weekends free and I’d like to get back on working on Haskell code and contributing to the environment.

Plus, I am also interested in the code for putting it to real use in another of my projects.

4. Hi Brent,
I have been touching this subject in the (distant) past while doing my PhD (on tests generation…). I do not have the skills to do the hard math stuff but would be interested in lending a hand on the coding part and use that for our testing.

5. ryanglscott says:

Interesting stuff! Are you envisioning making this a separate package, or merging this into QuickCheck itself? Either way, I’d be eager to work on something like this. Being able to create Arbitrary instances generically is something I’ve wanted for a long time.

6. lyxia says:

I’m surprised this still doesn’t exist. This is a very nice problem and quite a low hanging fruit to practice generic programming in Haskell on a project with high visibility.

A small issue is that Boltzmann+filtering is not lazy, since checking that a value has the right size will force it.

For that we could look for another generation strategy, probably with a non-uniform distribution, which we might still control to some extent. A slightly biased distribution seems like an acceptable loss for a lazy “default” Arbitrary instance.

One more point for laziness is that it would be consistent with instances for types whose values are all infinite (`data Forever a = Forever a (Forever a)`), which must be lazy if one intends to support them.

7. Brent says:

Hi all, thanks for the responses! The code is now up at

https://github.com/byorgey/boltzmann

I have started working a bit on cleaning it up and adding some comments. I may also try to outline some ideas as to what needs doing, but I welcome input on that front as well. Take a look, ask questions, make pull requests. As is my usual policy I’ll give push access to anyone who makes a reasonable pull request.

To address Ryan’s question, I think this should be a separate library, not merged into QuickCheck. Parts of it may be more generally useful/applicable than just for QuickCheck. @lyxia, nice observation re: laziness. I agree with you about supporting multiple generation strategies with different tradeoffs.

8. Alexandre says:

Hello,

I’d like to help out with this if possible.

9. lyxia says:

Hello, has there been any progress on this?
In your repo’s Oracle.Newton module, the iter function does a multiplication by the Jacobian, it should be its inverse.

• Brent says:

In that case please open an issue or pull request on the repository.