Update, 7/17/2017: this now exists; see https://byorgey.wordpress.com/2016/09/20/the-generic-random-library-part-1-simple-generic-arbitrary-instances/ .
tl;dr: I know how to generate random instances of data types in a generic way, and even have some old code that already does all the hard work, but won’t have time to polish and package it until this summer. If you’re interested in helping, let me know!
This morning Kenny Foner pointed out to me this tweet by Gabriel Gonzales, asking why there isn’t a default Arbitrary instance for types implementing Generic. It reminded me that I’ve been meaning for a while now (years, in fact!) to get around to packaging up some code that does this.
As several pointed out on Twitter, this seems obvious, but it isn’t. It’s easy to write a generic Arbitrary instance, but hard to write one that generates a good distribution of values. The basic idea is clear: randomly pick a constructor, and then recursively generate random subtrees. The problem is that this is very likely to either blow up and generate gigantic (even infinite) trees, or to generate almost all tiny trees, or both. I wrote a post about this three years ago which illustrates the problem. It also explains half of the solution: generate random trees with a target size in mind, and throw out any which are not within some epsilon of the target size (crucially, stopping the generation early as soon as the tree being generated gets too big).
However, I never got around to explaining the other half of the solution: it’s crucially important to use the right probabilities when picking a constructor. With the wrong probabilities, you will spend too much time generating trees that are either too small or too big. The surprising thing is that with exactly the right probabilities, you can expect to wait only time before generating a tree of size (approximately1)
.2
So, how does one pick the right probabilities? Essentially, you turn the generic description of your data type into a mutually recursive system of generating functions, and (numerically) find their radii of convergence, when thought of as functions in the complex plane. Using these values it is straightforward to compute the right probabilities to use. For the intrepid, this is explained in Duchon et. al3.
I have some old Haskell code from Alexis Darrasse which already does a bunch of the work. It would have to be updated a bit to work with modern libraries and with GHC.Generics, and packaged up to go on Hackage. I won’t really have time to work on this until the summer—but if anyone else is interested in working on this, let me know! I’d be happy to send you the code and provide some guidance in figuring it out.
- The constant factor depends on how approximate you are willing to be.↩
- I wanted to put an exclamation point at the end of that sentence, because this is really surprising. But it looked like
factorial. So, here is the exclamation point: !↩
- Duchon, Philippe, et al. “Boltzmann samplers for the random generation of combinatorial structures.” Combinatorics Probability and Computing 13.4-5 (2004): 577-625.↩
I’d love to take a look at the code, though I don’t know if I’d be able to complete it.
Interesting stuff. Would you mind publishing what you have on github with a large warning sign and a link to this blog post?
Hi,
I think I might give it a try, if there’s no one else willing to do it. I have weekends free and I’d like to get back on working on Haskell code and contributing to the environment.
Plus, I am also interested in the code for putting it to real use in another of my projects.
Hi Brent,
I have been touching this subject in the (distant) past while doing my PhD (on tests generation…). I do not have the skills to do the hard math stuff but would be interested in lending a hand on the coding part and use that for our testing.
Interesting stuff! Are you envisioning making this a separate package, or merging this into QuickCheck itself? Either way, I’d be eager to work on something like this. Being able to create Arbitrary instances generically is something I’ve wanted for a long time.
I’m surprised this still doesn’t exist. This is a very nice problem and quite a low hanging fruit to practice generic programming in Haskell on a project with high visibility.
A small issue is that Boltzmann+filtering is not lazy, since checking that a value has the right size will force it.
For that we could look for another generation strategy, probably with a non-uniform distribution, which we might still control to some extent. A slightly biased distribution seems like an acceptable loss for a lazy “default” Arbitrary instance.
One more point for laziness is that it would be consistent with instances for types whose values are all infinite (`data Forever a = Forever a (Forever a)`), which must be lazy if one intends to support them.
Hi all, thanks for the responses! The code is now up at
https://github.com/byorgey/boltzmann
I have started working a bit on cleaning it up and adding some comments. I may also try to outline some ideas as to what needs doing, but I welcome input on that front as well. Take a look, ask questions, make pull requests. As is my usual policy I’ll give push access to anyone who makes a reasonable pull request.
To address Ryan’s question, I think this should be a separate library, not merged into QuickCheck. Parts of it may be more generally useful/applicable than just for QuickCheck. @lyxia, nice observation re: laziness. I agree with you about supporting multiple generation strategies with different tradeoffs.
Hi Brent!
I was just made aware of this post and wanted to let you know that I had independently put my code online as-is but with a proper license at https://github.com/lip6-apr/syb_qc. I haven’t done anything related since, but I want to take a closer look at the following:
– Li-yao Xia’s implementation at https://github.com/Lysxia/generic-random
– the paper on Luck https://arxiv.org/abs/1607.05443, a DSL for writing generators
– the GenCheck framework by Uszkay & Carette http://hackage.haskell.org/package/gencheck
– Jonas Duregård’s work, e.g. http://publications.lib.chalmers.se/publication/225344
Hi Αλέξη! It’s great to hear from you, I hope you are well. Li-yao Xia’s implementation is very cool — quite thorough and several orders of magnitude more sophisticated than what I had in mind or could have implemented myself. I’m familiar with GenCheck and with Jonas Duregård’s work, but somehow I hadn’t yet heard about Luck, thanks for the link!
Hello,
I’d like to help out with this if possible.
Hello, has there been any progress on this?
In your repo’s Oracle.Newton module, the iter function does a multiplication by the Jacobian, it should be its inverse.
In that case please open an issue or pull request on the repository.
Pingback: Any clues about this Newton iteration formula with Jacobian matrix? | blog :: Brent -> [String]
Pingback: The generic-random library, part 1: simple generic Arbitrary instances | blog :: Brent -> [String]