In my previous post I challenged you to find a way to keep track of a sequence of slots in such a way that we can quickly (in or better) either mark any empty slot as full, or find the rightmost empty slot prior to a given index. When I posted it, I had two solutions in mind; thanks to all the excellent comments I now know of many more!

There were quite a few answers which in some way or another boiled down to using a balanced tree structure.

Joachim Breitner suggested to use something like
Data.IntMap
. 
Roman Cheplyaka suggested using
Data.Set
. 
An anonymous commenter suggested using some kind of selfbalancing tree like a redblack tree.

Commenter Albert suggested keeping two balanced trees, one containing all the full slots and one containing the empty slots (though it is not clear to me what the benefit would be of having both).
In all these cases, the idea is that the tree stores the sorted indices of all empty slots. We can mark a slot full or empty by deleting or inserting from the tree structure, and we can find the rightmost empty slot not exceeding a given index by searching for the index and returning highest value in the tree which is less than the index being searched. It is wellknown that all these operations can be done in time.


David Barbour suggested something along similar lines, but somewhat more general: keep a finger tree with cached monoidal annotations representing both the total number of elements of each subtree, as well as the number of elements satisfying some proposition (such as the number of empty slots). This also allows performing the operations in time. This is in some sense similar to the previous suggestions, but it generalizes much more readily, since we can use this scheme to track any kind of monoidal annotation. I had thought of using a segment tree where slot stores the value when it is full, and when it is empty, and each node caches the max value contained in its subtree, allowing updates and queries to happen in time. This could also track arbitrary monoidal annotations, but using a finger tree is strictly more expressive since it also supports insertion and deletion (although that is not required for my original formulation of the problem).

Albert also suggested using a van Emde Boas tree to achieve performance. Van Emde Boas trees directly support a “predecessor” operation which finds the largest key smaller than a given value.

Roman Cheplyaka suggested using some sort of dynamic rank/select: if we think of the sequence of slots as a bit vector, and represent empty slots by 1s and full slots by 0s, we can find the rightmost empty slot up to a given index by first finding the rank of the index, then doing a select operation on that rank. (I smell some kind of adjunction here: the composition of rank then select is a sort of idempotent closure operator that “rounds down” to the index of the rightmost preceding 1 bit. Maybe one of my readers can elaborate?) The tricky part, apparently, is doing this in such a way that we can dynamically update bits; apparently it can be done so the operations are still (Roman linked to this paper) but it seems complicated.

One of my favorite solutions (which I also independently came up with) was suggested by Julian Beaumont: use a disjointset data structure (aka unionfind) which stores a set for each contiguous (possibly empty) block of full slots together with its one empty predecessor (we can create a dummy slot on the left end to act as the “empty predecessor” for the first block of full slots). Each set keeps track of the index of its leftmost, empty, slot, which is easy to do: any time two sets are unioned we simply take the minimum of their empty slot indices (more generally, we can annotate the sets of a disjointset structure with values from any arbitrary monoid). To mark an empty slot as full, we simply union its set with the set of the slot to the left. To find the rightmost empty slot left of a given index, just look up the stored leftmost index corresponding to the set of the given index. Both these operations can thus be implemented in amortized time (where is the inverse Ackermann function, hence essentially constant). Intuitively, I doubt it is possible to do any better than this. Curiously, however, unlike the other solutions, this solution depends crucially on the fact that we can never revert a full slot to empty!

Apparently, this is known as the predecessor problem, and can also be solved with something called fusion trees which I had never heard of before.
For rank/select, Ed Kmett has mentioned an adjoint triple in this regard, e.g. https://twitter.com/kmett/status/846564035990605825
Here’s a more indepth discussion. I’ll use a slightly different definition of rank (and thus select) because it avoids corner cases.
First, fix a bitvector B : [0,N) > {0,1} where all intervals are intervals of natural numbers. Define W = sum_i B(i). (I chose W because this is often called the weight of a bitvector.) I’ll define three functions with the following types: rank : [0,N] > [0,W], and select, coselect : [0,W] > [0, N].
rank(j) = sum_{k in [0,j)} B(k)
The tweak is that I use a halfopen interval here instead of a closed interval. Incidentally, W = rank(N). Some key facts about rank which are easy to prove from its definition are: 1) it’s monotonic, 2) it’s surjective, 3) it reflects *strict* inequalities, i.e. rank(i) < rank(j) implies i < j.
select(i) = min{k in [0,N]  rank(k) = i}
coselect(i) = max{k in [0,N]  rank(k) = i}
These are welldefined because rank is surjective (and, unlike the usual definition, k is chosen from a closed interval). Using the key facts from above and properties of min/max, you can show that select  rank  coselect, i.e. for all i in [0, W], j in [0, N], select(i) <= j iff i <= rank(j). Similarly for coselect. We clearly have rank(select(i)) = i = rank(coselect(i)).
Since rank is surjective, it partitions its domain, [0,N]. Since it is monotonic, these partitions (or fibers) are subintervals. select and coselect just pick the smallest/largest elements of the specified fibers. This connects to the unionfind based implementation you describe. You can almost define select in terms of coselect (or vice versa) because the maximum element of the fiber corresponding to i is one below the minimum element of the fiber corresponding to i+1. This is the basis of the definition of coselect in Ed's tweet. Of course, we have an issue for i = N.
That makes a lot of sense, thanks!
Pingback: Data structure challenge: application  blog :: Brent > [String]