Data structure challenge: solutions

In my previous post I challenged you to find a way to keep track of a sequence of slots in such a way that we can quickly (in O(\lg n) or better) either mark any empty slot as full, or find the rightmost empty slot prior to a given index. When I posted it, I had two solutions in mind; thanks to all the excellent comments I now know of many more!

  • There were quite a few answers which in some way or another boiled down to using a balanced tree structure.

    In all these cases, the idea is that the tree stores the sorted indices of all empty slots. We can mark a slot full or empty by deleting or inserting from the tree structure, and we can find the rightmost empty slot not exceeding a given index by searching for the index and returning highest value in the tree which is less than the index being searched. It is well-known that all these operations can be done in O(\lg n) time.

  • David Barbour suggested something along similar lines, but somewhat more general: keep a finger tree with cached monoidal annotations representing both the total number of elements of each subtree, as well as the number of elements satisfying some proposition (such as the number of empty slots). This also allows performing the operations in O(\lg n) time. This is in some sense similar to the previous suggestions, but it generalizes much more readily, since we can use this scheme to track any kind of monoidal annotation. I had thought of using a segment tree where slot i stores the value i when it is full, and 0 when it is empty, and each node caches the max value contained in its subtree, allowing updates and queries to happen in O(\lg n) time. This could also track arbitrary monoidal annotations, but using a finger tree is strictly more expressive since it also supports insertion and deletion (although that is not required for my original formulation of the problem).

  • Albert also suggested using a van Emde Boas tree to achieve O(\lg \lg n) performance. Van Emde Boas trees directly support a “predecessor” operation which finds the largest key smaller than a given value.

  • Roman Cheplyaka suggested using some sort of dynamic rank/select: if we think of the sequence of slots as a bit vector, and represent empty slots by 1s and full slots by 0s, we can find the rightmost empty slot up to a given index by first finding the rank of the index, then doing a select operation on that rank. (I smell some kind of adjunction here: the composition of rank then select is a sort of idempotent closure operator that “rounds down” to the index of the rightmost preceding 1 bit. Maybe one of my readers can elaborate?) The tricky part, apparently, is doing this in such a way that we can dynamically update bits; apparently it can be done so the operations are still O(\lg n) (Roman linked to this paper) but it seems complicated.

  • One of my favorite solutions (which I also independently came up with) was suggested by Julian Beaumont: use a disjoint-set data structure (aka union-find) which stores a set for each contiguous (possibly empty) block of full slots together with its one empty predecessor (we can create a dummy slot on the left end to act as the “empty predecessor” for the first block of full slots). Each set keeps track of the index of its leftmost, empty, slot, which is easy to do: any time two sets are unioned we simply take the minimum of their empty slot indices (more generally, we can annotate the sets of a disjoint-set structure with values from any arbitrary monoid). To mark an empty slot as full, we simply union its set with the set of the slot to the left. To find the rightmost empty slot left of a given index, just look up the stored leftmost index corresponding to the set of the given index. Both these operations can thus be implemented in amortized O(\alpha(n)) time (where \alpha is the inverse Ackermann function, hence essentially constant). Intuitively, I doubt it is possible to do any better than this. Curiously, however, unlike the other solutions, this solution depends crucially on the fact that we can never revert a full slot to empty!

  • Apparently, this is known as the predecessor problem, and can also be solved with something called fusion trees which I had never heard of before.


About Brent

Associate Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in data structures and tagged , , , , , , , , , , , , . Bookmark the permalink.

3 Responses to Data structure challenge: solutions

  1. Derek Elkins says:

    For rank/select, Ed Kmett has mentioned an adjoint triple in this regard, e.g.

    Here’s a more in-depth discussion. I’ll use a slightly different definition of rank (and thus select) because it avoids corner cases.

    First, fix a bitvector B : [0,N) -> {0,1} where all intervals are intervals of natural numbers. Define W = sum_i B(i). (I chose W because this is often called the weight of a bitvector.) I’ll define three functions with the following types: rank : [0,N] -> [0,W], and select, coselect : [0,W] -> [0, N].

    rank(j) = sum_{k in [0,j)} B(k)

    The tweak is that I use a half-open interval here instead of a closed interval. Incidentally, W = rank(N). Some key facts about rank which are easy to prove from its definition are: 1) it’s monotonic, 2) it’s surjective, 3) it reflects *strict* inequalities, i.e. rank(i) < rank(j) implies i < j.

    select(i) = min{k in [0,N] | rank(k) = i}
    coselect(i) = max{k in [0,N] | rank(k) = i}

    These are well-defined because rank is surjective (and, unlike the usual definition, k is chosen from a closed interval). Using the key facts from above and properties of min/max, you can show that select -| rank -| coselect, i.e. for all i in [0, W], j in [0, N], select(i) <= j iff i <= rank(j). Similarly for coselect. We clearly have rank(select(i)) = i = rank(coselect(i)).

    Since rank is surjective, it partitions its domain, [0,N]. Since it is monotonic, these partitions (or fibers) are subintervals. select and coselect just pick the smallest/largest elements of the specified fibers. This connects to the union-find based implementation you describe. You can almost define select in terms of coselect (or vice versa) because the maximum element of the fiber corresponding to i is one below the minimum element of the fiber corresponding to i+1. This is the basis of the definition of coselect in Ed's tweet. Of course, we have an issue for i = N.

  2. Pingback: Data structure challenge: application | blog :: Brent -> [String]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.