Counting inversions via rank queries

In a post from about a year ago, I explained an algorithm for counting the number of inversions of a sequence in O(n \lg n) time. As a reminder, given a sequence a_1, a_2, \dots, a_n, an inversion is a pair of positions i, j such that a_i and a_j are in the “wrong order”, that is, i < j but a_i > a_j. There can be up to n(n-1)/2 inversions in the worst case, so we cannot hope to count them in faster than quadratic time by simply incrementing a counter. In my previous post, I explained one way to count inversions in O(n \lg n) time, using a variant of merge sort.

I recently learned of an entirely different algorithm for achieving the same result. (In fact, I learned of it when I gave this problem on an exam and a student came up with an unexpected solution!) This solution does not use a divide-and-conquer approach at all, but hinges on a clever data structure.

Suppose we have a bag of values (i.e. a collection where duplicates are allowed) on which we can perform the following two operations:

  1. Insert a new value into the bag.
  2. Count how many values in the bag are strictly greater than a given value.

We’ll call the second operation a rank query because it really amounts to finding the rank or index of a given value in the bag—how many values are greater than it (and thus how many are less than or equal to it)?

If we can do these two operations in logarithmic time (i.e. logarithmic in the number of values in the bag), then we can count inversions in O(n \lg n) time. Can you see how before reading on? You might also like to think about how we could actually implement a data structure that supports these operations.

Counting inversions with bags and rank queries

So, let’s see how to use a bag with logarithmic insertion and rank queries to count inversions. Start with an empty bag. For each element in the sequence, see how many things in the bag are strictly greater than it, and add this count to a running total; then insert the element into the bag, and repeat with the next element. That is, for each element we compute the number of inversions of which it is the right end, by counting how many elements that came before it (and are hence in the bag already) are strictly greater than it. It’s easy to see that this will count every inversion exactly once. It’s also easy to see that it will take O(n \lg n) time: for each of the n elements, we do two O(\lg n) operations (one rank query and one insertion).

In fact, we can do a lot more with this data structure than just count inversions; it sometimes comes in handy for competitive programming problems. More in a future post, perhaps!

So how do we implement this magical data structure? First of all, we can use a balanced binary search tree to store the values in the bag; clearly this will allow us to insert in logarithmic time. However, a plain binary search tree wouldn’t allow us to quickly count the number of values strictly greater than a given query value. The trick is to augment the tree so that each node also caches the size of the subtree rooted at that node, being careful to maintain these counts while inserting and balancing.

Augmented red-black trees in Haskell

Let’s see some code! In Haskell, probably the easiest type of balanced BST to implement is a red-black tree. (If I were implementing this in an imperative language I might use splay trees instead, but they are super annoying to implement in Haskell. (At least as far as I know. I will definitely take you out for a social beverage of your choice if you can show me an elegant Haskell implementation of splay trees! This is cool but somehow feels too complex.)) However, this isn’t going to be some fancy, type-indexed, correct-by-construction implementation of red-black trees, although that is certainly fun. I am actually going to implement left-leaning red-black trees, mostly following Sedgewick; see those slides for more explanation and proof. This is one of the simplest ways I know to implement red-black trees (though it’s not necessarily the most efficient).

First, a red-black tree is either empty, or a node with a color (which we imagine as the color of the incoming edge), a cached size, a value, and two subtrees.

> {-# LANGUAGE PatternSynonyms #-}
> 
> data Color = R | B
>   deriving Show
> 
> otherColor :: Color -> Color
> otherColor R = B
> otherColor B = R
> 
> data RBTree a
>   = Empty
>   | Node Color Int (RBTree a) a (RBTree a)
>   deriving Show

To make some of the tree manipulation code easier to read, we make some convenient patterns for matching on the structure of a tree when we don’t care about the values or cached sizes: ANY matches any tree and its subtrees, while RED and BLACK only match on nodes of the appropriate color. We also make a function to extract the cached size of a subtree.

> pattern ANY   l r <- Node _ _ l _ r
> pattern RED   l r <- Node R _ l _ r
> pattern BLACK l r <- Node B _ l _ r
> 
> size :: RBTree a -> Int
> size Empty            = 0
> size (Node _ n _ _ _) = n

The next thing to implement is the workhorse of most balanced binary tree implementations: rotations. The fiddliest bit here is managing the cached sizes appropriately. When rotating, the size of the root node remains unchanged, but the new child node, as compared to the original, has lost one subtree and gained another. Note also that we will only ever rotate around red edges, so we pattern-match on the color as a sanity check, although this is not strictly necessary. The error cases below should never happen.

> rotateL :: RBTree a -> RBTree a
> rotateL (Node c n t1 x (Node R m t2 y t3))
>   = Node c n (Node R (m + size t1 - size t3) t1 x t2) y t3
> rotateL _ = error "rotateL on non-rotatable tree!"
> 
> rotateR :: RBTree a -> RBTree a
> rotateR (Node c n (Node R m t1 x t2) y t3)
>   = Node c n t1 x (Node R (m - size t1 + size t3) t2 y t3)
> rotateR _ = error "rotateR on non-rotatable tree!"

To recolor a node, we just flip its color. We can then split a tree with two red subtrees by recoloring all three nodes. (The “split” terminology comes from the isomorphism between red-black trees and 2-3-4 trees; red edges can be thought of as “gluing” nodes together into a larger node, and this recoloring operation corresponds to splitting a 4-node into three 2-nodes.)

> recolor :: RBTree a -> RBTree a
> recolor Empty            = Empty
> recolor (Node c n l x r) = Node (otherColor c) n l x r
> 
> split :: RBTree a -> RBTree a
> split (Node c n l@(RED _ _) x r@(RED _ _))
>   = (Node (otherColor c) n (recolor l) x (recolor r))
> split _ = error "split on non-splittable tree!"

Finally, we implement a function to “fix up” the invariants by doing rotations as necessary: if we have two red subtrees we don’t touch them; if we have only one right red subtree we rotate it to the left (this is where the name “left-leaning” comes from), and if we have a left red child which itself has a left red child, we rotate right. (This function probably seems quite mysterious on its own; see Sedgewick for some nice pictures which explain it very well!)

> fixup :: RBTree a -> RBTree a
> fixup t@(ANY (RED _ _) (RED _ _)) = t
> fixup t@(ANY _         (RED _ _)) = rotateL t
> fixup t@(ANY (RED (RED _ _) _) _) = rotateR t
> fixup t = t

We can finally implement insertion. First, to insert into an empty tree, we create a red node with size 1.

> insert :: Ord a => a -> RBTree a -> RBTree a
> insert a Empty = Node R 1 Empty a Empty

If we encounter a node with two red children, we perform a split before continuing. This may violate the red-black invariants above us, but we will fix it up later on our way back up the tree.

> insert a t@(ANY (RED _ _) (RED _ _)) = insert a (split t)

Otherwise, we compare the element to be inserted with the root, insert on the left or right as appropriate, increment the cached size, and fixup the result. Notice that we don’t stop recursing upon encountering a value that is equal to the value to be inserted, because our goal is to implement a bag rather than a set. Here I have chosen to put values equal to the root in the left subtree, but it really doesn’t matter.

> insert a (Node c n l x r)
>   | a <= x    = fixup (Node c (n+1) (insert a l) x r)
>   | otherwise = fixup (Node c (n+1) l x (insert a r))

Implementing rank queries

Now, thanks to the cached sizes, we can count the values greater than a query value.

> numGT :: Ord a => RBTree a -> a -> Int

The empty tree contains 0 values strictly greater than anything.

> numGT Empty _ = 0

For a non-empty tree, we distinguish two cases:

> numGT (Node _ n l x r) q

If the query value q is less than the root, then we know that the root along with everything in the right subtree is strictly greater than q, so we can just add 1 + size r without recursing into the right subtree. We also recurse into the left subtree to count any values greater than q it contains.

>   | q < x     = numGT l q + 1 + size r

Otherwise, if q is greater than or equal to the root, any values strictly greater than q must be in the right subtree, so we recurse to count them.

>   | otherwise = numGT r q

By inspection we can see that numGT calls itself at most once, moving one level down the tree with each recursive call, so it makes a logarithmic number of calls, with only a constant amount of work at each call—thanks to the fact that size takes only constant time to look up a cached value.

Counting inversions

Finally, we can put together the pieces to count inversions. The code is quite simple: recurse through the list with an accumulating red-black tree, doing a rank query on each value, and sum the results.

> inversions :: Ord a => [a] -> Int
> inversions = go Empty
>   where
>     go _ []     = 0
>     go t (a:as) = numGT t a + go (insert a t) as

Let’s try it out!

λ> inversions [3,5,1,4,2]
6
λ> inversions [2,2,2,2,2,1]
5
λ> :set +s
λ> inversions [3000, 2999 .. 1]
4498500
(0.19 secs, 96,898,384 bytes)

It seems to work, and is reasonably fast!

Exercises

  1. Further augment each node with a counter representing the number of copies of the given value which are contained in the bag, and maintain the invariant that each distinct value occurs in only a single node.

  2. Rewrite inversions without a recursive helper function, using a scan, a zip, and a fold.

  3. It should be possible to implement bags with rank queries using fingertrees instead of building our own custom balanced tree type (though it seems kind of overkill).

  4. My intuition tells me that it is not possible to count inversions faster than n \lg n. Prove it.

Posted in haskell | Tagged , , , , , , , , , | 4 Comments

Computing Eulerian paths is harder than you think

Everyone who has studied any graph theory at all knows the celebrated story of the Seven Bridges of Königsberg, and how Euler gave birth to modern graph theory while solving the problem.

Euler’s proof is clever, incisive, not hard to understand, and a great introduction to the kind of abstract reasoning we can do about graphs. There’s little wonder that it is often used as one of the first nontrivial graph theory results students are introduced to, e.g. in a discrete mathematics course. (Indeed, I will be teaching discrete mathematics in the spring and certainly plan to talk about Eulerian paths!)

Euler’s 1735 solution was not constructive, and in fact he really only established one direction of the “if and only if”:

If a graph has an Eulerian path, then it has exactly zero or two vertices with odd degree.

This can be used to rule out the existence of Eulerian paths in graphs without the right vertex degrees, which was Euler’s specific motivation. However, one suspects that Euler knew it was an if and only if, and didn’t write about the other direction (if a graph has exactly zero or two vertices with odd degree, then it has an Eulerian path) because he thought it was trivial.1

The first person to publish a full proof of both directions, including an actual algorithm for finding an Eulerian path, seems to be Carl Hierholzer, whose friend published a posthumous paper in Hierholzer’s name after his untimely death in 1871, a few weeks before his 31st birthday.2 (Notice that this was almost 150 years after Euler’s original paper!) If the vertex degrees cooperate, finding an Eulerian path is almost embarrassingly easy according to Hierholzer’s algorithm: starting at one of the odd-degree vertices (or anywhere you like if there are none), just start walking through the graph—any which way you please, it doesn’t matter!—visiting each edge at most once, until you get stuck. Then pick another part of the graph you haven’t visited, walk through it randomly, and splice that path into your original path. Repeat until you’ve explored the whole graph. And generalizing all of this to directed graphs isn’t much more complicated.

So, in summary, this is a well-studied problem, solved hundreds of years ago, that we present to students as a first example of a nontrivial yet still simple-to-understand graph proof and algorithm. So it should be pretty easy to code, right?

So what’s the problem?

Recently I came across the eulerianpath problem on Open Kattis, and I realized that although I have understood this algorithm on a theoretical level for almost two decades (I almost certainly learned it as a young undergraduate), I have never actually implemented it! So I set out to solve it.

Right away the difficulty rating of 5.7 tells us that something strange is going on. “Easy” problems—the kind of problems you can give to an undergraduate at the point in their education when they might first be presented with the problem of finding Eulerian paths—typically have a difficulty rating below 3. As I dove into trying to implement it, I quickly realized two things. First of all, given an arbitrary graph, there’s a lot of somewhat finicky work that has to be done to check whether the graph even has an Eulerian path, before running the algorithm proper:

  1. Calculate the degree of all graph vertices (e.g. by iterating through all the edges and incrementing appropriate counters for the endpoints of each edge).
  2. Check if the degrees satisfy Euler’s criteria for the existence of a solution, by iterating through all vertices and making sure their degrees are all even, but also counting the number of vertices with an odd degree to make sure it is either zero or two. At the same time, if we see an odd-degree vertex, remember it so we can be sure to start the path there.
  3. If all vertices have even degree, pick an arbitrary node as the start vertex.
  4. Ensure the graph is connected (e.g. by doing a depth-first search)—Euler kind of took this for granted, but this technically has to be part of a correct statement of the theorem. If we have a disconnected graph, each component could have an Eulerian path or cycle without the entire graph having one.

And if the graph is directed—as it is in the eulerianpath problem on Kattis—then the above steps get even more finicky. In step 1, we have to count the in- and outdegree of each vertex separately; in step 2, we have to check that the in- and outdegrees of all vertices are equal, except for possibly two vertices where one of them has exactly one more outgoing than incoming edge (which must be the start vertex), and vice versa for the other vertex; in step 4, we have to make sure to start the DFS from the chosen start vertex, because the graph need not be strongly connected, it’s enough for the entire graph to be reachable from the start vertex.

The second thing I realized is that Hierholzer’s algorithm proper—walk around until getting stuck, then repeatedly explore unexplored parts of the graph and splice them into the path being built—is still rather vague, and it’s nontrivial to figure out how to do it, and what data structures to use, so that everything runs in time linear in the number of edges. For example, we don’t want to iterate over the whole graph—or even just the whole path built so far—to find the next unexplored part of the graph every time we get stuck. We also need to be able to do the path splicing in constant time; so, for example, we can’t just store the path in a list or array, since then splicing in a new path segment would require copying the entire path after that point to make space. I finally found a clever solution that pushes the nodes being explored on a stack; when we get stuck, we start popping nodes, placing them into an array which will hold the final path (starting from the end), and keep popping until we find a node with an unexplored outgoing edge, then switch back into exploration mode, pushing things on the stack until we get stuck again, and so on. But this is also nontrivial to code correctly since there are many lurking off-by-one errors and so on. And I haven’t even talked about how we keep track of which edges have been explored and quickly find the next unexplored edge from a vertex.

I think it’s worth writing another blog post or two with more details of how the implementation works, both in an imperative language and in a pure functional language, and I may very well do just that. But in any case, what is it about this problem that results in such a large gap between the ease of understanding its solution theoretically, and the difficulty of actually implementing it?


  1. Actually, the way I have stated the other direction of the if and only if is technically false!—can you spot the reason why?

  2. Though apparently someone named Listing published the basic idea of the proof, with some details omitted, some decades earlier. I’ve gotten all this from Herbert Fleischner, Eulerian Graphs and Related Topics, Annals of Discrete Mathematics 45, Elsevier 1990. Fleischner reproduces Euler’s original paper as well as Hierholzer’s, together with English translations.

Posted in learning | Tagged , , , , | 5 Comments

Competitive Programming in Haskell: reading large inputs with ByteString

In my last post in this series, we looked at building a small Scanner combinator library for lightweight input parsing. It uses String everywhere, and usually this is fine, but occasionally it’s not.

A good example is the Kattis problem Army Strength (Hard). There are a number of separate test cases; each test case consists of two lines of positive integers which record the strengths of monsters in two different armies. Supposedly the armies will have a sequence of battles, where the weakest monster dies each time, with some complex-sounding rules about how to break ties. It sounds way more complicated than it really is, though: a bit of thought reveals that to find out who wins we really just need to see which army’s maximum-strength monster is strongest.

So our strategy for each test case is to read in the two lists of integers, find the maximum of each list, and compare. Seems pretty straightforward, right? Something like this:

import           Control.Arrow
import           Data.List.Split

main = interact $
  lines >>> drop 1 >>> chunksOf 4 >>>
  map (drop 2 >>> map (words >>> map read) >>> solve) >>>
  unlines

solve :: [[Int]] -> String
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> "MechaGodzilla"
  _  -> "Godzilla"

Note I didn’t actually use the Scanner abstraction here, though I could have; it’s actually easier to just ignore the numbers telling us how many test cases there are and the length of each line, and just split up the input by lines and go from there.

This seems straightforward enough, but sadly, it results in a Time Limit Exceeded (TLE) error on the third of three test cases. Apparently this program takes longer than the allowed 1 second. What’s going on?

If we look carefully at the limits for the problem, we see that there could be up to 50 test cases, each test case could have two lists of length 10^5, and the numbers in the lists can be up to 10^9. If all those are maxed out (as they probably are in the third, secret test case), we are looking at an input file many megabytes in size. At this point the time to simply read the input is a big factor. Reading the input as a String has a lot of overhead: each character gets its own cons cell; breaking the input into lines and words requires traversing over these cons cells one by one. We need a representation with less overhead.

Now, if this were a real application, we would reach for Text, which is made for representing textual information and can correctly handle unicode encodings and all that good stuff. However, this isn’t a real application: competitive programming problems always limit the input and output strictly to ASCII, so characters are synonymous with bytes. Therefore we will commit a “double no-no”: not only are we going to use ByteString to represent text, we’re going to use Data.ByteString.Lazy.Char8 which simply assumes that each 8 bits is one character. As explained in a previous post, however, I think this is one of those things that is usually a no-no but is completely justified in this context.

Let’s start by just replacing some of our string manipulation with corresponding ByteString versions:

import           Control.Arrow
import qualified Data.ByteString.Lazy.Char8 as C
import           Data.List.Split

main = C.interact $
  C.lines >>> drop 1 >>> chunksOf 4 >>>
  map (drop 2 >>> map (C.words >>> map (C.unpack >>> read)) >>> solve) >>>
  C.unlines

solve :: [[Int]] -> C.ByteString
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> C.pack "MechaGodzilla"
  _  -> C.pack "Godzilla"

This already helps a lot: this version is actually accepted, taking 0.66 seconds. (Note there’s no way to find out how long our first solution would take if allowed to run to completion: once it goes over the time limit Kattis just kills the process. So we really don’t know how much of an improvement this is, but hey, it’s accepted!)

But we can do even better: it turns out that read also has a lot of overhead, and if we are specifically reading Int values we can do something much better. The ByteString module comes with a function

readInt :: C.ByteString -> Maybe (Int, C.ByteString)

Since, in this context, we know we will always get an integer with nothing left over, we can replace C.unpack >>> read with C.readInt >>> fromJust >>> fst. Let’s try it:

import           Control.Arrow
import qualified Data.ByteString.Lazy.Char8 as C
import           Data.List.Split
import           Data.Maybe (fromJust)

main = C.interact $
  C.lines >>> drop 1 >>> chunksOf 4 >>>
  map (drop 2 >>> map (C.words >>> map readInt) >>> solve) >>>
  C.unlines

  where
    readInt = C.readInt >>> fromJust >>> fst

solve :: [[Int]] -> C.ByteString
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> C.pack "MechaGodzilla"
  _  -> C.pack "Godzilla"

Now we’re talking — this version completes in a blazing 0.04 seconds!

We can take these principles and use them to make a variant of the Scanner module from last time which uses (lazy, ASCII) ByteString instead of String, including the use of the readInt functions to read Int values quickly. You can find it here.

Posted in haskell | Tagged , , , , , | 1 Comment

Lightweight invertible enumerations in Haskell

In a previous post I introduced a new Haskell library for enumerations (now on Hackage as simple-enumeration). The Data.Enumeration module defines a type Enumeration a, represented simply by a function Integer -> a which picks out the value of type a at a given index. This representation has a number of advantages, including the ability to quickly index into very large enumerations, and the convenience that comes from having Functor, Applicative, and Alternative instances for Enumeration.

I’ve just uploaded version 0.2 of the package, which adds a new Data.Enumeration.Invertible module with a new type, IEnumeration a, representing invertible enumerations. Whereas a normal enumeration is just a function from index to value, an invertible enumeration is a bijection between indices and values. In particular, alongside the Integer -> a function for picking out the value at an index, an invertible enumeration also stores an inverse function a -> Integer (called locate) for finding the index of a given value.

On the one hand, this comes at a cost: because the type parameter a now occurs both co- and contravariantly, IEnumeration i s no longer an instance of Functor, Applicative, or Alternative. There is a mapE combinator provided for mapping IEnumeration a to IEnumeration b, but in order to work it needs both an a -> b function and an inverse b -> a.

On the other hand, we also gain something: of course the ability to look up the index of a value is nifty, and beyond that we also get a combinator

functionOf :: IEnumeration a -> IEnumeration b -> IEnumeration (a -> b)

which works as long as the IEnumeration a is finite. This is not possible to implement with normal, non-invertible enumerations: we have to take an index and turn it into a function a -> b, but that function has to take an a as input and decide what to do with it. There’s nothing we can possibly do with a value of type a unless we have a way to connect it back to the IEnumeration a it came from.

Here’s a simple example of using the functionOf combinator to enumerate all Bool -> Bool functions, and then locating the index of not:

>>> bbs = functionOf (boundedEnum @Bool) (boundedEnum @Bool)
>>> card bbs
Finite 4
>>> locate bbs not
2
>>> map (select bbs 2) [False, True]
[True,False]

And here’s an example of enumerating recursive trees, which is parallel to an example given in my previous post. Note, however, how we can no longer use combinators like <$>, <*>, and <|>, but must explicitly use <+> (disjoint sum of enumerations) and >< (enumeration product) in combination with mapE. In return, though, we can find the index of any given tree in addition to selecting trees by index.

data Tree = L | B Tree Tree
  deriving Show

toTree :: Either () (Tree, Tree) -> Tree
toTree = either (const L) (uncurry B)

fromTree :: Tree -> Either () (Tree, Tree)
fromTree L       = Left ()
fromTree (B l r) = Right (l,r)

trees :: IEnumeration Tree
trees = infinite $ mapE toTree fromTree (unit <+> (trees >< trees))

>>> locate trees (B (B L (B L L)) (B (B L (B L L)) (B L (B L L))))
123
>>> select trees 123
B (B L (B L L)) (B (B L (B L L)) (B L (B L L)))

Of course, the original Data.Enumeration module remains available; there is clearly an inherent tradeoff to invertibility, and you are free to choose either style depending on your needs. Other than the tradeoffs outlined above and a couple other minor exceptions, the two modules export largely identical APIs.

Posted in combinatorics, haskell, projects | Tagged , , , , , , | 1 Comment

Competitive Programming in Haskell: Scanner

In my previous post I explored solving a simple competitive programming problem in Haskell. The input of the problem just consisted of a bunch of lines containing specific data, so that we could parse it using lines and words. There is another common class of problems, however, which follow this pattern:

The first line of the input consists of an integer T. Each of the next T lines consists of…

That is, the input contains integers which are not input data per se but just tell you how many things are to follow. This is really easy to process in an imperative language like Java or C++. For example, in Java we might write code like this:

Scanner in = new Scanner(System.in);
int T = in.nextInt();
for (int i = 0; i < T; i++) {
   // process each line
}

Occasionally, we can get away with completely ignoring the extra information in Haskell. For example, if the input consists of a number T followed by T lines, each of which contains a number n followed by a list of n numbers, we can just write

main = interact $
  lines >>> drop 1 >>> map (words >>> drop 1 >>> map read) >>> ...

That is, we can ignore the first line containing T since the end-of-file will tell us how many lines there are; and we can ignore the n at the beginning of each line, since the newline character tells us when the list on that line is done.

Sometimes, however, this isn’t possible, especially when there are multiple test cases, or when a single test case has multiple parts, each of which can have a variable length. For example, consider Popular Vote, which describes its input as follows:

The first line of input contains a single positive integer T \leq 500 indicating the number of test cases. The first line of each test case also contains a single positive integer n indicating the number of candidates in the election. This is followed by n lines, with the ith line containing a single nonnegative integer indicating the number of votes candidate i received.

How would we parse this? We could still ignore T—just keep reading until the end of the file—but there’s no way we can ignore the n values. Since the values for each test case are all on separate lines instead of on one line, there’s otherwise no way to know when one test case ends and the next begins.

Once upon a time, I would have done this using splitAt and explicit recursion, like so:

type Election = [Int]

readInput :: String -> [Election]
readInput = lines >>> drop 1 {- ignore T -} >>> map read >>> go
  where
    go :: [Int] -> [Election]
    go []     = []
    go (n:xs) = votes : go rest
      where (votes,rest) = splitAt n xs

However, this is really annoying to write and easy to get wrong. There are way too many variable names to keep track of (n, xs, votes, rest, go) and for more complex inputs it becomes simply unmanageable. You might think we should switch to using a real parser combinator library—parsec is indeed installed in the environment Kattis uses to run Haskell solutions—and although sometimes a full-blown parser combinator library is needed, in this case it’s quite a bit more heavyweight than we would like. I can never remember which modules I have to import to get parsec set up; there’s a bunch of boilerplate needed to set up a lexer; and so on. Using parsec is only worth it if we’re parsing something really complex.

Scanner

The heart of the issue is that we want to be able to specify a high-level description of the sequence of things we expect to see in the input, without worrying about managing the stream of tokens explicitly. Another key insight is that 99% of the time, we don’t need the ability to deal with parse failure or the ability to parse multiple alternatives. With these insights in mind, we can create a very simple Scanner abstraction, which is just a Stateful computation over a list of tokens:

type Scanner = State [String]

runScanner :: Scanner a -> String -> a
runScanner s = evalState s . words

To run a scanner, we just feed it the entire input as a String, which gets chopped into tokens using words. (Of course in some scenarios we might want to use lines instead of words, or even do more complex tokenization.)

Note since Scanner is just a type synonym for State [String], it is automatically an instance of Functor, Applicative, and Monad (but not Alternative).

So let’s develop a little Scanner DSL. The most fundamental thing we can do is read the next token.

str :: Scanner String
str = get >>= \case { s:ss -> put ss >> return s }

(This uses the LambdaCase extension, though we could easily rewrite it without.) str gets the current list of tokens, puts it back without the first token, and returns the first token. Note that I purposely didn’t include a case for the empty list. You might think we want to include a case for the empty token list and have it return the empty string or something like that. But since the input will always be properly formatted, if this scenario ever happens it means my program has a bug—e.g. perhaps I misunderstood the description of the input format. In this scenario I want it to crash loudly, as soon as possible, rather than continuing on with some bogus data.

We can now add some scanners for reading specific token types other than String, simply by mapping the read function over the output of str:

int :: Scanner Int
int = read <$> str

integer :: Scanner Integer
integer = read <$> str

double :: Scanner Double
double = read <$> str

Again, these will crash if they see a token in an unexpected format, and that is a very deliberate choice.

Now, as I explained earlier, a very common pattern is to have an integer n followed by n copies of something. So let’s make a combinator to encapsulate that pattern:

numberOf :: Scanner a -> Scanner [a]
numberOf s = int >>= flip replicateM s

numberOf s expects to first see an Int value n, and then it runs the provided scanner n times, returning a list of the results.

It’s also sometimes useful to have a way to repeat a Scanner some unknown number of times until encountering EOF (for example, the input for some problems doesn’t specify the number of test cases up front the way that Popular Vote does). This is similar to the many combinator from Alternative.

many :: Scanner a -> Scanner [a]
many s = get >>= \case { [] -> return []; _ -> (:) <$> s <*> many s }

many s repeats the scanner s as many times as it can, returning a list of the results. In particular it first peeks at the current token list to see if it is empty. If so, it returns the empty list of results; if there are more tokens, it runs s once and then recursively calls many s, consing the results together.

Finally, it’s quite common to want to parse a specific small number of something, e.g. two double values representing a 2D coordinate pair. We could just write replicateM 2 double, but this is common enough that I find it helpful to define dedicated combinators with short names:

two, three, four :: Scanner a -> Scanner [a]
[two, three, four] = map replicateM [2..4]

The complete file can be found on GitHub. As I continue this series I’ll be putting more code into that repository. Note I do not intend to make this into a Hackage package, since that wouldn’t be useful: you can’t tell Kattis to go download a package from Hackage before running your submission. However, it is possible to submit multiple files at once, so you can include Scanner.hs in your submission and just import Scanner at the top of your main module.

Examples

So what have we gained? Writing the parser for Popular Vote is now almost trivial:

type Election = [Int]

main = interact $ runScanner elections >>> ...

elections :: Scanner [Election]
elections = numberOf (numberOf int)

In practice I would probably just inline the definition of elections directly: interact $ runScanner (numberOf (numberOf int)) >>> ...

As a slightly more involved example, chosen almost at random, consider Board Wrapping:

On the first line of input there is one integer, N \leq 50, giving the number of test cases (moulds) in the input. After this line, N test cases follow. Each test case starts with a line containing one integer n, 1 \leq n \leq 600, which is the number of boards in the mould. Then n lines follow, each with five floating point numbers x,y,w,h,v where 0 \leq x,y,w,h \leq 10000 and -90^{\circ} < v \leq 90^{\circ}. The x and y are the coordinates of the center of the board and w and h are the width and height of the board, respectively. v is the angle between the height axis of the board to the y-axis in degrees, positive clockwise.

Here’s how I would set up the input, using Scanner and a custom data type to represent boards.

import Scanner

type V = [Double]     -- 2D vectors/points
newtype A = A Double  -- angle (radians)
                      -- newtype helps avoid conversion errors

fromDeg :: Double -> A
fromDeg d = A (d * pi / 180)

data Board = Board { boardLoc :: V, boardDims :: V, boardAngle :: A }

board :: Scanner Board
board = Board
  <$> two double
  <*> two double
  <*> ((fromDeg . negate) <$> double)

main = interact $
  runScanner (numberOf (numberOf board)) >>> ...
Posted in haskell | Tagged , , , , | 8 Comments

Lightweight, efficiently sampleable enumerations in Haskell

For another project I’m working on, I needed a way to enumerate and randomly sample values from various potentially infinite collections. There are plenty of packages in this space, but none of them quite fit my needs:

  • universe (and related packages) is very nice, but it’s focused on enumerating values of Haskell data types, not arbitrary sets: since it uses type classes, you have to make a new Haskell type for each thing you want to enumerate. It also uses actual Haskell lists of values, which doesn’t play nicely with sampling.
  • enumerable has not been updated in a long time and seems to be superseded by universe.
  • enumerate is likewise focused on generating values of Haskell data types, with accompanying generic deriving machinery.
  • size-based is used as the basis for the venerable testing-feat library, but these are again focused on generating values of Haskell data types. I’m also not sure I need the added complexity of size-indexed enumerations.
  • enumeration looks super interesting, and I might be able to use it for what I want, but (a) I’m not sure whether it’s maintained anymore, and (b) it seems rather more complex than I need.

I really want something like Racket’s nice data/enumerate package, but nothing like that seems to exist in Haskell. So, of course, I made my own! For now you can find it on GitHub.1 Here’s the package in a nutshell:

  • Enumerations are represented by the parameterized type Enumeration, which is an instance of Functor, Applicative, and Alternative (but not Monad).
  • Enumerations keep track of their cardinality, which could be either countably infinite or a specific natural number.
  • Enumerations are represented as functions from index to value, so they can be efficiently indexed (which also enables efficient random sampling).
  • The provided combinators will always do something sensible so that every value in the resulting enumeration occurs at a finite index. For example, if you take the disjoint union of two infinite enumerations, the resulting enumeration will alternate between values from the two inputs.

I wrote about something similar a few years ago. The main difference is that in that post I limited myself to only finite enumerations. There’s a lot more I could say but for now I think I will just show some examples:

>>> enumerate empty
[]
>>> enumerate unit
[()]
>>> enumerate $ empty <|> unit <|> unit
[(),()]

>>> enumerate $ finite 4 >< finiteList [27,84,17]
[(0,27),(0,84),(0,17),(1,27),(1,84),(1,17),(2,27),(2,84),(2,17),(3,27),(3,84),(3,17)]

>>> select (finite 4000000000000 >< finite 123456789) 0
(0,0)
>>> select (finite 4000000000000 >< finite 123456789) 196598723084073
(1592449,82897812)
>>> card (finite 4000000000000 >< finite 123456789)
Finite 493827156000000000000

>>> :set -XTypeApplications
>>> enumerate $ takeE 26 . dropE 65 $ boundedEnum @Char
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"

>>> take 10 . enumerate $ nat >< nat
[(0,0),(0,1),(1,0),(0,2),(1,1),(2,0),(0,3),(1,2),(2,1),(3,0)]
>>> take 10 . enumerate $ cw
[1 % 1,1 % 2,2 % 1,1 % 3,3 % 2,2 % 3,3 % 1,1 % 4,4 % 3,3 % 5]

>>> take 15 . enumerate $ listOf nat
[[],[0],[0,0],[1],[0,0,0],[1,0],[2],[0,1],[1,0,0],[2,0],[3],[0,0,0,0],[1,1],[2,0,0],[3,0]]

data Tree = L | B Tree Tree
  deriving (Eq, Show)

trees :: Enumeration Tree
trees = infinite $ singleton L <|> B <$> trees <*> trees

>>> take 3 . enumerate $ trees
[L,B L L,B L (B L L)]
>>> select trees 87239862967296
B (B (B (B (B L L) (B (B (B L L) L) L)) (B L (B L (B L L)))) (B (B (B L (B L (B L L))) (B (B L L) (B L L))) (B (B L (B L (B L L))) L))) (B (B L (B (B (B L (B L L)) (B L L)) L)) (B (B (B L (B L L)) L) L))

treesOfDepthUpTo :: Int -> Enumeration Tree
treesOfDepthUpTo 0 = singleton L
treesOfDepthUpTo n = singleton L <|> B <$> t' <*> t'
  where t' = treesOfDepthUpTo (n-1)

>>> card (treesOfDepthUpTo 0)
Finite 1
>>> card (treesOfDepthUpTo 1)
Finite 2
>>> card (treesOfDepthUpTo 3)
Finite 26
>>> card (treesOfDepthUpTo 10)
Finite
14378219780015246281818710879551167697596193767663736497089725524386087657390556152293078723153293423353330879856663164406809615688082297859526620035327291442156498380795040822304677
>>> select (treesOfDepthUpTo 10) (2^50)
B L (B L (B L (B (B L (B (B L (B (B L L) L)) (B (B (B (B L L) (B L L)) (B L (B L L))) (B (B (B L L) L) (B (B L L) L))))) (B (B (B (B (B (B L L) L) (B (B L L) L)) (B L L)) (B (B (B (B L L) L) (B L (B L L))) (B (B (B L L) (B L L)) L))) (B (B (B (B L L) (B L L)) (B (B (B L L) L) L)) (B (B L L) (B (B (B L L) L) (B (B L L) L))))))))

Comments, questions, suggestions for additional features, etc. are all very welcome!


  1. I chose the name enumeration before I realized there was already a package of that name on Hackage! So now I have to come up with another name that’s not already taken. Suggestions welcome…

Posted in combinatorics, haskell, projects | Tagged , , , , | 10 Comments

Code style and moral absolutes

In my previous post about my basic setup for solving competitive programming problems with Haskell, I (somewhat provocatively) used lists to represent pairs, and wrote a partial function to process them. Commenter Yom responded with a proposed alternative that was (less) partial. I was glad for the comment, because it gave me a good opportunity to think more about why I wrote the code in the way I did, and how it fits into larger issues of good coding practices and the reasons behind them.

Good code style as moral behavior

What is good code style? You probably have some opinions about this. In fact, I’m willing to bet you might even have some very strong opinions about this; I know I do. Whether consciously or not, we tend to frame good coding practices as a moral issue. Following good coding practices makes us feel virtuous; ignoring them makes us feel guilty. I can guess that this is why Yom said “I don’t think I could bring myself to be satisfied with partial functions” [emphasis added]. And this is why we say “good code style”, not “optimal” or “rational” or “best practice” code style.

Why is this? Partly, it is just human: we like to have right and wrong ways to do everything (load the dishwasher, enforce grammar “rules”, use a text editor, etc.), and we naturally create and enforce community standards via subtle and not-so-subtle social cues. In the case of coding practices, I think we also sometimes do it consciously and explicitly, because the benefits can be unintuitive or only manifest in the long term. So the only way to get our students—or ourselves—to follow practices that are in our rational self-interest is by framing them in moral terms; rational arguments do not work in and of themselves. For example, I cannot get my students to write good comments by explaining to them how it will be beneficial to them in the future. It seems obvious to them that they will remember perfectly how their code works in the future, so any argument claiming the opposite falls on deaf ears. The only way to get them to write comments is to make it a moral issue: they should feel bad (i.e. lose points, lose respect, feel like they are “taking shortcuts”) if they don’t. Of course I do this “for their own good”: I trust that in the future they will come to appreciate this ingrained behavior on its own merits.

The problem is that things framed in moral terms become absolutes, and it is then difficult for us to assess them rationally. My students will never be able to write a [function without comments, partial function, goto statement, …] without feeling bad about it, and they probably won’t stop to think about why.

Good code style as rational behavior

I ask again: what is good code style—and why? I have identified a few reasons for various “good” coding practices. Ultimately, we want our code to have properties such as:

  • Robustness: it should handle unexpected or invalid inputs gracefully.
  • Readability: it should be easy for others (or us in the future) to read and understand the program.
  • Maintainability: it should be easy to modify the program as requirements change.
  • Efficiency: in general, programs should not do anything obviously redundant, or use data structures with a lot of overhead when faster ones are available (e.g. String vs Text or ByteString).

Even in scenarios where one might initially think these properties are not needed (e.g. writing a one-off script for some sysadmin or data processing task), they often end up being important anyway (e.g. that one-off script gets copied and mutated until it becomes a key piece of some production system). And this is exactly one of the reasons for framing good coding style in moral terms! I won’t write comments or use good function decomposition in my one-off script just because I know, rationally, that it might end up in a production system someday. (I “know” that this particular script really is just a one-off script!) But I just might follow good coding practices anyway if I feel bad about not doing it (e.g. I would feel ashamed if other people saw it).

It seems to me that most things we would typically think of as good code style are geared towards producing code with some or all of the above properties (and perhaps some other properties as well), and most scenarios in which code is being written really do benefit from these properties.

“Good” code style is context-dependent

But what if there was a scenario where these properties are actually, concretely of no benefit? As you can probably guess, I would argue that competitive programming is one such scenario:

  • Robustness: we do not care what our program does when given unexpected or invalid inputs, since we are absolutely, 100% guaranteed that our program will only ever be run on inputs that exactly follow the given specification.
  • Maintainability: the requirements for our program will never change.
  • Efficiency: if you haven’t done much competitive programming you might be surprised to learn that we often don’t care about efficiency either. That is, although we certainly do care about asymptotic efficiency, i.e. choosing a good algorithm, problem time limits are typically set in such a way that constant factors don’t matter very much. A program that runs within 5x-10x of the optimal speed will often fit comfortably within the time limit.

So what do we care about?

  • Readability: the one thing from my previous list that we do care about is readability. Debugging becomes quite difficult if you can’t read and understand the code you wrote (this becomes even more important if you’re working on a team). And insofar as a solution represents particular insights or techniques, you may want be able to read it much later in order to remember or share what you learned.
  • Programmer time: programmer time is always valuable, of course, but with competitive programming this is taken to an extreme: it is almost always done under time pressure, so the whole point is to write a program to solve a given problem as fast as possible.

The combination of optimizing for speed and not caring about things like robustness, maintainability, and efficiency leads to a number of “best practices” for competitive programming that fly in the face of typical standards. For example:

  • Adding code to deal gracefully with inputs that don’t follow the specification would just be a waste of time (a cardinal sin in this context!). My Haskell solutions are full of calls to partial functions like read, head, tail, fromJust, and so on, even though I would almost never use these functions in other contexts. This is also why I used a partial function that was only defined on lists of length two in my previous post (though as I argue in a comment, perhaps it’s not so much that the function is partial as that its type is too big).
  • I often just use String for text processing, even though something like Text or ByteString (depending on the scenario) would be faster or more robust. (The exception is problems with a large amount of I/O, when the overhead of String really does become a problem; more on this in a future post.)
  • Other than the simplest uses of foldr, foldl', and scanl, I don’t bother with generic recursion schemes; I tend to just write lots of explicit recursion, which I find quicker to write and easier to debug.

There are similar things I do in Java as well. It has taken me quite a while to become comfortable with these things and stop feeling bad about them, and I think I finally understand why.

I’m not sure I really have a main point, other than to encourage you to consider your coding practices, and why you consider certain practices to be good or bad (and whether it depends on the context!).

Next time, back to your regularly scheduled competitive programming tips!

Posted in haskell | Tagged , , | 6 Comments