In my previous post I explored solving a simple competitive programming problem in Haskell. The input of the problem just consisted of a bunch of lines containing specific data, so that we could parse it using lines
and words
. There is another common class of problems, however, which follow this pattern:
The first line of the input consists of an integer . Each of the next
lines consists of…
That is, the input contains integers which are not input data per se but just tell you how many things are to follow. This is really easy to process in an imperative language like Java or C++. For example, in Java we might write code like this:
Scanner in = new Scanner(System.in);
int T = in.nextInt();
for (int i = 0; i < T; i++) {
// process each line
}
Occasionally, we can get away with completely ignoring the extra information in Haskell. For example, if the input consists of a number followed by
lines, each of which contains a number
followed by a list of
numbers, we can just write
main = interact $
lines >>> drop 1 >>> map (words >>> drop 1 >>> map read) >>> ...
That is, we can ignore the first line containing since the end-of-file will tell us how many lines there are; and we can ignore the
at the beginning of each line, since the newline character tells us when the list on that line is done.
Sometimes, however, this isn’t possible, especially when there are multiple test cases, or when a single test case has multiple parts, each of which can have a variable length. For example, consider Popular Vote, which describes its input as follows:
The first line of input contains a single positive integer
indicating the number of test cases. The first line of each test case also contains a single positive integer
indicating the number of candidates in the election. This is followed by
lines, with the
th line containing a single nonnegative integer indicating the number of votes candidate
received.
How would we parse this? We could still ignore —just keep reading until the end of the file—but there’s no way we can ignore the
values. Since the values for each test case are all on separate lines instead of on one line, there’s otherwise no way to know when one test case ends and the next begins.
Once upon a time, I would have done this using splitAt
and explicit recursion, like so:
type Election = [Int]
readInput :: String -> [Election]
readInput = lines >>> drop 1 {- ignore T -} >>> map read >>> go
where
go :: [Int] -> [Election]
go [] = []
go (n:xs) = votes : go rest
where (votes,rest) = splitAt n xs
However, this is really annoying to write and easy to get wrong. There are way too many variable names to keep track of (n
, xs
, votes
, rest
, go
) and for more complex inputs it becomes simply unmanageable. You might think we should switch to using a real parser combinator library—parsec
is indeed installed in the environment Kattis uses to run Haskell solutions—and although sometimes a full-blown parser combinator library is needed, in this case it’s quite a bit more heavyweight than we would like. I can never remember which modules I have to import to get parsec
set up; there’s a bunch of boilerplate needed to set up a lexer; and so on. Using parsec
is only worth it if we’re parsing something really complex.
Scanner
The heart of the issue is that we want to be able to specify a high-level description of the sequence of things we expect to see in the input, without worrying about managing the stream of tokens explicitly. Another key insight is that 99% of the time, we don’t need the ability to deal with parse failure or the ability to parse multiple alternatives. With these insights in mind, we can create a very simple Scanner
abstraction, which is just a State
ful computation over a list of tokens:
type Scanner = State [String]
runScanner :: Scanner a -> String -> a
runScanner s = evalState s . words
To run a scanner, we just feed it the entire input as a String
, which gets chopped into tokens using words
. (Of course in some scenarios we might want to use lines
instead of words
, or even do more complex tokenization.)
Note since Scanner
is just a type synonym for State [String]
, it is automatically an instance of Functor
, Applicative
, and Monad
(but not Alternative
).
So let’s develop a little Scanner
DSL. The most fundamental thing we can do is read the next token.
str :: Scanner String
str = get >>= \case { s:ss -> put ss >> return s }
(This uses the LambdaCase
extension, though we could easily rewrite it without.) str
gets the current list of tokens, puts it back without the first token, and returns the first token. Note that I purposely didn’t include a case for the empty list. You might think we want to include a case for the empty token list and have it return the empty string or something like that. But since the input will always be properly formatted, if this scenario ever happens it means my program has a bug—e.g. perhaps I misunderstood the description of the input format. In this scenario I want it to crash loudly, as soon as possible, rather than continuing on with some bogus data.
We can now add some scanners for reading specific token types other than String
, simply by mapping the read
function over the output of str
:
int :: Scanner Int
int = read <$> str
integer :: Scanner Integer
integer = read <$> str
double :: Scanner Double
double = read <$> str
Again, these will crash if they see a token in an unexpected format, and that is a very deliberate choice.
Now, as I explained earlier, a very common pattern is to have an integer followed by
copies of something. So let’s make a combinator to encapsulate that pattern:
numberOf :: Scanner a -> Scanner [a]
numberOf s = int >>= flip replicateM s
numberOf s
expects to first see an Int
value , and then it runs the provided scanner
times, returning a list of the results.
It’s also sometimes useful to have a way to repeat a Scanner
some unknown number of times until encountering EOF (for example, the input for some problems doesn’t specify the number of test cases up front the way that Popular Vote does). This is similar to the many
combinator from Alternative
.
many :: Scanner a -> Scanner [a]
many s = get >>= \case { [] -> return []; _ -> (:) <$> s <*> many s }
many s
repeats the scanner s
as many times as it can, returning a list of the results. In particular it first peeks at the current token list to see if it is empty. If so, it returns the empty list of results; if there are more tokens, it runs s
once and then recursively calls many s
, consing the results together.
Finally, it’s quite common to want to parse a specific small number of something, e.g. two double values representing a 2D coordinate pair. We could just write replicateM 2 double
, but this is common enough that I find it helpful to define dedicated combinators with short names:
two, three, four :: Scanner a -> Scanner [a]
[two, three, four] = map replicateM [2..4]
The complete file can be found on GitHub. As I continue this series I’ll be putting more code into that repository. Note I do not intend to make this into a Hackage package, since that wouldn’t be useful: you can’t tell Kattis to go download a package from Hackage before running your submission. However, it is possible to submit multiple files at once, so you can include Scanner.hs
in your submission and just import Scanner
at the top of your main module.
Examples
So what have we gained? Writing the parser for Popular Vote is now almost trivial:
type Election = [Int]
main = interact $ runScanner elections >>> ...
elections :: Scanner [Election]
elections = numberOf (numberOf int)
In practice I would probably just inline the definition of elections
directly: interact $ runScanner (numberOf (numberOf int)) >>> ...
As a slightly more involved example, chosen almost at random, consider Board Wrapping:
On the first line of input there is one integer,
, giving the number of test cases (moulds) in the input. After this line,
test cases follow. Each test case starts with a line containing one integer
, which is the number of boards in the mould. Then
lines follow, each with five floating point numbers
where
and
. The
and
are the coordinates of the center of the board and
and
are the width and height of the board, respectively.
is the angle between the height axis of the board to the
-axis in degrees, positive clockwise.
Here’s how I would set up the input, using Scanner
and a custom data type to represent boards.
import Scanner
type V = [Double] -- 2D vectors/points
newtype A = A Double -- angle (radians)
-- newtype helps avoid conversion errors
fromDeg :: Double -> A
fromDeg d = A (d * pi / 180)
data Board = Board { boardLoc :: V, boardDims :: V, boardAngle :: A }
board :: Scanner Board
board = Board
<$> two double
<*> two double
<*> ((fromDeg . negate) <$> double)
main = interact $
runScanner (numberOf (numberOf board)) >>> ...
I’ve been using http://hackage.haskell.org/package/base-4.12.0.0/docs/Text-ParserCombinators-ReadP.html for a bunch of advent-of-code solutions, and with a few helper functions that has proven to a quite simple and quick way of handling most semi-difficult parses.
Yes, that’s a nice idea, though I think it’s still overkill for a lot of things. Advent of Code is a bit different though, it would be interesting to compare the kind of parsing necessary for solving those and for solving ICPC-style problems (as on Kattis).
it maybe very helpful to implement “scanf”
Can you give an example of how you think it would be helpful? How do you envision scanf working in a Haskell context?
Personally I think it popular in Competitive Programming (if C used) and to my knowledge , there is already a ‘scanf’ in Haskell which seemingly works well .
The thing about “scanf” is that it’s not a combinator. If you look at the definition of “board” above, you’ll note that it’s a clear description of the input format in what’s essentially a domain-specific language. This description is built up from smaller descriptions, and it can easily be embedded in further descriptions (as with “(numberOf (numberOf board))” ). If you need to add a new format, say for a complex number, that’s easy to do and clear.
“scanf” doesn’t have this flexability or clarity, so no, it’s not terribly useful to implement it.
Thanks for the nice post! I wonder if this seemingly well-behaved (linear?) pattern of scanning tokens — every token is retrieved at most once, and in that case, is immediately discarded from the state — can be encoded in a more restricted monad than that of State. I guess this would not play well with your aim to encode the scanner quickly and using only base facilities, but let’s forget about this for a moment.
This imaginary more-restricted monad would have only one method `pop`, which would liberate you from doing (in `str` and then duplicated in `many`): `get >>= \case { s:ss -> put ss >> {- process s -}` replacing it with `pop >>= \s -> {- process s -}`. (Could be `\(Just s) -> {- process s -}` to account for the fact that the stack might be empty already.)
Actually, I was able to Google a very similar thing:
http://hackage.haskell.org/package/hexpr-0.0.0.0/docs/Control-Monad-Stack.html
There are several issues there, though: it’s basically dead, it’s pop and push simultaneously, it hardcodes list as the implementation of this behavior while many other container-like structures might do as well or better.
Also, minor: I prefer `liftA2 (:) s (many s)` over `(:) <$> s <*> many s` because I find the latter too noisy.
Pingback: Resumen de lecturas compartidas durante mayo de 2019 | Vestigium
Pingback: Competitive Programming in Haskell: reading large inputs with ByteString | blog :: Brent -> [String]
Pingback: Competitive programming in Haskell: summer series | blog :: Brent -> [String]
Pingback: Competitive programming in Haskell: building unordered trees | blog :: Brent -> [String]
Pingback: Competitive programming in Haskell: data representation and optimization, with cake | blog :: Brent -> [String]
Pingback: Competitive programming in Haskell: topsort via laziness | blog :: Brent -> [String]