Many parser combinator libraries in Haskell (such as parsec) have both a
Monad interface as well as an
Applicative interface. (Actually, to be really useful, you also need
MonadPlus along with
Alternative along with
Applicative, in order to encode choice; from now on when I talk about
Applicative note that I really have
Alternative in mind as well.) The
Applicative interface is often nicer, but it is less powerful than the
Monad interface: in particular, using
Applicative you can only parse context-free languages, whereas
Monad lets you parse arbitrary context-sensitive languages. Intuitively, this is because the structure of
Applicative computations cannot depend on intermediate results, whereas
Monad computations allow you to choose which computation to run next based on intermediate results.
This is a good bit of intuition, with only one minor caveat: it isn’t true! I believe it was two years ago, during the second Hac phi, when I first learned from Edward Kmett how
Applicative (by which I mean, of course,
Alternative) can be used to parse arbitrary context-sensitive languages. The question just came up again in the
#haskell IRC channel, and I figured it would be useful to have this all written down somewhere. In particular, Reid Barton gave a nice sketch which I decided to turn into some working code.
Here’s the key insight: normally, grammars are defined as finite objects: a finite set of terminals, a finite set of nonterminals, and a finite set of productions. However, Haskell’s general recursion means that we can write down a "grammar" with an infinite set of production rules. This is what lets us get away with parsing context-sensitive languages with
Applicative: we just make a different production rule for every possible input!
First, some imports. Notice that I do not import
> import Text.Parsec
> import Text.Parsec.String
> import Control.Arrow ((&&&))
> import Control.Applicative hiding ((<|>))
> import Data.List (group)
guard function is for
MonadPlus, but we can make something equivalent for
> guard' :: Alternative f => Bool -> f ()
> guard' True = pure ()
> guard' False = empty
And now for the meat of the example.
parseArbitrary takes an arbitrary predicate on
Strings built from lowercase letters and turns it into a parser. The created parser will accept
Strings for which the predicate evaluates to
()) and fail for any other string.
> parseArbitrary :: (String -> Bool) -> Parser ()
> parseArbitrary p =
If we encounter
eof, we simply ensure that the predicate holds of the empty string.
> (eof <* guard' (p ))
Otherwise, we choose between 26 alternatives based on the next character in the input. If the character
c is encountered, we make a recursive call to
parseArbitrary (p . (c:)). The remainder of the input must satisfy
(p . (c:)), that is, it must consist of some
s such that
(c:s) satisfies the predicate
> <|> foldr (<|>) parserZero
> (map (\c -> char c *>
> parseArbitrary (p . (c:))
For any given predicate
p, you can think of
parseArbitrary p as an infinite tree with a 26-way branch at each node. Each node "remembers" the path taken to reach it from the root of the tree, in the form of prepend functions composed with the original predicate. We have constructed an infinite grammar: each node in the tree corresponds to a production, one for every possible input prefix.
Let’s try it out. Here’s a function which only accepts
Strings of the form
"aaabbbccc", with an equal number of a’s, b’s, and c’s. This is a well-known example of a language which is not context-free (easily shown using the pumping lemma for context-free languages).
> f :: String -> Bool
> f s
> | [('a',na), ('b',nb), ('c',nc)]
> <- map (head &&& length). group $ s
> = na == nb && nb == nc
> | otherwise = False
Now we make
f into a parser and test it on some example inputs:
> p = parseArbitrary f
> main = do
> parseTest p "aaa"
> parseTest p "aaabbbcc"
> parseTest p "aaaabcccc"
> parseTest p "aaaaabbbbbccccc"
The last test succeeds by returning
(). For the first three, we get error messages like this:
parse error at (line 1, column 4):
unexpected end of input
expecting "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y" or "z"
Obviously, these are not very helpful. But what were you expecting?
After all, this is one of those things that is interesting in theory, but in practice amounts to an awful hack that no one would ever want to use in real code.
In the end, it’s still true that
Applicative can only parse context-free languages as long as we restrict ourselves to finite grammars—which any sensible programmer would do anyway.
[ETA: it looks like using infinite grammars is not as impractical as I thought—see the comments below!]