Parsing context-sensitive languages with Applicative

Many parser combinator libraries in Haskell (such as parsec) have both a Monad interface as well as an Applicative interface. (Actually, to be really useful, you also need MonadPlus along with Monad, or Alternative along with Applicative, in order to encode choice; from now on when I talk about Monad and Applicative note that I really have MonadPlus or Alternative in mind as well.) The Applicative interface is often nicer, but it is less powerful than the Monad interface: in particular, using Applicative you can only parse context-free languages, whereas Monad lets you parse arbitrary context-sensitive languages. Intuitively, this is because the structure of Applicative computations cannot depend on intermediate results, whereas Monad computations allow you to choose which computation to run next based on intermediate results.

This is a good bit of intuition, with only one minor caveat: it isn’t true! I believe it was two years ago, during the second Hac phi, when I first learned from Edward Kmett how Applicative (by which I mean, of course, Alternative) can be used to parse arbitrary context-sensitive languages. The question just came up again in the #haskell IRC channel, and I figured it would be useful to have this all written down somewhere. In particular, Reid Barton gave a nice sketch which I decided to turn into some working code.

Here’s the key insight: normally, grammars are defined as finite objects: a finite set of terminals, a finite set of nonterminals, and a finite set of productions. However, Haskell’s general recursion means that we can write down a "grammar" with an infinite set of production rules. This is what lets us get away with parsing context-sensitive languages with Applicative: we just make a different production rule for every possible input!

First, some imports. Notice that I do not import Control.Monad.

> import Text.Parsec
> import Text.Parsec.String
> import Control.Arrow ((&&&))
> import Control.Applicative hiding ((<|>))
> import Data.List (group)

The usual guard function is for MonadPlus, but we can make something equivalent for Alternative.

> guard' :: Alternative f => Bool -> f ()
> guard' True  = pure ()
> guard' False = empty

And now for the meat of the example. parseArbitrary takes an arbitrary predicate on Strings built from lowercase letters and turns it into a parser. The created parser will accept Strings for which the predicate evaluates to True (returning ()) and fail for any other string.

> parseArbitrary :: (String -> Bool) -> Parser ()
> parseArbitrary p =

If we encounter eof, we simply ensure that the predicate holds of the empty string.

>       (eof <* guard' (p [])) 

Otherwise, we choose between 26 alternatives based on the next character in the input. If the character c is encountered, we make a recursive call to parseArbitrary (p . (c:)). The remainder of the input must satisfy (p . (c:)), that is, it must consist of some String s such that (c:s) satisfies the predicate p.

>   <|> foldr (<|>) parserZero 
>         (map (\c -> char c *> 
>                     parseArbitrary (p . (c:))
>              ) 
>              ['a'..'z']
>         )

For any given predicate p, you can think of parseArbitrary p as an infinite tree with a 26-way branch at each node. Each node "remembers" the path taken to reach it from the root of the tree, in the form of prepend functions composed with the original predicate. We have constructed an infinite grammar: each node in the tree corresponds to a production, one for every possible input prefix.

Let’s try it out. Here’s a function which only accepts Strings of the form "aaabbbccc", with an equal number of a’s, b’s, and c’s. This is a well-known example of a language which is not context-free (easily shown using the pumping lemma for context-free languages).

> f :: String -> Bool
> f s 
>   | [('a',na), ('b',nb), ('c',nc)] 
>     <- map (head &&& length). group $ s
>     = na == nb && nb == nc
>   | otherwise = False

Now we make f into a parser and test it on some example inputs:

> p = parseArbitrary f
> main = do
>   parseTest p "aaa"
>   parseTest p "aaabbbcc"
>   parseTest p "aaaabcccc"
>   parseTest p "aaaaabbbbbccccc"

The last test succeeds by returning (). For the first three, we get error messages like this:

parse error at (line 1, column 4):
unexpected end of input
expecting "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y" or "z"

Obviously, these are not very helpful. But what were you expecting? After all, this is one of those things that is interesting in theory, but in practice amounts to an awful hack that no one would ever want to use in real code.

In the end, it’s still true that Applicative can only parse context-free languages as long as we restrict ourselves to finite grammars—which any sensible programmer would do anyway.

[ETA: it looks like using infinite grammars is not as impractical as I thought—see the comments below!]


About Brent

Associate Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in haskell and tagged , , , , , , . Bookmark the permalink.

11 Responses to Parsing context-sensitive languages with Applicative

  1. ezyang says:

    I’m not sure, but does this imply that there is a principled way to embed arbitrary context-sensitive grammars *inside* context-free grammars?

    • Brent says:

      I’m not sure I understand what you’re asking here.

      • Ryan Ingram says:

        Well, consider this implementation of a^n b^n c^n:

        abcCount n = (char ‘a’ *> abcCount (n+1)) string (replicate n ‘b’ ++ replicate n ‘c’)

        p = abcCount 0

        This grammar still has an infinite number of productions, but they are done in a principled way– a family of productions abcCount_n indexed by a natural number. I think you will find you get much better error messages with this implementation.

  2. Andrea Vezzosi says:

    Haven’t you shown here that it handles all the recursively enumerable languages too? or those decidable at least if we assume the (String -> Bool) predicate to be total?

    Also I thought Edward Kmett had some reasonable use for infinite grammars, but they seem to be handy in Agda anyhow:

    • Brent says:

      Yes, you’re right. I was using “context-sensitive” as if it meant “not context-free” but I see now it is narrower than that. And thanks for the link.

  3. Infinite grammars are perfectly useful, provided that we acknowledge the structure they have. This is just like the reason why PDAs are useful: technically push-down automata are infinite-state machines, but by encoding them as a finite-state machine plus a stack we make explicit the subregularities of the infinite-state machine. With PDAs we’re saying that infinite graphs (i.e., infinite DFAs) are helpful, because we can finitize them; with your example in mind, we’re saying that infinite hypergraphs (i.e., infinite CFGs) are helpful— which should intuitively follow from the facts that hypergraphs are helpful and that infinite graphs are helpful.

    This is the same intuition behind why Ryan Ingram’s implementation gives better error messages: his version makes the subregularities defining the a^n b^n c^n language explicit; that is, he makes the finitization of the infinite hypergraph explicit. We can do this for a large collection of languages simply by adding indices/parameters to the nonterminals of a CFG —which is akin to giving a stack to a DFA— and this is what is done in many grammar formalisms more powerful than context-free. It’s the explicit finitization which enables giving decent error messages.

    Though your example is quite nice because it shows how we can take an arbitrary characteristic function and convert it into an Applicative parser. I’m loathe to call a characteristic function a “grammar” for precisely the same reason why you get awful error messages— namely, that characteristic functions have no structure, whereas I’d define grammars as being a means of explaining the structure behind (particular classes of) characteristic functions.

  4. Dan Doel says:

    To pile on a bit more…. (sorry)

    When we translate context free grammars into the typical combinator libraries, we arguably make use of infinite grammars all the time. For instance, to translate:

    S ::= a S b | e

    We write

    s = char ‘a’ *> s <* char 'b' return ()

    or something of the sort. This looks just like the context free grammar, but it’s actually self-referential at the Haskell expression level, meaning that it’s just as infinite as:

    naturals = 0 : map (+1) naturals

    We just don’t typically think of it that way (although if you write parser combinator libraries in, say, Scala, you’ll bang up against this, as if you don’t make your combinators sufficiently lazy, definitions like the above will simply blow up), because it looks so similar to the CFG. But most libraries don’t deal in terminals and non-terminals, they deal in infinite trees of terminals glued together with concatenation and alternation.

    So infinite grammars are actually ubiquitous in the typical parser combinator approach.

  5. Pingback: How do you identify monadic design patterns? | PHP Developer Resource

  6. Nils Anders Danielsson says:

    If the set of tokens is infinite, then monadic combinators can, at least in some cases, be more expressive. For instance, consider the language { nn | n ∈ ℕ }. The approach outlined in your post does not work for this language. I discuss this in more detail in the paper mentioned above (Total Parser Combinators).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.