Parsing context-sensitive languages with Applicative

January 5, 2012

Many parser combinator libraries in Haskell (such as parsec) have both a Monad interface as well as an Applicative interface. (Actually, to be really useful, you also need MonadPlus along with Monad, or Alternative along with Applicative, in order to encode choice; from now on when I talk about Monad and Applicative note that I really have MonadPlus or Alternative in mind as well.) The Applicative interface is often nicer, but it is less powerful than the Monad interface: in particular, using Applicative you can only parse context-free languages, whereas Monad lets you parse arbitrary context-sensitive languages. Intuitively, this is because the structure of Applicative computations cannot depend on intermediate results, whereas Monad computations allow you to choose which computation to run next based on intermediate results.

This is a good bit of intuition, with only one minor caveat: it isn’t true! I believe it was two years ago, during the second Hac phi, when I first learned from Edward Kmett how Applicative (by which I mean, of course, Alternative) can be used to parse arbitrary context-sensitive languages. The question just came up again in the #haskell IRC channel, and I figured it would be useful to have this all written down somewhere. In particular, Reid Barton gave a nice sketch which I decided to turn into some working code.

Here’s the key insight: normally, grammars are defined as finite objects: a finite set of terminals, a finite set of nonterminals, and a finite set of productions. However, Haskell’s general recursion means that we can write down a "grammar" with an infinite set of production rules. This is what lets us get away with parsing context-sensitive languages with Applicative: we just make a different production rule for every possible input!

First, some imports. Notice that I do not import Control.Monad.

> import Text.Parsec
> import Text.Parsec.String
> import Control.Arrow ((&&&))
> import Control.Applicative hiding ((<|>))
> import Data.List (group)

The usual guard function is for MonadPlus, but we can make something equivalent for Alternative.

> guard' :: Alternative f => Bool -> f ()
> guard' True  = pure ()
> guard' False = empty

And now for the meat of the example. parseArbitrary takes an arbitrary predicate on Strings built from lowercase letters and turns it into a parser. The created parser will accept Strings for which the predicate evaluates to True (returning ()) and fail for any other string.

> parseArbitrary :: (String -> Bool) -> Parser ()
> parseArbitrary p =

If we encounter eof, we simply ensure that the predicate holds of the empty string.

>       (eof <* guard' (p [])) 

Otherwise, we choose between 26 alternatives based on the next character in the input. If the character c is encountered, we make a recursive call to parseArbitrary (p . (c:)). The remainder of the input must satisfy (p . (c:)), that is, it must consist of some String s such that (c:s) satisfies the predicate p.

>   <|> foldr (<|>) parserZero 
>         (map (\c -> char c *> 
>                     parseArbitrary (p . (c:))
>              ) 
>              ['a'..'z']
>         )

For any given predicate p, you can think of parseArbitrary p as an infinite tree with a 26-way branch at each node. Each node "remembers" the path taken to reach it from the root of the tree, in the form of prepend functions composed with the original predicate. We have constructed an infinite grammar: each node in the tree corresponds to a production, one for every possible input prefix.

Let’s try it out. Here’s a function which only accepts Strings of the form "aaabbbccc", with an equal number of a’s, b’s, and c’s. This is a well-known example of a language which is not context-free (easily shown using the pumping lemma for context-free languages).

> f :: String -> Bool
> f s 
>   | [('a',na), ('b',nb), ('c',nc)] 
>     <- map (head &&& length). group $ s
> 
>     = na == nb && nb == nc
> 
>   | otherwise = False

Now we make f into a parser and test it on some example inputs:

> p = parseArbitrary f
> 
> main = do
>   parseTest p "aaa"
>   parseTest p "aaabbbcc"
>   parseTest p "aaaabcccc"
>   parseTest p "aaaaabbbbbccccc"

The last test succeeds by returning (). For the first three, we get error messages like this:

parse error at (line 1, column 4):
unexpected end of input
expecting "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y" or "z"

Obviously, these are not very helpful. But what were you expecting? After all, this is one of those things that is interesting in theory, but in practice amounts to an awful hack that no one would ever want to use in real code.

In the end, it’s still true that Applicative can only parse context-free languages as long as we restrict ourselves to finite grammars—which any sensible programmer would do anyway.

[ETA: it looks like using infinite grammars is not as impractical as I thought---see the comments below!]


trapd

May 11, 2011


Themes on Streams

May 9, 2011
> {-# LANGUAGE DeriveFunctor, FlexibleInstances #-}

Recall that a stream is a countably infinite sequence of values:

> data Stream a = a :> Stream a
>   deriving (Functor, Show)
> 
> sHead (a :> _) = a
> sTail (_ :> s) = s

Streams are lovely things (especially in a lazy language) with many nice properties.

> type Theme = String
> type Consciousness = Stream Theme

The remainder of this blog post (in two parts) will be of type Consciousness.

Theme 1: Stream is a monad

The other day I read Jeremy Gibbons’s blog post about the stream monad, proving the monad laws for the version of join that diagonalizes nested streams:

> instance Monad Stream where
>   return a       = a :> return a
>   s >>= f        = sJoin (fmap f s)
> 
> sJoin (s :> ss) = sHead s :> sJoin (fmap sTail ss)

sJoin takes a stream of streams, and outputs a stream with the first element of the first stream, the second element of the second stream, … the nth element of the nth stream.

I recommend reading his post, it’s a very cool example of using universal properties and equational reasoning.

Theme 2: Streams are (isomorphic to) functions

In a comment on Jeremy’s post, Patai Gergely noted that insight into this issue can be gained by observing that Stream a is isomorphic to Nat -> a: there is one item in a stream at every natural number index.

Of course, (->) Nat is a monad: in fact, (->) e is a monad (the "reader monad") for any type e. And what does join for the (->) e monad do? Why, it duplicates an argument:

> rJoin :: (e -> (e -> a)) -> (e -> a)
> rJoin s e = s e e

A little thought shows that this corresponds exactly to the diagonalizing join on Streams: rJoin s e = s e e can be read as "the eth element of rJoin s is the eth element of the eth stream in the stream of streams s." See?

So I told my officemate Daniel about this, and it occurred to us that there’s nothing special here about Nat at all! rJoin is polymorphic in e; R -> a is a monad for any type R.

Functors which are isomorphic to (->) R for some concrete type R are called representable; so what this means is that all representable functors are monads. (Not that we were the first to think of this.)

Theme 3: Streams are comonads

Hmm, but Stream is a comonad too, right?

> class Comonad w where
>   extract   :: w a -> a
>   duplicate :: w a -> w (w a)
> 
> instance Comonad Stream where
>   extract                = sHead
>   duplicate s@(hd :> tl) = s :> duplicate tl

And since Stream is isomorphic to (->) Nat, that type must be a comonad too. What is the corresponding comonad instance for (->) Nat?

> type Nat = Integer   -- just pretend, OK?
> 
> instance Comonad ((->) Nat) where

Extracting from a stream just returns its head element, that is, the element at index 0. So extracting from a Nat -> a applies it to 0.

>   extract = ($0)

Duplicating a stream gives us a stream of streams, where the nth output stream contains consecutive elements from the original stream beginning with the nth. Put another way, the jth element from the ith stream in the output is the (i+j)th element of the original stream:

>   duplicate s i j = s (i + j)

Neat. Unlike the implementation of join for the monad instance, however, this is definitely making use of the particular structure of Nat. So this throws some cold water on our hopes of similarly generalizing this to all representable functors.

…or does it?

Theme 4: Comonads for representable functors

In the previous paragraph I said "this is definitely making use of the particular structure of Nat", but I was being deliberately obtuse. Nat has lots and lots of structure, surely we weren’t using all of it! In fact, we only mentioned the particular natural number 0 and the addition operation. Hmm… zero and addition… what’s interesting about them? Well, zero is the identity for addition, of course, and addition is associative — that is, the natural numbers form a monoid under addition with zero as the identity. So just as a wild guess, perhaps it’s the monoid structure of Nat which makes the comonad instance possible?

In fact… yes! It turns out that comonad structures on R -> a are in one-to-one correspondence with monoid structures on R. To prove this, we can… oh, darn, looks like we’re out of time! (Read: this blog post is getting too long and if I don’t publish something soon I never will.) I’ll continue in another post, but in the meantime you might fancy trying to prove this yourself.


Follow

Get every new post delivered to your Inbox.