(λblog. blog blog) (λblog. blog blog)

diverging since 2008

Attribute Grammars, GADTs and MonadFix (part 1)

2008-06-15 23:00

Prelude

I attended a compilers course this spring. The final excercise was to implement from scratch a compiler for what was basically a very restricted subset of C. The compiler was to generate LLVM assembly that could then be optimized and compiled into native binaries with the (very high quality) LLVM tools. I recommend having a look at the various LLVM benchmarks: it's going to be a major player in programming language implementation in the future.

All this however is but a backdrop to what I'm going to write about. I implemented the compiler in Haskell and bumped into very interesting design questions. First of all, I decided to implement the AST (abstract syntax tree, generated by the parser) as a generalized algebraic datatype (GADT) because I had been itching to find some real use for GADTs and ASTs were the canonical example of their use. The following is a condensed version of what the finished compiler uses (preceded by some boilerplate to make this post work as literate haskell):

> {-# LANGUAGE KindSignatures, EmptyDataDecls, MultiParamTypeClasses,
>              FlexibleInstances, GADTs #-}
>
> import Control.Monad
> import Control.Monad.Fix
> import Control.Monad.Error
>
> data Expr
> data Stmt
> data Location
>
> data NoAnn a = NoAnn
> type Identifier = String
> type Op = String
>
> data AST (ann :: * -> *) t where
>     If      :: ann Stmt -> AST ann Expr -> AST ann Stmt -> AST ann Stmt
>     Block   :: ann Stmt -> [AST ann Stmt] -> AST ann Stmt
>     Assign  :: ann Stmt -> AST ann Location -> AST ann Expr -> AST ann Stmt
>
>     Var     :: ann Location -> Identifier -> AST ann Location
>     Ref     :: ann Expr -> AST ann Location -> AST ann Expr
>
>     Binary  :: ann Expr -> Op -> AST ann Expr -> AST ann Expr -> AST ann Expr
>     NumLit  :: ann Expr -> Integer -> AST ann Expr

In addition to this there were declarations, array types and more control structures, but I'm trying to keep things simple. As you can see the GADT is parametrised over an annotation type ann and a phantom type that indicates the type of the node (statement, expression, location). The annotation type, however, is of kind *->*: it takes the node type as an argument. This is useful because different node types need to have different annotations, for example expressions have a type while statements do not. All this could also be formulated as vanilla Haskell types as follows (with a simple annotation type for simplicity):

> data Stmt' ann = If' ann (Expr' ann) (Stmt' ann)
>                | Block' ann [Stmt' ann]
>                | Assign' ann (Location' ann) (Expr' ann)
>
> data Location' ann = Var' ann Identifier
>
> data Expr' ann = Binary' ann Op (Expr' ann) (Expr' ann)
>                | Ref' ann (Location' ann)
>                | NumLit' ann Integer

However getting the different annotation types to flow as smoothly would be a bit bothersome. (The primes are added in order to keep this blog post a working .lhs file.)

Attribute grammars and first solution

I had decided to implement the compiler with attribute grammars, a very elegant and powerful formalism for tree traversals (see Why attribute grammars matter for an introduction). There exist attribute grammar systems that generate Haskell code from a description of the grammar (UUAG being the most noteworthy), but I decided to go on with a hand-coded approach. UUAG compiles the attribute grammar into a bunch of haskell functions that mutually recurse to calculate the attributes. This however mixes up all the attributes into one traversal and is very obfuscated. A more straightforward way would be to actually annotate the tree with attribute fields.

More specifically, syntetic attributes (ones which can be computed from a node's childs' attributes) would be stored in the node's annotation field while inherited attributes (ones which propagate from a node down to it's leaves) would be just passed down as arguments to the traversal function. The basic structure of the traversal would be something like

> class Traverse1 a inherited synthetic where
>     traverse1 :: AST NoAnn a -> inherited -> AST synthetic a
> -- and instances for Stmt, Expr and Location

This solution however has the problem that makes the traversal monolithic. I'd like to have the different phases of the compiler (e.g typecheck, code generation) define their own simple traversals and then simply compose these in the driver. One other solution I toyed with was:

> type M = Either String   -- for example, actually it was RWST on top of Error
>
> -- an action that we can mfix over (as long as M is an instance of MonadFix)
> type Result a = a -> M a 
>
> data AG a = AG { 
>       block   :: M [a Stmt]                     -> Result (a Stmt),
>       ifthen  :: M (a Expr) -> M (a Stmt)       -> Result (a Stmt),
>       assign  :: M (a Location) -> M (a Expr)   -> Result (a Stmt),
>       
>       var     :: Identifier                     -> Result (a Location),
> 
>       ref     :: M (a Location)                 -> Result (a Expr),
>       binary  :: Op -> M (a Expr) -> M (a Expr) -> Result (a Expr)
> }
>
> compose :: AG a -> AG a -> AG a
> compose (AG b' i' a' v' r' bin')
>         (AG b  i  a  v  r  bin ) =
>     AG (f1 b' b) (f2 i' i) (f2 a' a) (f1 v' v) (f1 r' r) (f3 bin' bin)
>     where f1 m n = \x -> m x >=> n x
>           f2 m n = \x y -> m x y >=> n x y
>           f3 m n = \x y z -> m x y z >=> n x y z
> 
> class TraverseFix attr t where
>     traverseFix :: AG attr -> AST NoAnn t -> M (attr t)
>
> -- All we do in these instances is define the recursion scheme
> instance TraverseFix attr Stmt where
>     traverseFix ag (Block _ ss)   = mfix $ (block ag) (mapM (traverseFix ag) ss)
>     traverseFix ag (If _ e s)     = mfix $
>         (ifthen  ag) (traverseFix ag e) (traverseFix ag s)
>     traverseFix ag (Assign _ l e) = mfix $
>         (assign  ag) (traverseFix ag l) (traverseFix ag e)
> instance TraverseFix attr Location where
>     traverseFix ag (Var _ i) = mfix $ (var ag) i
> instance TraverseFix attr Expr where
>     traverseFix ag (Ref _ l)          = mfix $ (ref ag) (traverseFix ag l)
>     traverseFix ag (Binary _ o e1 e2) = mfix $
>         (binary ag) o (traverseFix ag e1) (traverseFix ag e2)

This lets one compose multiple attribute grammars (AGs) and take their fixpoint over the AST: this makes the order of composition irrelevant and actually allows for more expressive AGs. Also, it evades actually storing the intermediate attributes in the tree quite elegantly. All this is very true to the spirit of attribute grammars but the requirement of knowing all attributes beforehand violates the modularity I needed.

Stand by for part 2...