(λblog. blog blog) (λblog. blog blog)

diverging since 2008

Entries tagged "programming"

2008-06-12 01:30

Testing HsColour. Here you go: fibonacci with iso-recursive types

> data Mu f = Mu {unMu :: f (Mu f)}
> type List a = Mu ((,) a)
> tl = snd . unMu
> hd = fst . unMu
> cns = (Mu .) . (,)
> ones = Mu (1,ones)
> fibs :: List Integer
> fibs = cns 1 . cns 1 $ go fibs (tl fibs)
>        where go (Mu (x,xs)) (Mu (y,ys)) = cns (x+y) $ go xs ys

(NB. planet readers, you can't see the colours)

2008-06-15 23:00


I attended a compilers course this spring. The final excercise was to implement from scratch a compiler for what was basically a very restricted subset of C. The compiler was to generate LLVM assembly that could then be optimized and compiled into native binaries with the (very high quality) LLVM tools. I recommend having a look at the various LLVM benchmarks: it's going to be a major player in programming language implementation in the future.

All this however is but a backdrop to what I'm going to write about. I implemented the compiler in Haskell and bumped into very interesting design questions. First of all, I decided to implement the AST (abstract syntax tree, generated by the parser) as a generalized algebraic datatype (GADT) because I had been itching to find some real use for GADTs and ASTs were the canonical example of their use. The following is a condensed version of what the finished compiler uses (preceded by some boilerplate to make this post work as literate haskell):

> {-# LANGUAGE KindSignatures, EmptyDataDecls, MultiParamTypeClasses,
>              FlexibleInstances, GADTs #-}
> import Control.Monad
> import Control.Monad.Fix
> import Control.Monad.Error
> data Expr
> data Stmt
> data Location
> data NoAnn a = NoAnn
> type Identifier = String
> type Op = String
> data AST (ann :: * -> *) t where
>     If      :: ann Stmt -> AST ann Expr -> AST ann Stmt -> AST ann Stmt
>     Block   :: ann Stmt -> [AST ann Stmt] -> AST ann Stmt
>     Assign  :: ann Stmt -> AST ann Location -> AST ann Expr -> AST ann Stmt
>     Var     :: ann Location -> Identifier -> AST ann Location
>     Ref     :: ann Expr -> AST ann Location -> AST ann Expr
>     Binary  :: ann Expr -> Op -> AST ann Expr -> AST ann Expr -> AST ann Expr
>     NumLit  :: ann Expr -> Integer -> AST ann Expr

In addition to this there were declarations, array types and more control structures, but I'm trying to keep things simple. As you can see the GADT is parametrised over an annotation type ann and a phantom type that indicates the type of the node (statement, expression, location). The annotation type, however, is of kind *->*: it takes the node type as an argument. This is useful because different node types need to have different annotations, for example expressions have a type while statements do not. All this could also be formulated as vanilla Haskell types as follows (with a simple annotation type for simplicity):

> data Stmt' ann = If' ann (Expr' ann) (Stmt' ann)
>                | Block' ann [Stmt' ann]
>                | Assign' ann (Location' ann) (Expr' ann)
> data Location' ann = Var' ann Identifier
> data Expr' ann = Binary' ann Op (Expr' ann) (Expr' ann)
>                | Ref' ann (Location' ann)
>                | NumLit' ann Integer

However getting the different annotation types to flow as smoothly would be a bit bothersome. (The primes are added in order to keep this blog post a working .lhs file.)

Attribute grammars and first solution

I had decided to implement the compiler with attribute grammars, a very elegant and powerful formalism for tree traversals (see Why attribute grammars matter for an introduction). There exist attribute grammar systems that generate Haskell code from a description of the grammar (UUAG being the most noteworthy), but I decided to go on with a hand-coded approach. UUAG compiles the attribute grammar into a bunch of haskell functions that mutually recurse to calculate the attributes. This however mixes up all the attributes into one traversal and is very obfuscated. A more straightforward way would be to actually annotate the tree with attribute fields.

More specifically, syntetic attributes (ones which can be computed from a node's childs' attributes) would be stored in the node's annotation field while inherited attributes (ones which propagate from a node down to it's leaves) would be just passed down as arguments to the traversal function. The basic structure of the traversal would be something like

> class Traverse1 a inherited synthetic where
>     traverse1 :: AST NoAnn a -> inherited -> AST synthetic a
> -- and instances for Stmt, Expr and Location

This solution however has the problem that makes the traversal monolithic. I'd like to have the different phases of the compiler (e.g typecheck, code generation) define their own simple traversals and then simply compose these in the driver. One other solution I toyed with was:

> type M = Either String   -- for example, actually it was RWST on top of Error
> -- an action that we can mfix over (as long as M is an instance of MonadFix)
> type Result a = a -> M a 
> data AG a = AG { 
>       block   :: M [a Stmt]                     -> Result (a Stmt),
>       ifthen  :: M (a Expr) -> M (a Stmt)       -> Result (a Stmt),
>       assign  :: M (a Location) -> M (a Expr)   -> Result (a Stmt),
>       var     :: Identifier                     -> Result (a Location),
>       ref     :: M (a Location)                 -> Result (a Expr),
>       binary  :: Op -> M (a Expr) -> M (a Expr) -> Result (a Expr)
> }
> compose :: AG a -> AG a -> AG a
> compose (AG b' i' a' v' r' bin')
>         (AG b  i  a  v  r  bin ) =
>     AG (f1 b' b) (f2 i' i) (f2 a' a) (f1 v' v) (f1 r' r) (f3 bin' bin)
>     where f1 m n = \x -> m x >=> n x
>           f2 m n = \x y -> m x y >=> n x y
>           f3 m n = \x y z -> m x y z >=> n x y z
> class TraverseFix attr t where
>     traverseFix :: AG attr -> AST NoAnn t -> M (attr t)
> -- All we do in these instances is define the recursion scheme
> instance TraverseFix attr Stmt where
>     traverseFix ag (Block _ ss)   = mfix $ (block ag) (mapM (traverseFix ag) ss)
>     traverseFix ag (If _ e s)     = mfix $
>         (ifthen  ag) (traverseFix ag e) (traverseFix ag s)
>     traverseFix ag (Assign _ l e) = mfix $
>         (assign  ag) (traverseFix ag l) (traverseFix ag e)
> instance TraverseFix attr Location where
>     traverseFix ag (Var _ i) = mfix $ (var ag) i
> instance TraverseFix attr Expr where
>     traverseFix ag (Ref _ l)          = mfix $ (ref ag) (traverseFix ag l)
>     traverseFix ag (Binary _ o e1 e2) = mfix $
>         (binary ag) o (traverseFix ag e1) (traverseFix ag e2)

This lets one compose multiple attribute grammars (AGs) and take their fixpoint over the AST: this makes the order of composition irrelevant and actually allows for more expressive AGs. Also, it evades actually storing the intermediate attributes in the tree quite elegantly. All this is very true to the spirit of attribute grammars but the requirement of knowing all attributes beforehand violates the modularity I needed.

Stand by for part 2...

2008-06-23 14:23

A handy (and quite well-known) trick in haskell is to turn list concatenation (a O(n) operation) into function composition that avoids copying and is O(1):

> let list1 = [1,2,3] in list1 ++ list1
> let list2 = (1:).(2:).(3:) in list2 . list2 $ []  -- ta-dah! no copying

I benchmarked this ages ago when I first heard about the trick, but morrow publishing a a library that uses this made me rerun and publish my observations. Here's the test code, the classical quicksort:

> module Main
>     where
> import Random
> import Control.Monad
> import Data.List
> qsort1 :: [Int] -> [Int]
> qsort1 [] = []
> qsort1 (x:xs) = qsort1 [a | a<-xs, a<x] ++ [x] ++ qsort1 [a | a<-xs, a>=x]
> qsort2 :: [Int] -> [Int]
> qsort2 xs = qsort xs []
>     where qsort [] = id
>           qsort (x:xs) = qsort [a | a<-xs, a<x] . (x:) . qsort [a | a<-xs, a>=x]
> main = do
>   l <- replicateM 500000 $ randomRIO (1,500000)
>   let l1 = qsort1 l
>   let l2 = qsort2 l
>   print (and $ zipWith (==) l1 l2)

Results with ghc -O0:

% for i in 1 2 3; do time ./qsort +RTS -P -K12000000; done
./qsort +RTS -P -K12000000  13.09s user 0.28s system 99% cpu 13.420 total
./qsort +RTS -P -K12000000  12.81s user 0.28s system 99% cpu 13.135 total
./qsort +RTS -P -K12000000  13.38s user 0.31s system 99% cpu 13.709 total

COST CENTRE                    MODULE               %time %alloc  ticks     bytes
qsort1                         Main                  35.4   46.2    109 143631183
qsort2                         Main                  32.5   34.3    100 106768654
main                           Main                  26.9   17.2     83  53500026
CAF                            Main                   5.2    2.3     16   7001237

and with ghc -O3:

% for i in 1 2 3; do time ./qsort +RTS -P -K12000000; done
./qsort +RTS -P -K12000000  9.40s user 0.30s system 99% cpu 9.720 total
./qsort +RTS -P -K12000000  10.58s user 0.40s system 99% cpu 11.021 total
./qsort +RTS -P -K12000000  10.10s user 0.36s system 99% cpu 10.501 total

COST CENTRE                    MODULE               %time %alloc  ticks     bytes
qsort1                         Main                  35.1   46.8     74 127559904
main                           Main                  33.6   19.1     71  52001182
qsort2                         Main                  31.3   34.1     66  93115220

So it seems the compositive version is a tad faster and notably less memory-hungry. Also, ghc optimizes the compositive version more efficiently, presumably because (.) enables all sorts of rewrites (++) doesn't.

As a sidenote, strangely enough replacing the list comprehensions in the above code with something like

> let (small,big) = partition (<x) xs in qsort small ++ [x] ++ qsort big

ended up in a performance decrease of some 3s with -O3 (didn't test w/o optimizations)...

Update: As folks on #haskell pointed out I should mention that this isn't a real quicksort. Quicksort refers to the exact in-place partitioning algorithm that the imperative version uses. This is of course immaterial to the benchmark.

Update 15:01: quicksilver noted that the profiling overheads for the two implementations might be different. I ran a quick test that showed the overheads to be roughly equal: unoptimized qsort2 was about 3% faster than unoptimized qsort1 both with and without profiling.

2008-06-26 18:05

Today's small haskell satori: the least fixpoint operator

> fix f = let x = f x in x

(which seemed like dark magic when I first saw it) could be written a bit more intuitively as

> fix' f = let f' = f . f' in f' undefined

This captures the intuition that fix just applies a function iteratively to undefined (i.e. bottom, _|_). Actually we could use any value in the place of undefined but that would restrict the type of fix'. You see, the only reason f' really needs to be applied to anything is to appease the typechecker.

Of course the original (Haskell98) definition is more elegant and probably has better sharing characteristics too.

Update 2008-07-11: found some related musings on haskell-cafe. See also the reply

2008-08-24 23:30

Sorry, somehow managed to forget about writing part 2 for multiple weeks

Okay, I'll pick up where part one ended. I've added booleans and declarations in order to make type checking non-trivial ;)

> {-# LANGUAGE KindSignatures, EmptyDataDecls, MultiParamTypeClasses,
>              FlexibleInstances, GADTs #-}
> import Control.Monad
> import Control.Monad.Error
> import Control.Monad.RWS
> import Control.Monad.Reader
> data Expr
> data Stmt
> data Location
> data Decl
> type Identifier = String
> type Op = String
> type Symbol = String
> data Type = BoolType | IntegerType
> -- the empty annotation
> data NoAnn a = NoAnn
> data AST (ann :: * -> *) t where
>     If      :: ann Stmt -> AST ann Expr -> AST ann Stmt -> AST ann Stmt
>     -- note the added declaration list
>     Block   :: ann Stmt -> [AST ann Decl] -> [AST ann Stmt] -> AST ann Stmt
>     Assign  :: ann Stmt -> AST ann Location -> AST ann Expr -> AST ann Stmt
>     Var     :: ann Location -> Identifier -> AST ann Location
>     Ref     :: ann Expr -> AST ann Location -> AST ann Expr
>     Binary  :: ann Expr -> Op -> AST ann Expr -> AST ann Expr -> AST ann Expr
>     NumLit  :: ann Expr -> Integer -> AST ann Expr
>     -- these are new
>     BoolLit :: ann Expr -> Bool -> AST ann Expr
>     Decl    :: Identifier -> Type -> AST ann Decl

Descending from the Tower o' Ivory

I decided to partly break the attribute grammar abstraction by performing different phases inside different suitable monads. This increased coding pleasure and clarity in comparison to the generalized versions presented in the previous post.

> type M = Either String    -- aka Error, used for inter-phase communication
> type TypeM = ReaderT TEnv (Either String)   -- typecheck monad
> type CodeM = RWST SEnv () SState (Either String) -- codegen monad

The Reader monad is used for passing down symbol tables, and, in the case of code generation, the label of the current loop for use in breaking. These are the inherited attributes of the respective attribute grammars.

> data SEnv = SEnv { symbols :: [Symbol], loopend :: Maybe String}
> data TEnv = TEnv { decls :: [AST NoAnn Decl]}

Additionally code generation needs to keep counters for disambiguating variables and scopes (labels). The State monad corresponds to a top-down left-to-right flowing attribute.

> data SState = SState { var :: Integer, scope :: Integer }

Synthetic attributes are embedded in the form of the recursive traversal, no fancy tricks are needed to represent them.

As another step down the abstraction ladder I differentiated the type signatures of different phases to enforce and clarify the flow of information between the phases.

Typechecking generates an annotated AST from an unannotated one:

> data Attr a where
>     EAttr {etyp :: AST Attr Type} :: Attr Expr
>     LAttr {ltyp :: AST Attr Type} :: Attr Location
>     Empty :: Attr a
> class Typecheck a where
>    typecheck :: AST NoAnn a -> TypeM (AST Attr a)
> do_typecheck :: AST NoAnn Stmt -> M (AST Attr Stmt)
> do_typecheck ast = runReaderT (typecheck ast) (TEnv [])

Code generation takes an annotated AST and transforms it into code:

> data Code a where 
>     -- code is the body, out is the register the result is in
>     ECode {code :: String, out :: String} :: Code Expr
>     -- code fetches the reference, which is then available in ptr
>     LCode {lcode :: String, ptr :: String} :: Code Location
>     -- statement code is just a body
>     SCode :: String -> Code Stmt
> class Codegen a where
>    codegen :: AST Attr a -> CodeM (Code a)
> do_codegen :: AST Attr Stmt -> Either String String
> do_codegen ast = do 
>   (SCode code,_) <- evalRWST (codegen ast) (SEnv [] Nothing) (SState 0 0)
>   return code

Now our glorious two phase compiler (still missing the instances ;) could be written as \s -> parser s >>= do_typecheck >>= do_codegen. The do_ functions are restricted to statements on purpose: we know that the root node of the AST will be a statement.

Notice how the Code and Attr types are parametrized with the same phantom types as nodes in the AST. This is what we had in mind when designing the GADT.

Now we have managed to present different phases modularily while not straying too much from the original attribute grammar approach.

Okay, this has been sitting in my draft directory for long enough, better "realease early, release often"

Coming next: the implementation.

Update 23:35

Update 2008-08-25: The do_ functions

2008-08-26 17:25

The last two final courses for a candidate degree here at Univ Helsinki CS are Ohjelmistotuotantoprojekti (Software Engineering Project) and Tieteellinen kirjoittaminen (Scientific Writing). The former is a programming project for groups of about eight with make-believe real software engineering processes. The projects also have proper clients, though these tend to be other departments or research projects.

Like most universities nowadays, Helsinki is a Java school. You can probably feel the dread I had for executing a largeish project in Java with a bunch of Java programmers. Thus my exuberance on finding out that one of this year's projects would be extending a research project coded in LISP. Earlier this day our group tutor mailed us a link to Practical Common Lisp. Things are looking up :)

In other news, the move to Pasila went ok. The interview with zendroid is coming up.

Tags: life, programming.
2009-06-17 01:30

Trying to overcome my writer's block by jotting down some random thoughts.


My life is a cyberpunk novel. I rise tired after minimal sleep, slam stimulants into my system and run through a DDR-esque concrete ghetto to the tram stop. I only wake up properly at work where I juggle abstractions and try to teach robot hands to pick up cubes in a virtual reality.


The Haskell project course is chugging along nicely. People are encountering frustrating and sometimes even interesting real-world problems like FFI, installing libraries and handling exceptions in pure code. I plan to blog on the exception issue soon.

Also, we've been thinking about the future teaching of functional programming at Univ Helsinki. There's a course on formal type theory (with Coq!) coming up this fall.


At work we've moved to Roundup, a pretty pleasant issue tracker that is easy to customize. The default templates could be a bit cleaner tho. (E.g., there's support for storing different queries in the db but the menu uses hard-coded GET parameters).

I've been working on compiling our stream computation DSL to jvm bytecode that just loops over a bunch of arrays. It's been sweet so far.


Foundations for Programming Languages is a great book. Tackles some type theory but mainly other cool stuff such as algebras (as featured in algebraic datatypes), different models of lambda calculus and logical relations. A good complement for TaPL.


Have travelling to do this summer:

  • The IMC competition in Bulgaria
  • A trip to Austria, including hiking in Hohe Tauern national park and some Vienna
  • Cottage and boating trips with parents & grandparents

Random links

for your browsing pleasure

RSS Feed