Posts tagged: compilers

MSR internship and some retrospection

I feel I can finally write about: I got accepted for a three-month internship at Microsoft Research Cambridge! This means I will be developing GHC and, hopefully, doing some serious research on the subject of functional programming and compiler implementation. My internship starts tomorrow, on 1st July. I’m not yet 100% certain about the exact topic of my research, so I’ll refrain from going into any kind of technical details for now and I will focus on my personal experience with functional programming. I feel this is really a good moment to summarize the past 1,5 year. I learned about functional programming at the very beginning of 2012 and since then I progressed from knowing completely nothing to being in Cambridge – something I would have not imagined 18 months ago.

Somewhere around July 2011 I finished writing my PhD. I had yet to deal with many formalities – which in the end took 8 months – but the most important part of my work was done and I only continued research on a few minor subjects that I ran into while writing a PhD. Somewhere in October I decided I need a break from all my current research topic – I finally wanted some time to pursue topics that interested me all along and for which I never had time. Compiler construction and theory of automata were two main topics I had in mind. That was the plan, but it wasn’t meant to work out, at least not yet. Somewhere around December 2012 I stumbled upon a book “Seven languages in seven weeks”, which was my first contact with functional programming. I didn’t follow the book exactly. I read chapters about Ruby, Io, Prolog (so much fun!), Scala and Erlang, but instead of reading chapter about Clojure I went for Scheme. I read R5RS language specification and The Little Schemer and when I reached the chapter about Haskell I decided to read Learn You A Haskell instead. At that point I already knew that Haskell is the functional programming language and I think that this was the moment I started having some serious plans about functional programming. But at the same time I was figuring out how to learn about compilers. It was April when Stanford University announced their two online courses on Compilers and Automata – these were really godsend. The Compilers course ended in late June. This concludes my first six months of contact with FP and I think that these months were extremely intense. I learned theoretical and practical foundations of compilers, a new programming paradigm and some new languages designed in that paradigm. I also started reading research papers on functional programming, with a focus on implementation of GHC. At that point I didn’t even try to work on the source code, but I was trying to understand how the compiler is designed.

The next six months, from July to December, were not as fruitful. I picked up interest in doing data-parallel computations in Haskell, as this seemed to be an active topic of research and also related to my PhD work. I made a failed attempt of an efficient parallel implementation of a wavelet transform. Although I wasn’t successful, my time was not wasted: I learned how to write, test and benchmark libraries in Haskell and also read a lot of papers on FP. I also got in touch with Ben Lippmeier, who pointed me to one problem with GHC he needed fixed. This was somewhere in January 2013. I already started reading the source code of GHC in December, but now I finally had a particular problem to solve. It was the time to start working on GHC. That is mostly what I did during the last six months, although I also managed to spend some time on theory (more papers and a book on combinatory logic).

As for the internship, I decided to apply for it in February. I polished my CV and cover letter (many thanks go to my friend Marek for his help) and sent my application at the beginning of March. After an interview with Geoffrey Mainland and Simon Peyton Jones I got acceptance notification at the beginning of April. And here I am in Cambridge, over 1300km from home, waiting for my first day at Microsoft Research.

Getting friendly with STG

I’ve been spending last months on developing GHC. No rocket science so far, just a bit of hacking here and there. The biggest thing I am working on is ticket #6135, which is about changing some of the existing PrimOps to return unboxed Int# instead of Bool. This means that the result of comparing two unboxed values will be either an unboxed 0# or unboxed 1#, instead of a tagged pointer to statically allocated object representing True or False. This modification will allow to write branchless algorithms in Haskell. I promise to write about this one day, but today I want to blog about a different topic.

It so happens that things I’ve been doing in GHC require me to make changes in the code generator. This is a bit challenging for me, because the code generator is something that didn’t interest me much when I started to learn about compilers. Probably the main reason for this is that code generation means dealing with assembly. I’ve been programming for about 16 years and only two languages caused me problems when I tried to learn them. Assembly is one of them1. I have been learning it for one year during my studies and, although I had no problems with understanding the idea behind assembly and writing short snippets of code, writing a larger piece of code always ended up in a headache.

It looks that the time has come to overcome my fear. During last months I’ve been reading a lot of assembly generated by GHC and I even made some attempts at writing assembly code by myself (well, using intrinsics, but I guess that counts). But between Haskell source code and the generated executable there are many intermediate steps. From my observations it seems that many Haskellers have basic knowledge of Core – GHC’s intermediate language. Most have also heard about other two intermediate representations used by GHC – STG and Cmm – but it seems that few people know them, unless they hack the compiler. And since I’m hacking the compiler I should probably have more knowledge about these two representations, right?

There’s a classic paper by Simon Peyton-Jones “Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine”. It is quite long – 87 pages total – and, being published in 1992, it is mostly out of date. These two things kept me from reading it, although I think that being out of date was only a pretext for me to avoid reading almost 90 pages of text. But, since I need to learn about STG, I finally decided to give it a shot. Reading the paper took my four days. Paper is very well written and in general is an easy read. I was afraid that I might not understand formal description of operational semantics of STG, but it turned out to be well explained so I had no problem with that. The major problem turned out to be the amount of knowledge I had to learn while reading. This resulted in problems with fully understanding last sections of the paper. Not because they are more difficult than the initial ones, but because I didn’t fully remember all the details that were discussed earlier. An important question is which information is not up to date. I’m not yet familiar with the existing implementation, but it seems that many things have changed: the Spineless Tagless G-machine is not tagless any more since the introduction of pointer tagging; curried function are now evaluated using eval/apply convention, while the paper describes push/enter; the paper discusses only compilation to C, while currently C back-end is becoming deprecated in favour of native code generator and LLVM; and finally the layout of closures is now slightly different than the one presented in the paper. I am almost certain that garbage collection is also performed differently. These are the differences that I noticed, which means that really a lot has changed since the publication over 20 years ago. Surprisingly, this doesn’t seem like a big problem, because the most important thing is that the paper presents an idea of how STG works, while the mentioned changes are only not so important details.

So, now that I have a basic idea of how STG works, what comes next? There are a few follow up papers:

  • “The STG runtime system (revised)” – an updated description of STG written in 1999 by Simon Peyton Jones and Simon Marlow. I guess it’s also outdated, but still probably worth reading. It has only 65 pages :)
  • “Making a Fast Curry. Push-Enter vs. Eval-Apply for Higher-order Languages” – this described the mentioned eval/apply and push/enter strategies. Already read this one.
  • “Faster Laziness Using Dynamic Pointer Tagging” – this will tell you why STG is not tagless. Read this one also.

And once I’ll deal with STG I’ll have to learn about Cmm.

  1. In case you’re interested, the other one is Io []

Benchmarking GHC HEAD with Criterion

So you’re developing GHC. You make some changes that affect performance of compiled programs, but how do you check whether the performance is really improved? Well, if you’re making some general optimisations – a new Core-to-Core transformation perhaps – than you can use the NoFib benchmark suite, which is a commonly accepted method of measuring GHC performance. But what if you’re developing some very specific optimisations that are unlikely to be benchmarked by NoFib? What if you extended the compiler in a way that allows you to write faster code in a way that was previously impossible and there is now way for NoFib to measure your improvements? Sounds like writing some criterion benchmarks would be a Good Thing. There’s a problem though – installing criterion with GHC HEAD. Criterion has lots of dependencies, but you cannot install them automatically with cabal-install, because cabal-install usually doesn’t work with GHC HEAD (although the Cabal library is one of GHC boot libraries). On the other hand installing dependencies manually is a pain. Besides, many libraries will not compile with GHC HEAD. So how to write criterion benchmarks for HEAD? I faced this problem some time ago and found a solution which, although not perfect, works fine for me.

In principle my idea is nothing fancy:

  1. download all the required dependencies from hackage to the disk and extract them in a single directory,
  2. determine the order in which they need to be installed,
  3. build each library with GHC HEAD, resolving the build errors if necessary
  4. register each library with GHC HEAD (see Appendix below)

Doing these things for the first time was very tedious and took me about 2-3 hours. Determining package dependencies was probably the most time consuming. Resolving build errors wasn’t that bad, though there were a couple of difficulties. It turned out that many packages put an upper bound on the version of the base package and removing these dependency is the only change required to build that package.

The key to my solution is that once you figure out in what order packages should be installed and remove the build errors, you can write a shell script that builds and installs packages automatically. This means that after installing GHC HEAD in a sandbox (see Appendix below) you can run the script to build and install all the packages. This will give you a fully working GHC installation in which you can write Criterion benchmarks for new features that you implemented in the compiler. Here’s what the script looks like (full version available here):

#!/bin/bash
 
PKGS="\
primitive-0.5.0.1 \
vector-0.10.0.1 \
dlist-0.5 \
vector-algorithms-0.5.4.2 \
..." # more packages in this list
 
if [[ $# -gt 1 ]]; then
    echo "Too many parameters"
    exit
elif [[ $# -eq 1 ]]; then
    if [[ $1 == "clean" ]]; then
        echo -n "Cleaning"
        for i in $PKGS
        do
            echo -n "."
            cd $i
            rm -rf dist
            rm -f Setup Setup.o Setup.hi
            cd ..
        done
        echo "done"
    else
        echo "Invalid parameter: $1"
        exit
    fi
else
    for i in $PKGS
    do
        echo "Installing package $i"
        cd $i
        ((if [[ -f Setup.lhs ]]; then ghc Setup.lhs; else ghc Setup.hs; fi) && \
            ./Setup configure --user --enable-shared \
            && ./Setup build && ./Setup install) \
            || exit
        cd ..
    done
fi

The script is nothing elaborate. Running without any parameters will build and install all packages on the list. If you run it with “clean” parameter it will remove build artefacts from package directories. If for some reason the script fails —  e.g. one of the libraries fails to build – you can comment out already installed libraries so that the script resumes from the point it previously stopped.

Summary

Using the approach described above I can finally write criterion benchmarks for GHC HEAD. There are a couple of considerations though:

  • things are likely to break as HEAD gets updated. Be prepared to add new libraries as dependencies, change compilation parameters or fix new build errors,
  • since some time you need to pass --enable-shared flag to cabal configure when building a shared library. This causes every library to be compiled twice. I don’t know if there’s anything one can do about that,
  • you need to manually download new versions of libraries,
  • fixing build errors manually may not be easy,
  • rerunning the script when something fails may be tedious,
  • changes in HEAD might cause performance problems in libraries you are using. If this goes unnoticed the benchmarking results might be invalid (I think this problem is hypothetical).

You can download my script and the source code for all the modified packages here. I’m not giving you any guarantee that it will work for you, since HEAD changes all the time. It’s also quite possible that you don’t need some of the libraries I’m using, for example Repa.

Appendix: Sandboxing GHC

For the above method to work effectively you need to have a sandboxed installation of GHC. There are tools designed for sandboxing GHC (e.g. hsenv) but I use a method described here. It’s perfectly suited for my needs. I like to have full manual control when needed but I also have this shell script to automate switching of sandboxes:

#!/bin/bash
 
SANDBOX_DIR="/path/to/ghc-sandbox/"
ACTIVE_SYMLINK="${SANDBOX_DIR}active"
STARTCOLOR="\e[32m";
ENDCOLOR="\e[0m";
 
active_link_name=`readlink ${ACTIVE_SYMLINK}`
active_name=`basename ${active_link_name}`
 
if [[ $# -lt 1 ]]; then
  for i in `ls ${SANDBOX_DIR}`; do
    if [[ $i != "active" ]]; then
      if [[ $i == $active_name ]]; then
        echo -e "* $STARTCOLOR$i$ENDCOLOR"
      else
        echo "  $i"
      fi
    fi
  done
  exit
fi
 
for i in `ls ${SANDBOX_DIR}`; do
  if [[ $i == $1 ]]; then
    cd $SANDBOX_DIR
    rm ${ACTIVE_SYMLINK}
    ln -s $1 ${ACTIVE_SYMLINK}
    exit
  fi
done
 
echo "Sandbox $1 not found"

It displays list of sandboxes when run without any parameter (the active sandbox is displayed in green and marked with an asterisk) and switches the active sandbox when given a command-line parameter. Together with bash auto completion feature switching between different GHC versions is a matter of seconds.

Staypressed theme by Themocracy