MSR internship and some retrospection

I feel I can finally write about: I got accepted for a three-month internship at Microsoft Research Cambridge! This means I will be developing GHC and, hopefully, doing some serious research on the subject of functional programming and compiler implementation. My internship starts tomorrow, on 1st July. I’m not yet 100% certain about the exact topic of my research, so I’ll refrain from going into any kind of technical details for now and I will focus on my personal experience with functional programming. I feel this is really a good moment to summarize the past 1,5 year. I learned about functional programming at the very beginning of 2012 and since then I progressed from knowing completely nothing to being in Cambridge – something I would have not imagined 18 months ago.

Somewhere around July 2011 I finished writing my PhD. I had yet to deal with many formalities – which in the end took 8 months – but the most important part of my work was done and I only continued research on a few minor subjects that I ran into while writing a PhD. Somewhere in October I decided I need a break from all my current research topic – I finally wanted some time to pursue topics that interested me all along and for which I never had time. Compiler construction and theory of automata were two main topics I had in mind. That was the plan, but it wasn’t meant to work out, at least not yet. Somewhere around December 2012 I stumbled upon a book “Seven languages in seven weeks”, which was my first contact with functional programming. I didn’t follow the book exactly. I read chapters about Ruby, Io, Prolog (so much fun!), Scala and Erlang, but instead of reading chapter about Clojure I went for Scheme. I read R5RS language specification and The Little Schemer and when I reached the chapter about Haskell I decided to read Learn You A Haskell instead. At that point I already knew that Haskell is the functional programming language and I think that this was the moment I started having some serious plans about functional programming. But at the same time I was figuring out how to learn about compilers. It was April when Stanford University announced their two online courses on Compilers and Automata – these were really godsend. The Compilers course ended in late June. This concludes my first six months of contact with FP and I think that these months were extremely intense. I learned theoretical and practical foundations of compilers, a new programming paradigm and some new languages designed in that paradigm. I also started reading research papers on functional programming, with a focus on implementation of GHC. At that point I didn’t even try to work on the source code, but I was trying to understand how the compiler is designed.

The next six months, from July to December, were not as fruitful. I picked up interest in doing data-parallel computations in Haskell, as this seemed to be an active topic of research and also related to my PhD work. I made a failed attempt of an efficient parallel implementation of a wavelet transform. Although I wasn’t successful, my time was not wasted: I learned how to write, test and benchmark libraries in Haskell and also read a lot of papers on FP. I also got in touch with Ben Lippmeier, who pointed me to one problem with GHC he needed fixed. This was somewhere in January 2013. I already started reading the source code of GHC in December, but now I finally had a particular problem to solve. It was the time to start working on GHC. That is mostly what I did during the last six months, although I also managed to spend some time on theory (more papers and a book on combinatory logic).

As for the internship, I decided to apply for it in February. I polished my CV and cover letter (many thanks go to my friend Marek for his help) and sent my application at the beginning of March. After an interview with Geoffrey Mainland and Simon Peyton Jones I got acceptance notification at the beginning of April. And here I am in Cambridge, over 1300km from home, waiting for my first day at Microsoft Research.

Getting friendly with STG

I’ve been spending last months on developing GHC. No rocket science so far, just a bit of hacking here and there. The biggest thing I am working on is ticket #6135, which is about changing some of the existing PrimOps to return unboxed Int# instead of Bool. This means that the result of comparing two unboxed values will be either an unboxed 0# or unboxed 1#, instead of a tagged pointer to statically allocated object representing True or False. This modification will allow to write branchless algorithms in Haskell. I promise to write about this one day, but today I want to blog about a different topic.

It so happens that things I’ve been doing in GHC require me to make changes in the code generator. This is a bit challenging for me, because the code generator is something that didn’t interest me much when I started to learn about compilers. Probably the main reason for this is that code generation means dealing with assembly. I’ve been programming for about 16 years and only two languages caused me problems when I tried to learn them. Assembly is one of them1. I have been learning it for one year during my studies and, although I had no problems with understanding the idea behind assembly and writing short snippets of code, writing a larger piece of code always ended up in a headache.

It looks that the time has come to overcome my fear. During last months I’ve been reading a lot of assembly generated by GHC and I even made some attempts at writing assembly code by myself (well, using intrinsics, but I guess that counts). But between Haskell source code and the generated executable there are many intermediate steps. From my observations it seems that many Haskellers have basic knowledge of Core – GHC’s intermediate language. Most have also heard about other two intermediate representations used by GHC – STG and Cmm – but it seems that few people know them, unless they hack the compiler. And since I’m hacking the compiler I should probably have more knowledge about these two representations, right?

There’s a classic paper by Simon Peyton-Jones “Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine”. It is quite long – 87 pages total – and, being published in 1992, it is mostly out of date. These two things kept me from reading it, although I think that being out of date was only a pretext for me to avoid reading almost 90 pages of text. But, since I need to learn about STG, I finally decided to give it a shot. Reading the paper took my four days. Paper is very well written and in general is an easy read. I was afraid that I might not understand formal description of operational semantics of STG, but it turned out to be well explained so I had no problem with that. The major problem turned out to be the amount of knowledge I had to learn while reading. This resulted in problems with fully understanding last sections of the paper. Not because they are more difficult than the initial ones, but because I didn’t fully remember all the details that were discussed earlier. An important question is which information is not up to date. I’m not yet familiar with the existing implementation, but it seems that many things have changed: the Spineless Tagless G-machine is not tagless any more since the introduction of pointer tagging; curried function are now evaluated using eval/apply convention, while the paper describes push/enter; the paper discusses only compilation to C, while currently C back-end is becoming deprecated in favour of native code generator and LLVM; and finally the layout of closures is now slightly different than the one presented in the paper. I am almost certain that garbage collection is also performed differently. These are the differences that I noticed, which means that really a lot has changed since the publication over 20 years ago. Surprisingly, this doesn’t seem like a big problem, because the most important thing is that the paper presents an idea of how STG works, while the mentioned changes are only not so important details.

So, now that I have a basic idea of how STG works, what comes next? There are a few follow up papers:

  • “The STG runtime system (revised)” – an updated description of STG written in 1999 by Simon Peyton Jones and Simon Marlow. I guess it’s also outdated, but still probably worth reading. It has only 65 pages :)
  • “Making a Fast Curry. Push-Enter vs. Eval-Apply for Higher-order Languages” – this described the mentioned eval/apply and push/enter strategies. Already read this one.
  • “Faster Laziness Using Dynamic Pointer Tagging” – this will tell you why STG is not tagless. Read this one also.

And once I’ll deal with STG I’ll have to learn about Cmm.

  1. In case you’re interested, the other one is Io []

Staypressed theme by Themocracy