## Benchmarking GHC HEAD with Criterion

So you’re developing GHC. You make some changes that affect performance of compiled programs, but how do you check whether the performance is really improved? Well, if you’re making some general optimisations – a new Core-to-Core transformation perhaps – than you can use the NoFib benchmark suite, which is a commonly accepted method of measuring GHC performance. But what if you’re developing some very specific optimisations that are unlikely to be benchmarked by NoFib? What if you extended the compiler in a way that allows you to write faster code in a way that was previously impossible and there is now way for NoFib to measure your improvements? Sounds like writing some criterion benchmarks would be a Good Thing. There’s a problem though – installing criterion with GHC HEAD. Criterion has lots of dependencies, but you cannot install them automatically with cabal-install, because cabal-install usually doesn’t work with GHC HEAD (although the Cabal library is one of GHC boot libraries). On the other hand installing dependencies manually is a pain. Besides, many libraries will not compile with GHC HEAD. So how to write criterion benchmarks for HEAD? I faced this problem some time ago and found a solution which, although not perfect, works fine for me.

In principle my idea is nothing fancy:

1. download all the required dependencies from hackage to the disk and extract them in a single directory,
2. determine the order in which they need to be installed,
3. build each library with GHC HEAD, resolving the build errors if necessary
4. register each library with GHC HEAD (see Appendix below)

Doing these things for the first time was very tedious and took me about 2-3 hours. Determining package dependencies was probably the most time consuming. Resolving build errors wasn’t that bad, though there were a couple of difficulties. It turned out that many packages put an upper bound on the version of the base package and removing these dependency is the only change required to build that package.

The key to my solution is that once you figure out in what order packages should be installed and remove the build errors, you can write a shell script that builds and installs packages automatically. This means that after installing GHC HEAD in a sandbox (see Appendix below) you can run the script to build and install all the packages. This will give you a fully working GHC installation in which you can write Criterion benchmarks for new features that you implemented in the compiler. Here’s what the script looks like (full version available here):

 #!/bin/bash   PKGS="\ primitive-0.5.0.1 \ vector-0.10.0.1 \ dlist-0.5 \ vector-algorithms-0.5.4.2 \ ..." # more packages in this list   if [[ $# -gt 1 ]]; then echo "Too many parameters" exit elif [[$# -eq 1 ]]; then     if [[ $1 == "clean" ]]; then echo -n "Cleaning" for i in$PKGS         do             echo -n "."             cd $i rm -rf dist rm -f Setup Setup.o Setup.hi cd .. done echo "done" else echo "Invalid parameter:$1"         exit     fi else     for i in $PKGS do echo "Installing package$i"         cd $i ((if [[ -f Setup.lhs ]]; then ghc Setup.lhs; else ghc Setup.hs; fi) && \ ./Setup configure --user --enable-shared \ && ./Setup build && ./Setup install) \ || exit cd .. done fi The script is nothing elaborate. Running without any parameters will build and install all packages on the list. If you run it with “clean” parameter it will remove build artefacts from package directories. If for some reason the script fails – e.g. one of the libraries fails to build – you can comment out already installed libraries so that the script resumes from the point it previously stopped. # Summary Using the approach described above I can finally write criterion benchmarks for GHC HEAD. There are a couple of considerations though: • things are likely to break as HEAD gets updated. Be prepared to add new libraries as dependencies, change compilation parameters or fix new build errors, • since some time you need to pass --enable-shared flag to cabal configure when building a shared library. This causes every library to be compiled twice. I don’t know if there’s anything one can do about that, • you need to manually download new versions of libraries, • fixing build errors manually may not be easy, • rerunning the script when something fails may be tedious, • changes in HEAD might cause performance problems in libraries you are using. If this goes unnoticed the benchmarking results might be invalid (I think this problem is hypothetical). You can download my script and the source code for all the modified packages here. I’m not giving you any guarantee that it will work for you, since HEAD changes all the time. It’s also quite possible that you don’t need some of the libraries I’m using, for example Repa. # Appendix: Sandboxing GHC For the above method to work effectively you need to have a sandboxed installation of GHC. There are tools designed for sandboxing GHC (e.g. hsenv) but I use a method described here. It’s perfectly suited for my needs. I like to have full manual control when needed but I also have this shell script to automate switching of sandboxes:  #!/bin/bash SANDBOX_DIR="/path/to/ghc-sandbox/" ACTIVE_SYMLINK="${SANDBOX_DIR}active" STARTCOLOR="\e[32m"; ENDCOLOR="\e[0m";   active_link_name=readlink ${ACTIVE_SYMLINK} active_name=basename${active_link_name}   if [[ $# -lt 1 ]]; then for i in ls${SANDBOX_DIR}; do if [[ $i != "active" ]]; then if [[$i == $active_name ]]; then echo -e "*$STARTCOLOR$i$ENDCOLOR" else echo " $i" fi fi done exit fi for i in ls${SANDBOX_DIR}; do if [[ $i ==$1 ]]; then cd $SANDBOX_DIR rm${ACTIVE_SYMLINK} ln -s $1${ACTIVE_SYMLINK} exit fi done   echo "Sandbox $1 not found" It displays list of sandboxes when run without any parameter (the active sandbox is displayed in green and marked with an asterisk) and switches the active sandbox when given a command-line parameter. Together with bash auto completion feature switching between different GHC versions is a matter of seconds. ## The Lambda Papers I’m reading about functional programming as much as I can. Aside from staying up-to-date with the latest publications I also spend some time reading the older papers. Among the classics are true pearls: John Backus’ “Can Programming Be Liberated from the von Neumann Style. A Functional Style and Its Algebra of Programs” took hours to read but was an eye-opener for me (I think this was the first paper on FP I ever read) and Phil Wadler’s “Comprehending monads” and “The essence of functional programming” are a beautiful demonstration of power and elegance of monads. But today I want to write about The Lambda Papers – a series of approximately 10 papers (mostly AI Lab Memos) written by Guy Steele and Gerald Sussman between 1975 and 1980. I originally heard about them while browsing Lambda The Ultimate (which, by the way, takes its name from these papers). Lambda papers revolve around the programming language Scheme, a simple dialect of Lisp developed by Steele and Sussman at MIT in the seventies. So far I have read two of these papers – “Lambda: The ultimate imperative” and “Lambda: The ultimate declarative”. The first one discusses the implementation of programming constructs know from imperative languages – for example GOTOs, blocks, assignments, loops or exceptions – in Scheme. Authors demonstrate that lambda expressions, that is anonymous functions, suffice to implement all these imperative constructs. Moreover, they show that using lambdas we can easily implement different scoping strategies (dynamic vs. static) and calling conventions (call by name vs. call by value vs. call by need). “Lambda: The ultimate declarative” continues the discussion of GOTO expressions started in “Lambda: The ultimate imperative”. Main focus in placed on relation between GOTOs, function calls and operations that modify environment (function call is one of them, assignments are another). The paper provides some really deep insight into the mechanisms of compiling function calls. For example, authors demonstrate that in compiled Scheme code it is not necessary to make any function calls – GOTOs (unconditional jumps) are all that is needed. There’s also an interesting discussion on defining data types using closures and named variables vs. temporaries (unnamed variables generated by the compiler). Paper concludes with a thought that the expressive power of Lisp makes it a good candidate for an intermediate language used in compilers. As I already mentioned I consider these papers to be one of the best papers I have read. I am truly impressed with the amount of knowledge squeezed into these 90 pages. Probably the most surprising thing is that after almost 40 years since their original publication Lambda Papers remain relevant. It is interesting to see that some ideas are mentioned briefly in those papers and today these ideas have evolved into really important concepts. For example Steele mentions in “Lambda: The ultimate declarative” about the idea of annotating functions with information about possible side effects, which is done in Haskell’s type system to separate pure and impure code. I have also found it interesting to learn a bit about history of computer science and trace the origin of concepts like thunks and continuations. That’s only the outline of the Lambda Papers. They provide a lot more insightful information and I strongly encourage anyone interested in programming languages to read these two publications. They are not too difficult – if you know Scheme you’ll be able to grasp them. ## Haskell as fast as C: A case study Once in a while someone, most likely new to Haskell, asks how does Haskell performance compare with C. In fact, when I was beginning with Haskell, I asked exactly the same question. During last couple of days I’ve been playing a bit with squeezing out some performance from a very simple piece of Haskell code. Turned out that the results I got are comparable with C so I thought I might share this. This will be a short case study, so I don’t intend to cover the whole subject of Haskell vs. C performance. There was a lot written on this already so I encourage to search through the Haskell-cafe archives as well as some other blogs. Most of all I suggest reading this and this post on Don Stewart’s blog. Here is my simple piece of code:  sumSqrL :: [Int] -> Int sumSqrL = sum . map (^2) . filter odd It takes a list of Ints, removes all even numbers from it, squares the remaining odd numbers and computes the sum. This is idiomatic Haskell code: it uses built-in list processing functions from the standard Prelude and relies on function composition to get code that is both readable and modular. So how can we make that faster? The simplest thing to do is to switch to a more efficient data structure, namely an unboxed Vector:  import Data.Vector.Unboxed as U sumSqrV :: U.Vector Int -> Int sumSqrV = U.sum . U.map (^2) . U.filter odd The code practically does not change, except for the type signature and namespace prefix to avoid clashing with the names from Prelude. As you will see in a moment this code is approximately three times faster than the one working on lists. Can we do better than that? Yes, we can. The code below is three times faster than the one using Vector, but there is a price to pay. We need to sacrifice modularity and elegance of the code:  sumSqrPOp :: U.Vector Int -> Int sumSqrPOp vec = runST$ do   let add a x = do         let !(I# v#) = x             odd#     = v# andI# 1#         return $a + I# (odd# *# v# *# v#) foldM add 0 vec -- replace  with ' here This code works on an unboxed vector. The add function, used to fold the vector, takes an accumulator a (initiated to 0 in the call to foldM') and an element of the vector. To check parity of the element the function unboxes it and zeros all its bits except the least significant one. If the vector element is even then odd# will contain 0, if the element is odd then odd# will contain 1. By multiplying square of the vector element by odd# we avoid a conditional branch instruction at the expense of possibly performing unnecessary multiplication and addition for even elements. Let’s see how these functions compile into Core intermediate language. The sumSqrV looks like this: $wa =   \vec >     case vec of _ { Vector vecAddressBase vecLength vecData ->     letrec {       workerLoop =         \index acc ->           case >=# index vecLength of _ {             False ->               case indexIntArray# vecData (+# vecAddressBase index)               of element { __DEFAULT ->               case remInt# element 2 of _ {                 __DEFAULT ->                   workerLoop (+# index 1) (+# acc (*# element element));                 0 -> workerLoop (+# index 1) acc               }               };             True -> acc           }; } in     workerLoop 0 0     }

while sumSqrPOp compiles to:

 wsumSqrPrimOp = \ vec -> runSTRep ( (\ @ s_X1rU -> case vec of _ { Vector vecAddressBase vecLength vecData -> (\ w1_s37C -> letrec { workerLoop = \ state index acc -> case >=# index vecLength of _ { False -> case indexIntArray# vecData (+# vecAddressBase index) of element { __DEFAULT -> workerLoop state (+# index 1) (+# acc (*# (*# (andI# element 1) element) element)) }; True -> (# state, I# acc #) }; } in workerLoop w1_s37C 0 0) }) ) I cleaned up the code a bit to make it easier to read. In the second version there is some noise from the ST monad, but aside from that both pieces of code are very similar. They differ in how the worker loop is called inside the most nested case expression. First version does a conditional call of one of the two possible calls to workerLoop, whereas the second version does an unconditional call. This may seem not much, but it turns out that this makes the difference between the code that is comparable in performance with C and code that is three times slower. Let’s take a look at the assembly generated by the LLVM backend. The main loop of sumSqrV compiles to:  LBB1_4: imulq %rdx, %rdx addq %rdx, %rbx .LBB1_1: leaq (%r8,%rsi), %rdx leaq (%rcx,%rdx,8), %rdi .align 16, 0x90 .LBB1_2: cmpq %rax, %rsi jge .LBB1_5 incq %rsi movq (%rdi), %rdx addq8, %rdi     testb   $1, %dl je .LBB1_2 jmp .LBB1_4 While the main loop of sumSqrPOp compiles to:  .LBB0_4: movq (%rsi), %rbx movq %rbx, %rax imulq %rax, %rax andq$1, %rbx     negq    %rbx     andq    %rax, %rbx     addq    %rbx, %rcx     addq    \$8, %rsi     decq    %rdi     jne     .LBB0_4

No need to be an assembly expert to see that the second version is much more dense.

I promised you comparison with C. Here’s the code:

 long int c_sumSqrC( long int* xs, long int xn ) {   long int index   = 0;   long int result  = 0;   long int element = 0;  Loop:   if (index == xn) goto Return;   element = xs[index];   index++;   if ((0x1L & element) == 0) goto Loop;   result += element * element;   goto Loop;  Return:   return result; }

You’re probably wondering why the hell did I use gotos. The reason is that the whole idea of this sum-square-of-odds function was taken from the paper “Automatic transformation of series expressions into loops” by Richard Waters and I intended to closely mimic the solution produced by his fusion framework.

I used criterion to compare the performance of four presented implementations: based on list, base on vector, based on vector using foldM+primops and C. I used FFI to call C implementation from Haskell so that I can benchmark it with criterion as well. Here are the results for a list/vector containing one million elements:

C version is still faster than the one based on primops by about 8%. I think this is a very good achievement given that the version based on Vector library is three times slower.

# A few words of summary

The vector library uses stream fusion under the hood to optimize the code working on vectors. In the blog posts I mentioned in the beginning Don Stewart talks a bit about stream fusion, but if you want to learn more you’ll probably be interested in two papers: Stream Fusion. From Lists to Streams to Nothing at All and  Haskell Beats C Using Generalized Stream Fusion. My sumSqrPOp function, although as fast as C, is in fact pretty ugly and I wouldn’t recommend anyone to write Haskell code in such a way. You might have realized that while efficiency of sumSqrPOp comes from avoiding the conditional instruction within the loop, the C version does in fact use the conditional instruction within the loop to determine the parity of the vector element. The interesting thing is that this conditional is eliminated by gcc during the compilation.

As you can see it might be possible to write Haskell code that is as fast as C. The bad thing is that to get efficient code you might be forced to sacrifice the elegance and abstraction of functional programming. I hope that one day Haskell will have a fusion framework capable of doing more optimisations than the frameworks existing today and that we will be able to have both the elegance of code and high performance. After all, if gcc is able to get rid of unnecessary conditional instructions then it should be possible to make GHC do the same.

# A short appendix

To dump Core produced by GHC use -ddump-simpl flag during compilation. I also recommend using -dsuppress-all flag, which suppresses all information about types – this makes the Core much easier to read.

To dump the assembly produced by GHC use -ddump-asm flag. When compiling with LLVM backend you need to use -keep-s-files flag instead.

To disassemble compiled object files (e.g. compiled C files) use the objdump -d command.

# Update – discussion on Reddit

Mikhail Glushenkov pointed out that the following Haskell code produces the same result as my sumSqrPOp function:

 sumSqrB :: U.Vector Int -> Int sumSqrB = U.sum . U.map (\x -> (x .&. 1) * x * x)

I admit I didn’t notice this simple solution and could have come with a better example were such a solution would not be possible.

There was a request to compare performance with idiomatic C code, because the C code I have shown clearly is not idiomatic. So here’s the most idiomatic C code I can come up with (not necessarily the fastest one):

 long int c_sumSqrC( long int* xs, long int xn ) { long int result = 0; long int i = 0; long int e; for ( ; i < xn; i++ ) { e = xs[ i ]; if ( e % 2 != 0 ) { result += e * e; } } return result; }

The performance turns out to be the same as before (“Bits” represents Mikhail Glushenkov’s solution, “C” now represents the new C code):

There was a suggestion to use the following C code:

 for(int i = 0; i < xn; i++) { result += xs[i] * xs[i] * (xs[i] & 1); }

Author claims that this code is faster than the version I proposed, but I cannot confirm that on my machine – I get results that are noticeably slower (2.7ms vs 1.7ms for vectors of 1 million elements). Perhaps this comes from me using GCC 4.5, while the latest available version is 4.8.

Finally, there were questions about overhead added by calling C code via FFI. I was concerned with this also when I first wanted to benchmark my C code via FFI. After making some experiments it turned out that this overhead is so small that it can be ignored. For more information see this post.

Staypressed theme by Themocracy