## Haskell as fast as C: A case study

Here is my simple piece of code:

sumSqrL :: [Int] -> Int sumSqrL = sum . map (^2) . filter odd

It takes a list of Ints, removes all even numbers from it, squares the remaining odd numbers and computes the sum. This is idiomatic Haskell code: it uses built-in list processing functions from the standard Prelude and relies on function composition to get code that is both readable and modular. So how can we make that faster? The simplest thing to do is to switch to a more efficient data structure, namely an unboxed Vector:

import Data.Vector.Unboxed as U   sumSqrV :: U.Vector Int -> Int sumSqrV = U.sum . U.map (^2) . U.filter odd

The code practically does not change, except for the type signature and namespace prefix to avoid clashing with the names from Prelude. As you will see in a moment this code is approximately three times faster than the one working on lists.

Can we do better than that? Yes, we can. The code below is three times faster than the one using Vector, but there is a price to pay. We need to sacrifice modularity and elegance of the code:

sumSqrPOp :: U.Vector Int -> Int sumSqrPOp vec = runST $do let add a x = do let !(I# v#) = x odd# = v# andI# 1# return$ a + I# (odd# *# v# *# v#)   foldM add 0 vec -- replace  with ' here

This code works on an unboxed vector. The add function, used to fold the vector, takes an accumulator a (initiated to 0 in the call to foldM') and an element of the vector. To check parity of the element the function unboxes it and zeros all its bits except the least significant one. If the vector element is even then odd# will contain 0, if the element is odd then odd# will contain 1. By multiplying square of the vector element by odd# we avoid a conditional branch instruction at the expense of possibly performing unnecessary multiplication and addition for even elements.

Let’s see how these functions compile into Core intermediate language. The sumSqrV looks like this:

$wa = \vec > case vec of _ { Vector vecAddressBase vecLength vecData -> letrec { workerLoop = \index acc -> case >=# index vecLength of _ { False -> case indexIntArray# vecData (+# vecAddressBase index) of element { __DEFAULT -> case remInt# element 2 of _ { __DEFAULT -> workerLoop (+# index 1) (+# acc (*# element element)); 0 -> workerLoop (+# index 1) acc } }; True -> acc }; } in workerLoop 0 0 } while sumSqrPOp compiles to: $wsumSqrPrimOp =   \ vec ->     runSTRep       ( (\ @ s_X1rU ->           case vec of _ { Vector vecAddressBase vecLength vecData ->           (\ w1_s37C ->              letrec {                workerLoop =                  \ state index acc ->                    case >=# index vecLength of _ {                      False ->                        case indexIntArray# vecData (+# vecAddressBase index)                        of element { __DEFAULT ->                        workerLoop                          state                          (+# index 1)                          (+# acc (*# (*# (andI# element 1) element) element))                        };                      True -> (# state, I# acc #)                    }; } in              workerLoop w1_s37C 0 0)           })        )

I cleaned up the code a bit to make it easier to read. In the second version there is some noise from the ST monad, but aside from that both pieces of code are very similar. They differ in how the worker loop is called inside the most nested case expression. First version does a conditional call of one of the two possible calls to workerLoop, whereas the second version does an unconditional call. This may seem not much, but it turns out that this makes the difference between the code that is comparable in performance with C and code that is three times slower.

Let’s take a look at the assembly generated by the LLVM backend. The main loop of sumSqrV compiles to:

LBB1_4:     imulq    %rdx, %rdx     addq    %rdx, %rbx .LBB1_1:     leaq    (%r8,%rsi), %rdx     leaq    (%rcx,%rdx,8), %rdi     .align  16, 0x90 .LBB1_2:     cmpq    %rax, %rsi     jge     .LBB1_5     incq    %rsi     movq    (%rdi), %rdx     addq    $8, %rdi testb$1, %dl     je     .LBB1_2     jmp     .LBB1_4

While the main loop of sumSqrPOp compiles to:

.LBB0_4:     movq    (%rsi), %rbx     movq    %rbx, %rax     imulq   %rax, %rax     andq    $1, %rbx negq %rbx andq %rax, %rbx addq %rbx, %rcx addq$8, %rsi     decq    %rdi     jne     .LBB0_4

No need to be an assembly expert to see that the second version is much more dense.

I promised you comparison with C. Here’s the code:

long int c_sumSqrC( long int* xs, long int xn ) {   long int index   = 0;   long int result  = 0;   long int element = 0;  Loop:   if (index == xn) goto Return;   element = xs[index];   index++;   if ((0x1L & element) == 0) goto Loop;   result += element * element;   goto Loop;  Return:   return result; }

You’re probably wondering why the hell did I use gotos. The reason is that the whole idea of this sum-square-of-odds function was taken from the paper “Automatic transformation of series expressions into loops” by Richard Waters and I intended to closely mimic the solution produced by his fusion framework.

I used criterion to compare the performance of four presented implementations: based on list, base on vector, based on vector using foldM+primops and C. I used FFI to call C implementation from Haskell so that I can benchmark it with criterion as well. Here are the results for a list/vector containing one million elements: C version is still faster than the one based on primops by about 8%. I think this is a very good achievement given that the version based on Vector library is three times slower.

# A few words of summary

The vector library uses stream fusion under the hood to optimize the code working on vectors. In the blog posts I mentioned in the beginning Don Stewart talks a bit about stream fusion, but if you want to learn more you’ll probably be interested in two papers: Stream Fusion. From Lists to Streams to Nothing at All and  Haskell Beats C Using Generalized Stream Fusion. My sumSqrPOp function, although as fast as C, is in fact pretty ugly and I wouldn’t recommend anyone to write Haskell code in such a way. You might have realized that while efficiency of sumSqrPOp comes from avoiding the conditional instruction within the loop, the C version does in fact use the conditional instruction within the loop to determine the parity of the vector element. The interesting thing is that this conditional is eliminated by gcc during the compilation.

As you can see it might be possible to write Haskell code that is as fast as C. The bad thing is that to get efficient code you might be forced to sacrifice the elegance and abstraction of functional programming. I hope that one day Haskell will have a fusion framework capable of doing more optimisations than the frameworks existing today and that we will be able to have both the elegance of code and high performance. After all, if gcc is able to get rid of unnecessary conditional instructions then it should be possible to make GHC do the same.

# A short appendix

To dump Core produced by GHC use -ddump-simpl flag during compilation. I also recommend using -dsuppress-all flag, which suppresses all information about types – this makes the Core much easier to read.

To dump the assembly produced by GHC use -ddump-asm flag. When compiling with LLVM backend you need to use -keep-s-files flag instead.

To disassemble compiled object files (e.g. compiled C files) use the objdump -d command.

# Update – discussion on Reddit

There was some discussion about this post on reddit and I’d like to address some of the objections that were raised there and in the comments below.

Mikhail Glushenkov pointed out that the following Haskell code produces the same result as my sumSqrPOp function:

sumSqrB :: U.Vector Int -> Int sumSqrB = U.sum . U.map (\x -> (x .&. 1) * x * x)

I admit I didn’t notice this simple solution and could have come with a better example were such a solution would not be possible.

There was a request to compare performance with idiomatic C code, because the C code I have shown clearly is not idiomatic. So here’s the most idiomatic C code I can come up with (not necessarily the fastest one):

long int c_sumSqrC( long int* xs, long int xn ) { long int result = 0; long int i = 0; long int e; for ( ; i < xn; i++ ) { e = xs[ i ]; if ( e % 2 != 0 ) { result += e * e; } } return result; }

The performance turns out to be the same as before (“Bits” represents Mikhail Glushenkov’s solution, “C” now represents the new C code): There was a suggestion to use the following C code:

for(int i = 0; i < xn; i++) { result += xs[i] * xs[i] * (xs[i] & 1); }

Author claims that this code is faster than the version I proposed, but I cannot confirm that on my machine – I get results that are noticeably slower (2.7ms vs 1.7ms for vectors of 1 million elements). Perhaps this comes from me using GCC 4.5, while the latest available version is 4.8.

Finally, there were questions about overhead added by calling C code via FFI. I was concerned with this also when I first wanted to benchmark my C code via FFI. After making some experiments it turned out that this overhead is so small that it can be ignored. For more information see this post.

## Strange benchmarking results for FFI bindings

It looks like I am getting pretty good at getting hit by Haskell bugs. My previous post described behaviour that turned out to be a bug in GHC (thanks to Joachim Breitner for pointing this out). Now I found problems with benchmarking FFI bindings using method described a month ago.

I work on a project in which the same algorithm is implemented using different data structures – one implementation is done in C, another using Vector library and yet another using Repa. Everything is benchmarked with Criterion and C implementation is the fastest one (look at first value after mean – this is mean time of running a function):

benchmarking DWT/C1 mean: 87.26403 us, lb 86.50825 us, ub 90.05830 us, ci 0.950 std dev: 6.501161 us, lb 1.597160 us, ub 14.81257 us, ci 0.950   benchmarking DWT/Vector1 mean: 209.4814 us, lb 208.8169 us, ub 210.5628 us, ci 0.950 std dev: 4.270757 us, lb 2.978532 us, ub 6.790762 us, ci 0.950

This algorithm uses a simpler lattice function that is repeated a couple of times. I wrote benchmarks that measure time needed by a single invocation of lattice:

benchmarking C1/Lattice Seq mean: 58.36111 us, lb 58.14981 us, ub 58.65387 us, ci 0.950 std dev: 1.260742 us, lb 978.6512 ns, ub 1.617153 us, ci 0.950   benchmarking Vector1/Lattice Seq mean: 34.97816 us, lb 34.87454 us, ub 35.14377 us, ci 0.950 std dev: 661.5554 ns, lb 455.7412 ns, ub 1.013466 us, ci 0.950

Hey, what’s this!? Vector implementation is suddenly faster than C? Not possible given that DWT in C is faster than DWT using Vector. After some investigation it turned out that the first C benchmark runs correctly while subsequent benchmarks of C functions take performance hit. I managed to create a simple code that demonstrates the problem in as few lines as possible. I implemented a copy function in C that takes an array and copies it to another array. Here’s copy.c:

#include #include "copy.h"   double* c_copy( double* inArr, int arrLen ) { double* outArr = malloc( arrLen * sizeof( double ) );   for ( int i = 0; i < arrLen; i++ ) { outArr[ i ] = inArr[ i ]; }   return outArr; }

and copy.h:

#ifndef _COPY_H_ #define _COPY_H_   double* c_copy( double*, int );   #endif

I wrote a simple binding for that function and benchmarked it multiple times in a row:

module Main where   import Criterion.Main import Data.Vector.Storable hiding (copy) import Control.Monad (liftM) import Foreign hiding (unsafePerformIO) import Foreign.C import System.IO.Unsafe (unsafePerformIO)   foreign import ccall unsafe "copy.h" c_copy :: Ptr CDouble -> CInt -> IO (Ptr CDouble)   signal :: Vector Double signal = fromList [1.0 .. 16384.0]   copy :: Vector Double -> Vector Double copy sig = unsafePerformIO $do let (fpSig, _, lenSig) = unsafeToForeignPtr sig pLattice <- liftM castPtr$ withForeignPtr fpSig $\ptrSig -> c_copy (castPtr ptrSig) (fromIntegral lenSig) fpLattice <- newForeignPtr finalizerFree pLattice return$ unsafeFromForeignPtr0 fpLattice lenSig     main :: IO () main = defaultMain [ bgroup "FFI" [ bench "C binding" $whnf copy signal , bench "C binding"$ whnf copy signal , bench "C binding" $whnf copy signal , bench "C binding"$ whnf copy signal , bench "C binding" $whnf copy signal , bench "C binding"$ whnf copy signal , bench "C binding" $whnf copy signal , bench "C binding"$ whnf copy signal , bench "C binding" $whnf copy signal ] ] Compiling and running this benchmark with: $ ghc -O2 -Wall -optc -std=c99 ffi_crit.hs copy.c $./ffi_crit -g gave me this results: benchmarking FFI/C binding mean: 17.44777 us, lb 16.82549 us, ub 19.84387 us, ci 0.950 std dev: 5.627304 us, lb 968.1911 ns, ub 13.18222 us, ci 0.950 benchmarking FFI/C binding mean: 45.46269 us, lb 45.17545 us, ub 46.01435 us, ci 0.950 std dev: 1.950915 us, lb 1.169448 us, ub 3.201935 us, ci 0.950 benchmarking FFI/C binding mean: 45.79727 us, lb 45.55681 us, ub 46.26911 us, ci 0.950 std dev: 1.669191 us, lb 1.029116 us, ub 3.098384 us, ci 0.950 The first run takes about 17?s, later runs take about 45?s. I found this result repeatable across different runs, although in about 10-20% of runs all benchmarks – including the first one – took about 45?s. I obtained this results on GHC 7.4.1, openSUSE 64-bit linux with 2.6.37 kernel, Intel Core i7 M 620 CPU. I posted this on Haskell-cafe and #haskell. Surprisingly nobody could replicate the result! I was confused so I gave it a try on my second machine: Debian Squeeze, 64-bit, GHC 7.4.2, 2.6.32 kernel, Intel Core 2 Due T8300 CPU. At first the problem did not appear: benchmarking FFI/C binding mean: 107.3837 us, lb 107.2013 us, ub 107.5862 us, ci 0.950 std dev: 983.6046 ns, lb 822.6750 ns, ub 1.292724 us, ci 0.950 benchmarking FFI/C binding mean: 108.1152 us, lb 107.9457 us, ub 108.3052 us, ci 0.950 std dev: 916.2469 ns, lb 793.1004 ns, ub 1.122127 us, ci 0.950 All benchmarks took about 107?s. Now watch what happens when I increase size of the copied vector from 16K elements to 32K: benchmarking FFI/C binding mean: 38.50100 us, lb 36.71525 us, ub 46.87665 us, ci 0.950 std dev: 16.93131 us, lb 1.033678 us, ub 40.23900 us, ci 0.950 benchmarking FFI/C binding mean: 209.9733 us, lb 209.5316 us, ub 210.4680 us, ci 0.950 std dev: 2.401398 us, lb 2.052981 us, ub 2.889688 us, ci 0.950 This first run is 2.5 time faster (!), while all other runs are two times slower. While the latter could be expected, the former certainly is not. So what exactly is going on? I tried analysing eventlog of the program but I wasn’t able to figure out the cause of the problem. I noticed that if I comment out the loop in C function so that it only allocates memory and returns an empty vector then the problem disappears. Someone on Haskell-cafe suggested that these are cache effects, but I am sceptical about this explanation. If this is caused by cache then why did the first benchmark sped up when size of the vector was increased? And why does this effect occur for 16K length vectors on a machine with 4MB cache, while machine with 3MB cache needs twice longer vector for the problem to occur? So if anyone has a clue what causes this strange behaviour please let me know. I would be happy to resolve that since now result of my benchmarks are distorted (perhaps yours are too only you didn’t notice). ## Benchmarking C functions using Foreign Function Interface I am currently working on implementing Discrete Wavelet Transform (DWT) in Haskell. I want to make use of Haskell’s parallel programing capabilities to implement an algorithm that can take advantage of multiple CPU cores. My previous posts on testing and benchmarking were by-products of this project, as I needed to ensure reliability of my implementation and to measure its performance. The key question that is in my head all the time is “can I write Haskell code that outperforms C when given more CPU cores?”. To answer this question I needed a way to benchmark performance of algorithm written in C and I must admit that this problem was giving me a real headache. One obvious solution was to implement the algorithm in C and measure its running time. This didn’t seem acceptable. I use Criterion for benchmarking and it does lots of fancy stuff like measuring clock resolution and calculating kernel density estimation. So unless I implemented this features in C (read: re-implement the whole library) the results of measurements would not be comparable. Luckily for me there is a better solution: Foreign Function Interface (FFI). This is an extension of Haskell 98 standard – and part of Haskell 2010 – that allows to call functions written in C1. This means that I could write my function in C, wrap it in a pure Haskell function and benchmark that wrapper with Criterion. The results would be comparable with Haskell implementation, but I was afraid that overheads related to data copying would affect the performance measurements. As it turned out I was wrong. I started with chapter 17 of Real World Haskell. It presents a real world example – I guess that title of the book is very adequate – of creating bindings for an already existing library. Sadly, after reading it I felt very confused. I had a general idea of what should be done but I didn’t understand many of the details. I had serious doubts about proper usage of Ptr and ForeignPtr data types and these are in fact very important when working with FFI. Someone on #haskell advised me to read the official specification of FFI and this was a spot-on. This is actually one of the few official specifications that are a real pleasure to read (if you read R5RS then you know what I mean). It is concise (30 pages) and provides a comprehensive overview of all data types and functions used for making foreign calls. After reading the specification it was rather straightforward to write my own bindings to C. Here’s a prototype of called C function, located in dwt.h header file: double* c_dwt(double* ls, int ln, double* xs, int xn); The corresponding dwt.c source file contains: double* c_dwt( double* ls, int ln, double* xs, int xn ) { double* ds = malloc( xn * sizeof( double ) ); // fill ds array with result return ds; } The important thing is that C function mallocates new memory which we will later manage using Haskell’s garbage collector. Haskell binding for such a function looks like this: foreign import ccall unsafe "dwt.h" c_dwt :: Ptr CDouble -> CInt -> Ptr CDouble -> CInt -> IO (Ptr CDouble) Here’s what it does: ccall denotes C calling convention, unsafe improves performance of the call at the cost of safety2 and "dwt.h" points to a header file. Finally, I define the name of the function and it’s type. This name is the same as the name of original C function, but if it were different I would have to specify name of C function in the string that specifies name of the header file. You probably already noticed that type int from C is represented by CInt in Haskell and double by CDouble. You can convert between Int and CInt with fromIntegral and between Double and CDouble with realToFrac. Pointers from C became Ptr, so double* from C is represented as Ptr Double in Haskell binding. What might be surprising about this type signature is that the result is in the IO monad, that is our function from C is denoted as impure. The reason for this is that every time we run c_dwt function a different memory address will be allocated by malloc, so indeed the function will return different results given the same input. In my function however the array addressed by that pointer will always contain exactly the same values (for the same input data), so in fact my function is pure. The problem is that Haskell doesn’t know that and we will have to fix that problem using the infamous unsafePerformIO. For that we have to create a wrapper function that has pure interface: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import Control.Monad (liftM) import Data.Vector.Storable import Foreign hiding (unsafePerformIO) import Foreign.C import System.IO.Unsafe dwt :: Vector Double -> Vector Double -> Vector Double dwt ls sig = unsafePerformIO$ do     let (fpLs , _, lenLs ) = unsafeToForeignPtr ls         (fpSig, _, lenSig) = unsafeToForeignPtr sig     pDwt <- liftM castPtr $withForeignPtr fpLs$ \ptrLs ->             withForeignPtr fpSig $\ptrSig -> c_dwt (castPtr ptrLs ) (fromIntegral lenLs ) (castPtr ptrSig) (fromIntegral lenSig) fpDwt <- newForeignPtr finalizerFree pDwt return$ unsafeFromForeignPtr0 fpDwt lenSig

Our wrapper function takes two Vectors as input and returns a new Vector. To interface with C we need to use storable vectors, which store data that can be written to raw memory (that’s what the C function is doing). I wasn’t able to figure out what is the difference between storable and unboxed vectors. It seems that both store primitive values in continuous memory block and therefore both offer similar performance (assumed, not verified). First thing to do is to get ForeignPtrs out of input vectors. ForeignPtr is a Ptr with a finalizer attached. Finalizer is a function called when the object is no longer in use and needs to be garbage collected. In this case we need a function that will free memory allocated with malloc. This is a common task, so FFI implementation already provides a finalizerFree function for that. The actual call to foreign function is made on lines 11-14. We can operate on Ptr values stored in ForeignPtr using withForeignPtr function. However, since we have vectors of Doubles as input, we also have Ptr Double, not Ptr CDouble that c_dwt function expects. There are two possible solutions to that problem. One would be to copy memory, converting every value in a vector using realToFrac. I did not try that assuming this would kill performance. Instead I used castPtr which casts pointer of one type to a pointer of another type. This is potentially dangerous and relies on the fact that Double and CDouble have the same internal structure. This is in fact expected, but by no means it is guaranteed by any specification! I wouldn’t be surprised it that didn’t work on some sort of exotic hardware architecture. Anyway, I written tests to make sure that this cast does work the way I want it to. This little trick allows to avoid copying the input data. The output pointer has to be cast from Ptr CDouble to Ptr Double and since the result is in the IO monad the castPtr has to be lifted with liftM. After getting the result as Ptr Double we wrap it in a ForeignPtr with a memory-freeing finalizer (line 15) and use that foreign pointer to construct the resulting vector of Doubles.

# Summary

I had two concerns when writing this binding. First was the possible performance overhead. Thanks to using pointer casts it was possible to avoid any sort of data copying and that makes this binding real quick. Measuring execution time with criterion shows that calling C function that does only memory allocation (as shown in this post) takes about 250?s. After adding the rest of C code that actually does computation the execution time jumps to about 55ms, so the FFI calling overhead does not skew the performance tests. Big thanks go to Mikhail Glushenkov who convinced me with his answer on StackOverflow to use FFI. My second concern was the necessity to use many functions with the word “unsafe”, especially the unsafePerformIO. I googled a bit and it seems that this is a normal thing when working with FFI and I guess there is no reason to worry, provided that the binding is thoroughly tested. So in the end I am very happy with the result. It is fast, Haskell manages garbage collection of memory allocated with C and most importantly I can benchmark C code using Criterion.

1. Specification mentions also the calling conventions for other languages and platforms (Java VM, .Net and C++) but I think that currently there is no implementation of these. []
2. Calls need to be safe only when called C code calls Haskell code, which I think is rare []

Staypressed theme by Themocracy