Strange benchmarking results for FFI bindings

It looks like I am getting pretty good at getting hit by Haskell bugs. My previous post described behaviour that turned out to be a bug in GHC (thanks to Joachim Breitner for pointing this out). Now I found problems with benchmarking FFI bindings using method described a month ago.

I work on a project in which the same algorithm is implemented using different data structures – one implementation is done in C, another using Vector library and yet another using Repa. Everything is benchmarked with Criterion and C implementation is the fastest one (look at first value after mean – this is mean time of running a function):

benchmarking DWT/C1
mean: 87.26403 us, lb 86.50825 us, ub 90.05830 us, ci 0.950
std dev: 6.501161 us, lb 1.597160 us, ub 14.81257 us, ci 0.950
 
benchmarking DWT/Vector1
mean: 209.4814 us, lb 208.8169 us, ub 210.5628 us, ci 0.950
std dev: 4.270757 us, lb 2.978532 us, ub 6.790762 us, ci 0.950

This algorithm uses a simpler lattice function that is repeated a couple of times. I wrote benchmarks that measure time needed by a single invocation of lattice:

benchmarking C1/Lattice Seq
mean: 58.36111 us, lb 58.14981 us, ub 58.65387 us, ci 0.950
std dev: 1.260742 us, lb 978.6512 ns, ub 1.617153 us, ci 0.950
 
benchmarking Vector1/Lattice Seq
mean: 34.97816 us, lb 34.87454 us, ub 35.14377 us, ci 0.950
std dev: 661.5554 ns, lb 455.7412 ns, ub 1.013466 us, ci 0.950

Hey, what’s this!? Vector implementation is suddenly faster than C? Not possible given that DWT in C is faster than DWT using Vector. After some investigation it turned out that the first C benchmark runs correctly while subsequent benchmarks of C functions take performance hit. I managed to create a simple code that demonstrates the problem in as few lines as possible. I implemented a copy function in C that takes an array and copies it to another array. Here’s copy.c:

#include
#include "copy.h"
 
double* c_copy( double* inArr, int arrLen ) {
  double* outArr = malloc( arrLen * sizeof( double ) );
 
  for ( int i = 0; i < arrLen; i++ ) {
    outArr[ i ] = inArr[ i ];
  }
 
  return outArr;
}

and copy.h:

#ifndef _COPY_H_
#define _COPY_H_
 
double* c_copy( double*, int );
 
#endif

I wrote a simple binding for that function and benchmarked it multiple times in a row:

module Main where
 
import Criterion.Main
import Data.Vector.Storable hiding (copy)
import Control.Monad (liftM)
import Foreign hiding (unsafePerformIO)
import Foreign.C
import System.IO.Unsafe (unsafePerformIO)
 
foreign import ccall unsafe "copy.h"
  c_copy :: Ptr CDouble -> CInt -> IO (Ptr CDouble)
 
signal :: Vector Double
signal = fromList [1.0 .. 16384.0]
 
copy :: Vector Double -> Vector Double
copy sig = unsafePerformIO $ do
    let (fpSig, _, lenSig) = unsafeToForeignPtr sig
    pLattice <- liftM castPtr $ withForeignPtr fpSig $ \ptrSig ->
                c_copy (castPtr ptrSig) (fromIntegral lenSig)
    fpLattice <- newForeignPtr finalizerFree pLattice
    return $ unsafeFromForeignPtr0 fpLattice lenSig
 
 
main :: IO ()
main = defaultMain [
         bgroup "FFI" [
           bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         , bench "C binding" $ whnf copy signal
         ]
       ]

Compiling and running this benchmark with:

$ ghc -O2 -Wall -optc -std=c99 ffi_crit.hs copy.c
$ ./ffi_crit -g

gave me this results:

benchmarking FFI/C binding
mean: 17.44777 us, lb 16.82549 us, ub 19.84387 us, ci 0.950
std dev: 5.627304 us, lb 968.1911 ns, ub 13.18222 us, ci 0.950
 
benchmarking FFI/C binding
mean: 45.46269 us, lb 45.17545 us, ub 46.01435 us, ci 0.950
std dev: 1.950915 us, lb 1.169448 us, ub 3.201935 us, ci 0.950
 
benchmarking FFI/C binding
mean: 45.79727 us, lb 45.55681 us, ub 46.26911 us, ci 0.950
std dev: 1.669191 us, lb 1.029116 us, ub 3.098384 us, ci 0.950

The first run takes about 17μs, later runs take about 45μs. I found this result repeatable across different runs, although in about 10-20% of runs all benchmarks – including the first one – took about 45μs. I obtained this results on GHC 7.4.1, openSUSE 64-bit linux with 2.6.37 kernel, Intel Core i7 M 620 CPU. I posted this on Haskell-cafe and #haskell. Surprisingly nobody could replicate the result! I was confused so I gave it a try on my second machine: Debian Squeeze, 64-bit, GHC 7.4.2, 2.6.32 kernel, Intel Core 2 Due T8300 CPU. At first the problem did not appear:

benchmarking FFI/C binding
mean: 107.3837 us, lb 107.2013 us, ub 107.5862 us, ci 0.950
std dev: 983.6046 ns, lb 822.6750 ns, ub 1.292724 us, ci 0.950
 
benchmarking FFI/C binding
mean: 108.1152 us, lb 107.9457 us, ub 108.3052 us, ci 0.950
std dev: 916.2469 ns, lb 793.1004 ns, ub 1.122127 us, ci 0.950

All benchmarks took about 107μs. Now watch what happens when I increase size of the copied vector from 16K elements to 32K:

benchmarking FFI/C binding
mean: 38.50100 us, lb 36.71525 us, ub 46.87665 us, ci 0.950
std dev: 16.93131 us, lb 1.033678 us, ub 40.23900 us, ci 0.950
 
benchmarking FFI/C binding
mean: 209.9733 us, lb 209.5316 us, ub 210.4680 us, ci 0.950
std dev: 2.401398 us, lb 2.052981 us, ub 2.889688 us, ci 0.950

This first run is 2.5 time faster (!), while all other runs are two times slower. While the latter could be expected, the former certainly is not.

So what exactly is going on? I tried analysing eventlog of the program but I wasn’t able to figure out the cause of the problem. I noticed that if I comment out the loop in C function so that it only allocates memory and returns an empty vector then the problem disappears. Someone on Haskell-cafe suggested that these are cache effects, but I am sceptical about this explanation. If this is caused by cache then why did the first benchmark sped up when size of the vector was increased? And why does this effect occur for 16K length vectors on a machine with 4MB cache, while machine with 3MB cache needs twice longer vector for the problem to occur? So if anyone has a clue what causes this strange behaviour please let me know. I would be happy to resolve that since now result of my benchmarks are distorted (perhaps yours are too only you didn’t notice).

Waiting for garbage collection can kill parallelism?

I am reposting my mail from Haskell-cafe, since I got no replies in over a week and I think it is an interesting case. I was reading “Parallel Performance Tuning for Haskell” by Jones, Marlow and Singh and wanted to replicate the results for their first case study. The code goes like this:

module Main where
 
import Control.Parallel
 
main :: IO ()
main = print . parSumFibEuler 38 $ 5300
 
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
 
mkList :: Int -> [Int]
mkList n = [1..n-1]
 
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
 
euler :: Int -> Int
euler n = length (filter (relprime n) (mkList n))
 
sumEuler :: Int -> Int
sumEuler = sum . (map euler) . mkList
 
sumFibEuler :: Int -> Int -> Int
sumFibEuler a b = fib a + sumEuler b
 
parSumFibEuler :: Int -> Int -> Int
parSumFibEuler a b = f `par` (e `pseq` (e + f))
    where f = fib a
          e = sumEuler b

In the paper authors show that this code performs computation of fib ans sumEuler in parallel and that good speed-up is achieved:

To make the parallelism more robust, we need to be explicit about the evaluation order we intend. The way to do this is to use pseq in combination with par, the idea being to ensure that the main thread works on sumEuler while the sparked thread works on fib. (…) This version does not make any assumptions about the evaluation order of +, but relies only on the evaluation order of pseq, which is guaranteed to be stable.

These results were obtained on older GHC version1. However, compiling program with:

$ ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog parallel.hs

and then running on GHC 7.4.1 using:

$ ./parallel +RTS -qa -g1 -s -ls -N2

yields a completely different result. These are statistics for a parallel run on two cores:

SPARKS: 1 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
 
INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.52s  (  2.51s elapsed)
GC      time    0.03s  (  0.05s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    2.55s  (  2.56s elapsed)

Running the same code on one core results in:

SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 0 fizzled)
 
INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.51s  (  2.53s elapsed)
GC      time    0.03s  (  0.05s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    2.55s  (  2.58s elapsed)

Looking and MUT (mutator time) it looks that there is no speed-up at all. Investigating eventlog using ThreadScope sheds some light on execution of a program:

Both threads start computation, but HEC 1 soon blocks and only resumes when HEC 0 finishes computation. Zooming in it looks that HEC 1 stops because it requests garbage collection, but HEC 0 does not respond to that request so GC begins only when HEC 0 is done with its computation:

Why does this happen? I am no expert on GHC’s garbage collection, my only knowledge of that comes from section 6 of “Runtime Support for Multicore Haskell“. If I understood correctly this should not happen – it certainly didn’t happen when the paper was published. Do we have a regression or am I misunderstanding something?

  1. Paper does not mention which version exactly. I believe it was 6.10, since “Runtime Support for Multicore Haskell” by the same authors released at the same time uses GHC 6.10 []

How to shoot yourself in the foot with Haskell

Haskell is “advertised” as a safe language that does all type checking upfront, making sure that you don’t experience runtime type errors, null pointers and all that kind of stuff. It also gives you ways to bypass some of the safety mechanisms, so a conscious programmer can use unsafe functions to get a boost in performance (e.g. by not performing bounds checking when indexing a vector).

I’ve written some very ugly Haskell code that creates a vector using destructive updates. It is in fact an imperative algorithm, not a functional one. When the initialization is over the vector is frozen using unsafeFreeze. I wrote my code using read and write functions, tested it using QuickCheck and when the tests passed I switched to unsafeRead and unsafeWrite to make my program faster. Some time later I started getting random segfaults when running my tests. This never happened before in any of my Haskell programs so I almost panicked. At first I didn’t had a slightest idea how to even approach this problem. I suspected that this might even be a bug in GHC. Then I started disabling groups of tests trying to track down the bug and finally managed to locate a single test that was causing the problem. Guess what – it was the test of my vector initialization with unsafe functions. What happened is that after switching to unsafeRead/unsafeWrite I refactored the function and its test. I made a mistake in the testing code and passed incorrect data that resulted with an attempt to write an element at address -1. Finding this bug took me a little over an hour. A factor that made debugging harder was that disabling tests that seemed to cause the segfault resulted in problems appearing in a completely different part of the program – or so I thought by looking at the output of my program. Looks like I completely forgot about lazy evaluation and unspecified evaluation order!

Second bug I encountered was even trickier. I wrote functions that perform cyclic shifts of a signal by any value. For example shifting [1,2,3,4] left by 1 yields [2,3,4,1]. Note that shifting by 5, 9, 13 and so on gives exactly the same result – the shift function is periodic. You might recall that I used shift functions to demonstrate code testing. This time however I written shifts using Repa library. Then I created QuickCheck property stating that shifting any signal left and then right by same value yields the original signal. This is pretty obvious property and the tests passed without any problems. Later, when writing some other tests I ended up with one of the tests failing. Using ghci I checked that this error should not be happening at all, but after careful debugging it turned out that during actual tests some of the values in the results become zeros. After two hours of debugging I realized that the actual bug is in the shifting functions – they worked only for the basic period, that is shift values from 0 to signal length. Why QuickCheck didn’t manage to falsify my property of left/right shift compositions? Repa is a smart (and tricky) library that attempts to fuse many operations on an array into one. And it fused application of left shift followed by right shift into identity transform! Well, this is great. After all this is the kind of optimization we would like to have in our programs. But it turns out that it can also impact tests! After realizing what is going on it was actually a matter of 5 minutes to fix the bug, but finding it was not a trivial task.

Staypressed theme by Themocracy