## Waiting for garbage collection can kill parallelism?

I am reposting my mail from Haskell-cafe, since I got no replies in over a week and I think it is an interesting case. I was reading “Parallel Performance Tuning for Haskell” by Jones, Marlow and Singh and wanted to replicate the results for their first case study. The code goes like this:

module Main where import Control.Parallel main :: IO () main = print . parSumFibEuler 38 $ 5300 fib :: Int -> Int fib 0 = 0 fib 1 = 1 fib n = fib (n - 1) + fib (n - 2) mkList :: Int -> [Int] mkList n = [1..n-1] relprime :: Int -> Int -> Bool relprime x y = gcd x y == 1 euler :: Int -> Int euler n = length (filter (relprime n) (mkList n)) sumEuler :: Int -> Int sumEuler = sum . (map euler) . mkList sumFibEuler :: Int -> Int -> Int sumFibEuler a b = fib a + sumEuler b parSumFibEuler :: Int -> Int -> Int parSumFibEuler a b = f `par` (e `pseq` (e + f)) where f = fib a e = sumEuler b |

In the paper authors show that this code performs computation of `fib`

ans `sumEuler`

in parallel and that good speed-up is achieved:

To make the parallelism more robust, we need to be explicit about the evaluation order we intend. The way to do this is to use

`pseq`

in combination with`par`

, the idea being to ensure that the main thread works on`sumEuler`

while the sparked thread works on`fib`

. (…) This version does not make any assumptions about the evaluation order of`+`

, but relies only on the evaluation order of`pseq`

, which is guaranteed to be stable.

These results were obtained on older GHC version^{1}. However, compiling program with:

$ ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog parallel.hs |

and then running on GHC 7.4.1 using:

$ ./parallel +RTS -qa -g1 -s -ls -N2 |

yields a completely different result. These are statistics for a parallel run on two cores:

SPARKS: 1 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 2.52s ( 2.51s elapsed) GC time 0.03s ( 0.05s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.55s ( 2.56s elapsed) |

Running the same code on one core results in:

SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 2.51s ( 2.53s elapsed) GC time 0.03s ( 0.05s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.55s ( 2.58s elapsed) |

Looking and `MUT`

(mutator time) it looks that there is no speed-up at all. Investigating eventlog using ThreadScope sheds some light on execution of a program:

Both threads start computation, but HEC 1 soon blocks and only resumes when HEC 0 finishes computation. Zooming in it looks that HEC 1 stops because it requests garbage collection, but HEC 0 does not respond to that request so GC begins only when HEC 0 is done with its computation:

Why does this happen? I am no expert on GHC’s garbage collection, my only knowledge of that comes from section 6 of “Runtime Support for Multicore Haskell“. If I understood correctly this should not happen – it certainly didn’t happen when the paper was published. Do we have a regression or am I misunderstanding something?

- Paper does not mention which version exactly. I believe it was 6.10, since “Runtime Support for Multicore Haskell” by the same authors released at the same time uses GHC 6.10 [↩]