## Waiting for garbage collection can kill parallelism?

I am reposting my mail from Haskell-cafe, since I got no replies in over a week and I think it is an interesting case. I was reading “Parallel Performance Tuning for Haskell” by Jones, Marlow and Singh and wanted to replicate the results for their first case study. The code goes like this:

```module Main where   import Control.Parallel   main :: IO () main = print . parSumFibEuler 38 \$ 5300   fib :: Int -> Int fib 0 = 0 fib 1 = 1 fib n = fib (n - 1) + fib (n - 2)   mkList :: Int -> [Int] mkList n = [1..n-1]   relprime :: Int -> Int -> Bool relprime x y = gcd x y == 1   euler :: Int -> Int euler n = length (filter (relprime n) (mkList n))   sumEuler :: Int -> Int sumEuler = sum . (map euler) . mkList   sumFibEuler :: Int -> Int -> Int sumFibEuler a b = fib a + sumEuler b   parSumFibEuler :: Int -> Int -> Int parSumFibEuler a b = f `par` (e `pseq` (e + f)) where f = fib a e = sumEuler b```

In the paper authors show that this code performs computation of `fib` ans `sumEuler` in parallel and that good speed-up is achieved:

To make the parallelism more robust, we need to be explicit about the evaluation order we intend. The way to do this is to use `pseq` in combination with `par`, the idea being to ensure that the main thread works on `sumEuler` while the sparked thread works on `fib`. (…) This version does not make any assumptions about the evaluation order of `+`, but relies only on the evaluation order of `pseq`, which is guaranteed to be stable.

These results were obtained on older GHC version1. However, compiling program with:

`\$ ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog parallel.hs`

and then running on GHC 7.4.1 using:

`\$ ./parallel +RTS -qa -g1 -s -ls -N2`

yields a completely different result. These are statistics for a parallel run on two cores:

```SPARKS: 1 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)   INIT    time    0.00s  (  0.00s elapsed) MUT     time    2.52s  (  2.51s elapsed) GC      time    0.03s  (  0.05s elapsed) EXIT    time    0.00s  (  0.00s elapsed) Total   time    2.55s  (  2.56s elapsed)```

Running the same code on one core results in:

```SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 0 fizzled)   INIT    time    0.00s  (  0.00s elapsed) MUT     time    2.51s  (  2.53s elapsed) GC      time    0.03s  (  0.05s elapsed) EXIT    time    0.00s  (  0.00s elapsed) Total   time    2.55s  (  2.58s elapsed)```

Looking and `MUT` (mutator time) it looks that there is no speed-up at all. Investigating eventlog using ThreadScope sheds some light on execution of a program:

Both threads start computation, but HEC 1 soon blocks and only resumes when HEC 0 finishes computation. Zooming in it looks that HEC 1 stops because it requests garbage collection, but HEC 0 does not respond to that request so GC begins only when HEC 0 is done with its computation:

Why does this happen? I am no expert on GHC’s garbage collection, my only knowledge of that comes from section 6 of “Runtime Support for Multicore Haskell“. If I understood correctly this should not happen – it certainly didn’t happen when the paper was published. Do we have a regression or am I misunderstanding something?

1. Paper does not mention which version exactly. I believe it was 6.10, since “Runtime Support for Multicore Haskell” by the same authors released at the same time uses GHC 6.10 []

### 5 Responses to “Waiting for garbage collection can kill parallelism?”

1. ceii says:

Your analysis about HEC 0 waiting for HEC 1 to participate in GC is correct. What seems to be happening is that fib gets compiled to a “perfect” loop that performs no allocation at all. Since GHC threads can only give control back to the runtime when allocating memory, this means the fib thread does its whole job in a single stride and only notices the request for GC once it’s done. Contrast with what happens when you compile the program without optimizations, and fib allocates memory on each step.

This idea of only preempting threads on allocation seems idiotic at first, but it has clear advantages from a performance point of view. Besides, a normal functional program allocates memory so often that this model is usually indistinguishable from true preemption. The caveat of this model is that threads that actually perform no allocation become unresponsive; this is well-understood, but people have chosen to live with it.

Getting back to the question, the behavior of this example has certainly changed since GHC 6.10. It’s hard to call to call this a regression though, since this is the result of GHC optimizing fib *better*, only with sad side-effects.

At any rate, you can avoid this phenomenon by preventing fib from being optimized too much. Moving the definition of fib to a separate module with {-# OPTIONS_GHC -fno-strictness #-} cuts my running by 25% compared to the single-threaded, single-module version; this is probably bout as good as it can get considering that the tasks are very unbalanced.

Keep in mind that the above was about purposedly making fib _slower_. Still, sometimes (like in this case) responsiveness matters more that single-threaded performance.

(As for the choice of -fno-strictness, I figured that disabling strictness analysis would force GHC to re-box/re-unbox the parameter and return value of fib at each recursive call. There might be other flags that result in the same effect.)

2. ceii says:

Oops, I swapped HEC 0 and 1 in my first paragraph. HEC 1 needs a GC, and HEC 0 can’t stop fib to participate in it.

3. This is http://hackage.haskell.org/trac/ghc/ticket/367 which is is partly fixed in GHC HEAD thanks to work by Edward Z. Yang: There will be an optional flag that prevents such loops, but it has a considerable space penalty.

4. Not related to the actual technical content of your post, but may I ask how you get Haskell syntax highlighting in your code examples?

5. Jan Stolarek says:

I use WP-Syntax plugin for WordPress.

Staypressed theme by Themocracy