Code benchmarking in Haskell – some thoughts about random data generation

In my last post I showed you how to use criterion library to write benchmarks for Haskell code. In tutorial project that I created to demonstrate my ideas I decided to generate random data for benchmarking. Bryan O’Sullivan has commented on my approach that “the code (…) that generates random inputs on every run would be a good antipattern for performance testing.” After giving some thought to his words I think I see his point.

The code that Bryan refers to looks like this:

main :: IO ()
main = newStdGen >>= defaultMainWith benchConfig (return ()) . benchmarks
 
benchmarks :: RandomGen g => g -> [Benchmark]
benchmarks = ...

Each time a benchmark suite is run a different random numbers generator is created with newStdGen. This generator is then used by the benchmarks function to create values used for benchmarking. When I designed this I made an assumption that values of the data don’t influence the flow of computations. I believe that this holds for the shift functions I benchmarked in my tutorial. It doesn’t really matter what values are in the shifted list. As long as lists have the same length on different runs of the benchmark the results are comparable, but if you want to have the same random values generated on each run you can create a StdGen based on a seed that you supply. The modified main function would look like this:

main = return (mkStdGen 123456) >>= defaultMainWith benchConfig (return ()) . benchmarks

What happens however when data values do influence the flow of computation? In that case you definitely don’t want newStdGen, as it would make results of benchmark incomparable between different runs: you wouldn’t know if the speed-up is caused by changes in the code or data. It is also very likely that you don’t want to use mkStdGen. Why? Well, you would certainly get results comparable between different runs. The problem is that you wouldn’t know the characteristics of the data used for this particular benchmark. For example let’s assume that your algorithm executes faster when the data it processes contains many zeros. You benchmark the algorithm with random values created with a fixed StdGen and get a very good result. But how many zeros were in the data used for benchmarking? Perhaps 50% of input were zeros? You don’t know that. In this case you definitely want to prepare your own input data sets (e.g. one with many zeros and one without any) to measure the performance of your code based on input it receives. I guess Bryan is right here – careless use of random data generation for benchmarking can be a shot in the foot.

Code benchmarking in Haskell

Three weeks ago I wrote about code testing in Haskell. Today I will discuss how to benchmark code written in Haskell and how to integrate benchmarks into a project. For demonstration purposes I extended my sample project on github so now it shows how to create both tests and benchmarks.

Overview of Criterion benchmarking library

While there was a lot of testing libraries to choose from, it seems that benchmarking is dominated by only one library – Bryan O’Sullivan’s criterion. To get started with it you should read this post on Bryan’s blog. I will present some of the basics in today’s post and will also mention a couple of things that are undocumented.

Writing benchmarks for a functional language like Haskell is a bit tricky. There are no side effects in pure functional code, which means that after computing value of a function once it can be memoized and reused later without need for recomputing. This is of course not what we want during benchmarking. Criterion takes care of that, but requires that benchmarks be written in a special way. Let’s look at an example banchmark for our shift function1:

bench "Left shift" $ nf (cyclicShiftLeft 2) [1..8192]

The nf function is the key here. It takes two parameters: first is the benchmarked function saturated with all but its last argument; second is the last parameter to the benchmarked function. The type of nf is:

ghci> :t nf
nf :: Control.DeepSeq.NFData b => (a -> b) -> a -> Pure

When the benchmark is run nf applies the argument to the function and evaluates it to normal form, which means that the result gets fully evaluated. This is needed to ensure that laziness doesn’t distort the outcomes of a benchmark.

Code shown above will work perfectly, but I find such way of creating benchmarks very inconvenient due to four reasons:

  • presence of magic numbers
  • problems with creating more complex input data
  • verbosity of benchmarks when there are many parameters taken by the benchmarked function
  • problems of keeping consistency between benchmarks that should take the same inputs (it wouldn’t make sense to benchmark shift left and right functions with signals of different length)

To deal with these problems I decided to write my benchmarks using wrappers:

bench "Left shift"  $ nf benchCyclicShiftLeft paramCyclicShift

The benchCyclicShiftLeft function takes a tuple containing all the data needed for a benchmarked function:

{-# INLINE benchCyclicShiftLeft #-}
benchCyclicShiftLeft :: (Int, [Double]) -> [Double]
benchCyclicShiftLeft (n, sig) = cyclicShiftLeft n sig

The INLINE pragma is used to make sure that the function doesn’t add unnecessary call overhead. As you have probably guessed, the paramCyclicShift takes care of creating the tuple. In my code paramCyclicShift is actually a wrapper around such function:

dataShift :: RandomGen g => g -> Int -> Int -> (Int, [Double])
dataShift gen n sigSize = (n, take sigSize $ randoms gen)

To keep benchmarking code easily manageable I organize it similarly to tests. Project root contains bench directory with structure identical to src and tests directories. File containing benchmarks for a module is named like that module but with “Bench” appended before file extension. For example benchCyclicShiftLeft and dataShift functions needed to benchmark code in src/Signal/Utils.hs are placed in bench/Signal/UtilsBench.hs. Just like tests, benchmarks are assembled into one suite in bench/MainBenchmarkSuite.hs file:

import qualified BenchParam        as P
import qualified Signal.UtilsBench as U
 
main :: IO ()
main = newStdGen >>= defaultMainWith benchConfig (return ()) . benchmarks
 
benchmarks :: RandomGen g => g -> [Benchmark]
benchmarks gen =
    let paramCyclicShift = U.dataShift gen P.shiftSize P.sigSize
    in [
      bgroup "Signal shifts"
      [
        bench "Left shift"  $ nf U.benchCyclicShiftLeft  paramCyclicShift
      , bench "Right shift" $ nf U.benchCyclicShiftRight paramCyclicShift
      ]
    ]
 
benchConfig :: Config
benchConfig = defaultConfig {
             cfgPerformGC = ljust True
           }

The most important part is the benchmarks function, which takes a random number generator and assembles all benchmarks into one suite (UPDATE (24/10/2012): read a follow-up post on random data generation). Just as with tests we can create logical groups and assign names. A cool thing is a bcompare function. It takes a list of benchmarks, assumes that the first one is the reference one and reports the relative speed of other functions. In my code I use let to introduce paramCyclicShift wrapper around dataShift function. This allows to use the same input data for both benchmarks. Of course let is not necessary, but it allows to avoid code repetition. I also use shiftSize and sigSize functions from BenchParam module. These functions are defined as constant values and ensure that there is a single source of configuration. Using a separate module for this is a disputable choice – you may as well define shiftSize and sigSize in the same let binding as paramCyclicShift. The main function creates a random generator, uses bind to pass it to benchmarks function and finally runs the created suite. I use custom configuration created with benchConfig function to enable garbage collection between benchmarks2. I noticed that enabling GC is generally a good idea, because otherwise it will kick in during the benchmarking and distort the results.

The good thing about this approach is that benchmarks are concise, follow the structure of the project, magic numbers can be eliminated and it is easy to ensure that benchmarked functions get the same data when needed.

Automating benchmarks using cabal

Guess what – cabal has a built in support for benchmarks! All we need to do is add one more entry to project’s cabal file:

benchmark signal-bench
  type:             exitcode-stdio-1.0
  hs-source-dirs:   src, bench
  main-is:          MainBenchmarkSuite.hs
  build-depends:    base,
                    criterion,
                    random
  ghc-options:      -Wall
                    -O2

Structure of this entry is identical to the one related to tests, so I will skip the discussion. To run the benchmarks issue these three commands:

cabal configure --enable-benchmarks
cabal build
cabal bench

This couldn’t be easier. Criterion will produce quite a detailed output on the console. As I already said, it’s all explained on Bryan’s blog, but some of the mentioned features are not present any more in criterion. Most importantly, you cannot display results in a window or save them to png file. Luckily, there is an even fancier feature instead. Run benchmarks like this:

cabal bench --benchmark-options="-o report.html"

And criterion will produce a nice html report with interactive graphs.

A few more comments on benchmarking

All of this looks very easy and straightforward, but I actually spent about three days trying to figure out whether my code is benchmarked correctly. The problem is laziness. When I call my dataShift function the random data isn’t created until it is demanded by the cyclic shift function. This means that the time needed to actually create the random data would be incorporated in the benchmark. It turns out that criterion is smart enough not to do so. The first run of each benchmark forces the evaluation of lazily created data, but its run time is discarded and not included in the final results. The evaluated data is used in the subsequent runs of the benchmark. You can test this easily by doing something like this:

dataShift gen n sigSize = unsafePerformIO $ do
    delayThread 1000000
    return (n, take sigSize $ randoms gen)

This will cause a delay of 1 second each time dataShift function is evaluated. When you run the benchmark you will notice that criterion will estimate time needed to run the benchmarks to be over a hundred seconds (this information is not displayed when the estimated time is short), but it will finish much faster and there should be no difference in the performance of benchmarked functions. This will even work if you create your benchmark like this:

bench "Left shift" $ nf U.benchCyclicShiftLeft 
                     (U.dataShift gen P.shiftSize P.sigSize)

Another problem I stumbled upon quite quickly was type constraint on the nf function: it requires that the return value belongs to NFData. This type class represents data that can be evaluated to normal form (i.e. can be fully evaluated). Most of standard Haskell data types belong to it, but this is not the case with data containers like Vector or Repa arrays. For such cases there is a whnf function that doesn’t have that constraint, but it only evaluates the result to weak head normal form (i.e. to the first data constructor or lambda). Luckily, for unboxed arrays containing primitive types weak head normal form is the same as the normal form so the problem with Vectors is solved.

I also quickly realised that benchmarking results are not as repeatable as I would like them to be. This is of course something I could expect in a multitasking operating system. I guess that ultimate solution to this would be booting into single user mode and closing all the background services like cron and network.

I am also experimenting with tuning the options of the runtime system, as they can also influence performance considerably. In a project I am currently working on I benchmark parallel code and got better performance results by setting the thread affinity option (-qa command line switch) and disabling parallel garbage collection (-g1 switch). I also found ThreadScope to be extremely useful for inspecting events occurring during program runtime. It works great also for a single threaded application, as it shows when the garbage collection happens and that alone is very useful information.

Summary

Up till now I never wrote repeatable benchmarks for my code and relied on a simple methods like measuring the wall time of a single run of my application. Criterion seems like a big step forward and, thanks to an instant feedback it provides, I already managed to speed up my parallel code by a factor of 2. I guess that most of all I like the way cabal allows to seamlessly integrate tests and benchmarks into my project – this speeds up development considerably.

Remember that all the source code used in this post is available as a project on github. There are also some additional comments in the code.

  1. See my post on testing if you don’t know what shift I’m talking about []
  2. This can also be enabled with a command line switch -g, but doing this in code ensures that it is always turned on. []

Some thoughts about Coursera

After completing Compilers and Automata courses on Coursera I am participating in yet another course: Writing in the Sciences. I guess the name is pretty self-explanatory but just to make it clear: the course is about writing scientific articles. This is something I definitely need as a scientist for my work. I never considered writing to be my strong skill so I look forward to completing this course. Writing in the sciences is lead by professor Kristin Sainani. It is a well-known fact that fun from learning depends not on the subject itself, but mostly on the lecturer’s ability to present information in enjoyable and interesting way. Professor Sainani’s lectures are a great example of that, in contrast to professor Ullman’s lectures on automata, which I personally found a bit dry.

That said, I guess I have a few reasons to complain about the course and Coursera as a whole. While lectures given by professor Sainani are interesting, it is sad that she practically does not participate in the forums. I know that it probably takes a lot of time to prepare the lectures, but still I’d expect the lecturer to spend an hour on the forums once every 5-6 days. While the forums are for the students to help each other with learning, there are some issues that must be resolved by course staff. A good example is a problem with downloading one of this week’s lectures. It was reported on Monday and after two days the problem remains unfixed despite numerous complaints on the forum (to make things worse, this lecture is about assignment that is due to the end of week). Another thing that makes participation in the course more difficult is publishing new material on Mondays. I have a lot of time on Sundays so I could easily watch new material on weekend and do the homework during the week. That’s how it was done in Compilers course. With videos released at the beginning of a week I have problems with finding time to watch them and even more problems with doing the homework. Again, there was request on the forums to release videos one day earlier, but no response from the course staff. I am also very sceptical about the grading methods. With compilers or automata it was easy to design tests. In Writing in the Sciences it is not that simple. During the weekly tests students are asked to rewrite some fragments of scientific articles, but this cannot be graded automatically. I personally find these tests difficult. I usually come up with answers that I consider great, but after comparing them to model answers I am mostly thinking “why didn’t I figure this out”. Oh well, I guess that’s what learning is all about. I also wonder how the upcoming peer-review will work out. The plan is to have students review each others homework assignment (a short review of a scientific paper). I’m afraid that this may be a failure. One unfavourable factor is the time needed to review five different papers. Another one is that people come from different cultural backgrounds, know English on a moderate level and, saying straight out, often seem to misinterpret intentions of the writer and the message he/she is trying to convey (I’m judging from the forums and not trying to offend anyone).

While not getting feedback from professor Sainani is a problem of one particular course, I noticed that a general problem of Coursera is lack of any kind of technical support. Sure there is an address to technical support, but I have never received any kind of reply despite reporting a couple of problems. I am aware that Course is dealing with over a million participants and it is impossible to respond to every mail, but it wouldn’t hurt them to send at least an automatic “we will look into your problem” response. Since I never got any response and I don’t see the reported problems fixed I feel as though my mails were being send to /dev/null.

Another reason to complain is Coursera’s lax approach to announced time schedules. I already learned that despite being announced, courses get delayed and even cancelled. I understand that a one week slip is possible, but how can it happen that a course is planned for early July and it hasn’t started yet? It seems that someone who planned the course didn’t take his job seriously.

With all that said I still think that Coursera is a revolutionary approach to education. All it needs is just a bit more work and a bit more involvement.

Staypressed theme by Themocracy