Configuring Emacs is a nightmare

When I started using Emacs a few months ago I went through some effort to configure it to my liking. There were a couple of issues I wasn’t able to resolve back then. I decided that it’s time to get them fixed. That’s how I wasted two days doing something that should take no longer than an hour.

About 90% of my time went to configuring Emacs running under Tmux. While Emacs runs fine under terminal emulator (I’m using Konsole for that) it turned out that running it under Tmux causes numerous problems: some key combinations don’t work and even worse the colours get screwed up. I am using Tmux always when working in console and at first I wanted to blame Tmux for all the problems. Soon I realized that all other applications work correct and only Emacs is causing problems. Problems with key combinations were solved using xterm-extras add-on. I am not exactly sure what is causing issues with colours but this is related to TERM environment variable. Within Tmux this is set to screen, while my terminal emulator sets it to xterm. If I understand correctly Tmux takes care of translating between these two different modes, but it looks that Emacs is still affected by this difference. Of course I tried changing values of TERM in all possible ways, but things only got worse. I tried improving Emacs’ colours by using color-theme but under my terminal emulator this breaks completely. I enabled 256 colours (by default there’s only 8!), but still I mostly got garbage when applying different themes. Oh well, I can live with 8 colours just fine. This was day one.

On the next day, when things were more or less working fine, I decided to upgrade Emacs to version 24.2. I was tempted by Emacs’ built in package management system and a promise of better compatibility with terminal emulators. This was a mistake. I spent another day trying to figure out why configuration that worked fine under 23.2 breaks under 24.2.

I acknowledge the fact that if I had arcane knowledge of Elisp all my issues would be much easier to solve. Without that knowledge I have to resort to googling and voodoo programming. Well, I am just an ordinary user who just wants to use a text editor. I need a text-based editor so I can do almost all of my work using console. Emacs meets that requirement, which makes me more productive but I seriously doubt that this increase in productivity justifies time needed to configure editor to my needs. I guess I would be proud of myself had I managed to solve all the problems. I didn’t and there are still unresolved issues:

  • How can I manage keyboard shortcuts in a sane way? I install some expansion only to realize it is not working because keyboard shortcuts collide with the ones from different expansion. I can remap the keys, but then again I run into risk of colliding with other shortcuts.
  • Can I have a block mode in Emacs similar to one offered by Kate (under KDE)? I found that rect-mark allows me to mark a rectangle area and cut it, but pasting doesn’t work the way it should.
  • Emacs’ undo is clumsy. I managed to improve it by using undo-tree, but still it has rough edges: every delete is seen as a single operation. If I hold backspace and delete 30 characters Emacs sees that as 30 separate operations! Undoing that is a real pain.

Solving these issues would make Emacs a very decent editor to work with Haskell (though not as good as Eclipse is for Java). Perhaps one day I’ll find patience and energy to resolve the remaining problems. Until then I steel myself for using Emacs the way it is.

Benchmarking C functions using Foreign Function Interface

I am currently working on implementing Discrete Wavelet Transform (DWT) in Haskell. I want to make use of Haskell’s parallel programing capabilities to implement an algorithm that can take advantage of multiple CPU cores. My previous posts on testing and benchmarking were by-products of this project, as I needed to ensure reliability of my implementation and to measure its performance. The key question that is in my head all the time is “can I write Haskell code that outperforms C when given more CPU cores?”. To answer this question I needed a way to benchmark performance of algorithm written in C and I must admit that this problem was giving me a real headache. One obvious solution was to implement the algorithm in C and measure its running time. This didn’t seem acceptable. I use Criterion for benchmarking and it does lots of fancy stuff like measuring clock resolution and calculating kernel density estimation. So unless I implemented this features in C (read: re-implement the whole library) the results of measurements would not be comparable.

Luckily for me there is a better solution: Foreign Function Interface (FFI). This is an extension of Haskell 98 standard – and part of Haskell 2010 – that allows to call functions written in C1. This means that I could write my function in C, wrap it in a pure Haskell function and benchmark that wrapper with Criterion. The results would be comparable with Haskell implementation, but I was afraid that overheads related to data copying would affect the performance measurements. As it turned out I was wrong.

I started with chapter 17 of Real World Haskell. It presents a real world example – I guess that title of the book is very adequate – of creating bindings for an already existing library. Sadly, after reading it I felt very confused. I had a general idea of what should be done but I didn’t understand many of the details. I had serious doubts about proper usage of Ptr and ForeignPtr data types and these are in fact very important when working with FFI. Someone on #haskell advised me to read the official specification of FFI and this was a spot-on. This is actually one of the few official specifications that are a real pleasure to read (if you read R5RS then you know what I mean). It is concise (30 pages) and provides a comprehensive overview of all data types and functions used for making foreign calls.

After reading the specification it was rather straightforward to write my own bindings to C. Here’s a prototype of called C function, located in dwt.h header file:

double* c_dwt(double* ls, int ln, double* xs, int xn);

The corresponding dwt.c source file contains:

double* c_dwt( double* ls, int ln, double* xs, int xn ) {
  double* ds = malloc( xn * sizeof( double ) );
 
  // fill ds array with result
 
  return ds;
}

The important thing is that C function mallocates new memory which we will later manage using Haskell’s garbage collector. Haskell binding for such a function looks like this:

foreign import ccall unsafe "dwt.h"
  c_dwt :: Ptr CDouble -> CInt -> Ptr CDouble -> CInt -> IO (Ptr CDouble)

Here’s what it does: ccall denotes C calling convention, unsafe improves performance of the call at the cost of safety2 and "dwt.h" points to a header file. Finally, I define the name of the function and it’s type. This name is the same as the name of original C function, but if it were different I would have to specify name of C function in the string that specifies name of the header file. You probably already noticed that type int from C is represented by CInt in Haskell and double by CDouble. You can convert between Int and CInt with fromIntegral and between Double and CDouble with realToFrac. Pointers from C became Ptr, so double* from C is represented as Ptr Double in Haskell binding. What might be surprising about this type signature is that the result is in the IO monad, that is our function from C is denoted as impure. The reason for this is that every time we run c_dwt function a different memory address will be allocated by malloc, so indeed the function will return different results given the same input. In my function however the array addressed by that pointer will always contain exactly the same values (for the same input data), so in fact my function is pure. The problem is that Haskell doesn’t know that and we will have to fix that problem using the infamous unsafePerformIO. For that we have to create a wrapper function that has pure interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import Control.Monad (liftM)
import Data.Vector.Storable
import Foreign hiding (unsafePerformIO)
import Foreign.C
import System.IO.Unsafe
 
dwt :: Vector Double -> Vector Double -> Vector Double
dwt ls sig = unsafePerformIO $ do
    let (fpLs , _, lenLs ) = unsafeToForeignPtr ls
        (fpSig, _, lenSig) = unsafeToForeignPtr sig
    pDwt <- liftM castPtr $ withForeignPtr fpLs $ \ptrLs ->
            withForeignPtr fpSig $ \ptrSig ->
                c_dwt (castPtr ptrLs ) (fromIntegral lenLs )
                      (castPtr ptrSig) (fromIntegral lenSig)
    fpDwt <- newForeignPtr finalizerFree pDwt
    return $ unsafeFromForeignPtr0 fpDwt lenSig

Our wrapper function takes two Vectors as input and returns a new Vector. To interface with C we need to use storable vectors, which store data that can be written to raw memory (that’s what the C function is doing). I wasn’t able to figure out what is the difference between storable and unboxed vectors. It seems that both store primitive values in continuous memory block and therefore both offer similar performance (assumed, not verified). First thing to do is to get ForeignPtrs out of input vectors. ForeignPtr is a Ptr with a finalizer attached. Finalizer is a function called when the object is no longer in use and needs to be garbage collected. In this case we need a function that will free memory allocated with malloc. This is a common task, so FFI implementation already provides a finalizerFree function for that. The actual call to foreign function is made on lines 11-14. We can operate on Ptr values stored in ForeignPtr using withForeignPtr function. However, since we have vectors of Doubles as input, we also have Ptr Double, not Ptr CDouble that c_dwt function expects. There are two possible solutions to that problem. One would be to copy memory, converting every value in a vector using realToFrac. I did not try that assuming this would kill performance. Instead I used castPtr which casts pointer of one type to a pointer of another type. This is potentially dangerous and relies on the fact that Double and CDouble have the same internal structure. This is in fact expected, but by no means it is guaranteed by any specification! I wouldn’t be surprised it that didn’t work on some sort of exotic hardware architecture. Anyway, I written tests to make sure that this cast does work the way I want it to. This little trick allows to avoid copying the input data. The output pointer has to be cast from Ptr CDouble to Ptr Double and since the result is in the IO monad the castPtr has to be lifted with liftM. After getting the result as Ptr Double we wrap it in a ForeignPtr with a memory-freeing finalizer (line 15) and use that foreign pointer to construct the resulting vector of Doubles.

Summary

I had two concerns when writing this binding. First was the possible performance overhead. Thanks to using pointer casts it was possible to avoid any sort of data copying and that makes this binding real quick. Measuring execution time with criterion shows that calling C function that does only memory allocation (as shown in this post) takes about 250?s. After adding the rest of C code that actually does computation the execution time jumps to about 55ms, so the FFI calling overhead does not skew the performance tests. Big thanks go to Mikhail Glushenkov who convinced me with his answer on StackOverflow to use FFI. My second concern was the necessity to use many functions with the word “unsafe”, especially the unsafePerformIO. I googled a bit and it seems that this is a normal thing when working with FFI and I guess there is no reason to worry, provided that the binding is thoroughly tested. So in the end I am very happy with the result. It is fast, Haskell manages garbage collection of memory allocated with C and most importantly I can benchmark C code using Criterion.

  1. Specification mentions also the calling conventions for other languages and platforms (Java VM, .Net and C++) but I think that currently there is no implementation of these. []
  2. Calls need to be safe only when called C code calls Haskell code, which I think is rare []

Code benchmarking in Haskell – some thoughts about random data generation

In my last post I showed you how to use criterion library to write benchmarks for Haskell code. In tutorial project that I created to demonstrate my ideas I decided to generate random data for benchmarking. Bryan O’Sullivan has commented on my approach that “the code (…) that generates random inputs on every run would be a good antipattern for performance testing.” After giving some thought to his words I think I see his point.

The code that Bryan refers to looks like this:

main :: IO ()
main = newStdGen >>= defaultMainWith benchConfig (return ()) . benchmarks
 
benchmarks :: RandomGen g => g -> [Benchmark]
benchmarks = ...

Each time a benchmark suite is run a different random numbers generator is created with newStdGen. This generator is then used by the benchmarks function to create values used for benchmarking. When I designed this I made an assumption that values of the data don’t influence the flow of computations. I believe that this holds for the shift functions I benchmarked in my tutorial. It doesn’t really matter what values are in the shifted list. As long as lists have the same length on different runs of the benchmark the results are comparable, but if you want to have the same random values generated on each run you can create a StdGen based on a seed that you supply. The modified main function would look like this:

main = return (mkStdGen 123456) >>= defaultMainWith benchConfig (return ()) . benchmarks

What happens however when data values do influence the flow of computation? In that case you definitely don’t want newStdGen, as it would make results of benchmark incomparable between different runs: you wouldn’t know if the speed-up is caused by changes in the code or data. It is also very likely that you don’t want to use mkStdGen. Why? Well, you would certainly get results comparable between different runs. The problem is that you wouldn’t know the characteristics of the data used for this particular benchmark. For example let’s assume that your algorithm executes faster when the data it processes contains many zeros. You benchmark the algorithm with random values created with a fixed StdGen and get a very good result. But how many zeros were in the data used for benchmarking? Perhaps 50% of input were zeros? You don’t know that. In this case you definitely want to prepare your own input data sets (e.g. one with many zeros and one without any) to measure the performance of your code based on input it receives. I guess Bryan is right here – careless use of random data generation for benchmarking can be a shot in the foot.

Staypressed theme by Themocracy