The idea behind injective type families is to infer the arguments of a type family from the result. For example, given a definition:

type family F a = r | r -> a where F Char = Bool F Bool = Char F a = a |

if we know `(F a ~ Bool)`

^{1} then we want to infer `(a ~ Char)`

. And if we know `(F a ~ Double)`

then we want to infer `(a ~ Double)`

. Going one step further from this, if we know `(F a ~ F b)`

then – knowing that `F`

is injective – we want to infer `(a ~ b)`

.

Notice that in order to declare `F`

as injective I used new syntax. Firstly, I used “`= r`

” to introduce a name for the result returned by the type family. Secondly, I used syntax borrowed from functional dependencies to declare injectivity. For multi-argument type families this syntax allows to declare injectivity in only some of the arguments, e.g.:

type family G a b c = r | r -> a c |

Actually, you can even have kind injectivity, assuming that type arguments have polymorphic kinds.

Obviously, to make use of injectivity declared by the user GHC needs to check that the injectivity annotation is true. And that’s the really tricky part that the paper focuses on. Here’s an example:

type family T a = r | r -> a where T [a] = a |

This type family returns the type of elements stored in a list. It certainly looks injective. Surprisingly, it is not. Say we have `(T [T Int])`

. By the only equation of `T`

this gives us `(T [T Int] ~ T Int)`

. And by injectivity we have `([T Int] ~ Int)`

. We just proved that lists and integers are equal, which is a disaster.

The above is only a short teaser. The paper covers much more: more corner cases, our algorithm for verifying user’s injectivity annotations, details of exploiting knowledge of injectivity inside the compiler and relationship of injective type families to functional dependencies. Extended version of the paper also comes with proofs of soundness and completeness of our algorithm.

`~`

means unification. Think of “`~`

” as “having a proof that two types are equal”.

Find the type error in the following Haskell expression:

`if null xs then tail xs else xs`

You can’t, of course: this program is obviously nonsense unless you’re a typechecker. The trouble is that only certain computations make sense if the `null xs`

test is `True`

, whilst others make sense if it is `False`

. However, as far as the type system is concerned, the type of the then branch is the type of the else branch is the type of the entire conditional. Statically, the test is irrelevant. Which is odd, because if the test really were irrelevant, we wouldn’t do it. Of course, `tail []`

doesn’t go wrong – well-typed programs don’t go wrong – so we’d better pick a different word for the way they do go.

The above quote is an opening paragraph of Conor McBride’s “Epigram: Practical Programming with Dependent Types” paper. As always, Conor makes a good point – this test is completely irrelevant for the typechecker although it is very relevant at run time. Clearly the type system fails to accurately approximate runtime behaviour of our program. In this short post I will show how to fix this in Haskell using dependent types.

The problem is that the types used in this short program carry no information about the manipulated data. This is true both for `Bool`

returned by `null xs`

, which contains no evidence of the result, as well as lists, that store no information about their length. As some of you probably realize the latter is easily fixed by using vectors, ie. length-indexed lists:

data N = Z | S N -- natural numbers data Vec a (n :: N) where Nil :: Vec a Z Cons :: a -> Vec a n -> Vec a (S n) |

The type of vector encodes its length, which means that the type checker can now be aware whether it is dealing with an empty vector. Now let’s write `null`

and `tail`

functions that work on vectors:

vecNull :: Vec a n -> Bool vecNull Nil = True vecNull (Cons _ _) = False vecTail :: Vec a (S n) -> Vec a n vecTail (Cons _ tl) = tl |

`vecNull`

is nothing surprising – it returns `True`

for empty vector and `False`

for non-empty one. But the tail function for vectors differs from its implementation for lists. `tail`

from Haskell’s standard prelude is not defined for an empty list so calling `tail []`

results in an exception (that would be the case in Conor’s example). But the type signature of `vecTail`

requires that input vector is non-empty. As a result we can rule out the `Nil`

case. That also means that Conor’s example will no longer typecheck^{1}. But how can we write a correct version of this example, one that removes first element of a vector only when it is non-empty? Here’s an attempt:

shorten :: Vec a n -> Vec a m shorten xs = case vecNull xs of True -> xs False -> vecTail xs |

That however won’t compile: now that we written type-safe tail function typechecker requires a proof that vector passed to it as an argument is non empty. The weak link in this code is the `vecNull`

function. It tests whether a vector is empty but delivers no type-level proof of the result. In other words we need:

vecNull` :: Vec a n -> IsNull n |

ie. a function with result type carrying the information about the length of the list. This data type will have the runtime representation isomorphic to `Bool`

, ie. it will be an enumeration with two constructors, and the type index will correspond to length of a vector:

data IsNull (n :: N) where Null :: IsNull Z NotNull :: IsNull (S n) |

`Null`

represents empty vectors, `NotNull`

represents non-empty ones. We can now implement a version of `vecNull`

that carries proof of the result at the type level:

vecNull` :: Vec a n -> IsNull n vecNull` Nil = Null vecNull` (Cons _ _) = NotNull |

The type signature of `vecNull``

says that the return type must have the same index as the input vector. Pattern matching on the `Nil`

case provides the type checker with the information that the `n`

index of `Vec`

is `Z`

. This means that the return value in this case must be `Null`

– the `NotNull`

constructor is indexed with `S`

and that obviously does not match `Z`

. Similarly in the `Cons`

case the return value must be `NotNull`

. However, replacing `vecNull`

in the definition of `shorten`

with our new `vecNull``

will again result in a type error. The problem comes from the type signature of `shorten`

:

shorten :: Vec a n -> Vec a m |

By indexing input and output vectors with different length indices – `n`

and `m`

– we tell the typechecker that these are completely unrelated. But that is not true! Knowing the input length `n`

we know exactly what the result should be: if the input vector is empty the result vector is also empty; if the input vector is not empty it should be shortened by one. Since we need to express this at the type level we will use a type family:

type family Pred (n :: N) :: N where Pred Z = Z Pred (S n) = n |

(In a fully-fledged dependently-typed language we would write normal function and then apply it at the type level.) Now we can finally write:

shorten :: Vec a n -> Vec a (Pred n) shorten xs = case vecNull` xs of Null -> xs NotNull -> vecTail xs |

This definition should not go wrong. Trying to swap expression in the branches will result in a type error.

- Assuming we don’t abuse Haskell’s unsoundness as logic, eg. by using
`undefined`

.

Let’s begin with looking at Haskell because it is a good example of language not formalizing coinduction in any way. Two features of Haskell are of interest to us. First one is laziness. Thanks to Haskell being lazy we can write definitions like these (in GHCi):

ghci> let ones = 1 : ones ghci> let fib = zipWith (+) (1:fib) (1:1:fib) |

`ones`

is – as the name implies – an infinite sequence (list) of ones. `fib`

is a sequence of Fibonacci numbers. Both these definitions produce infinite lists but we can use these definitions safely because laziness allows us to force a finite number of elements in the sequence:

ghci> take 5 ones [1,1,1,1,1] ghci> take 10 fib [2,3,5,8,13,21,34,55,89,144] |

Now consider this definition:

ghci> let inf = 1 + inf |

No matter how hard we try there is no way to use the definition of `inf`

in a safe way. It always causes an infinite loop:

ghci> (0 /= inf) *** Exception: <<loop>> |

The difference between definitions of `ones`

or `fib`

an the definition of `inf`

is that the former use something what is called a *guarded recursion*. The term *guarded* comes from the fact that recursive reference to self is hidden under datatype constructor (or: guarded by a constructor). The way lazy evaluation is implemented gives a guarantee that we can stop the recursion by not evaluating the recursive constructor argument. This kind of infinite recursion can also be called *productive recursion*, which means that although recursion is infinite each recursive call is guaranteed to produce something (in my examples either a 1 or next Fibonacci number). By contrast recursion in the definition of `inf`

is not guarded or productive in any way.

Haskell happily accepts the definition of `inf`

even though it is completely useless. When we write Haskell programs we of course don’t want them to fall into silly infinite loops but the only tool we have to prevent us from writing such code is our intelligence. Situation changes when it comes to….

These languages deeply care about termination. By “termination” I mean ensuring that a program written by the user is guaranteed to terminate for any input. I am aware of two reasons why these languages care about termination. First reason is theoretical: without termination the resulting language is inconsistent as logic. This happens because non-terminating term can prove any proposition. Consider this non-terminating Coq definition:

Fixpoint evil (A : Prop) : A := evil A. |

If that definition was accepted we could use it to prove any proposition. Recall that when it comes to viewing types as proofs and programs as evidence “proving a proposition” means constructing a term of a given type. `evil`

would allow to construct a term inhabiting any type `A`

. (`Prop`

is a *kind* of logical propositions so `A`

is a type.) Since dependently-typed languages aim to be consistent logics they must reject non-terminating programs. Second reason for checking termination is practical: dependently typed languages admit functions in type signatures. If we allowed non-terminating functions then typechecking would also become non-terminating and again this is something we don’t want. (Note that Haskell gives you `UndecidableInstances`

that can cause typechecking to fall into an infinite loop).

Now, if you paid attention on your Theoretical Computer Science classes all of this should ring a bell: the halting problem! The halting problem says that the problem of determining whether a given Turing machine (read: a given computer program) will ever terminate is undecidable. So how is that possible that languages like Agda, Coq or Idris can answer that question? That’s simple: they are not Turing-complete (or at least their terminating subsets are not Turing complete). They prohibit user from using some constructs, probably the most important one being *general recursion*. Think of general recursion as any kind of recursion imaginable. Dependently typed languages require structural recursion on subterms of the arguments. That means that if a function receives an argument of an inductive data type (think: algebraic data type/generalized algebraic data type) then you can only make recursive calls on terms that are syntactic subcomponents of the argument. Consider this definition of `map`

in Idris:

map : (a -> b) -> List a -> List b map f [] = [] map f (x::xs) = f x :: map f xs |

In the second equation we use pattern matching to deconstruct the list argument. The recursive call is made on `xs`

, which is structurally smaller then the original argument. This guarantees that any call to `map`

will terminate. There is a silent assumption here that the `List A`

argument passed to `map`

is finite, but with the rules given so far it is not possible to construct infinite list.

So we just eliminated non-termination by limiting what can be done with recursion. This means that our Haskell definitions of `ones`

and `fib`

would not be accepted in a dependently-typed language because they don’t recurse on an argument that gets smaller and as a result they construct an infinite data structure. Does that mean we are stuck with having only finite data structures? Luckily, no.

Coinduction provides a way of defining and operating on infinite data structures as long as we can prove that our operations are safe, that is they are guarded and productive. In what follows I will use Coq because it seems that it has better support for coinduction than Agda or Idris (and if I’m wrong here please correct me).

Coq, Agda and Idris all require that a datatype that can contain infinite values has a special declaration. Coq uses `CoInductive`

keyword instead of `Inductive`

keyword used for standard inductive data types. In a similar fashion Idris uses `codata`

instead of `data`

, while Agda requires ∞ annotation on a coinductive constructor argument.

Let’s define a type of infinite `nat`

streams in Coq:

CoInductive stream : Set := | Cons : nat -> stream -> stream. |

I could have defined a polymorphic stream but for the purpose of this post stream of nats will do. I could have also defined a `Nil`

constructor to allow finite coinductive streams – declaring data as coinductive means it *can* have infinite values, not that it *must* have infinite values.

Now that we have infinite streams let’s revisit our examples from Haskell: `ones`

and `fib`

. `ones`

is simple:

CoFixpoint ones : stream := Cons 1 ones. |

We just had to use `CoFixpoint`

keyword to tell Coq that our definition will be corecursive and it is happily accepted even though a similar recursive definition (ie. using `Fixpoint`

keyword) would be rejected. Allow me to quote directly from CPDT:

whereas recursive definitions were necessary to

usevalues of recursive inductive types effectively, here we find that we needco-recursive definitionstobuildvalues of co-inductive types effectively.

That one sentence pins down an important difference between induction and coinduction.

Now let’s define `zipWith`

and try our second example `fib`

:

CoFixpoint zipWith (f : nat -> nat -> nat) (a : stream) (b : stream) : stream := match a, b with | Cons x xs, Cons y ys > Cons (f x y) (zipWith f xs ys) end. CoFixpoint fib : stream := zipWith plus (Cons 1 fib) (Cons 1 (Cons 1 fib)). |

Unfortunately this definition is rejected by Coq due to “unguarded recursive call”. What exactly goes wrong? Coq requires that all recursive calls in a corecursive definition are:

- direct arguments to a data constructor
- not inside function arguments

Our definition of `fib`

violates the second condition – both recursive calls to `fib`

are hidden inside arguments to `zipWith`

function. Why does Coq enforce such a restriction? Consider this simple example:

Definition tl (s : stream) : stream := match s with | Cons _ tl' => tl' end. CoFixpoint bad : stream := tl (Cons 1 bad). |

`tl`

is a standard tail function that discards the first element of a stream and returns its tail. Just like our definition of `fib`

the definition of `bad`

places the corecursive call inside a function argument. I hope it is easy to see that accepting the definition of `bad`

would lead to non-termination – inlining definition of `tl`

and simplifying it leads us to:

CoFixpoint bad : stream := bad. |

and that is bad. You might be thinking that the definition of `bad`

really has no chance of working whereas our definition of `fib`

could in fact be run safely without the risk of non-termination. So how do we persuade Coq that our corecursive definition of `fib`

is in fact valid? Unfortunately there seems to be no simple answer. What was meant to be a simple exercise in coinduction turned out to be a real research problem. This past Monday I spent well over an hour with my friend staring at the code and trying to come up with a solution. We didn’t find one but instead we found a really nice paper “Using Structural Recursion for Corecursion” by Yves Bertot and Ekaterina Komendantskaya. The paper presents a way of converting definitions like `fib`

to a guarded and productive form accepted by Coq. Unfortunately the converted definition looses the linear computational complexity of the original definition so the conversion method is far from perfect. I encourage to read the paper. It is not long and is written in a very accessible way. Another set of possible solutions is given in chapter 7 of CPDT but I am very far from labelling them as “accessible”.

I hope this post demonstrates that basics ideas behind coinduction are actually quite simple. For me this whole subject of coinduction looks really fascinating and I plan to dive deeper into it. I already have my eyes set on several research papers about coinduction so there’s a good chance that I’ll write more about it in future posts.

]]>(Note: In what follows I will compare Coq to Agda and Idris but you have to be aware that despite similarity in features of these languages they don’t aim to be the same. Coq is a proof assistant with an extra feature of code extraction that allows you to turn your proofs into code – if you ever heard about “programs as proofs” this is it. Idris is a programming language with extra features that allow you to prove your code correct. I’m not really sure how to classify Agda. It is definitely on the programming-language-end of the spectrum – it allows you to prove your code correct but does not provide any extra built-in proof support. At the same time turning Agda code into working programs is non-trivial.)

Let me start off by saying that I don’t have any real-life Coq project on the horizon, so my learning is not motivated by need to solve any practical problem. My main driving force for learning Coq is purely interest in programming languages and seeing how Coq compares to Agda and Idris. A common thing with dependently-typed languages is that the types can get too complicated for the programmer to comprehend and thus a language requires an interactive mode to provide programmer with compiler feedback about the types. This is true for Agda, Idris and Coq. Agda offers a great support for holes: programmer can insert question marks into the program code and once it is re-compiled in the editor (read: Emacs) question marks become holes ie. places where Agda compiler provides user with feedback about expected types, bindings available in the hole context as well as some nice inference features allowing to automatically fill in contents of holes. So in Agda one proves by constructing terms of appropriate types. Coq is different, as it relies on a mechanism called “tactics”. Once the user writes down a type (theorem) he is presented with a set of goals to prove. Applying a tactic transforms the current goal into a different goal (or several goals). Conducting consecutive steps of the proof (ie. applying several tactics) should lead to some trivial goal that follows from definition and ends the proof. To work with Coq I decided to use Proof General, an Emacs extension for working with proofs (many other proof assistants are supported besides Coq)^{1}. It launches Coq process in the background and essentially integrates writing code with proving. With Proof General I can easily step through my proofs to see how the goals are transformed by usage of tactics. Idris falls somewhere between Agda and Coq. As stated earlier it is mostly a programming language but it also provides tactic-based proving. So for example when I write a definition that requires explicit proof to typecheck, idris-mode launches interactive REPL in which I can conduct a proof in a fashion similar to Proof General and once I’m finished the proof is inserted into the source code. the result looks something like this:

par : (n : Nat) -> Parity n par Z = even {n=Z} par (S Z) = odd {n=Z} par (S (S k)) with (par k) par (S (S (j + j))) | even ?= even {n = S j} par (S (S (S (j + j)))) | odd ?= odd {n = S j} ---------- Proofs ---------- Basics.par_lemma_2 = proof intros rewrite sym (plusSuccRightSucc j j) trivial Basics.par_lemma_1 = proof intros rewrite sym (plusSuccRightSucc j j) trivial |

The last time I checked Idris once the proof was completed and added to the source code it was not possible to step through it back and forth to see how goals are transformed. (Things might have changed since I last checked.)

So far I’ve been doing rather basic stuff with Coq so I haven’t seen much that wouldn’t be also possible in Agda or Idris. The biggest difference is that Coq feels a much more grown up language than any of the mentioned two. One totally new thing I learned so far is co-induction, but I’m only starting with it and won’t go into details, rather leaving it for a separate post. (Agda also supports co-induction.)

As for the CPDT book I have to say it is a challenging read: it’s very dense and focuses on more advanced Coq techniques without going into details of basics. As such it is a great demonstration of what can be done in Coq but not a good explanation of how it can be done. Depending on what you are expecting this book might or might not be what you want. As stated earlier I don’t plan on applying Coq in any project but rather want to see a demo of Coq features and possibly pick up some interesting theoretical concepts. As such CPDT works quite well for me although I am not entirely happy with not being able to fully understand some of the demonstrated techniques. As such CPDT is definitely not a self-contained read, which I believe was a conscious decision on the author’s side. Discussing with people on #coq IRC channel and reading various posts on the internet leads me to a conclusion that CPDT is a great book for people that have been using Coq for some time and want to take their skills to a new level. The main theme of the book is proof automation, that is replacing tactic-based sequential proofs with automated decision procedures adjusted to the problem at hand that can construct proofs automatically. Indeed tactic-based proofs are difficult to understand and maintain. Consider this proof of a simple property that `n + 0 = n`

:

Theorem n_plus_O : forall (n : nat), plus n O = n. intros. induction n. reflexivity. simpl. rewrite IHn. reflexivity. Qed. |

To understand that proof one has to step through it to see how goals are transformed or have enough knowledge of Coq to know that without the need of stepping through. Throughout the book Adam Chlipala demonstrates the power of proof automation by using his tactic called *crush*, which feels like a magic wand since it usually ends the proof immediately (sometimes it requires some minimal guidance before it ends the proof immediately). I admit this is a bit frustrating as I don’t feel I learn anything by seeing *crush* applied to magically finish a proof. Like I said, a good demo of what can be done but without an explanation. The worst thing is that *crush* does not seem to be explained anywhere in the book so readers wanting to understand it are left on their own (well, almost on their own).

What about those of you who want to learn Coq starting from the basics? It seems like Software Foundations is the introductory book about Coq and – given that the main author is Benjamin Pierce – it looks like you can’t go wrong with this book. I am not yet sure whether I’ll dive into SF but most likely not as this would mean giving up on CPDT and for me it’s more important to get a general coverage of more advanced topics rather than in-depth treatment of basics.

- Other choices of interactive mode are available for Coq, for example CoqIDE shipped by default with Coq installation

My overall impression is that porting from Agda to Haskell turned out to be fairly straightforward. It was definitely not a complete rewrite. More like syntax adjustments here and there. There were of course some surprises and bumps along the way but nothing too problematic. More precise details are given in the code comments.

When it comes to programming with dependent types Agda, being a fully-fledged dependently-typed language, beats Haskell in many aspects:

- Agda has the same language for terms and types. Haskell separates these languages, which means that if I want to have addition for natural numbers then I need to have two separate definitions for terms and types. Moreover, to tie types and terms together I need singleton types. And once I have singleton types then I need to write third definition of addition that works on singletons. All of this is troublesome to write and use. (This tedious process can be automated by using singletons package.)
- interactive agda-mode for Emacs makes writing code much simpler in Agda. Here I was porting code that was already written so having an interactive Emacs mode for Haskell was not at all important. But if I were to write all that dependently-typed code from scratch in Haskell this would be painful. We definitely need better tools for dependently-typed programming in Haskell.
- Agda admits Unicode identifiers. This allows me to have type constructors like
`≥`

or variables like`p≥b`

. In Haskell I have`GEq`

and`pgeb`

, respectively. I find that less readable. (This is very subjective.) - Agda has implicit arguments that can be deduced from types. Haskell does not, which makes some function calls more difficult. Surprisingly that was not as huge problem as I initially thought it will be.
- Agda is total, while Haskell is not. Since there are bottoms in Haskell it is not sound as a logic. In other words we can prove false eg. by using undefined.

The list is noticeably shorter:

- Haskell has much better term-level syntax. In many places this resulted in significantly shorter code than in Agda.
- Haskell is not total. As stated earlier this has its drawbacks but it also has a good side: we don’t need to struggle with convincing the termination checker that our code does actually terminate. This was painful in Agda since it required using sized types.
- Haskell’s
`gcastWith`

function is much better than Agda’s`subst`

. Both these functions allow type-safe casts given the proof that the cast is safe. The difference is that Agda’s`subst`

requires more explicit arguments (as I noted earlier the opposite is usually the case) and restricts the cast to the last type parameter (Haskell allows cast for any type parameter).

While the list of wins is longer for Agda than it is for Haskell I’m actually very happy with Haskell’s performance in this task. The verification in Haskell is as powerful as it is in Agda. No compromises required.

It’s worth remarking that my implementation works with GHC 7.6, so you don’t need the latest fancy type-level features like closed type families. The really essential part are the promoted data types.

]]>`singletons`

library. This work was finished in June and last Friday I gave a talk about our research on Haskell Symposium 2014. This was the first time I’ve been to the ICFP and Haskell Symposium. It was pretty cool to finally meet all these people I know only from IRC. I also admit that the atmosphere of the conference quite surprised me as it often felt like some sort of fan convention rather than the biggest event in the field of functional programming.
The paper Richard and I published is titled “Promoting Functions to Type Families in Haskell”. This work is based on Richard’s earlier paper “Dependently typed programming with singletons” presented two years ago on Haskell Symposium. Back then Richard presented the `singletons`

library that uses Template Haskell to generate singleton types and functions that operate on them. Singleton types are types that have only one value (aside from bottom) which allows to reason about runtime values during compilation (some introduction to singletons can be found in this post on Richard’s blog). This smart encoding allows to simulate some of the features of dependent types in Haskell. In our current work we extended promotion capabilities of the library. Promotion is only concerned with generating type-level definitions from term-level ones. Type-level language in GHC has become quite expressive during the last couple of years but it is still missing many features available in the term-level language. Richard and I have found ways to encode almost all of these missing features using the already existing type-level language features. What this means is that you can write normal term-level definition and then our library will automatically generate an equivalent type family. You’re only forbidden from using infinite terms, the `do`

-notation, and decomposing `String`

literals to `Char`

s. Numeric literals are also very problematic and the support is very limited but some of the issues can be worked around. What is really cool is that our library allows you to have partial application at the type level, which GHC normally prohibits.

You can learn more by watching my talk on YouTube, reading the paper or the `singletons`

documentation. Here I’d like to add a few more information that are not present in the paper. So first of all the paper was concerned only with promotion and didn’t say anything about singletonization. But as we enabled more and more language constructs to be promoted we also made them singletonizable. So almost everything that can be promoted can also be singletonized. The most notable exception to this rule are type classes, which are not yet implemented at the moment.

An interesting issue was raised by Adam Gundry in a question after the talk: what about difference between lazy term-level semantics and strict type-level semantics? You can listen to my answer in the video but I’ll elaborate some more on this here. At one point during our work we were wondering about this issue and decided to demonstrate an example of an algorithm that crucially relies on laziness to work, ie. fails to work with strict semantics. I think it’s not straightforward to come up with such an algorithm but luckily I recalled the backwards state monad from Philip Wadler’s paper “The essence of functional programming”^{1}. Bind operator of that monad looks like this (definition copied from the paper):

m `bindS` k = \s2 -> let (a,s0) = m s1 (b,s1) = k a s2 in (b,s0) |

The tricky part here is that the output of call to `m`

becomes input to call to `k`

, while the output of call to `k`

becomes the input of `m`

. Implementing this in a strict language does not at all look straightforward. So I promoted that definition expecting it to fail spectacularly but to my surprised it worked perfectly fine. After some investigation I understood what’s going on. Type-level computations performed by GHC are about constraint solving. It turns out that GHC is able to figure out in which order to solve these constraints and get the result. It’s exactly analogous to what happens with the term-level version at runtime: we have an order of dependencies between the closures and there is a way in which we can run these closures to get the final result.

All of this work is a small part of a larger endeavour to push Haskell’s type system towards dependent types. With singletons you can write type-level functions easily by writing their definitions using the term-level language and then promoting these definitions. And then you can singletonize your functions to work on singleton types. There were two other talks about dependent types during the conference: Stephanie Weirich’s “Depending on Types” keynote lecture during ICPF and Richard’s “Dependent Haskell” talk during Haskell Implementators Workshop. I encourage everyone interested in Haskell’s type system to watch both of these talks.

- The awful truth is that this monad does not really work with the released version of
`singletons`

. I only realized that when I was writing this post. See issue #94 on`singletons`

bug tracker.

I was wrong and I failed. After about 6 months I abandoned the project. Despite my initial optimism, upon closer examination the algorithm turned out not to be embarrassingly parallel. I could not find a good way of parallelizing it and doing things in functional setting made things even more difficult. I don’t think I will ever get back to this project so I’m putting the code on GitHub. In this post I will give a brief overview of the algorithm, discuss parallelization strategies I came up with and the state of the implementation. I hope that someone will pick it up and solve the problems I was unable to solve. Consider this a challenge problem in parallel programming in Haskell. I think that if solution is found it might be worthy a paper (unless it is something obvious that escaped me). In any case, please let me know if you’re interested in continuing my work.

The algorithm I wanted to parallelize is called the “lattice structure”. It is used to compute a Discrete Wavelet Transform (DWT) of a signal^{1}. I will describe how it works but will not go into details of why it works the way it does (if you’re interested in the gory details take a look at this paper).

Let’s begin by defining a two-point base operation:

This operations takes two floating-point values x and y as input and returns two new values x’ and y’ created by performing simple matrix multiplication. In other words:

where is a real parameter. Base operation is visualised like this:

(The idea behind base operation is almost identical as in the butterfly diagram used in Fast Fourier Transforms).

The lattice structure accepts input of even length, sends it through a series of layers and outputs a transformed signal of the same length as input. Lattice structure is organised into layers of base operations connected like this:

The number of layers may be arbitrary; the number of base operations depends on the length of input signal. Within each layer all base operations are identical, i.e. they share the same value of . Each layer is shifted by one relatively to its preceding layer. At the end of signal there is a cyclic wrap-around, as denoted by and arrows. This has to do with the edge effects. By edge effects I mean the question of what to do at the ends of a signal, where we might have less samples than required to actually perform our signal transformation (because the signal ends and the samples are missing). There are various approaches to this problem. Cyclic wrap-around performed by this structure means that a finite-length signal is in fact treated as it was an infinite, cyclic signal. This approach does not give the best results, but it is very easy to implement. I decided to use it and focus on more important issues.

Note that if we don’t need to keep the original signal the lattice structure could operate in place. This allows for a memory-efficient implementation in languages that have destructive updates. If we want to keep the original signal it is enough that the first layer copies data from old array to a new one. All other layers can operate in place on the new array.

One look at the lattice structure and you see that it is parallel – base operations within a single layer are independent of each other and can easily be processed in parallel. This approach seems very appropriate for CUDA architecture. But since I am not familiar with GPU programming I decided to begin by exploring parallelism opportunities on a standard CPU.

For CPU computations you can divide input signal into chunks containing many base operations and distribute these chunks to threads running on different cores. Repa library uses this parallelization strategy under the hood. The major problem here is that after each layer has been computed we need to synchronize threads to assemble the result. The question is whether the gains from parallelism are larger than this cost.

After some thought I came up with another parallelization strategy. Instead of synchronizing after each layer I would give each thread its own chunk of signal to propagate through all the layers and then merge the result at the end. This approach requires that each thread is given an input chunk that is slightly larger than the expected output. This results from the fact that here we will not perform cyclic wrap-around but instead we will narrow down the signal. This idea is shown in the image below:

This example assumes dividing the signal between two threads. Each thread receives an input signal of length 8 and produces output of length 4. A couple of issues arise with this approach. As you can see there is some overlap of computations between neighbouring threads, which means we will compute some base operations twice. I derived a formula to estimate amount of duplicate computations with a conclusion that in practice this issue can be completely neglected. Another issue is that the original signal has to be enlarged, because we don’t perform a wrap-around but instead expect the wrapped signal components to be part of the signal (these extra operations are marked in grey colour on the image above). This means that we need to create input vector that is longer than the original one and fill it with appropriate data. We then need to slice that input into chunks, pass each chunk to a separate thread and once all threads are finished we need to assemble the result. Chunking the input signal and assembling the results at the end are extra costs, but they allow us to avoid synchronizing threads between layers. Again, this approach might be implemented with Repa.

A third approach I came up with was a form of nested parallelism: distribute overlapping chunks to separate threads and have each thread compute base operations in parallel, e.g. by using SIMD instructions.

My plan was to implement various versions of the above parallelization strategies and compare their performance. When I worked in Matlab I used its profiling capabilities to get precise execution times for my code. So one of the first questions I had to answer was “how do I measure performance of my code in Haskell?” After some googling I quickly came across criterion benchmarking library. Criterion is really convenient to use because it automatically runs the benchmarked function multiple times and performs statistical analysis of the results. It also plots the results in a very accessible form.

While criterion offered me a lot of features I needed, it also raised many questions and issues. One question was whether the forcing of lazily generated benchmark input data distorts the benchmark results. It took me several days to come up with experiments that answered this question. Another issue was that of the reliability of the results. For example I observed that results can differ significantly across runs. This is of course to be expected in a multi-tasking environment. I tried to eliminate the problem by switching my Linux to single-user mode where I could disable all background services. Still, it happened that some results differed significantly across multiple runs, which definitely points out that running benchmarks is not a good way to precisely answer the question “which of the implementations is the most efficient?”. Another observation I made about criterion was that results of benchmarking functions that use FFI depend on their placement in the benchmark suite. I was not able to solve that problem and it undermined my trust in the criterion results. Later during my work I decided to benchmark not only the functions performing the Discrete Wavelet Transform but also all the smaller components that comprise them. Some of the results were impossible for me to interpret in a meaningful way. I ended up not really trusting results from criterion.

Another tool I used for measuring parallel performance was Threadscope. This nifty program visualizes CPU load during program execution as well as garbage collection and some other events like activating threads or putting them to sleep. Threadscope provided me with some insight into what is going on when I run my program. Information from it was very valuable although I couldn’t use it to get the most important information I needed for a multi-threaded code: “how much time does the OS need to start multiple threads and synchronize them later?”.

As already mentioned, one of my goals for this project was to learn various parallelization techniques and libraries. This resulted in implementing algorithms described above in a couple of ways. First of all I used three different approaches to handle cyclic wrap-around of the signal between the layers:

**cyclic shift**– after computing one layer perform a cyclic shift of the intermediate transformed signal. First element of the signal becomes the last, all other elements are shifted by one to the front. This is rather inefficient, especially for lists.**signal extension**– instead of doing cyclic shift extend the initial signal and then shorten it after each layer (this approach is required for the second parallelization strategy but it can be used in the first one as well). Constructing the extended signal is time consuming but once lattice structure computations are started the transition between layers becomes much faster for lists. For other data structures, like vectors, it is time consuming because my implementation creates a new, shorter signal and copies data from existing vector to a new one. Since vectors provide constant-time indexing it would be possible to avoid copying by using smarter indexing. I don’t remember why I didn’t implement that.**smart indexing**– the most efficient way of implementing cyclic wrap-around is using indexing that shifts the base operations by one on the odd layers. Obviously, to be efficient it requires a data structure that provides constant-time indexing. It requires no copying or any other modification of output data from a layer. Thus it carries no memory and execution overhead.

Now that we know how to implement cyclic wrap-around let’s focus on the actual implementations of the lattice structure. I only implemented the first parallelization strategy, i.e. the one that requires thread synchronization after each layer. I admit I don’t remember the exact reasons why I didn’t implement the signal-chunking strategy. I think I did some preliminary measurements and concluded that overhead of chunking the signal is way to big. Obviously, the strategy that was supposed to use nested parallelizm was also not implemented because it relied on the chunking strategy. So all of the code uses parallelizm within a single layer and synchronizes threads after each layer.

Below is an alphabetic list of what you will find in my source code in the Signal.Wavelet.* modules:

**Signal.Wavelet.C1**– I wanted to at least match the performance of C, so I made a sequential implementation in C (see cbits/dwt.c) and linked it into Haskell using FFI bindings. I had serious doubts that the overhead of calling C via FFI might distort the results, but luckily it turned out that it does not – see this post. This implementation uses smart indexing to perform cyclic wrap-around. It also operates in place (except for the first layer, as described earlier).**Signal.Wavelet.Eval1**– this implementation uses lists and the Eval monad. It uses cyclic shift of the input signal between layers. This implementation was not actually a serious effort. I don’t expect anything that operates on lazy lists to have decent performance in numerical computations. Surprisingly though, adding Eval turned out to be a performance killer compared to the sequential implementation on lists. I never investigated why this happens**Signal.Wavelet.Eval2**– same as Eval1, but uses signal extension instead of cyclic shift. Performance is also very poor.**Signal.Wavelet.List1**– sequential implementation on lazy lists with cyclic shift of the signal between the layers. Written as a reference implementation to test other implementations with QuickCheck.**Signal.Wavelet.List2**– same as previous, but uses signal extension. I wrote it because it was only about 10 lines of code.**Signal.Wavelet.Repa1**– parallel and sequential implementation using Repa with cyclic shift between layers. Uses unsafe Repa operations (unsafe = no bounds checking when indexing), forces each layer after it is computed and is as strict as possible.**Signal.Wavelet.Repa2**– same as previous, but uses signal extension.**Signal.Wavelet.Repa3**– this implementation uses internals of the Repa library. To make it run you need to install modified version of Repa that exposes its internal modules. In this implementation I created a new type of Repa array that represents a lattice structure. With this implementation I wanted to see if I can get better performance from Repa if I place the lattice computations inside the array representation. This implementation uses smart indexing.**Signal.Wavelet.Vector1**- this implementation is a Haskell rewrite of the C algorithm that was supposed to be my baseline. It uses mutable vectors and lots of unsafe operations. The code is ugly – it is in fact an imperative algorithm written in a functional language.

In most of the above implementations I tried to write my code in a way that is idiomatic to functional languages. After all this is what the Haskell propaganda advertised – parallelism (almost) for free! The exceptions are Repa3 and Vector1 implementations.

Criterion tests each of the above implementations by feeding it a vector containing 16384 elements and then performing a 6 layer transformation. Each implementation is benchmarked 100 times. Based on these 100 runs criterion computes average runtime, standard deviation, influence of outlying results on the average and a few more things like plotting the results. Below are the benchmarking results on Intel i7 M620 CPU using two cores (click to enlarge):

“DWT” prefix of all the benchmarks denotes the forward DWT. There is also the IDWT (inverse DWT) but the results are similar so I elided them. “Seq” suffix denotes sequential implementation, “Par” suffix denotes parallel implementation. As you can see there are no results for the Eval* implementations. The reason is that they are so slow that differences between other implementations become invisible on the bar chart.

The results are interesting. First of all the C implementation is really fast. The only Haskell implementation that comes close to it is Vector1. Too bad the code of Vector1 relies on tons of unsafe operations and isn’t written in functional style at all. All Repa implementations are noticeably slower. The interesting part is that for Repa1 and Repa2 using parallelism slows down the execution time by a factor of 2. For some reason this is not the case for Repa3, where parallelism improves performance. Sadly, Repa3 is as slow as implementation that uses lazy lists.

The detailed results, which I’m not presenting here because there’s a lot of them, raise more questions. For example in one of the benchmarks run on a slower machine most of the running times for the Repa1 implementation were around 3.25ms. But there was one that was only around 1ms. What to make of such a result? Were all the runs, except for this one, slowed down by some background process? Is it some mysterious caching effect? Or is it just some criterion glitch? There were many such questions where I wasn’t able to figure out the answer by looking at the criterion results.

There are more benchmarks in the sources – see the benchmark suite file.

From a time perspective I can identify several mistakes that I have made that eventually lead to a failure of this project. Firstly, I think that focusing on CPU implementations instead of GPU was wrong. My plan was to quickly deal with the CPU implementations, which I thought I knew how to do, and then figure out how to implement these algorithms on a GPU. However, the CPU implementation turned out to be much slower than I expected and I spent a lot of time trying to actually make my CPU code faster. In the end I never even attempted a GPU implementation.

An important theoretical issue that I should have addressed early in the project is how big input signal do I need to benefit from parallelism. Parallelism based on multiple threads comes with a cost of launching and synchronizing threads. Given that Repa implementations do that for each layer I really pay a lot of extra cost. As you’ve seen my benchmarks use vectors with 16K elements. The problem is that this seems not enough to benefit from parallelism and at the same time it is much more than encountered in typical real-world applications of DWT. So perhaps there is no point in parallelizing the lattice structure, other than using SIMD instructions?

I think the main cause why this project failed is that I did not have sufficient knowledge of parallelism. I’ve read several papers on Repa and DPH and thought that I know enough to implement parallel version of an algorithm I am familiar with. I struggled to understand benchmark results that I got from criterion but in hindsight I think this was not a good approach. The right thing to do was looking at the generated assembly, something that I did not know how to do at that time. I should also have a deeper understanding of hardware and thread handling by the operating system. As a side note, I think this shows that parallelism is not really for free and still requires some arcane knowledge from the programmer. I guess there is a lot to do in the research on parallelism in Haskell.

I have undertaken a project that seemed like a relatively simple task but it ended up as a failure. This was not the first and probably not the last time in my career – it’s just the way science is. I think the major factor that contributed to failure was me not realizing that I have insufficient knowledge. But I don’t consider my work on this to be a wasted effort. I learned how to use FFI and how to benchmark and test my code. This in turn lead to many posts on this blog.

What remains is an unanswered question: how to implement an efficient, parallel lattice structure in Haskell? I hope thanks to this post and putting my code on Github someone will answer this question.

During my work on this project I contacted Ben Lippmeier, author of the Repa library. Ben helped me realize some things that I have missed in my work. That sped up my decision to abandon this project and I thank Ben for that.

**UPDATE (18/06/2014)**, supersedes previous update from (28/05/2014)

One of the comments below suggests it would be interesting to see performance of parallel implementation in C++ or Rust. Thus I have improved the C implementation to use SSE3 SIMD instructions. There’s not too much parallelism there: the main idea is that we can pack both input parameters of the base operation into a single XMM register. This allows to reduce the number of multiplication by two. Also, addition and subtraction are now done by a single instruction. This work is now merged into master. SSE3 support is controlled via `-sse3`

flag in the cabal file. From the benchmarks it seems that in practice the performance gain is small.

A possible next step I’d like to undertake one day is implementing the lattice structure using AVX instructions. With 256bit registers this will allow to compute two base operations in a single loop iteration. Sadly, at the moment I don’t have access to AVX-enabled CPU.

- Orthogonal transform, to be more precise. It is possible to construct lattice structures for biorthogonal wavelets, but that is well beyond the scope of this post.

`--show-options`

flag that lists all command-line flags. This feature can be used to auto-complete command-line flags in shells that support this feature. To enable auto-completion in Bash add this code snippet to your ~/.bashrc file:
# Autocomplete GHC commands _ghc() { local envs=`ghc --show-options` # get the word currently being completed local cur=${COMP_WORDS[$COMP_CWORD]} # the resulting completions should be put into this array COMPREPLY=( $( compgen -W "$envs" -- $cur ) ) } complete -F _ghc -o default ghc |

From my experience the first completion is a bit slow but once the flags are cached things work fast.

- Please ignore 7.8.1 release. It shipped with a bug that caused rejection of some valid programs.

`tasty-hunit-adapter`

allows to import existing HUnit tests into tasty (hackage, github):module Main where import Test.HUnit ( (~:), (@=?) ) import Test.Tasty ( defaultMain, testGroup ) import Test.Tasty.HUnit.Adapter ( hUnitTestToTestTree ) main :: IO () main = defaultMain $ testGroup "Migrated from HUnit" $ hUnitTestToTestTree ("HUnit test" ~: 2 + 2 @=? 4)

`tasty-program`

allows to run external program and test whether it terminates successfully (hackage, github):module Main ( main ) where import Test.Tasty import Test.Tasty.Program main :: IO () main = defaultMain $ testGroup "Compilation with GHC" $ [ testProgram "Foo" "ghc" ["-fforce-recomp", "foo.hs"] Nothing ]

This package has only this basic functionality at the moment. A missing feature is the possibility of logging stdout and stderr to a file so that it can later be inspected or perhaps used by a golden test (but for the latter tasty needs test dependencies).