I was wrong and I failed. After about 6 months I abandoned the project. Despite my initial optimism, upon closer examination the algorithm turned out not to be embarrassingly parallel. I could not find a good way of parallelizing it and doing things in functional setting made things even more difficult. I don’t think I will ever get back to this project so I’m putting the code on GitHub. In this post I will give a brief overview of the algorithm, discuss parallelization strategies I came up with and the state of the implementation. I hope that someone will pick it up and solve the problems I was unable to solve. Consider this a challenge problem in parallel programming in Haskell. I think that if solution is found it might be worthy a paper (unless it is something obvious that escaped me). In any case, please let me know if you’re interested in continuing my work.

The algorithm I wanted to parallelize is called the “lattice structure”. It is used to compute a Discrete Wavelet Transform (DWT) of a signal^{1}. I will describe how it works but will not go into details of why it works the way it does (if you’re interested in the gory details take a look at this paper).

Let’s begin by defining a two-point base operation:

This operations takes two floating-point values x and y as input and returns two new values x’ and y’ created by performing simple matrix multiplication. In other words:

where is a real parameter. Base operation is visualised like this:

(The idea behind base operation is almost identical as in the butterfly diagram used in Fast Fourier Transforms).

The lattice structure accepts input of even length, sends it through a series of layers and outputs a transformed signal of the same length as input. Lattice structure is organised into layers of base operations connected like this:

The number of layers may be arbitrary; the number of base operations depends on the length of input signal. Within each layer all base operations are identical, i.e. they share the same value of . Each layer is shifted by one relatively to its preceding layer. At the end of signal there is a cyclic wrap-around, as denoted by and arrows. This has to do with the edge effects. By edge effects I mean the question of what to do at the ends of a signal, where we might have less samples than required to actually perform our signal transformation (because the signal ends and the samples are missing). There are various approaches to this problem. Cyclic wrap-around performed by this structure means that a finite-length signal is in fact treated as it was an infinite, cyclic signal. This approach does not give the best results, but it is very easy to implement. I decided to use it and focus on more important issues.

Note that if we don’t need to keep the original signal the lattice structure could operate in place. This allows for a memory-efficient implementation in languages that have destructive updates. If we want to keep the original signal it is enough that the first layer copies data from old array to a new one. All other layers can operate in place on the new array.

One look at the lattice structure and you see that it is parallel – base operations within a single layer are independent of each other and can easily be processed in parallel. This approach seems very appropriate for CUDA architecture. But since I am not familiar with GPU programming I decided to begin by exploring parallelism opportunities on a standard CPU.

For CPU computations you can divide input signal into chunks containing many base operations and distribute these chunks to threads running on different cores. Repa library uses this parallelization strategy under the hood. The major problem here is that after each layer has been computed we need to synchronize threads to assemble the result. The question is whether the gains from parallelism are larger than this cost.

After some thought I came up with another parallelization strategy. Instead of synchronizing after each layer I would give each thread its own chunk of signal to propagate through all the layers and then merge the result at the end. This approach requires that each thread is given an input chunk that is slightly larger than the expected output. This results from the fact that here we will not perform cyclic wrap-around but instead we will narrow down the signal. This idea is shown in the image below:

This example assumes dividing the signal between two threads. Each thread receives an input signal of length 8 and produces output of length 4. A couple of issues arise with this approach. As you can see there is some overlap of computations between neighbouring threads, which means we will compute some base operations twice. I derived a formula to estimate amount of duplicate computations with a conclusion that in practice this issue can be completely neglected. Another issue is that the original signal has to be enlarged, because we don’t perform a wrap-around but instead expect the wrapped signal components to be part of the signal (these extra operations are marked in grey colour on the image above). This means that we need to create input vector that is longer than the original one and fill it with appropriate data. We then need to slice that input into chunks, pass each chunk to a separate thread and once all threads are finished we need to assemble the result. Chunking the input signal and assembling the results at the end are extra costs, but they allow us to avoid synchronizing threads between layers. Again, this approach might be implemented with Repa.

A third approach I came up with was a form of nested parallelism: distribute overlapping chunks to separate threads and have each thread compute base operations in parallel, e.g. by using SIMD instructions.

My plan was to implement various versions of the above parallelization strategies and compare their performance. When I worked in Matlab I used its profiling capabilities to get precise execution times for my code. So one of the first questions I had to answer was “how do I measure performance of my code in Haskell?” After some googling I quickly came across criterion benchmarking library. Criterion is really convenient to use because it automatically runs the benchmarked function multiple times and performs statistical analysis of the results. It also plots the results in a very accessible form.

While criterion offered me a lot of features I needed, it also raised many questions and issues. One question was whether the forcing of lazily generated benchmark input data distorts the benchmark results. It took me several days to come up with experiments that answered this question. Another issue was that of the reliability of the results. For example I observed that results can differ significantly across runs. This is of course to be expected in a multi-tasking environment. I tried to eliminate the problem by switching my Linux to single-user mode where I could disable all background services. Still, it happened that some results differed significantly across multiple runs, which definitely points out that running benchmarks is not a good way to precisely answer the question “which of the implementations is the most efficient?”. Another observation I made about criterion was that results of benchmarking functions that use FFI depend on their placement in the benchmark suite. I was not able to solve that problem and it undermined my trust in the criterion results. Later during my work I decided to benchmark not only the functions performing the Discrete Wavelet Transform but also all the smaller components that comprise them. Some of the results were impossible for me to interpret in a meaningful way. I ended up not really trusting results from criterion.

Another tool I used for measuring parallel performance was Threadscope. This nifty program visualizes CPU load during program execution as well as garbage collection and some other events like activating threads or putting them to sleep. Threadscope provided me with some insight into what is going on when I run my program. Information from it was very valuable although I couldn’t use it to get the most important information I needed for a multi-threaded code: “how much time does the OS need to start multiple threads and synchronize them later?”.

As already mentioned, one of my goals for this project was to learn various parallelization techniques and libraries. This resulted in implementing algorithms described above in a couple of ways. First of all I used three different approaches to handle cyclic wrap-around of the signal between the layers:

**cyclic shift**– after computing one layer perform a cyclic shift of the intermediate transformed signal. First element of the signal becomes the last, all other elements are shifted by one to the front. This is rather inefficient, especially for lists.**signal extension**– instead of doing cyclic shift extend the initial signal and then shorten it after each layer (this approach is required for the second parallelization strategy but it can be used in the first one as well). Constructing the extended signal is time consuming but once lattice structure computations are started the transition between layers becomes much faster for lists. For other data structures, like vectors, it is time consuming because my implementation creates a new, shorter signal and copies data from existing vector to a new one. Since vectors provide constant-time indexing it would be possible to avoid copying by using smarter indexing. I don’t remember why I didn’t implement that.**smart indexing**– the most efficient way of implementing cyclic wrap-around is using indexing that shifts the base operations by one on the odd layers. Obviously, to be efficient it requires a data structure that provides constant-time indexing. It requires no copying or any other modification of output data from a layer. Thus it carries no memory and execution overhead.

Now that we know how to implement cyclic wrap-around let’s focus on the actual implementations of the lattice structure. I only implemented the first parallelization strategy, i.e. the one that requires thread synchronization after each layer. I admit I don’t remember the exact reasons why I didn’t implement the signal-chunking strategy. I think I did some preliminary measurements and concluded that overhead of chunking the signal is way to big. Obviously, the strategy that was supposed to use nested parallelizm was also not implemented because it relied on the chunking strategy. So all of the code uses parallelizm within a single layer and synchronizes threads after each layer.

Below is an alphabetic list of what you will find in my source code in the Signal.Wavelet.* modules:

**Signal.Wavelet.C1**– I wanted to at least match the performance of C, so I made a sequential implementation in C (see cbits/dwt.c) and linked it into Haskell using FFI bindings. I had serious doubts that the overhead of calling C via FFI might distort the results, but luckily it turned out that it does not – see this post. This implementation uses smart indexing to perform cyclic wrap-around. It also operates in place (except for the first layer, as described earlier).**Signal.Wavelet.Eval1**– this implementation uses lists and the Eval monad. It uses cyclic shift of the input signal between layers. This implementation was not actually a serious effort. I don’t expect anything that operates on lazy lists to have decent performance in numerical computations. Surprisingly though, adding Eval turned out to be a performance killer compared to the sequential implementation on lists. I never investigated why this happens**Signal.Wavelet.Eval2**– same as Eval1, but uses signal extension instead of cyclic shift. Performance is also very poor.**Signal.Wavelet.List1**– sequential implementation on lazy lists with cyclic shift of the signal between the layers. Written as a reference implementation to test other implementations with QuickCheck.**Signal.Wavelet.List2**– same as previous, but uses signal extension. I wrote it because it was only about 10 lines of code.**Signal.Wavelet.Repa1**– parallel and sequential implementation using Repa with cyclic shift between layers. Uses unsafe Repa operations (unsafe = no bounds checking when indexing), forces each layer after it is computed and is as strict as possible.**Signal.Wavelet.Repa2**– same as previous, but uses signal extension.**Signal.Wavelet.Repa3**– this implementation uses internals of the Repa library. To make it run you need to install modified version of Repa that exposes its internal modules. In this implementation I created a new type of Repa array that represents a lattice structure. With this implementation I wanted to see if I can get better performance from Repa if I place the lattice computations inside the array representation. This implementation uses smart indexing.**Signal.Wavelet.Vector1**- this implementation is a Haskell rewrite of the C algorithm that was supposed to be my baseline. It uses mutable vectors and lots of unsafe operations. The code is ugly – it is in fact an imperative algorithm written in a functional language.

In most of the above implementations I tried to write my code in a way that is idiomatic to functional languages. After all this is what the Haskell propaganda advertised – parallelism (almost) for free! The exceptions are Repa3 and Vector1 implementations.

Criterion tests each of the above implementations by feeding it a vector containing 16384 elements and then performing a 6 layer transformation. Each implementation is benchmarked 100 times. Based on these 100 runs criterion computes average runtime, standard deviation, influence of outlying results on the average and a few more things like plotting the results. Below are the benchmarking results on Intel i7 M620 CPU using two cores (click to enlarge):

“DWT” prefix of all the benchmarks denotes the forward DWT. There is also the IDWT (inverse DWT) but the results are similar so I elided them. “Seq” suffix denotes sequential implementation, “Par” suffix denotes parallel implementation. As you can see there are no results for the Eval* implementations. The reason is that they are so slow that differences between other implementations become invisible on the bar chart.

The results are interesting. First of all the C implementation is really fast. The only Haskell implementation that comes close to it is Vector1. Too bad the code of Vector1 relies on tons of unsafe operations and isn’t written in functional style at all. All Repa implementations are noticeably slower. The interesting part is that for Repa1 and Repa2 using parallelism slows down the execution time by a factor of 2. For some reason this is not the case for Repa3, where parallelism improves performance. Sadly, Repa3 is as slow as implementation that uses lazy lists.

The detailed results, which I’m not presenting here because there’s a lot of them, raise more questions. For example in one of the benchmarks run on a slower machine most of the running times for the Repa1 implementation were around 3.25ms. But there was one that was only around 1ms. What to make of such a result? Were all the runs, except for this one, slowed down by some background process? Is it some mysterious caching effect? Or is it just some criterion glitch? There were many such questions where I wasn’t able to figure out the answer by looking at the criterion results.

There are more benchmarks in the sources – see the benchmark suite file.

From a time perspective I can identify several mistakes that I have made that eventually lead to a failure of this project. Firstly, I think that focusing on CPU implementations instead of GPU was wrong. My plan was to quickly deal with the CPU implementations, which I thought I knew how to do, and then figure out how to implement these algorithms on a GPU. However, the CPU implementation turned out to be much slower than I expected and I spent a lot of time trying to actually make my CPU code faster. In the end I never even attempted a GPU implementation.

An important theoretical issue that I should have addressed early in the project is how big input signal do I need to benefit from parallelism. Parallelism based on multiple threads comes with a cost of launching and synchronizing threads. Given that Repa implementations do that for each layer I really pay a lot of extra cost. As you’ve seen my benchmarks use vectors with 16K elements. The problem is that this seems not enough to benefit from parallelism and at the same time it is much more than encountered in typical real-world applications of DWT. So perhaps there is no point in parallelizing the lattice structure, other than using SIMD instructions?

I think the main cause why this project failed is that I did not have sufficient knowledge of parallelism. I’ve read several papers on Repa and DPH and thought that I know enough to implement parallel version of an algorithm I am familiar with. I struggled to understand benchmark results that I got from criterion but in hindsight I think this was not a good approach. The right thing to do was looking at the generated assembly, something that I did not know how to do at that time. I should also have a deeper understanding of hardware and thread handling by the operating system. As a side note, I think this shows that parallelism is not really for free and still requires some arcane knowledge from the programmer. I guess there is a lot to do in the research on parallelism in Haskell.

I have undertaken a project that seemed like a relatively simple task but it ended up as a failure. This was not the first and probably not the last time in my career – it’s just the way science is. I think the major factor that contributed to failure was me not realizing that I have insufficient knowledge. But I don’t consider my work on this to be a wasted effort. I learned how to use FFI and how to benchmark and test my code. This in turn lead to many posts on this blog.

What remains is an unanswered question: how to implement an efficient, parallel lattice structure in Haskell? I hope thanks to this post and putting my code on Github someone will answer this question.

During my work on this project I contacted Ben Lippmeier, author of the Repa library. Ben helped me realize some things that I have missed in my work. That sped up my decision to abandon this project and I thank Ben for that.

**UPDATE (18/06/2014)**, supersedes previous update from (28/05/2014)

One of the comments below suggests it would be interesting to see performance of parallel implementation in C++ or Rust. Thus I have improved the C implementation to use SSE3 SIMD instructions. There’s not too much parallelism there: the main idea is that we can pack both input parameters of the base operation into a single XMM register. This allows to reduce the number of multiplication by two. Also, addition and subtraction are now done by a single instruction. This work is now merged into master. SSE3 support is controlled via `-sse3`

flag in the cabal file. From the benchmarks it seems that in practice the performance gain is small.

A possible next step I’d like to undertake one day is implementing the lattice structure using AVX instructions. With 256bit registers this will allow to compute two base operations in a single loop iteration. Sadly, at the moment I don’t have access to AVX-enabled CPU.

- Orthogonal transform, to be more precise. It is possible to construct lattice structures for biorthogonal wavelets, but that is well beyond the scope of this post.

`--show-options`

flag that lists all command-line flags. This feature can be used to auto-complete command-line flags in shells that support this feature. To enable auto-completion in Bash add this code snippet to your ~/.bashrc file:
# Autocomplete GHC commands _ghc() { local envs=`ghc --show-options` # get the word currently being completed local cur=${COMP_WORDS[$COMP_CWORD]} # the resulting completions should be put into this array COMPREPLY=( $( compgen -W "$envs" -- $cur ) ) } complete -F _ghc -o default ghc |

From my experience the first completion is a bit slow but once the flags are cached things work fast.

- Please ignore 7.8.1 release. It shipped with a bug that caused rejection of some valid programs.

`tasty-hunit-adapter`

allows to import existing HUnit tests into tasty (hackage, github):module Main where import Test.HUnit ( (~:), (@=?) ) import Test.Tasty ( defaultMain, testGroup ) import Test.Tasty.HUnit.Adapter ( hUnitTestToTestTree ) main :: IO () main = defaultMain $ testGroup "Migrated from HUnit" $ hUnitTestToTestTree ("HUnit test" ~: 2 + 2 @=? 4)

`tasty-program`

allows to run external program and test whether it terminates successfully (hackage, github):module Main ( main ) where import Test.Tasty import Test.Tasty.Program main :: IO () main = defaultMain $ testGroup "Compilation with GHC" $ [ testProgram "Foo" "ghc" ["-fforce-recomp", "foo.hs"] Nothing ]

This package has only this basic functionality at the moment. A missing feature is the possibility of logging stdout and stderr to a file so that it can later be inspected or perhaps used by a golden test (but for the latter tasty needs test dependencies).

As a response to the test-framework package being unmaintained Roman Cheplyaka has released tasty (original announcement here). Since its release in August 2013 tasty has received packages supporting integration with QuickCheck, HUnit, SmallCheck, hspec as well as support for golden testing and few others. I decided to give tasty a try and use it in my haskell-testing-stub project. Tasty turned out to be almost a drop-in replacement for test-framework. I had to update cabal file (quite obviously), change imports to point to tasty rather than test-framework and replace usage of `[Test]`

type with `TestTree`

. The only problem I encountered was adapting tests from HUnit. It turns out that tasty-hunit package does not have a function that allows to use an existing suite of HUnit tests. That feature was present in test-framework-hunit as `hUnitTestToTests`

function. I mailed Roman about this and his reply was that this was intentional as he does not “believe it adds anything useful to the API (i.e. the way to *write* code).” That’s not a big issue though as it was easy to adapt the missing function (although I think I’ll just put it in a separate package and release it so others don’t have to reinvent the wheel).

I admit that at this point I am not sure whether switching from test-framework to tasty is a good move. The fact that tasty is actively developed is a huge plus although test-framework has reached a mature state so perhaps active development is no longer of key importance. Also, test-framework still has more supporting libraries than tasty. Migrating them should be easy but up till now no one has done it. So I’m not arguing heavily for tasty. This is more like an experiment to see how it works.

]]>- Fun with type functions (2011) – Simon PJ’s presentation of the tutorial paper with the same title. Covers associated data types and type families (see “Associated Types With Class” for an in-depth presentation) + some stuff found in Data Parallel Haskell (read “Data Parallel Haskell: a status report” for more details). The whole presentation feels like a teaser as it ends quite quickly and skips some really interesting examples found in the paper.
- Types a la Milner (2012) by Benjamin C. Pierce (he’s the author of the book about types “Types and Programming Languages”). The talk covers a bit of programming languages history, type systems in general (“well-typed programs don’t go wrong”), type inference in the presence of polymorphism and using types to manage security of personal information. I found the type inference and historical parts very interesting.
- The trouble with types (2013) by Martin Odersky (creator of Scala). Talk covers the role of types in programming, presents the spectrum of static type systems and then focuses on innovations in the type system of Scala.
- I also found an interesting blog hosted on GitHub. Despite only 10 posts the blog has lot’s of stuff on practical type level programming in Haskell. Highly recommended.

Draft version of the paper can be downloaded here. It comes with companion source code that contains a thorough discussion of concepts presented in the paper as well as others that didn’t make it into publication due to space limitations. Companion code is available at GitHub (tag “blog-post-draft-release” points to today’s version). The paper is mostly finished. It should only receive small corrections and spelling fixes. However, if you have any suggestions or comments please share them with me – submission deadline is in three weeks so there is still time to include them.

]]>First and foremost I am using Linux on all of my machines. Debian is my distro of choice, but any *nix based system will do. That said I believe things I describe below can’t be done on Windows. Unless you’re using Cygwin. But then again if you work under Cygwin then maybe it’s time to switch to Linux instead of faking it?

One thing I quickly learned is that it is useful to have access to different versions of GHC and – if you’re working on the backend – LLVM. It is also useful to be able to install latest GHC HEAD as your system-wide GHC installation. I know there are tools designed to automate sandboxing, like hsenv, but I decided to use sandboxing method described by Edsko. This method is essentially based on setting your path to point to certain symlinks and then switching these symlinks to point to different GHC installations. Since I’ve been using this heavily I wrote a script that manages sandboxes in a neat way. When run without parameters it displays list of sandboxes in a fashion identical to `git branch`

command. When given a sandbox name it makes that sandbox active. It can also add new and remove existing sandboxes. It is even smart enough to prevent removal of a default sandbox. Finally, I’ve set up my `.bashrc`

file to provide auto-completion of sandbox names. Here’s how it looks in practice (click to enlarge):

This is probably obvious to anyone working under Linux: script as much as you can. If you find yourself doing something for the second or third time then this particular activity should be scripted. I know how hard it is to convince yourself to dedicate 10 or 15 minutes to write a script when you can do the task in 1 minute, but this effort will quickly pay off. I have scripts for pulling the GHC source repositories (even though I do it really seldom), resetting the GHC build tree, starting tmux sessions and a couple of other things.

In the beginning I wrote my scripts in an ad-hoc way with all the paths hardcoded. This turned out to be a pain when I decided to reorganize my directory structure. The moral is: define paths to commonly used directories as environment variables in your shell’s configuration file (`~/.bashrc`

in case of bash). Once you’ve done that make your scripts dependent on that variables. This will save you a lot of work when you decide to move your directories around. I’ve also defined some assertion functions in my `.bashrc`

file. I use them to check whether the required variables are set and if not the script fails gracefully.

Bash has a built-in auto-completion support. It allows you to get auto-completion of parameters for the commonly used commands. I have auto-completion for cabal and my sandbox management scripts. When GHC 7.8 comes out it will have support for auto-completion as well.

I use Emacs for development despite my initial scepticism. Since configuring Emacs is a nightmare I started a page on GHC wiki to gather useful tips, tricks and configurations in one place so that others can benefit from them. Whatever editor you are using make sure that you take as much advantage of its features as possible.

GHC wiki describes how to set up Firefox to quickly find tickets by number. Use that to your benefit.

Geoffrey Mainland managed to convince me to use `make`

and I thank him for that. Makefiles are a great help if you’re debugging GHC and need to repeatedly recompile a test case and possibly analyse some Core or Cmm dumps. Writing the first Makefile is probably the biggest pain but later you can reuse it as a template. See here for some example Makefiles I used for debugging.

The goal of this post was to convince you that spending time on configuring and scripting your GHC development environment is an investment. It will return and it will allow you to focus on important things that really require your attention. Remember that most of my configuration and scripts described in this post is available on github.

]]>A type system can be regarded as calculating a kind of static approximation to the run-time behaviours of the terms in a program.

So if a type system is a static approximation of program’s behaviour at runtime a natural question to ask is: “how accurate this approximation can be?” Turns out it can be very accurate.

Let’s assume that we have following definition of natural numbers^{1}:

data Nat : Set where zero : Nat suc : Nat → Nat |

First constructor – `zero`

– says that zero is a natural number. Second – `suc`

– says that successor of any natural number is also a natural number. This representation allows to encode `0`

as `zero`

, `1`

as `suc zero`

, `2`

as `suc (suc zero)`

and so on^{2}. Let’s also define a type of booleans to represent logical true and false:

data Bool : Set where false : Bool true : Bool |

We can now define a `≥`

operator that returns `true`

if its arguments are in greater-equal relation and `false`

if they are not:

_≥_ : Nat → Nat → Bool m ≥ zero = true zero ≥ suc n = false suc m ≥ suc n = m ≥ n |

This definition has three cases. First says that any natural number is greater than or equal to zero. Second says that zero is not greater than any successor. Final case says that two non-zero natural numbers are in ≥ relation if their predecessors are also in that relation. What if we replace `false`

with `true`

in our definition?

_≥_ : Nat → Nat → Bool m ≥ zero = true zero ≥ suc n = true suc m ≥ suc n = m ≥ n |

Well… nothing. We get a function that has nonsense semantics but other than that it is well-typed. The type system won’t catch this mistake. The reason for this is that our function returns a result but it doesn’t say why that result is true. And since `≥`

doesn’t give us any evidence that result is correct there is no way of statically checking whether the implementation is correct or not.

But it turns out that we can do better using dependent types. We can write a comparison function that proves its result correct. Let’s forget our definition of `≥`

and instead define datatype called `≥`

:

data _≥_ : Nat → Nat → Set where ge0 : { y : Nat} → y ≥ zero geS : {x y : Nat} → x ≥ y → suc x ≥ suc y |

This type has two `Nat`

indices that parametrize it. For example: `5 ≥ 3`

and `2 ≥ 0`

are two distinct types. Notice that each constructor can only be used to construct values of a specific type: `ge0`

constructs a value that belongs to types like `0 ≥ 0`

, `1 ≥ 0`

, `3 ≥ 0`

and so on. `geS`

given a value of type `x ≥ y`

constructs a value of type `suc x ≥ suc y`

.

There are a few interesting properties of `≥`

datatype. Notice that not only `ge0`

can construct value of types `y ≥ 0`

, but it is also the only possible value of such types. In other words the only value of `0 ≥ 0`

, `1 ≥ 0`

or `3 ≥ 0`

is `ge0`

. Types like `5 ≥ 3`

also have only one value (in case of `5 ≥ 3`

it is `geS (geS (geS ge0))`

). That’s why we call `≥`

a *singleton type*. Note also that there is no way to construct values of type `0 ≥ 3`

or `5 ≥ 2`

– there are no constructors that we could use to get a value of that type. We will thus say that `≥`

datatype is a witness (or evidence): if we can construct a value for a given two indices then this value is a witness that relation represented by the `≥`

datatype holds. For example `geS (geS ge0))`

is a witness that relations `2 ≥ 2`

and `2 ≥ 5`

hold but there is no way to provide evidence that `0 ≥ 1`

holds. Notice that previous definition of `≥`

function had three cases: one base case for `true`

, one base case for `false`

and one inductive case. The `≥`

datatype has only two cases: one being equivalent of `true`

and one inductive. Because the value of `≥`

exists if and only if its two parameters are in ≥ relation there is no need to represent `false`

explicitly.

We have a way to express proof that one value is greater than another. Let’s now construct a datatype that can say whether one value is greater than another and supply us with a proof of that fact:

data Order : Nat → Nat → Set where ge : {x : Nat} {y : Nat} → x ≥ y → Order x y le : {x : Nat} {y : Nat} → y ≥ x → Order x y |

Order is indexed by two natural numbers. These numbers can be anything – there is no restriction on any of the constructors. We can construct values of Order using one of two constructors: `ge`

and `le`

. Constructing value of `Order`

using `ge`

constructor requires a value of type `x ≥ y`

. In other words it requires a proof that `x`

is greater than or equal to `y`

. Constructing value of `Order`

using `le`

constructor requires the opposite proof – that `y ≥ x`

. `Order`

datatype is equivalent of `Bool`

except that it is specialized to one particular relation instead of being a general statement of truth or false. It also carries a proof of the fact that it states.

Now we can write a function that compares two natural numbers and returns a result that says whether first number is greater than or equal to the second one^{3}:

order : (x : Nat) → (y : Nat) → Order x y order x zero = ge ge0 order zero (suc b) = le ge0 order (suc a) (suc b) with order a b order (suc a) (suc b) | ge a≥b = ge (geS a≥b) order (suc a) (suc b) | le b≥a = le (geS b≥a) |

In this implementation `ge`

plays the role of `true`

and `le`

plays the role of `false`

. But if we try to replace `le`

with `ge`

the way we previously replaced `false`

with `true`

the result will not be well-typed:

order : (x : Nat) → (y : Nat) → Order x y order x zero = ge ge0 order zero (suc b) = ge ge0 -- TYPE ERROR order (suc a) (suc b) with order a b order (suc a) (suc b) | ge a≥b = ge (geS a≥b) order (suc a) (suc b) | le b≥a = le (geS b≥a) |

Why? It is a direct result of the definitions that we used. In the second equation of `order`

, `x`

is `zero`

and `y`

is `suc b`

. To construct a value of `Order x y`

using `ge`

constructor we must provide a proof that `x ≥ y`

. In this case we would have to prove that `zero ≥ suc b`

, but as discussed previously there is no constructor of `≥`

that could construct value of this type. Thus the whole expression is ill-typed and the incorrectness of our definition is caught at compile time.

The idea that types can represent logical propositions and values can be viewed as proofs of these propositions is not new – it is known as Curry-Howard correspondence (or isomorphism) and I bet many of you have heard that name. Example presented here is taken from “Why Dependent Types Matter”. See this recent post for a few more words about this paper.

- All code in this post is in Agda.
- For the sake of readability I will write Nats as numerals, not as applications of suc and zero. So remember that whenever I write 2 I mean
`suc (suc zero)`

- Note that in Agda
`a≥b`

is a valid identifier, not an application of`≥`

- Standard library in Idris feels friendlier than in Agda. It is bundled with the compiler and doesn’t require additional installation (unlike Agda’s). Prelude is by default imported into every module so programmer can use Nat, Bool, lists and so on out of the box. There are also some similarities with Haskell prelude. All in all, standard library in Idris is much less daunting than in Agda.
- Idris is really a programming language, i.e. one can write programs that actually run. Agda feels more like a proof assistant. According to one of the tutorials I’ve read you can run programs written in Agda, but it is not as straightforward as in Idris. I personally I haven’t run a single Agda program – I’m perfectly happy that they typecheck.
- Compared to Agda Idris has limited Unicode support. I’ve never felt the need to use Unicode in my source code until I started programming in Agda – after just a few weeks it feels like an essential thing. I think Idris allows Unicode only in identifiers, but doesn’t allow it in operators, which means I have to use awkward operators like
`<!=`

instead of ≤. I recall seeing some discussions about Unicode at #idris channel, so I wouldn’t be surprised if that changed soon. - One of the biggest differences between Agda and Idris is approach to proofs. In Agda a proof is part of function’s code. Programmer is assisted by agda-mode (in Emacs) which guides code writing according to types (a common feature in dependently typed languages). Over the past few weeks I’ve come to appreciate convenience offered by agda-mode: automatic generation of case analysis, refinement of holes, autocompletion of code based on types to name a few. Idris-mode for Emacs doesn’t support interactive development. One has to use interactive proof mode provided in Idris REPL – this means switching between terminal windows, which might be a bit inconvenient. Proofs in Idris can be separated from code they are proving. This allows to write code that is much clearer. In proof mode one can use tactics, which are methods used to convert proof terms in order to reach a certain goal. Generated proof can then be added to source file. It is hard for me to decide which method I prefer. The final result is more readable in Idris, but using tactics is not always straightforward. I also like interactive development offered by Agda. Tough choice.
- Both languages are poorly documented. That said, Idris has much less documentation (mostly papers and presentations by Edwin Brady). I expect this to change, as the Idris community seems to be growing (slowly, but still).
- One thing I didn’t like in Idris are visibility qualifiers used to define how functions and datatypes are exported from the module. There are three available: public (export name and implementation), private (don’t export anything) and abstract (export type signature, but don’t export implementation). This is slightly different than in Haskell – I think that difference comes from properties of dependent types. What I didn’t like are rules and syntax used to define export visibility. Visibility for a function or datatype can be defined by annotating it with one of three keywords: public, private, abstract. If all definitions in a module are not annotated then everything is public. But if there is at least one annotation everything without annotation is private. Unless you changed the default visibility, in which case everything without annotation can be abstract! In other words if you see a definition without annotation it means that: a) it can be public, but you have to check if all other definitions are without annotations; b) private, if at least one other definition is annotated – again, you have to check whole file; c) but it can be abstract as well – you need to check the file to see if the default export level was set. The only way to be sure – except for nuking the entire site from orbit – is annotating every function with an export modifier, but that feels very verbose. I prefer Haskell’s syntax for defining what is exported and what is not and I think it could be easily extended to support three possible levels of export visibility.
- Unlike Agda, Idris has case expressions. They have some limitations however. I’m not sure whether these limitations come from properties of dependently typed languages or are they just simplifications in Idris implementation that could theoretically be avoided.
- Idris has lots of other cool features. Idiom brackets are a syntactic sugar for applicative style: you can write
`[| f a b c |]`

instead of`pure f <*> a <*> b <$*gt; c`

. Idris has syntax extensions designed to support development of EDSLs. Moreover tuples are available out of the box, there’s do-notation for monadic expressions, there are list comprehensions and Foreign Function Interface. - One feature that I’m a bit sceptical about are “implicit conversions” that allow to define implicit casts between arguments and write expressions like
`"Number " ++ x`

, where`x`

is an`Int`

. I can imagine this could be a misfeature. - Idris has “using” notation that allows to introduce definitions that are visible throughout a block of code. Most common use seems to be in definition of data types. Agda does it better IMO by introducing type parameters into scope of data constructors.
- Idris seems to be developed more actively. The repos are stored on github so anyone can easily contribute. This is not the case with Agda, which has Darcs repos and the whole process feels closed (in a sense “not opened to community”). On the other hand mailing list for Idris is set up on Google lists, which is a blocker for me.

All in all programming in Idris is also fun although it is slightly different kind of fun than in Agda. I must say that I miss two features from Agda: interactive development in Emacs and Unicode support. Given how actively Idris is developed I imagine it could soon become more popular than Agda. Perhaps these “missing” features will also be added one day?

As an exercise I rewrote code from “Why dependent types matter” paper from Agda (see my previous post) to Idris. Code is available in on github.

]]>Recently I decided to solidify my knowledge of basics of dependent types by reading “Why Dependent Types Matter”. This unpublished paper was written by Thorsten Altenkirch, Conor McBride and James McKinna somewhere in 2006 I believe. It gives a great overview of dependent types and various design decisions related to their usage. But most of all this paper shows how to write a provably correct merge-sort algorithm. Proving correctness of algorithms is something I find very interesting, so this paper was a must-read for me.

There is only one catch with “Why Dependent Types Matter”. All the code is written in Epigram, a dependently typed functional language designed by Conor McBride and James McKinna. The problem is that Epigram’s webpage has been offline for few months now^{1} and the language basically seems dead. Anyway, since my dependently-typed language of choice is Agda (for the moment at least – I’m thinking a lot about Idris recently) I decided to rewrite all the code in the paper to Agda. For the most part this was a straightforward task, once I learned how to read Epigram’s unusual syntax. There were however a few bumps along the way. One problem I encountered early on was Agda’s termination checker complaining about some functions. Luckily, Agda community is as helpful as Haskell’s and within a day I was given a detailed explanation of what goes wrong. A slightly larger problem was that paper elides details of some proofs. If I wanted to have working Agda code I had to fill in these details. Since I didn’t know how to do that I had to pause for one day and go through online materials for Thorsten Altenkirch’s course on Computer Aided Formal Reasoning. In the end I managed to fill in all the missing gaps. My code is available on github. Now I feel ready to prove correctness of a few more algorithms on my own.

Conor will be giving his course on “Dependently typed metaprogramming” in November and December at University of Edinburgh. See here for details. Be sure not to miss it if you have a chance to attend. Code repository for the course is available here.

Unofficial mirror of Epigram’s sources is available on github.

- I recall Conor mentioning that Nottingham people, who were hosting it Epigram’s web page on their servers, sent him the hard drive with said web page.