Cmm pipeline woes

For the last 4 months I have been working on improving compilation of intermediate Cmm language used by GHC. Recently I am noticing some recurring problems in design of the Cmm pipeline.

Cmm

I already mentioned in my earlier post that Cmm is a low-level language, something between C and assembly. Cmm is produced from another intermediate language, STG. A single Cmm procedure is represented as a directed graph. Each node in a graph is a Cmm Block of low level instructions. Exactly one Cmm Block in a graph is an entry point – a block from which the execution starts when procedure represented by a graph is called. Most Cmm Blocks in a graph have at least one successor, that is node(s) to which control flows from a given Cmm Block. A Cmm Block may not have a successor if it is a call to another procedure, i.e. it passes flow of control to another Cmm graph. Each Cmm Block consists of a linear list of Cmm Nodes. A Cmm Node represents a single Cmm instruction like store to a register, conditional jump or call to a function.

Pipeline

Cmm representation produced by the STG -> Cmm pass is incomplete. For example operations on the stack are represented in an abstract way. It is also far from being optimal as it may contain lots of empty blocks or control flow paths that will never be taken. That’s why we have Cmm pipeline. It takes Cmm representation produced by the code generator1 and applies a series of transformations to it. Some of them are mandatory (like stack layout), while others perform optimisations and are completely optional. Here’s a rough overview of the pipeline:

  1. Control Flow Optimisations. Optimises structure of a graph by concatenating blocks and omitting empty blocks.
  2. Common Block Elimination (optional). Eliminates duplicate blocks.
  3. Minimal Proc-point set. Determines a minimal set of proc-points2.
  4. Stack Layout. Turns abstract stack representation into explicit stack pointer references. Requires proc-point information computed in step 3. Creates stack maps.
  5. Sinking pass (optional). Moves variable declarations closer to their usage sites. Inlines some literals and registers in a way similar to constant propagation.
  6. CAF analysis. Does analysis of constant-applicative forms (top-level declarations that don’t have any parameters). CAF information is returned by the Cmm pipeline together with optimized Cmm graph.
  7. Proc-point analysis and proc-point splitting (optional). Here the pipeline splits into two alternative flows. They are identical except for the fact that one branch begins by performing proc-point splitting. This means that blocks that were determined to be proc-points are now turned into separate procedures. Requires proc-point information computed in step 3.
  8. Info Tables. Populates info tables of each Cmm function with stack usage information. Uses stack maps created by the stack layout (step 4).
  9. Control Flow Optimisations (optional). Repeat control flow optimisations (step 1), but this time this is optional.
  10. Unreachable Blocks Removal. Eliminates blocks that don’t have a predecessor in the Cmm graph.

As an example consider this Cmm produced by the code generator:

c4wk:
    goto c4wi;
c4wi:
    goto c4wl;
c4wl:
    goto c4wm;
c4wm:
    if ((old + 0) - <highSp> < SpLim) goto c4x9; else goto c4xa;
c4x9:
    R2 = _s2Rv::P64;
    R1 = _s2Ru::P64;
    call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;

After going through the Cmm pipeline the empty blocks are eliminated and abstract stack representation ((old + 0) and <highSp>) is turned into explicit stack usage:

c4wi:
    if (Sp - 88 < SpLim) goto c4x9; else goto c4xa;
c4x9:
    R2 = _s2Rv::P64;
    R1 = _s2Ru::P64;
    call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;

Woes

 During my work on the Cmm pipeline I encountered following bugs and design issues:

  • In some cases the Stack Layout phase (step 4) can invalidate the minimal set of proc-points by removing a block that was earlier determined to be a proc-point. However, minimal proc-point information is later used by proc-point analysis. GHC would panic if a proc-point block was removed. This was bug #8205. I solved it by adding additional check in the proc-point analysis that ensures that all proc-points actually exist in a graph. This was a simple, one line solution, but it doesn’t feel right. It accepts the fact that our data is in an inconsistent state and places the responsibility of dealing with that on algorithms relying on that state. In other words, any algorithm that relies on minimal proc-point set after the stack layout must check whether given proc-points exist in a graph.
  • I observed that in some circumstances control flow optimisation pass may lead to unexpected duplication of blocks (see #8456). After some investigation it turned out that this pass began by computing set of predecessors for each block and then modified the graph based on this information. The problem was that the list of predecessors was not being updated as the graph was modified. This problem is the same as the previous one: we compute a fact about our data structure and based on that fact we start modifying the data but we don’t update the fact as we go.
  • Control flow optimisations pass may produce unreachable blocks. They remain in the data structure representing the graph, but they are not reachable from any block in the graph. The common block elimination pass will remove the unreachable blocks but before it does the data is in an inconsistent state.
  • Same thing happens later: stack layout pass may create unreachable blocks and relies on later passes to remove them.

I listed only the problems I am aware of but I believe there are more and they will manifest themselves one day. I spent last couple of days thinking how to solve these issues. Of course fixing each of them separately is a relatively simple task, but I’m trying to come up with some general design that would prevent us from introducing inconsistencies in our data structures.

  1. That’s how we often refer to STG -> Cmm pass []
  2. I want to avoid going into details of what are proc-points and why do we need them. For the purpose of this post it is sufficient that you consider proc-point as a property that might be assigned to a block based on its predecessors []

Let no escape!

I observed that the number of people familiar with Core and using it for debugging purposes is relatively big to the number of people familiar with other two intermediate representations used by GHC further in the pipeline: STG and Cmm. Some seem to treat these two a bit like black magic :-) I have to admit that I used to treat them like this before I came to MSR to work on the GHC backend. Now I am a bit more familiar with code generator and I’d like to share some of my knowledge. Today I will present an optimization called “let-no-escape” (LNE for short). I believe it is almost unknown amongst Haskellers and GHC hackers. Main reasons for that are: a) it is performed in the code generator (STG-to-Cmm pass) and, as I said, people tend to stay away from that part of the compiler; b) it was not described in any paper (“We didn’t consider it to be worth a paper” – Simon Marlow), nor it is described on GHC wiki – the only chance to learn about was to find it in the source code. My plan is to explain what and why LNE does.

Motivation

Consider this imperative code in C:

if (y > 0) {
  x = 3;
} else {
  x = 4;
}
...x...// use x

Earlier I mentioned the Cmm intermediate language used by GHC. It is a low-level imperative language, something between C and assembly. If we were to rewrite above C program into Cmm we would get something like this:

     if (y > 0) goto L1; else goto L2;
L1 : x = 3
     goto L3;
L2 : x = 4;
     goto L3;
L3 : ....x... // use x

In both C and Cmm code we would expect the generated assembly to check value of y, perform a conditional jump to one of the branches, set value of x to 3 or 4 and finally perform unconditional jump from the branch to code that follows the if construct (i.e. L3 label in Cmm version).

All of this seems obvious and looks very simple, but it gets more complicated when we start thinking in a functional setting. How do we implement this logic in Haskell? Well, we could do something like this:

let j x = ...x... -- use x
in case (y > 0) of
      True  -> j 3
      False -> j 4

We introduced a binding j, which corresponds to code after the conditional (again, L3 in the Cmm example). case expression corresponds to if statement. In the branches we just make a call to j and pass desired value of x as a parameter. There’s a problem here though. If we simply compile this code then j will be compiled as a function. It will be a closure with two entry points (slow and fast) and possibly stack check and heap check at the very beginning. So with C we had an unconditional jump, whereas here we get a function call with all its overhead. This is very bad and we want something better. One easy way to avoid call to j would be inlining it in every case alternative. But this duplicates code so this is no good either. We really want to produce code that is as fast as produced by C and without unnecessary duplication. Let-no-escape optimization allows us to achieve that.

Let-no-escape

When we compile core to STG we distinguish between two types of let bindings: normal and let-no-escape ones. A binding is a let-no-escape binding if it is non-updatable (so it must have some parameters), it is called in a tail position with exactly the right number of arguments and it is “guaranteed to be entered before the stack retreats – i.e. it is not enclosed in a heap-allocated closure or passed as an argument to something”1. Normal bindings – i.e. bindings for which at least one of these conditions does not hold – are compiled as a closure and follow the calling convention i.e. they expect their parameters to be passed in some particular registers or on the stack. Let-no-escape bindings are compiled in a special way: they are not closures and don’t follow the calling convention. Instead they are compiled as a normal block of Cmm code that expects its entry parameters to be located in some particular local variables (which will later be mapped to hardware registers or spilled to the stack). To “call” a LNE binding we place parameters in the correct local variables and jump to the block of Cmm code.

Let’s look once again at our example program:

let j x = ...x... -- use x
in case (y > 0) of
      True  -> j 3
      False -> j 4

The binding for j is a let-no-escape one: it is non-updatable, in a tail position, all call sites pass exactly the required number of arguments and it is not passed as a parameter. This means that we can compile j as a let-no-escape binding – it will not be a closure and will expect its parameter in a local variable. Call sites for j – i.e. True and False alternatives of a case expression – need to place the value of argument x in a place where j expects it. This brings us to Cmm code that I presented at the top of this post:

     if (y > 0) goto L1; else goto L2;
L1 : x = 3
     goto L3;
L2 : x = 4;
     goto L3;
L3 : ....x... // use x

The conditional if implements case expression that selects one of alternatives. L1 is the True branch, while L2 is the False alternative. We compiled j in such a way that it expects its parameter to be passed in a local variable x, so every call site needs to initialize x with the value it wants to pass to j as an argument. After initializing the argument we perform unconditional jump to code that implements the binding (instead of making a call). In this example L3 implements the j let-no-escape binding. It is not a callable closure, but a normal block of Cmm code which expects its argument to be passed in a local variable x. In this way we compiled a functional program to efficient low-level code that is comparable with C.

Summary

This was a conceptual explanation of what are let-no-escape bindings and why we use them. A separate question is “how is this optimisation implemented by GHC?”. Relevant code is located in a couple of places, mostly in codeGen/ directory (STG -> Cmm pass). If you want to explore implementation details a good starting point might be getCallMethod function and CallMethod data type in StgCmmClosure module and cgIdApp in StgCmmExpr module.

  1. Quote from source file compiler/stgSyn/CoreToStg.lhs []

Staypressed theme by Themocracy