There's more than one way to map a cat

There are lots of data structures out there, ranging from primitive to so sophisticated that only a single person in the world understands them (these are, of course, mostly useless). The choice of the data structure is mostly specific to the problem; however, obviously some data structures are generally more popular/useful than others.

I’d say that the most generally useful data structure in the imperative world is an array. Arrays are as good as you can get if you do not need fast searching over large datasets - they have a simple efficient memory pattern (unless you need a multi-gigabyte array), leading to fast iteration, they have minimal meta data (per instance cost is zero), they can be indexed in guaranteed constant time, array transformations are trivially parallelizable, they generalize to multiple dimensions easily, etc.

However, in functional world, from my limited experience it looks like the favorite data structure is a consed list, which is better known as a singly-linked list. There is a type, which is called cons cell, which is a pair. A list consists of cons cells, which are linked together - the first element of a cons cell signifies the data element, the second one points to the next cell, or to the special object, nil, which represents an empty list:

(1 2 3 4) == (cons 1 (cons 2 (cons 3 (cons 4 nil))))

For some reason, a linked list is also the favorite data structure of our game logic programmers, though this has nothing to do with functional programming.

Consed lists are present in lots of functional programming languages; in some, they’re the only basic structure (Lisp/Scheme rely heavily on consed lists, even the code is composed of s-expressions which are stored in consed list form; Haskell strings are consed lists, etc.). However, for a programmer who spends half of his work day in a profiler, consed lists are a rather bad structure.

The benefits of consed lists are:

Not much, is it? The drawbacks are:

There is a function, called map (mapcar in Common Lisp), which takes a consed list and a one-argument function, and returns a new list, which is produced by application of the function to all elements of the source list. This function, obviously, has O(N) time complexity and O(N) space requirements (after all, it generates a new list!). Looking at various possible implementations of this function gives some insight into problems of consed lists in particular and functional programming style in general.

My language of choice for today is F#, which is a multi-paradigm language (with both functional and object-oriented imperative elements) based on .NET platform. Of course, F# has a built-in consed lists, so we’ll start with them.

Naive recursive approach

When an imperative programmer has to implement a map function, his first natural reaction is to write a for loop. However, for loops usually have a mutable iterator, and thus are either discouraged or not present at all in functional languages. Instead, functional programmers like to use recursion. Indeed, a recursive implementation of map is straightforward, once you get used to recursive functions:

let rec mapcat1 pred lst =
    match lst with
    | car :: cdr -> (pred car) :: (mapcat1 pred cdr)
    | [] -> []

Aside from some syntactic weirdness (:: is used to make a cons cell, function arguments are specified without commas and without braces, I use pattern matching here instead of ifs), the code should be self-explanatory. Mapping an empty list produces an empty list; mapping anything else can be done recursively by making a cons cell with first element being a transformed first element of the original list, and the rest being a result of recursive map.

This function works, but it has a tiny problem - it’s recursive. Wait, what?

Tail-recursive approaches

You see, as you’ve been likely taught, recursion is bad. Each call to a recursive function pushes function arguments and the return address to the stack, creating a stack frame; a recursive function like map creates N stack frames for list of N elements, so it requires O(N) temporary memory. What’s worse, however, is that in some functional languages, like F#, the size of stack is bounded, so processing a large list is going to generate a StackOverflowException (Haskell, on the contrary, is happy to grow the stack until no address space is left in the process).

To solve this problem without loops, a concept of tail recursion is introduced. If the function call is the very last thing the function does, there is no need for additional memory - the old stack frame is not needed after the call, so the new frame can replace the old frame. For some languages the tail-recursion is an additional optimization (for example, some C++ compilers do it), for other it’s a spec requirement (i.e. a guaranteed feature).

There is a common recursive function transformation pattern, which is almost never done automatically by the compiler, but is usually easy to do by hand. Our map function can be transformed into this one:

let mapcat2 pred lst =
    let rec loop rest acc =
        match rest with
        | car :: cdr -> loop cdr (pred car :: acc)
        | [] -> acc
    loop lst []

There is an inner recursive function, which is conveniently named loop (which, I guess, shows my imperative background); the part of the list which is already built is stored in the acc(umulator) argument, which is updated at each call with the new element.

Note that loop cdr (pred car :: acc) is tail recursive, but (pred car) :: (loop pred cdr) is not - there is an extra cons operator after the function call.

The new version of the function works without stack overflows even on large inputs. However, due to the code structure change, it produces the list with the reversed order of elements, because we walk through the list and prepend the elements to the result, so the first element of the original list will be the last element of the new list.

Well, it’s easy - instead of prepending, we’ll append!

let mapcat3 pred lst =
    let rec loop rest acc =
        match rest with
        | car :: cdr -> loop cdr (acc @ [pred car])
        | [] -> acc
    loop lst []

I’ve changed :: to @, which is an append operator. Victory!

Wait, why is the program still running?

Remember I’ve said that consed lists are not symmetrical? You can easily prepend an element, but you don’t know where the last element of the list is, so append has to iterate through the entire left argument, making the new function quadratic. Oops. Note that this won’t be a problem with doubly linked lists, but, unfortunately, append was not a priority 50 years ago, when Lisp was accidentally created.

Ok, seems we’ll have to just reverse the result. A relatively common interview question is to code a single linked list reverse function. An inplace version is usually required, so that the old list gets destroyed in the reversing process; we can’t do that here, because consed cells, like all other objects in a pure functional world, are immutable - you can’t change them once they’re created. So we have to make an additional copy of the list (we can use the same map function for this, because it conveniently returns a transformed reversed list):

let mapcat4 pred lst =
    mapcat2 pred lst |> reverse

let mapcat5 pred lst =
    mapcat2 pred lst |> mapcat2 (fun x -> x)

|> is an F# pipeline operator, you can read it as “the left part’s expression result is passed as an additional argument to the right part”.

Now we finally have a correct solution; it is tail recursive, so it consumes O(1) stack frames, but it creates an additional copy of the list, so it still needs O(N) temporary memory. However, it works in F# for long lists.

Continuation-passing style

There is another workaround for the stack frame problem, which is very functional in spirit. What we need in order to build our list in the right order is to traverse through the list, remembering the chain of nodes, and then to unwind the chain starting from the last node. This is essentially what we do in a recursive approach, however we can stay tail recursive if the unwinding chain will be formed from continuations.

Think of it this way: we have to make a function, which, given a tail of the result, prepends another element to it, forming the next tail. If we have N such functions, and each is calling the next one, then the unwind chain will be formed from the function calls, which coincidentally happen to be tail recursive.

This is an example implementation (fun x -> expr is a way to create an anonymous function with argument x which evaluates expr as its result):

let mapcat6 pred lst =
    let rec loop rest cont =
        match rest with
        | car :: cdr -> loop cdr (fun acc -> cont (pred car :: acc))
        | [] -> cont []
    loop lst (fun x -> x)

We start with an identity function, which, given an argument x, returns it back. Then we form a function chain, which for a list [1; 2; 3] will look like this (in the order of creation):

fun x -> x
fun acc1 -> (fun x -> x) (pred 1 :: acc1) (* this is equal to fun acc1 -> pred1 :: acc1 *)
fun acc2 -> (fun acc1 -> pred 1 :: acc1) (pred 2 :: acc2) (* this is equal to fun acc2 -> (pred 1) :: (pred 2) :: acc2 *)
fun acc3 -> (fun acc2 -> (pred 1) :: (pred 2) :: acc2) (pred 3 :: acc3) (* this is equal to fun acc3 -> (pred 1) :: (pred 2) :: (pred 3) :: acc3 *)

And finally the result is called with an empty list, which results in (pred 1) :: (pred 2) :: (pred 3) :: [], which is what we want.

We’ve successfully traded N stack frames by N closure instances, hurray! (amazingly, this is faster than the original recursive version; see below for timings).

Imperative world

Timing the standard List.map function, which does the same thing, showed that it’s way faster than the fastest of the above. The only way to optimize this, as far as I understand, is to introduce mutable data structures, which means introducing a special structure instead of the built-in cons cell (Scheme-aware readers can immediately recognize that Scheme has a built-in set-cdr! function, which is what we’ll need here).

The code is very much imperative, apart from the tail-recursion-instead-of-loops, so I’ll leave it without explanations:

type Cell =
    val car: int
    val mutable cdr: Cell

    new (car_, cdr_) = { car = car_; cdr = cdr_; }

let nil: Cell = Unchecked.defaultof<Cell>
let cons car cdr = Cell(car, cdr)

let mapcat2_mut pred lst =
    let rec loop (rest: Cell) (last: Cell) =
        if System.Object.ReferenceEquals(rest, null) then
            rest
        else
            let cell = cons (pred rest.car) nil
            last.cdr <- cell
            loop rest.cdr cell

    let head = Cell(0, nil)
    let res = loop lst head
    head.cdr

Note that I always create a stub object which is thrown out; that’s because I’m lazy.

Results

Now let’s stuff everything in a single program, add some timing code, and look at the results (the complete F# source is available here).

The program, when run, outputs the following:

"recursive" took 1.656725 ms
"tail-recursive (cons)" took 0.341175 ms
"tail-recursive (cons) + reverse" took 0.922769 ms
"tail-recursive (cons) + reverse (via map)" took 0.943074 ms
"tail-recursive (continuations)" took 1.110428 ms
"standard" took 0.496710 ms
"mutable recursive" took 1.430596 ms
"mutable tail-recursive" took 0.563062 ms
"mutable loop" took 0.577617 ms

So, as we can see, the fastest way is the List.map function, which is closely followed by our mutable variant (there are both tail-recursive and loop versions here - F# has native support for loops); the next best are the functions which construct two lists, followed by the continuation version (amazing!), and, finally, by recursive version. The first tail-recursive variant is the fastest of them all, but it’s incorrect.

How did they do that? Why is List.map as fast (well, it’s even 10% faster), than our mutable version, given that the F# list node is immutable? I’ve studied the F# assembly using ildasm, and found out, that…

… they mutate the resulting list. List.map creates a head node from the first element of the list, and then calls mapToFreshConsTail, which creates the rest, and modifies the tail (cdr) of the cells in the process.

Conclusion: When purity and performance collapse, performance usually wins.

Oh, and using arrays here results in 0.1 ms runtime, which is 5x faster than the fastest list-based solution. Just saying.