Having recently finished reading the Clojure programming book named Clojure: High Performance JVM Programming, I found in it quite a bunch of useful information to increase the performance of Clojure (and more generally JVM-based) applications.
The book contains many examples of different constructs we can use in Clojure to increase the performance of arithmetic operations. Many of these examples feature the classic fibonacci computation.
Some of these implementations come with a associated measurements inside the book, but some implementations are missing these figures. It is like watching an action movie where some of the action scenes are replaced by blanks. Pretty frustrating.
In this post, we will fill the missing blanks, and go through the different ways to compute a fibonacci sequence, some of which comes from the book, and some fancier ones I tested, and give test their performance using the benchmarking library criterium.
We will conclude with some bonus measurements on Haskell implementations of the same algorithm, and discuss the subject of performance and micro-benchmarks.
The fibonacci sequence is the sequence of number such that each number is the sum of the two preceding ones. We seed this sequence with the two starting numbers: we will choose to seed it with 0 and 1.
There are many ways to compute this sequences, some more efficient that other. The fastest one I know computes the Nth fibonacci number in O(log N) steps, using matrix exponentiation and the exponentiation by squaring algorithm.
The book however makes use of the linear algorithm, probably because it has much more implementation possible. We will thus stick to this algorithm as well, and measure how different implementations of it perform.
In this section, we will cover some of the classic functional approaches used to compute a sequence of values: iterate, lazy sequence, recur and trampoline.
Iterate creates a lazy sequence of values, from an initial seed, and a function to compute then next element. Indexing the Nth element gives us our Nth fibonacci number.
Mean execution time: 113 microseconds.
Since iterate creates a lazy sequence, we can try to build the lazy sequence ourselves without using iterate. Doing so gives us more power on how we can create the sequence. In particular, we can avoid instantiating intermediary pairs as fibs-iterate did.
Mean execution time: 97 microseconds.
As we could expect, no creation of pairs means less dynamic allocations, which in terns gives us better performance.
We can use the low level construct recur to implement the algorithm using a tail recursive implementation.
Mean execution time: 56 microseconds.
Recur is usually used to avoid boxing in math computations, for better performance. In our case, the boxing is needed since we manipulate BigInts. Nevertheless, using recur improves the performance quite substantially.
The twin cousin of recur is trampoline, which allows to implement mutual recursion without consuming stack. We do not need mutual recursion there, but it is useful to see how it compares to recur.
Mean execution time: 65 microseconds.
As we could suspect, trampoline is a bit more expensive than using plain tail recursion via recur. It is not much, but it should be reserved from true mutual recursion.
In this section, we will cover some of the imperative approaches, based on assignments, that we could use to implement our algorithm.
Clojure allows us to create local vars using with-local-vars. These local vars have lexical scoping, and can be read and modified using var-get (or @) and var-set respectively.
Mean execution time: 467 microseconds.
As we can see, this approach is both more imperative and less performant than our functional implementations. It turns out that vars are pretty expensive to manipulate, and that we can switch to faster “refs”.
Since Clojure 1.7, we have access to volatile references, which are the fastest “refs” available in Clojure. We replace our vars with volatile! below.
Mean execution time: 78 microseconds.
This implementation goes much faster than the one based on vars, and beats the lazy-seq implementation, but is still significantly slower than the functional approach based on tail recursion.
Getting rid of volatile
In Java, volatile variables have implications on the memory model as defined by the JVM. The relationship they impose (such as happens-before) constrain the kind of optimization we can get. Let us try to get rid of volatile to go faster.
Clojure does not allow mutable non-volatile local variables. But we can cheat. Deftype allows to declare :unsynchronized-mutable variables, which are plain Java variables without the volatile. So we can declare a type just to get rid of our volatiles. This is already quite ugly, but this is about to get worse.
Since Clojure does not allow public access to mutable fields, we will declare a protocol to advance in the fibonacci sequence value and have our deftype implement it to mutate its fields. To keep a bit of dignity, we wrap this horror inside a function fibo-with-type:
Mean execution time: 60 microseconds.
This implementation goes slightly slower than our tail recursive implementation. It is therefore both ugly and not worth it: let us not do that again.
Java implementation (baseline)
Of course, the ultimate imperative solution would be a Java implementation wrapped inside some Clojure code.
Mean execution time: 41 microseconds.
This implementation goes significantly faster than our best Clojure implementation so far (56 microseconds). But notice that the implementation makes use of java.math.BigInteger instead of clojure.lang.BigInt.
Clojure with BigInteger
We can rewrite our fastest Clojure implementation, fibo-recur, to use Java’s BigInteger instead of Clojure’s BigInts, and check if it makes any difference.
Mean execution time: 41 microseconds.
As we can see, the Clojure implementation based on tail recursion and BigInteger is on par with a purely imperative Java implementation.
You can find below the traces of a full benchmark of all the implementations listed above. These benchmarks were performed on the same machine, all at the same time, using the following benchmark code, and based on the criterium library.
We went over different implementation for the fibonacci sequence (based on the linear algorithm, not the best one) and completed the relative measures of the many implementation shown in the Clojure: High Performance JVM Programming.
We measured that recur can offer performance very close to the performance of Java implementations. The functional implementations are overall pretty efficient, those based on lazy sequence being a bit slower than the others.
We also saw that Clojure’s BigInts are not as fast (in this use case at least) as Java’s BigIntegers. We went from 56 microseconds down to 41 microseconds by switching to the Java’s BigInteger, which is a quite significant gain.
Bonus and side notes
I implemented the same algorithm in Haskell, on the same laptop, using two different implementations: one based on iterate and one based on tail recursion.
The Haskell tail recursive implementation is basically equivalent to the fibo-recur implementation of Clojure.
Mean execution time: 30 microseconds.
This is faster than what we achieved with the Clojure or Java implementations, more than 25% improvement. This can be due to many factors: Haskell being compiled to native code, the implementation of Integer (equivalent of Java’s BigInteger), or specific optimizations triggers by the GHC compiler.
The second Haskell implementation is based on the iterate equivalent of Clojure, also named iterate in Haskell.
Mean execution time: 4.5 microseconds.
Haskell is renowned to be very efficient at optimizing purely functional code, with powerful loop fusions and rewrite rules. But the result looks kind of surprising. It is almost one order of magnitude faster than Java!
As a reference, an imperative Python implementation takes around 150 microseconds. So it looks like Haskell is doing some magic there, or is cheating: the Haskell tail recursive implementation should also be faster than iterate.
Unable to find an appropriate explanation for this, I posted the question on reddit/r/haskell. You can have a look at the discussion, but the short story is that GHC is memoizing (caching) the results, effectively trading space for time.
All the measurements were made using the Criterion benchmark library of Haskell, which is very close to the criterium benchmark library of Clojure.
The kind of optimizations done by the Haskell GHC compiler are quite stunning. I tried some other test cases and in most cases, Haskell and Clojure perform in the same ballpark, indicating that this kind of memoization does not always kicks in.
In any case, it shows how hard of a subject is performance, and how much it depends on many subtle factors. If anything, it show that we should be suspicious about concluding anything from micro-benchmarks like these.
You can follow me on Twitter at @quduval.