Having recently finished reading the Clojure programming book named Clojure: High Performance JVM Programming, I found in it quite a bunch of useful information to increase the performance of Clojure (and more generally JVM-based) applications.

The book contains many examples of different constructs we can use in Clojure to increase the performance of arithmetic operations. Many of these examples feature the classic fibonacci computation.

Some of these implementations come with a associated measurements inside the book, but some implementations are missing these figures. It is like watching an action movie where some of the action scenes are replaced by blanks. Pretty frustrating.

In this post, we will fill the missing blanks, and go through the different ways to compute a fibonacci sequence, some of which comes from the book, and some fancier ones I tested, and give test their performance using the benchmarking library criterium.

We will conclude with some bonus measurements on Haskell implementations of the same algorithm, and discuss the subject of performance and micro-benchmarks.

### Fibonacci Sequence

The fibonacci sequence is the sequence of number such that each number is the sum of the two preceding ones. We seed this sequence with the two starting numbers: we will choose to seed it with 0 and 1.

There are many ways to compute this sequences, some more efficient that other. The fastest one I know computes the Nth fibonacci number in O(log N) steps, using matrix exponentiation and the exponentiation by squaring algorithm.

The book however makes use of the linear algorithm, probably because it has much more implementation possible. We will thus stick to this algorithm as well, and measure how different implementations of it perform.

### Functional approaches

In this section, we will cover some of the classic functional approaches used to compute a sequence of values: iterate, lazy sequence, recur and trampoline.

###### Iterate

Iterate creates a lazy sequence of values, from an initial seed, and a function to compute then next element. Indexing the Nth element gives us our Nth fibonacci number.

 (defn fibo-iterate [^long n] (let [next-fib (fn [[a b]] [b (+ a b)]) fibs (iterate next-fib [0N 1N])] (first (nth fibs n))))

Mean execution time: 113 microseconds.

###### Lazy sequence

Since iterate creates a lazy sequence, we can try to build the lazy sequence ourselves without using iterate. Doing so gives us more power on how we can create the sequence. In particular, we can avoid instantiating intermediary pairs as fibs-iterate did.

 (defn fibo-lazy-seq [^long n] (letfn [(fibs [a b] (cons a (lazy-seq (fibs b (+ a b)))))] (nth (fibs 0N 1N) n)))

Mean execution time: 97 microseconds.

As we could expect, no creation of pairs means less dynamic allocations, which in terns gives us better performance.

###### Recur

We can use the low level construct recur to implement the algorithm using a tail recursive implementation.

 (defn fibo-recur [^long n] (loop [curr 0N next 1N n n] (if-not (zero? n) (recur next (+ curr next) (dec n)) curr)))
view raw fibo-recur.clj hosted with ❤ by GitHub

Mean execution time: 56 microseconds.

Recur is usually used to avoid boxing in math computations, for better performance. In our case, the boxing is needed since we manipulate BigInts. Nevertheless, using recur improves the performance quite substantially.

###### Trampoline

The twin cousin of recur is trampoline, which allows to implement mutual recursion without consuming stack. We do not need mutual recursion there, but it is useful to see how it compares to recur.

 (defn fibo-trampoline [^long n] (letfn [(fibs [curr next ^long n] (if-not (zero? n) #(fibs next (+ curr next) (dec n)) curr))] (trampoline (fibs 0N 1N n))))

Mean execution time: 65 microseconds.

As we could suspect, trampoline is a bit more expensive than using plain tail recursion via recur. It is not much, but it should be reserved from true mutual recursion.

### Imperative approaches

In this section, we will cover some of the imperative approaches, based on assignments, that we could use to implement our algorithm.

###### Local vars

Clojure allows us to create local vars using with-local-vars. These local vars have lexical scoping, and can be read and modified using var-get (or @) and var-set respectively.

 (defn fibo-local-vars [^long n] (with-local-vars [curr 0N next 1N iter n] (while (> @iter 0) (let [nnext (+ @curr @next)] (var-set curr @next) (var-set next nnext) (var-set iter (dec @iter)))) @curr))

Mean execution time: 467 microseconds.

As we can see, this approach is both more imperative and less performant than our functional implementations. It turns out that vars are pretty expensive to manipulate, and that we can switch to faster “refs”.

###### Volatile bindings

Since Clojure 1.7, we have access to volatile references, which are the fastest “refs” available in Clojure. We replace our vars with volatile! below.

 (defn fibo-volatile [^long n] (let [curr (volatile! 0N) next (volatile! 1N) iter (volatile! n)] (while (> @iter 0) (let [nnext (+ @curr @next)] (vreset! curr @next) (vreset! next nnext) (vswap! iter dec))) @curr))

Mean execution time: 78 microseconds.

This implementation goes much faster than the one based on vars, and beats the lazy-seq implementation, but is still significantly slower than the functional approach based on tail recursion.

###### Getting rid of volatile

In Java, volatile variables have implications on the memory model as defined by the JVM. The relationship they impose (such as happens-before) constrain the kind of optimization we can get. Let us try to get rid of volatile to go faster.

Clojure does not allow mutable non-volatile local variables. But we can cheat. Deftype allows to declare :unsynchronized-mutable variables, which are plain Java variables without the volatile. So we can declare a type just to get rid of our volatiles. This is already quite ugly, but this is about to get worse.

Since Clojure does not allow public access to mutable fields, we will declare a protocol to advance in the fibonacci sequence value and have our deftype implement it to mutate its fields. To keep a bit of dignity, we wrap this horror inside a function fibo-with-type:

 (defprotocol Advance (advance [this n])) (deftype FiboType [^:unsynchronized-mutable curr ^:unsynchronized-mutable next] Advance (advance [_ n] (loop [^long n n] (if-not (zero? n) (let [nnext (+ curr next)] (set! curr next) (set! next nnext) (recur (dec n))))) curr)) (defn fibo-with-type [^long n] (advance (FiboType. 0N 1N) n))

Mean execution time: 60 microseconds.

This implementation goes slightly slower than our tail recursive implementation. It is therefore both ugly and not worth it: let us not do that again.

###### Java implementation (baseline)

Of course, the ultimate imperative solution would be a Java implementation wrapped inside some Clojure code.

 public class PerfFiboJava { static public BigInteger fibs(int n) { BigInteger curr = BigInteger.valueOf(0); BigInteger next = BigInteger.valueOf(1); for (; n > 0; --n) { BigInteger nnext = curr.add(next); curr = next; next = nnext; } return curr; } } (defn fibo-with-java [^long n] (PerfFiboJava/fibs n))

Mean execution time: 41 microseconds.

This implementation goes significantly faster than our best Clojure implementation so far (56 microseconds). But notice that the implementation makes use of java.math.BigInteger instead of clojure.lang.BigInt.

###### Clojure with BigInteger

We can rewrite our fastest Clojure implementation, fibo-recur, to use Java’s BigInteger instead of Clojure’s BigInts, and check if it makes any difference.

 (defn fibo-recur-java-bigint [^long n] (loop [curr (BigInteger/valueOf 0) next (BigInteger/valueOf 1) n n] (if-not (zero? n) (recur next (.add curr next) (dec n)) curr)))

Mean execution time: 41 microseconds.

As we can see, the Clojure implementation based on tail recursion and BigInteger is on par with a purely imperative Java implementation.

### All measurements

You can find below the traces of a full benchmark of all the implementations listed above. These benchmarks were performed on the same machine, all at the same time, using the following benchmark code, and based on the criterium library.

 fibo-iterate : ------------------------------ Evaluation count : 539580 in 60 samples of 8993 calls. Execution time mean : 113,354044 µs Execution time std-deviation : 3,455371 µs Execution time lower quantile : 107,950936 µs ( 2,5%) Execution time upper quantile : 119,014339 µs (97,5%) Overhead used : 1,633507 ns fibo-lazy-seq : ------------------------------ Evaluation count : 638160 in 60 samples of 10636 calls. Execution time mean : 96,818549 µs Execution time std-deviation : 2,511137 µs Execution time lower quantile : 92,324320 µs ( 2,5%) Execution time upper quantile : 101,180065 µs (97,5%) Overhead used : 1,633507 ns fibo-recur : ------------------------------ Evaluation count : 1017960 in 60 samples of 16966 calls. Execution time mean : 56,285983 µs Execution time std-deviation : 1,681193 µs Execution time lower quantile : 54,120987 µs ( 2,5%) Execution time upper quantile : 59,522576 µs (97,5%) Overhead used : 1,633507 ns fibo-trampoline : ------------------------------ Evaluation count : 956160 in 60 samples of 15936 calls. Execution time mean : 65,309869 µs Execution time std-deviation : 2,272909 µs Execution time lower quantile : 62,521979 µs ( 2,5%) Execution time upper quantile : 69,748727 µs (97,5%) Overhead used : 1,633507 ns fibo-local-vars : ------------------------------ Evaluation count : 130320 in 60 samples of 2172 calls. Execution time mean : 467,227202 µs Execution time std-deviation : 13,190264 µs Execution time lower quantile : 446,724420 µs ( 2,5%) Execution time upper quantile : 488,447763 µs (97,5%) Overhead used : 1,633507 ns fibo-volatile : ------------------------------ Evaluation count : 798360 in 60 samples of 13306 calls. Execution time mean : 77,973893 µs Execution time std-deviation : 2,420353 µs Execution time lower quantile : 75,057724 µs ( 2,5%) Execution time upper quantile : 82,666311 µs (97,5%) Overhead used : 1,633507 ns fibo-with-type : ------------------------------ Evaluation count : 976080 in 60 samples of 16268 calls. Execution time mean : 59,948189 µs Execution time std-deviation : 1,970876 µs Execution time lower quantile : 57,145222 µs ( 2,5%) Execution time upper quantile : 63,384722 µs (97,5%) Overhead used : 1,633507 ns fibo-with-java : ------------------------------ Evaluation count : 1501800 in 60 samples of 25030 calls. Execution time mean : 41,292173 µs Execution time std-deviation : 1,295403 µs Execution time lower quantile : 39,329754 µs ( 2,5%) Execution time upper quantile : 43,271634 µs (97,5%) Overhead used : 1,633507 ns fibo-recur-java-bigint : ------------------------------ Evaluation count : 1495620 in 60 samples of 24927 calls. Execution time mean : 41,463051 µs Execution time std-deviation : 1,454860 µs Execution time lower quantile : 39,384545 µs ( 2,5%) Execution time upper quantile : 44,186069 µs (97,5%) Overhead used : 1,633507 ns

### Conclusion

We went over different implementation for the fibonacci sequence (based on the linear algorithm, not the best one) and completed the relative measures of the many implementation shown in the Clojure: High Performance JVM Programming.

We measured that recur can offer performance very close to the performance of Java implementations. The functional implementations are overall pretty efficient, those based on lazy sequence being a bit slower than the others.

We also saw that Clojure’s BigInts are not as fast (in this use case at least) as Java’s BigIntegers. We went from 56 microseconds down to 41 microseconds by switching to the Java’s BigInteger, which is a quite significant gain.

### Bonus and side notes

I implemented the same algorithm in Haskell, on the same laptop, using two different implementations: one based on iterate and one based on tail recursion.

The Haskell tail recursive implementation is basically equivalent to the fibo-recur implementation of Clojure.

 fiboRecur :: Int -> Integer fiboRecur = loop 0 1 where loop !curr !next n | n == 0 = curr | otherwise = loop next (curr + next) (n - 1)
view raw fiboRecur.hs hosted with ❤ by GitHub

Mean execution time: 30 microseconds.

This is faster than what we achieved with the Clojure or Java implementations, more than 25% improvement. This can be due to many factors: Haskell being compiled to native code, the implementation of Integer (equivalent of Java’s BigInteger), or specific optimizations triggers by the GHC compiler.

The second Haskell implementation is based on the iterate equivalent of Clojure, also named iterate in Haskell.

 fiboIterate :: Int -> Integer fiboIterate n = fst (iterate next (0, 1) !! n) where next (a, b) = (b, a + b)
view raw fiboIterate.hs hosted with ❤ by GitHub

Mean execution time: 4.5 microseconds.

Haskell is renowned to be very efficient at optimizing purely functional code, with powerful loop fusions and rewrite rules. But the result looks kind of surprising. It is almost one order of magnitude faster than Java!

As a reference, an imperative Python implementation takes around 150 microseconds. So it looks like Haskell is doing some magic there, or is cheating: the Haskell tail recursive implementation should also be faster than iterate.

Unable to find an appropriate explanation for this, I posted the question on reddit/r/haskell. You can have a look at the discussion, but the short story is that GHC is memoizing (caching) the results, effectively trading space for time.

###### Benchmark results

All the measurements were made using the Criterion benchmark library of Haskell, which is very close to the criterium benchmark library of Clojure.

 benchmarking Fibo/Iterate1000 time 4.467 μs (4.430 μs .. 4.512 μs) 0.999 R² (0.999 R² .. 1.000 R²) mean 4.488 μs (4.447 μs .. 4.540 μs) std dev 153.8 ns (124.8 ns .. 199.5 ns) variance introduced by outliers: 44% (moderately inflated) benchmarking Fibo/Recur1000 time 30.31 μs (29.65 μs .. 30.99 μs) 0.998 R² (0.997 R² .. 1.000 R²) mean 29.81 μs (29.63 μs .. 30.11 μs) std dev 827.4 ns (465.3 ns .. 1.217 μs) variance introduced by outliers: 29% (moderately inflated)

###### Last thoughts

The kind of optimizations done by the Haskell GHC compiler are quite stunning. I tried some other test cases and in most cases, Haskell and Clojure perform in the same ballpark, indicating that this kind of memoization does not always kicks in.

In any case, it shows how hard of a subject is performance, and how much it depends on many subtle factors. If anything, it show that we should be suspicious about concluding anything from micro-benchmarks like these.