Monday, July 22, 2013

[abfsttwf] Optimizing "iterate"

The logistic map becomes interestingly chaotic for multipliers greater than 3.56995.  However, for our purposes, it is just an example of an endomorphic function.

logistic_chaos :: Double -> Double;
logistic_chaos x = 3.57 * x * (1-x);

As an example of a computationally expensive task, we seek the value of the billionth iteration of this function initialized at 0.5.  This post will explore various implementations of iterateN.  Compiler was ghc 7.6.3 on amd64 with compilation flag -O2.

result :: Double;
result = iterateN 1000000000 logistic_chaos 0.5;

iterateN :: Int -> (a -> a) -> a -> a;

The "model" implementation fails due to a space leak:

iterateN_1 n f start = (iterate f start) !! n;

Hopefully, someday the GHC strictness analyzer will be able to prevent the space leak. It seems unfortunate that the obvious implementation of a simple and common task fails so badly.

Making the evaluated function strict still fails with a space leak:

{-# LANGUAGE BangPatterns #-}
strict_chaos :: Double -> Double;
strict_chaos !x = 3.57 * x * (1-x);

We go back to the non-strict version. This one also fails due to a space leak:

iterateN_2 n f = last . (take (1+n)) . (iterate f);

We reimplement the Prelude function "iterate" by inserting "seq".  There were two ways of doing it, so I tried both.  Putting the "seq" in the tail was slower, completing in 17.0 seconds.

iterateN_3 n f start = let {
seq_iterate :: (a -> a) -> a -> [a];
seq_iterate f start = start : (seq start (seq_iterate f (f start)))
} in (seq_iterate f start) !! n;

Putting the "seq" around the whole list was faster, completing in 14.5 seconds:

iterateN_4 n f start = let {
seq_iterate :: (a -> a) -> a -> [a];
seq_iterate f start = seq start (start : (seq_iterate f (f start)))
} in (seq_iterate f start) !! n;

Not passing "f" into the inner function but instead relying on scope cut the runtime to 10.9 and 10.8 seconds. However, this is an undesirable solution as we would like to make seq_iterate a global library function.

{-# LANGUAGE ScopedTypeVariables #-}
iterateN_5 :: forall a . Int -> (a -> a) -> a -> a;
iterateN_5 n f start = let {
seq_iterate :: a -> [a];
seq_iterate start = start : (seq start (seq_iterate (f start)))
} in (seq_iterate start) !! n;

iterateN_6 :: forall a . Int -> (a -> a) -> a -> a;
iterateN_6 n f start = let {
seq_iterate :: a -> [a];
seq_iterate start = seq start (start : (seq_iterate (f start)))
} in (seq_iterate start) !! n;

Manually maintaining a counter runs in 9.7 seconds:

iterateN_7 n f start = case (compare n 0) of {
  LT -> error "cannot iterate less than zero";
  EQ -> start;
  GT -> seq start (iterateN_7 (pred n) f (f start))

Using an Integer instead of an Int counter cost an additional 1.14*10^-8 seconds per iteration, so this ran in 21.1 seconds.

iterateN_Integer :: Integer -> (a -> a) -> a -> a;
iterateN_Integer n f start = case (compare n 0) of {
  LT -> error "cannot iterate less than zero";
  EQ -> start;
  GT -> seq start (iterateN_Integer (pred n) f (f start))

A pure C++ version (gcc version 4.6.3 compiled with -O2) runs in 4.1 seconds:

#include <iostream>
int main(){
  double x=0.5;
  for(int i=0;i<1000000000;++i){

Complete Haskell source code.


Anonymous said...

Try this function from Tokano Akio on Haskell-Cafe, which evaluates a list in the order you need:

headStrict :: [a] -> [a]
headStrict = foldr (\x y -> x `seq` (x:y)) []

iterateN_1 n f start = headStrict (iterate f start) !! n

How long does it take?

Josef said...

Another problem with your initial version (apart from the space leak) is that GHC doesn't fuse `iterate` and `!!`. If you wanted to still have a compositional way of writing your program I would suggest you wrote a strict version of `iterate` and used stream fusion from the stream-fusion package. Then GHC would be able to fuse the two functions and you would get both compositionality and efficiency. My bet is that a version based on stream fusion would be about as fast as `iterateN_7`.

Unknown said...

I would find it interesting if you could try an imperative version implemented in Haskell (in ST or IO monad) and compare its execution time to the others...

Johannes Röhl said...

I have written a Haskell variant, which is as fast as the C++ code. The main difference to "iterateN_7" is, that "f" is not recursively passed around as a parameter. This allows GHC to inline "f" in place. That gives approximately a speedup of factor two.

I learned the trick by looking on the implementation of "map" and Co in the "Base" library.

Best regards

mrjbq7 said...
This comment has been removed by the author.
mrjbq7 said...

Nice post!

I made an implementation in Factor to see how it compares to C, and it does pretty well:

winterkoninkje said...

You can further improve on Johannes Röhl's version by using (<= 0) which is more optimal than using (==0). On my machine the former generates:

_c1XM: cmpl $0,0(%ebp); jle _c1XP; ... jmp _c1XM; _c1XP: ... jmp *0(%ebp)

whereas the latter generates:

_c1Yv: movl 0(%ebp),%eax; testl %eax,%eax; jne _c1Yz; ... jmp *0(%ebp); _c1Yz: ... jmp _c1Yv

The cmpl/jle is more efficient than movl/testl/jne, and you get marginally better locality as well. This is a general phenomenon when dealing with ints. If GHC focused more on microptimizations, then it could transform the (==0) version arising from case analysis into the (<= 0) version; but so far it does not.

winterkoninkje said...

I.e., the following code:

iterateN_ :: Int -> (a->a) -> a -> a
iterateN_ n f x = assert (n>=0) (go n x)
go !n !x
| n <= 0 = x
| otherwise = go (n-1) (f x)