Sub Specie Æternitatis: Prime Number Sieve: A Study in Elementary Haskell Optimization

Recently I ran across the following implementation of the Sieve of Eratosthenes (which calculates all the prime numbers) in post on the Code & Co. blog:

{- Some names slightly changed -}
import qualified Data.IntMap.Strict as IM

go (n:ns) m =
    case IM.lookup n m of
        Nothing -> n : go ns (IM.insertWith (++) (2 * n) [n] m)
        Just ps -> go ns $ foldl' insertPrime (IM.delete n m) ps
    where
        insertPrime m p = IM.insertWith (++) (n + p) [p] m

sieve:: [Int]
sieve = go 2 IM.empty

This code is simple, elegant, accurate, and—in contrast to a lot of other code so labeled—it really is a sieve, rather than a fancy trial division. In fact, it has everything going for it, except performance. To see that, let's run a simple benchmark to calculate the sum of all primes up to 10⁷.

main = print $ sum $ takeWhile (‹10^7) sieve

On my workstation (Intel Core i7-970, 24 GBytes of RAM) running GHC 7.8.3 and compiling with optimizations (-O), we get the following performance statistics:

25,989,111,904 bytes allocated in the heap 8,494,002,744 bytes copied during GC 68,898,312 bytes maximum residency (124 sample(s)) 1,305,240 bytes maximum slop 198 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 49632 colls, 0 par 8.30s 7.96s 0.0002s 0.0003s Gen 1 124 colls, 0 par 3.44s 3.39s 0.0273s 0.0989s INIT time 0.00s ( 0.00s elapsed) MUT time 10.12s ( 10.57s elapsed) GC time 11.73s ( 11.35s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 21.86s ( 21.95s elapsed) %GC time 53.7% (51.7% elapsed) Alloc rate 2,566,825,867 bytes per MUT second Productivity 46.3% of total user, 46.1% of total elapsed

That is a lot of heap allocation (8.5 GByte!) and a large resident size (70 MBytes!). As a comparison, I ran the same benchmark using Sebastian Fischer's code implementation from Hackage. This code also implements the Sieve of Eratosthenes cleverly using only simple lazy lists of Ints. For the same benchmark, its performance is:

2,061,038,888 bytes allocated in the heap 179,552,920 bytes copied during GC 300,448 bytes maximum residency (105 sample(s)) 40,112 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3847 colls, 0 par 0.33s 0.43s 0.0001s 0.0003s Gen 1 105 colls, 0 par 0.08s 0.06s 0.0006s 0.0025s INIT time 0.00s ( 0.00s elapsed) MUT time 1.94s ( 1.88s elapsed) GC time 0.41s ( 0.49s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.34s ( 2.37s elapsed) %GC time 17.3% (20.7% elapsed) Alloc rate 1,063,762,006 bytes per MUT second Productivity 82.7% of total user, 81.8% of total elapsed

So how why is our sieve—implementing the same algorithm—slower by a factor of almost 10, while allocating almost 50 times as much on the heap and requiring more than 200 times as much memory? I set out to find how fast we can make our basic algorithm using only elementary Haskell techniques (i.e., no explicit parallelism or Monad mongering).

Optimization 1

Our code seems to be using a lot of memory accesses. That is not caused by the input or output lists. Those are probably optimized away by GHC anyway. So it must be the IntMap.

The Haskell standard IntMap is a very efficient data structure which performs updates and lookups in logarithmic time. But this is one of those instances where Big O notation can be deceptive. Every lookup has to track a nested data structure through memory, which will involve many cache misses when the IntMap is large, and each of which will cost many cycles. Functional updates are even worse because, in addition to the lookup cost, they may require re-creation of a substantial part of the structure, using up heap space and spreading our data across memory. So let's try to minimize IntMap lookups and updates.

Our original code keeps a list of all prime factors for future Ints in the table. That is somewhat wasteful. Do determine whether a number of prime, we do not need the list of all its prime factors—just that it has at least one. It also requires us to update multiple map entries every time we encounter a composite number.

Let's do away with all of that by, rather than storing a list of prime factors in the map, only store the largest prime factor. That way we only need to update the map once for each composite.

One slight complication in this case is that the map entry which we want to update may already have been taken by another prime (which by necessity is larger as it reached the map earlier). In that case, we cannot overwrite that larger prime (because then it would cease to be part of the sieve). Nor can we just ignore our largest prime factor (because it too has to remain part of the sieve). Instead we have to keep searching the map until we find the next empty slot:

go (n:ns) m | Just p ‹- IM.lookup n m = go ns (insertPrime n p (IM.delete n m)) | otherwise = n : go ns (IM.insert (2*n) n m) -- No need to go through insertPrime; the map slot is by necessity empty where insertPrime s p w | Just r ‹- IM.lookup t w = insertPrime t p w | otherwise = IM.insert t p w where t = s+p

The performance measure for this code is not great, but not bad either. We reduced heap allocation and memory usage substantially and gained about 20% in speed:

12,204,052,304 bytes allocated in the heap 6,571,745,472 bytes copied during GC 52,657,376 bytes maximum residency (124 sample(s)) 897,592 bytes maximum slop 152 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 23249 colls, 0 par 5.77s 5.81s 0.0002s 0.0005s Gen 1 124 colls, 0 par 2.47s 2.49s 0.0201s 0.0664s INIT time 0.00s ( 0.00s elapsed) MUT time 9.61s ( 9.57s elapsed) GC time 8.23s ( 8.29s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 17.84s ( 17.89s elapsed) %GC time 46.1% (46.4% elapsed) Alloc rate 1,270,015,199 bytes per MUT second Productivity 53.9% of total user, 53.7% of total elapsed

Optimization 1'

In the previous optimization, we decided to store the largest prime factor in the map. Let's try the other obvious choice: the smallest prime factor. One additional complication here is that, when we try to reinsert our smallest prime factor in the map, we may find that there already is another factor at the spot. If the other factor is smaller, we just move on to the next spot in the map. If it is larger, we have to displace it and reinsert it at a higher spot.

go (n:ns) m
    | Just p ‹- IM.lookup n m
        = go ns (insertPrime n p (IM.delete n m))
    | otherwise
        = n : go ns (IM.insert (n*n) n m)
    where
        insertPrime s p w
            | Just r ‹- IM.lookup t w, r‹p = insertPrime t p w
            | Just r ‹- IM.lookup t w      = insertPrime t r (IM.insert t p w)
            | otherwise                    = IM.insert t p w
            where t = s+p

This turns out to be a pessimization. While slightly faster than the original code, it allocates much more on the heap, and is inferior to Optimization 1 by every measure.

44,218,867,768 bytes allocated in the heap 1,758,926,456 bytes copied during GC 48,770,072 bytes maximum residency (23 sample(s)) 1,041,048 bytes maximum slop 135 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 84605 colls, 0 par 1.75s 1.96s 0.0000s 0.0000s Gen 1 23 colls, 0 par 0.39s 0.42s 0.0184s 0.0507s INIT time 0.00s ( 0.00s elapsed) MUT time 17.78s ( 17.68s elapsed) GC time 2.14s ( 2.38s elapsed) EXIT time 0.02s ( 0.02s elapsed) Total time 19.94s ( 20.09s elapsed) %GC time 10.7% (11.9% elapsed) Alloc rate 2,486,825,603 bytes per MUT second Productivity 89.3% of total user, 88.6% of total elapsed

Optimization 2

To understand why our code is still so slow, let's think about what it actually spends most of its time doing: Shuffling small primes out of the map and then reinserting them at a slightly higher position (e.g., once every second integer for prime factor 2!). Given that our IntMap grows to over 660,000 elements (π(10⁷)=664,579), that is a large map that we continuously create and discard. If we could just deal with these small primes separately without updating the map, we could have a large performance gain.

And—because integer division is actually pretty fast on modern processors—we can! Just throw out all integers with small prime factors through trial division. When we do that, we'll also have to update our insertPrime function so that it will not insert one of our larger primes into a map slot that will be skipped later. First we define a smallFactor function:

import Data.Maybe

smallFactor :: Int -> Maybe Int
{-# INLINE smallFactor #-}
smallFactor n
{- We use 'rem' rather than the more mathematically correct 'mod' because 'rem'
compiles to a single machine instruction on many architectures where 'mod' does not,
but yields the same results for positive parameters. -}
  | n `rem`  2 == 0 = Just  2
  | n `rem`  3 == 0 = Just  3
  | n `rem`  5 == 0 = Just  5
  | n `rem`  7 == 0 = Just  7
  | n `rem` 11 == 0 = Just 11
  | n `rem` 13 == 0 = Just 13
  | n `rem` 17 == 0 = Just 17
  | n `rem` 19 == 0 = Just 19
  | n `rem` 23 == 0 = Just 23
  | n `rem` 29 == 0 = Just 29
  | otherwise       = Nothing

largePrime = head $ filter ( isNothing . smallFactor ) [2..]

smallPrimes = filter (\ p -> smallFactor p == Just p) [2..largePrime]

Next we adapt the code from Optimization 1 to just skip integers with small prime factors:

go (n:ns) m
    | Just p ‹- smallFactor n
        = go ns m
    | Just p ‹- IM.lookup n m
        = go ns (insertPrime n p (IM.delete n m))
    | otherwise
        = n : go ns (insertPrime n n m)
    where
        insertPrime s p w
            | Just r ‹- smallFactor t = insertPrime t p w
            | Just r ‹- IM.lookup t w = insertPrime t p w
            | otherwise               = IM.insert t p w
            where t = s+2*p -- We can skip s+p because that is necessarily a multiple of 2
sieve = smallPrimes ++ go [largePrime..] IM.empty

The performance of this code is much better! By skipping integers with small prime factors, we improved our performance five-fold and reduced heap allocation by an even larger multiple:

1,704,922,016 bytes allocated in the heap 754,137,952 bytes copied during GC 49,154,632 bytes maximum residency (17 sample(s)) 729,144 bytes maximum slop 128 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3248 colls, 0 par 0.72s 0.62s 0.0002s 0.0002s Gen 1 17 colls, 0 par 0.23s 0.29s 0.0169s 0.0528s INIT time 0.00s ( 0.00s elapsed) MUT time 2.19s ( 2.23s elapsed) GC time 0.95s ( 0.91s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 3.16s ( 3.16s elapsed) %GC time 30.2% (28.9% elapsed) Alloc rate 779,392,921 bytes per MUT second Productivity 69.8% of total user, 69.8% of total elapsed

Optimization 2'

We can try the same trick with our smallest-prime-factor-in-map code from Optimization 1':

go (n:ns) m
    | Just p ‹- smallFactor n
        = go ns m
    | Just p ‹- IM.lookup n m
        = go ns (insertPrime n p (IM.delete n m))
    | otherwise
        = n : go ns (IM.insert (n*n) n m)
    where
        insertPrime s p w
            | Just r ‹- smallFactor t      = insertPrime t p w
            | Just r ‹- IM.lookup t w, r‹p = insertPrime t p w
            | Just r ‹- IM.lookup t w      = insertPrime t r (IM.insert t p w)
            | otherwise                    = IM.insert t p w
            where t = s+2*p

The small-factor optimization has turned the ugly duckling into a beautiful swan. While we still use much more memory than the sieve from Hackage, this one is actually faster! Compared to the original code, we increased performance ten-fold.

3,131,728,368 bytes allocated in the heap 305,585,008 bytes copied during GC 37,811,952 bytes maximum residency (9 sample(s)) 661,520 bytes maximum slop 82 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 5986 colls, 0 par 0.28s 0.27s 0.0000s 0.0002s Gen 1 9 colls, 0 par 0.09s 0.11s 0.0125s 0.0425s INIT time 0.00s ( 0.00s elapsed) MUT time 1.81s ( 1.85s elapsed) GC time 0.38s ( 0.38s elapsed) EXIT time 0.02s ( 0.01s elapsed) Total time 2.20s ( 2.24s elapsed) %GC time 17.0% (17.0% elapsed) Alloc rate 1,727,850,134 bytes per MUT second Productivity 83.0% of total user, 81.5% of total elapsed

Optimization 3

For my next optimization, I tried to reduce the number of lookups in the map. As we go through the integers, we keep checking the map for every one. But that is not really necessary. We could just ask the map for the smallest key. All numbers between our current position and that number must be either (1) multiples of a small prime factor or (2) new primes. They cannot be multiples of previously discovered large primes. Adapting the code from Optimization 2' yields:

go (n:ns) m =
    let
        ((nnext,pnext),m2) = IM.deleteFindMin m
        
        cont (n:ns) m
            | Just p ‹- smallFactor n
                = cont ns m -- p : 
            | n ‹ nnext
                = n : cont ns (IM.insert (n*n) n m)
            | otherwise
                = go ns (insertPrime nnext pnext m)
    in
        cont ns m2
    where
        insertPrime s p w -- same as before
            | Just r ‹- smallFactor t      = insertPrime t p w
            | Just r ‹- IM.lookup t w, r‹p = insertPrime t p w
            | Just r ‹- IM.lookup t w      = insertPrime t r (IM.insert t p w)
            | otherwise                    = IM.insert t p w
            where t = s+2*p
sieve = smallPrimes ++ [largePrime] ++ go [largePrime..] (IM.singleton (largePrime^2) largePrime)

I would have thought that eliminating so many lookups and merging the map lookup with the delete, performance would improve at least slightly. I was wrong. One would need to look at the code transformed by the optimizer to find out why.

4,657,497,608 bytes allocated in the heap 375,697,488 bytes copied during GC 47,478,232 bytes maximum residency (10 sample(s)) 682,720 bytes maximum slop 108 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 8914 colls, 0 par 0.25s 0.33s 0.0000s 0.0001s Gen 1 10 colls, 0 par 0.14s 0.16s 0.0156s 0.0540s INIT time 0.00s ( 0.00s elapsed) MUT time 1.91s ( 1.85s elapsed) GC time 0.39s ( 0.48s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 2.30s ( 2.35s elapsed) %GC time 17.0% (20.6% elapsed) Alloc rate 2,443,277,433 bytes per MUT second Productivity 83.0% of total user, 81.1% of total elapsed

Conclusion

For much more on the sieve in Haskell and even better optimizations, see https://wiki.haskell.org/Prime_numbers. If you have a suggestion on how to improve the approach here, please comment!

Thursday, May 28, 2015

Prime Number Sieve: A Study in Elementary Haskell Optimization