Saturday, June 28, 2008

More Fibonacci

Note: you may want to read the extensive comment chain as it has much more information than the original post.

So I went looking around to see what other people had done to calculate Fibonacci numbers exactly. I saw plenty of naive implementations (F(n) = F(n-1) + F(n-2)). Wikipedia has plenty of sample code, but none seemed interesting the first time I went through the page (although it does have valuable things, see below). Wolfram (Mathematica) didn't say much (or I missed it if it did). Then I run into this page, and Dijkstra's recursion:

  • F(2n) = (2 F(n-1) + F(n)) * F(n)
  • F(2n-1) = F(n-1)^2 + F(n)^2
Ok Dijkstra, you're on now. Which is faster: Dijkstra's recursion, or the cached split recursion calculator based on F(a+b)?

As it turns out, Dijkstra's recursion is extremely fast. However, it is not as fast as the cached calculator can be. And in fact, the presence of the cache is the difference: by design, Dijkstra's recursion is difficult to cache because the indices that would be cached change with every F(k). On the other hand, by using F(a+b), one can choose which indices will be cached in advance, and therefore it is possible to use the same cache for all F(k).

Measurements show that F(15625) is calculated by the cached calculator about 2x faster than Dijkstra's method --- but only after the cache is filled. With F(390625), this advantage is reduced to a factor of about 1.4x.

How about F(1953125)? Dijkstra's method crunches this monster of a number, complete with its 408179 decimal digits, in 16.5 seconds. The cached calculator, after the cache is filled, does the same in 14.7 seconds. At first sight it appears that as the Fibonacci index increases, the methods become more and more similar. Maybe there is an asymptotical advantage towards Dijkstra's method (for example, multiplication is most efficient when squaring), although it's hard to tell at first sight.

Where the cached calculator does leave Dijkstra's recursion in the dust is when calculating F(k) mod: m. The caching proves too much, and F(15625) reveals a runtime difference of 12x.

Fun, isn't it? :)

PS1: at first I thought Wikipedia's alternate expressions for Dijkstra's recursion (under J, double recursion) would be faster, but measurements reveal that they are slightly slower than Dijkstra's original recursion.

PS2: unhappy with a couple things, I made a few changes to the cached calculator so its execution context is more aware of what is going on. This results is a speedup of a few percent compared to the code I posted earlier.

35 comments:

Lukas Renggli said...

What about using a generating function?

The obvious drawback here is that floating-point arithmetic is used. I assume that for big fibonacci number it is faster than any recursive approach? The closed form has no additional memory requirements (cache) and the formula is simple to derive too.

Andres said...

Lukas,

Unfortunately, calculations that involve things like phi (1 + 5 sqrt / 2) will, at most, have 53 significant bits. This is insufficient for any modestly large Fibonacci number such as F(100), which requires 89 bits to express in binary.

One could go and calculate phi with a lot of precision as required by the F(k) to be calculated. However, I'd expect those calculations to run significantly slower than integer arithmetic and a powerful recursion (such as Dijkstra's).

Andres.

nicolas cellier said...

An interesting alternative is to use Dijkstra to generate the cache.
You then get an algorithm combining the best features of each.
The code (not factored so well as yours) would be:

privateFibonacci: anInteger
"optimize with Dijkstra algorithm in case of power of 2 or immediate neighbour"

| half fn n21 firstHalf secondHalf |
anInteger highBit > (anInteger - 1) highBit
ifTrue: [half := anInteger bitShift: -1.
fn := self fibonacci: half.
^cache at: anInteger put: ((self fibonacci: half-1) bitShift: 1) + fn * fn].
((n21 := anInteger + 1) highBit > anInteger highBit or: [anInteger highBit > (anInteger - 2) highBit])
ifTrue: [half := n21 bitShift: -1.
^cache at: anInteger put: ((self fibonacci: half-1) squared) + ((self fibonacci: half) squared)].

firstHalf := self firstHalfIndexFor: anInteger.
secondHalf := anInteger - firstHalf.
^(self fibonacci: firstHalf + 1) * (self fibonacci: secondHalf)
+ ((self fibonacci: firstHalf) * (self fibonacci: secondHalf - 1)).

Tom said...

I don't see why you can't use a cache with Dijkstra to get the best of both worlds. True, the cache may not be fully populated, but is that really a problem? It will be filled on an as needed basis.

Sample code at pastebin, I got under 16 seconds for the large computation.

A Nony Mouse said...

the simple observation that f(2n+1)=(f(n)+f(n-1))^2+f(n)^2 means that you can avoid the branching thus making it iterative and save a bunch of calcs, it might speed thing up a bit

Andres said...

Tom,

Dijkstra is hard to cache because each index to be cached changes with each F(k). For example, with F(100) you'd cache F(50), F(25) and so on. Now if you try F(104), you need F(52) and F(26), and so the previous work doesn't necessarily help you.

Andres.

Suspect said...

Are these times for one calculation? After reading your post, I wrote a naive (by which I mean it uses the F(a+b) identity, but there are NO optimizations), done-in-5-minutes python script that seems to be faster.

0.22 seconds to find fib(390625), which has 81636 decimal digits (I calculated that just to verify that it is, indeed, the correct answer.)

2.88 seconds to find fib(1953125), which is 408179 digits (matching your result.)

If the results are correct, then the speed difference is most likely the caching scheme, and/or the structure of the cache itself. Naively, my script puts every value of F(n) calculated in a hash table.

Interestingly, finding the number of digits takes an order of magnitude more time than actually calculating the function value :) Either that or it's (binary digits) * ln(2)/ln(10) ...

Andres said...

Suspect,

The cache structure used in the F(a+b) cached calculator is also a hash table. The time difference however leads me to think that we're comparing the implementation of arbitrary precision arithmetic. If I remember correctly, the VisualWorks VM does not have primitives designed with these kind of huge numbers in mind. Do you know what multiplication algorithms Python is using?

Side note: so the VisualWorks VM is not very efficient for large integer arithmetic... perhaps one of these days I will fish out Knuth's TAOCP and change that :).

Andres.

Andres said...

Suspect,

Also, I checked out the time to evaluate this:

largeF := 1953125 fibonacci.
Time millisecondsToRun: [largeF printString size]

Here it took 28 seconds to do that. Interestingly, that would match your observation about the digit count calculation taking one order of magnitude more than your calculation (and about 2x longer than the calculation here).

I'll play a bit with the time profiler and see if something can be done about this.

Andres.

Andres said...

Hey Suspect, also... if you are into math, have you seen anything that makes it possible to find solutions to this type of problem?

ax + b = k^2

All variables are integers. On the left, a and b are fixed. On the right, k is a free variable. Also, ax + b is tuned so that it only produces quadratic residues mod many primes. The variable x has a range of possible values, starting at 1. The issue is to determine if any x satisfies the equation above, and if so which (although a deterministic yes/no answer is also valuable).

Andres.

Andres said...

Nony mouse,

I am sorry, I am not seeing why the branch wouldn't be necessary. Would you mind clarifying a bit?

Also, keep in mind that the execution time is going to be strongly dominated by large integer arithmetic.

Andres.

Suspect said...

That's very possible. Hmm. I'm just itching to have a go at it with pure C to see where this can go.

No, I have no idea what algos Python uses to handle arbitrary precision numeric types.

Ah, Knuth. The best way to make programmers learn calculus... He always manages to thoroughly discourage me with his godly math-fu :)

Side note: Isn't it fitting that TAOCP has Tao in its acronym? :)

Andres said...

Suspect,

A final clarification... (I wish I could edit my own comments!)... the numbers a and b are way beyond the threshold past which continued fractions are no longer useful.

Andres.

Suspect said...

Ouch, number theory :(

I haven't seen anything like that, and I'm not too good with discrete math, but I'd say work mod a, and then, ah --

-- nope. Nothing leaps, as my math teacher would put it.

Suspect said...

Yes, continued fractions would have helped, but that "recipe" doesn't leave much room for any extra conditions (eg. quadratic residues here) and you end up starting from Fermat's Little and Co. and working on up.

Forgive me, but I'd take a diffy q any day ;)

grfgguvf said...

Strange argument there...

You dismiss the formula based on the golden section but you use big-integers for the recursive version.

The formula with big-floats would be much faster.

Maybe Smalltalk doesn't have big floats, but Mathematica does, for example.

Andres said...

Suspect,

> That's very possible. Hmm. I'm just itching to have a go at it with pure C to see where this can go.

Yeah, that's always an option... I used to do that kind of stuff with x86 assembler :).

> Ah, Knuth. The best way to make programmers learn calculus... He always manages to thoroughly discourage me with his godly math-fu :)

Oohhh, actually... get a copy of Concrete Mathematics. Seriously, I do not think it's possible to have more fun with math than that.

> Side note: Isn't it fitting that TAOCP has Tao in its acronym? :)

Haha :)...

Andres.

Andres said...

Suspect,

Dude, you're 17 and you're talking continued fractions and Fermat's theorem and such as if they were your friends! Seriously, grab a copy of Concrete Mathematics. It even has differential equation stuff in it!!!

Andres.

Andres said...

Grfgguv,

> You dismiss the formula based on the golden section but you use big-integers for the recursive version.

I do not see what's strange about this?

> The formula with big-floats would be much faster. Maybe Smalltalk doesn't have big floats, but Mathematica does, for example.

I implemented my own arbitrary (fixed) precision and arbitrary base floating point numbers in Smalltalk. Actually it's quite easy because you treat them as fractions and then it's all integer arithmetic behind the scenes.

Perhaps I am missing something, but my reluctance to use floating point numbers is that using them would force me to do a precision analysis on the whole expression to determine how precise should the calculations be.

For instance, when one raises things like phi to powers, how precise does phi have to be for the final result to be exact? How much precision can you tolerate to lose in each step? At this point, I just think "well I can do it in integers and be done with it".

But... maybe I'm wrong (as I frequently am). Would you mind running some experiments in Mathematica and see what you get? I do not have it myself.

Andres.

Andres said...

Suspect,

At one point I considered mod b, but the fact that k^2 is a free variable means that I'd have to determine whether ax is a quadratic residue mod b. Since unfortunately b is rather large, calculating all the quadratic residues mod b and putting them in a hash table (in Smalltalk, a Set) for fast lookup is out of the question.

I do remember that at once point I had derived a check to see if something was a quadratic residue mod something else...

... hey... maybe there is something in there after all. All quadratic residues are of the form (\sum_j 2j+1) mod b, and that has to be equal to ax mod b...

I'll have to think more about that.

Andres.

Andres said...

Nicolas,

> The code (not factored so well as yours) would be:

Well well... thank you :).

Now seriously, I think the ability to write factored code coupled with the inability to write unfactored code is a very important skill to hone. This is why I am writing the fundamentals book: so I can put down what I learned after 12 years of Smalltalk.

Andres.

Tom said...

Andres,

I understood that the caching mechanism may or may not help in all cases using Dijkstra, as you stated, you may not hit the cache that often if you are searching for numbers who's binary representations are very different. (Since we are rougly dividing by 2 each time, roughly speaking there should be a relation between bits set and the values needed to hit in the cache.)

Having said that, I don't see what you lose by adding in the cache. Any hits you do have should help save some effort, and over time the cache should get populated as well.

Tom said...

Addendum to last comment.

As a quick test, I copied my code before using exactly the same algorithm but without memoization/caching. Granted this is just one implementation, in python, but the numbers have a significant difference.

Time in ms:

fib(15625) --- 0.0
fib(390625) --- 125.0
fib(1953125) --- 1531.00013733
------------
fib2(15625) --- 266.000032425
fib2(390625) --- 18389.9998665
fib2(1953125) --- 119743.999958

nicolas cellier said...

I followed the wikipedia links, implemented the matrix iteration [0 1;1 0]^n and tried to understand why it was slower than split algorithm.

Well, the log(n) algorithm like raisedToInteger: is naive...

Given that the inverse matrix is also made of integers ([-1 1; 1 0]) there is a possibly faster way to compute some powers...
For example (M raisedTo: 126) is not as efficient as (M raisedTo: 128)*(M raised to: -2).

Based on this, i have a version that
- caches M^(2^p) (as triplet)
- is written with an iteration rather than a recursion
- count bits of n, and if higher than half highBit use a decomposition of
(2^highBit)-n using inverseMatrix...

Performances are interesting.
Here is my Squeak version.
Style to be cleaned up...

Object subclass: #FibonacciCachedMatrixIterationCalculator
instanceVariableNames: 'cache bitCounter'
classVariableNames: ''
poolDictionaries: ''
category: 'Fibonacci'


initialize
super initialize.
cache := OrderedCollection with: #(0 1 1).
self initializeBitCounter


initializeBitCounter
"answer a table to count bits in a byte"

| bitCount |
bitCount := #(0).
8 timesRepeat: [bitCount := bitCount , (bitCount collect: [:e | e + 1])].
^bitCounter := bitCount


getCacheOfRank: anInteger
| abc a2 b2 c2 |
[anInteger > cache size]
whileTrue: [abc := cache last.
a2 := (abc at: 1) squared.
b2 := (abc at: 2) squared.
c2 := (abc at: 3) squared.
cache
add: (Array
with: a2 + b2
with: c2 - a2
with: b2 + c2)].
^ cache at: anInteger


getInverseCacheOfRank: anInteger
| abc |
abc := self getCacheOfRank: anInteger.
^anInteger = 1
ifTrue: [Array
with: abc last negated with: abc second with: abc first negated]
ifFalse: [Array
with: abc last with: abc second negated with: abc first]


countBitsOf: anInteger
"count bit set to 1"

| bitCount |
bitCount := 0.
1 to: anInteger digitLength do: [:i |
bitCount := bitCount + (bitCounter at: 1 + (anInteger digitAt: i))].
^bitCount


fibonacci: anInteger
| h n abc abc1 a2 b2 c2 nBits |
h := anInteger highBit.
nBits := self countBitsOf: anInteger.
nBits > (h bitShift: -1)
ifTrue:
[abc := (self getCacheOfRank: h + 1) copy.
n := (1 bitShift: h) - anInteger.
[h := n highBit.
abc1 := self getInverseCacheOfRank: h.
a2 := abc first * abc1 first.
b2 := abc second * abc1 second.
c2 := abc last * abc1 last.
abc at: 1 put: a2 + b2;
at: 2 put: c2 - a2;
at: 3 put: b2 + c2.
(n := n bitXor: (1 bitShift: h - 1)) > 0]
" (n := n - (1 bitShift: h - 1)) > 0]"
whileTrue]
ifFalse:
[abc := (self getCacheOfRank: h) copy.
n := anInteger.
" [(n := n - (1 bitShift: h - 1)) > 0]"
[(n := n bitXor: (1 bitShift: h - 1)) > 0]
whileTrue: [h := n highBit.
abc1 := self getCacheOfRank: h.
a2 := abc first * abc1 first.
b2 := abc second * abc1 second.
c2 := abc last * abc1 last.
abc at: 1 put: a2 + b2;
at: 2 put: c2 - a2;
at: 3 put: b2 + c2]].
^ abc second

Andres said...

Tom,

To a point, I just feel that using a cache will be most effective when the cache hits can be predicted in advance. This is why I do not feel too enthusiastic about caching Dijkstra.

On the other hand, nothing beats an experiment. I will try that and compare both implementations.

Andres.

Andres said...

Nicolas,

Watch out because I think digitAt: gives you a byte as opposed to a bit...

Andres.

nicolas cellier said...

> Watch out because I think digitAt: gives you a byte as opposed to a bit...

Yes, i know, i count bits in a byte fast, then iterate on bytes

grfgguvf said...

Phi^k approaches an integer. So big "floats" are needed, but not arbitrary precision reals.

Phi^k + 1/Phi^k = integer (or - for odd k)

In fact, the programmer in me would just round at the k>100 magnitudes.

Or what am I missing?

Andres said...

Grfgguvf,

I think the issue is how to determine how big should the floats be given k in the F(k) you want to calculate.

Let's say we agree that for practical purposes 100 bits are enough. But is that sufficient for k = 15625? So let's say we now say 1000 bits are enough. But is that sufficient for k = 390625?

Without a proof via an error propagation analysis, how do we know how many bits the big floats should have?

Andres.

Andres said...

Tom,

I ran the experiment caching every index Dijkstra sees. Here are the times.

F(390625)

F(a+b) first time: 985 ms.
F(a+b) second time: 570 ms.

Dijkstra first time: 635 ms.
Dijkstra second time: 0 ms (because it remembers everything).

At this point, I tried again with F(1953125).

F(a+b) first time: 22.5s.
F(a+b) second time: 14.6s.

Dijkstra first time: 14.2s.
Dijkstra second time: 0 ms (because it remembers everything).

At this point, the cache sizes are:

F(a+b): 95 entries.
Dijkstra: 105 entries.

Ok, so now I am tempted to do another experiment: allow the F(a+b) cache to cache everything, just like Dijkstra's. So here we go.

F(390625):

Greedy F(a+b) first time: 1s.
Greedy F(a+b) second time: 0 ms.

F(1953125):

Greedy F(a+b) first time: 21.7s.
Greedy F(a+b) second time: 0 ms.

Clearly this shows that Dijkstra, when allowed to cache everything, will be significantly faster than the F(a+b) recursion. The growth of the cache size over time, however, may be a concern depending on the application. Even then it might be a good idea to just blow the cache away and let it rebuild it.

Andres.

Andres said...

Tom,

Further examination shows that after the tests, letting F(a+b) cache everything will result in 277 cache entries compared to Dijkstra's 105. As such, it seems that the interesting cases are either to allow Dijkstra to cache everything in the interest of speed, or to allow F(a+b) to cache triplets only.

Andres.

nicolas cellier said...

Well, MatrixIteration algorithm i posted is very sensible to bit patterns:

very efficient with 2r1111111111000000
less efficient with 2r1000000111111111

I'll have to review the condition for branching on inverseMatrix...

Maybe i'll have to check for and iterate on bit groups...

nicolas cellier said...

Yes, of course, MatrixIteration branch test does not always take most efficient branch because i confused bitInvert and two-complement! Tsss...

I will correct and publish an update.

A Nony Mouse said...

I was guessing at your algorithm from the times and the link you posted.

I knocked up some quick python to show what I was thinking. I think that the recursive one is more understandable but the iterative one might be a a wee bit faster. It should always be faster than F(a+b) unchached, and perhaps stand a chance when n=2^i-1, against the cached version.

#returns (f(n),f(n-1)
def dijkstra(n):
  if n<2: return (n,0)
  (a,b)=dijkstra(n/2)
  c=a+b
  if n&1: return (c*c+a*a,(b+c)*a)
  return ((b+c)*a,a*a+b*b)

def dijkstra2(n):
  (a,b,i)=(1,0,1<<32)
  while i and not n&i:i>>=1
  while i>1:
    i>>=1
    c=a+b
    if n&i: (a,b)= (c*c+a*a,(b+c)*a)
    else:(a,b)= ((b+c)*a,a*a+b*b)
  return a

Andres said...

A Nony Mouse,

I tried this with F(1048575). F(a+b), cold cache, 7.1 seconds. Dijkstra, 4.8 seconds.

However, note that the cached, non greedy F(a+b) calculator will cache triplets around powers of two, and so it will cache indices like 2^k - 1. Therefore, the second time one does F(1048575) with the F(a+b) calculator, the run time is (essentially) zero.

Andres.