Fast modular Fibonacci

Basics

As a reminder, the Fibonacci sequence is defined in the following way:

\[ F_n = \begin{cases} 0 & \textrm{ if } n = 0,\\ 1 & \textrm{ if } n = 1,\\ F_{n - 1} + F_{n - 2} & \textrm{ otherwise} \end{cases} \]

So the sequence goes 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, … .

If we were to compute a term of this sequence in the straightforward way, we would obtain the following code (in Haskell):

fib 0 = 0
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)

Recursion tree

The program above clearly computes what we want, but the time it takes is awful. As an experiment, computing the 32nd term of the sequence, \(F_{32} = 2178309\), takes about 9.37 seconds on my computer.

The reason for this low speed is best explained by showing a tree of “which computations call which other computations”. Here’s such a tree for computing \(F_5\):

We can clearly see that this tree gets quite big. In fact, when computing \(F_n\), this tree has \(2F_{n + 1} - 1\) nodes:

This is a simple proof by induction in \(n\). If \(n = 0\) or \(n = 1\), we have a tree with a single node. We also have \(2F_{0 + 1} - 1= 1\), and \(2F_{1 + 1 } - 1 = 1\), so the proposition is true for \(0 \le n \le 1\).

Let’s assume it’s valid for every number smaller than \(n\), and we’ll prove it for \(n\). If we the tree \(T_n\), we see \(T_n\) has a single node \(F_n\), and two subtrees \(T_{n - 1}\) and \(T_{n - 2}\). By the inductive hypothesis, the first one has \(2F_n - 1\) nodes, and the second one has \(2F_{n-1} - 1\) nodes, so when we add all of those nodes up, we get \(2F_n - 1 + 2F_{n-1} - 1 = 2(F_n + F_{n-1}) - 2 = 2F_{n+1} - 2\) nodes. When we add the root node of \(T_n\), we get that the number of nodes in \(T_n\) is \(2F_{n+1} - 1\), which was to be shown.

Due to Binet’s formula, we know \(F_n\) grows exponentially, so this solution quickly becomes impractical.

Pruning the tree

Notice that, for example, \(F_3\) is being computed twice. In general, when you try to compute \(F_n\), you’ll compute things you’ve already computed but discarded, many many times. In fact, the number of times you will compute \(F_k\), with \(k \ge 1\), when computing \(F_n\) is \(F_{n - k + 1}\):

Again we will use induction. We want to prove that for every \(n \ge 0\), for every \(1 \le k \le n\), the number of times we will compute \(F_k\) while computing \(F_n\) is \(F_{n - k + 1}\).

For \(n = 0\), there are no \(1 \le k \le 0\), so the proposition is trivially true. For \(n = 1\), we can only pick \(k = 1\). It is true that \(F_1\) appears \(1 = F_{1 - 1 + 1} = F_1\) times.

Now we assume that our proposition is valid for all numbers smaller than \(n\), and we’ll prove it for \(n\). Let \(1 \le k \le n\). Again we picture the recursion tree for \(F_n\), which we call \(T_n\).

If \(k = n\), then clearly \(F_n\) only shows up \(1\) time in \(T_n\), so the result holds. Likewise, if \(k = n - 1\), the only place we will see \(F_{n - 1}\) is as the root of our left subtree, the larger one. So the result holds in that case as well, since \(F_{n - (n - 1) + 1} = F_2 = 1\).

Let us then assume that \(k \le n - 2\), which allows us to use the inductive hypothesis on the left and right subtrees of \(T_n\).

By induction, the number of times \(F_k\) appears in the left subtree is \(F_{n - 1 - k + 1} = F_{n - k}\), and the number of times it appears in the right subtree is \(F_{n - 2 - k + 1} = F_{n - k - 1}\). Then the number of times it appears in \(T_n\) is the sum of both, so \(F_{n - k} + F_{n - k - 1} = F_{n - k + 1}\), which was to be shown.

For \(k = 0\), the property that holds is that the number of times we will compute \(F_0\) while computing \(F_n\) is \(F_{n - 1}\).

So one thing we could do is to compute \(F_k\) once, and store it, so that when we later need that number, we just fetch it from memory, as opposed to re-computing it. This means that all of the right subtrees in our recursion tree above will never happen, because the right subtree will always have been computed as part of computing the left subtree!

Graphically, what’s happening is this:

The red nodes, which are right branches, will never be computed, because they were already computed and stored while we took care of the left branch of their parent.

Our recursion then looks like this:

In Haskell, we’d get something like the following code:

import Data.Array
arr = listArray (0, 100) (0:1:[arr ! (i - 1) + arr ! (i - 2) | i <- [2..]])
fib n = arr ! n

This makes computing \(F_{32}\) a matter of 0.01 seconds.

In C/C++, this would translate to the following:

unsigned long long arr[94] = {0, 1};

void preprocess() {
  int i;
  for (i = 2; i <= 93; ++i) {
    arr[i] = arr[i - 1] + arr[i - 2];
  }
}

unsigned long long fib(int i) {
  return arr[i];
}

where I’m limiting myself to \(n = 93\) so as not to overflow unsigned long long. In Haskell, these are all arbitrary precision, GMP-based Integers.

Optimizing space

In the code above, you see we had to give a dimension to the array. This places an upper bound on the maximum index of the Fibonacci sequence we can request: We cannot compute \(F_{101}\) with the above code. In general, we need space for \(n\) integers in order to compute \(F_n\). We can do better, when we notice that to compute \(F_n\), the only two things we really need are \(F_{n-1}\) and \(F_{n - 2}\). So we only store the last two entries, and we get a constant space usage (in asymptotic terms, \(O(1)\)), while retaining linear time (\(O(n)\)) to compute \(F_n\).

The procedure, then, would be that at every moment, we maintain a tuple \((a, b)\), containing \((a, b) = (F_{n - 1}, F_{n - 2})\). On the next iteration, we replace this tuple by \((a + b, a) = (F_{n-1} + F_{n - 2}, F_{n - 1}) = (F_n, F_{n - 1})\).

The code would look something like this:

fib 0 = 0
fib n = fst (fib' n)
  where
    fib' 1 = (1, 0)
    fib' n = (a + b, a)
      where
        (a, b) = fib' (n - 1)

This code runs in time similar to the previous one, but it has two improvements:

We do not need to give an upper bound to preallocate space for the array
We only use a constant amount of space

For those wondering about the stack space due to recursion, we note that the algorithm can be written without recursion in, say, C, in the following way:

int fib(int n) {
  if (!n) return 0;
  int a = 1, b = 0, tmp;
  while (n > 1) {
    tmp = a;
    a = a + b;
    b = tmp;
    --n;
  }
  return a;
}

In Haskell we can also use constant stack space, with the slightly more complicated:

fib 0 = 0
fib !n = fib' n 1 0
  where
    fib' 1 !a !b = a
    fib' n !a !b = fib' (n - 1) (a + b) a

(See Bang Patterns for the use of “!”)

Using linear algebra

Things get a bit more fun if we notice that the procedure we were doing above was doing \((a, b) \to (a + b, a)\). This operation can be neatly described in terms of linear algebra!

If we take \(Q\) to be the following matrix:

\[ Q = \begin{pmatrix} 1 & 1\\ 1 & 0 \end{pmatrix} \]

Then we see that if we have a vector \(v = (a, b)\), then \(Q v = (a + b, a)\). Since our previous way of computing \(F_n\) required doing this operation \(n - 1\) times, starting with \((1, 0)\), this is equivalent to computing \(Q Q Q \dots Q (1, 0)\), where there are \(n - 1\) \(Q\)s. In other words, it’s the same as computing \(Q^{n - 1} (1, 0)\), In fact, we can prove that powers of \(Q\) always have a certain form:

\[ Q^n = \begin{pmatrix} F_{n + 1} & F_n \\ F_ n & F_{n - 1} \end{pmatrix} \]

for all \(n \ge 1\). This is an easy proof, again by induction, left as an exercise to the reader.

How does this help us? It would just seem to be a notational difference. Well, this becomes useful once we realize that

\[ Q^n = \begin{cases} \left(Q^m\right)^2 & \textrm{ if } n \equiv 0 \pmod {2},\\ \left(Q^m\right)^2 Q & \textrm { if } n \equiv 1 \pmod{2} \end{cases} \]

with \(m = \left\lfloor \frac{n}{2} \right\rfloor\).

This leads us to a divide-and-conquer algorithm to compute \(Q^{n - 1}\), and once we have that, we ned only check the top left entry in this matrix to obtain \(F_n\).

The algorithm would be something like this:

import Data.Array

a .*. b = listArray ((0, 0), (1, 1)) [a ! (0, 0) * b ! (0, 0) + a ! (0, 1) * b ! (1, 0),
                                      a ! (0, 0) * b ! (0, 1) + a ! (0, 1) * b ! (1, 1),
                                      a ! (1, 0) * b ! (0, 0) + a ! (1, 1) * b ! (1, 0),
                                      a ! (1, 0) * b ! (0, 1) + a ! (1, 1) * b ! (1, 1)]

identity = listArray ((0, 0), (1, 1)) [1, 0, 0, 1]

a .^. 0 = identity
a .^. n = b
  where
    m = n `div` 2
    a' = (a .^. m)
    b = a' .*. a' .*. (if odd n then a else identity)

q = listArray ((0, 0), (1, 1)) [1, 1, 1, 0]

fib 0 = 0
fib n = (q .^. (n - 1)) ! (0, 0)

In computing \(F_{500000}\), the tuned, linear time Haskell version we wrote before takes about 2.17 seconds on this machine, while this implementation takes 0.02 seconds.

In fact, the reason for this dramatic timing difference is that the exponentiation of \(Q\) is being done in \(O(\log n)\) time, instead of \(O(n)\) in the previous algorithm.

Constant optimization and modular arithmetic

We can improve the above program if we can use the matrices as only an abstraction, our code doing something slightly more efficient. Indeed, since we know the shape \(Q^n\) has, we need only store two numbers, \(F_n\) and \(F_{n - 1}\) to compute with it. Suppose we had \(Q^m\), and we wanted to compute \(Q^{2m}\), then we have:

\[ \begin{aligned} Q^m Q^m &= \begin{pmatrix} F_{m + 1} & F_m\\ F_m & F_{m - 1} \end{pmatrix} \begin{pmatrix} F_{m + 1} & F_m\\ F_m & F_{m - 1} \end{pmatrix}\\ &= \begin{pmatrix} \left(F_{m + 1}\right)^2 + \left(F_m\right)^2 & F_{m + 1} F_m + F_m F_{m - 1}\\ F_{m + 1} F_m + F_m F_{m - 1} & \left(F_m\right)^2 + \left(F_{m - 1}\right)^2 \end{pmatrix}\\ &= Q^{2m} \end{aligned} \]

So we could note the algorithm’s representation of \(Q^m\) as simply \(R_{Q^m} = (F_m, F_{m - 1}) = (a, b)\), and we would have

\[ \begin{aligned} R_{Q^{2m}} &= (F_{m + 1} F_m + F_m F_{m - 1}, \left(F_m\right)^2 + \left(F_{m - 1}\right)^2)\\ &= ((F_m + F_{m-1}) F_m + F_m F_{m - 1}, \left(F_m\right)^2 + \left(F_{m - 1}\right)^2)\\ &= ((a + b) a + ab, a^2 + b^2)\\ &= (a^2 + 2ab, a^2 + b^2) \end{aligned} \]

We also note that we can always write \(n\) in base 2, as \[ n = \sum_{i = 0}^k b_i \cdot 2^i \]

for some \(k \in \mathbb{N}\). And thus, we have

\[ \begin{aligned} Q^n &= Q^{\sum_{i = 0}^k b_i \cdot 2^i}\\ &= \prod_{i = 0}^k Q^{b_i \cdot 2^i} \end{aligned} \]

So if we are willing to cache \(k\) of these matrices’ representations, we can compute up to \(F_{2^k - 1}\) in \(O(k)\) steps. And remember, our representation of these matrices is just two numbers per each matrix, so this isn’t much memory (it’s also logarithmic, instead of linear, in the maximum \(n\) we want to compute).

Lastly, if we are computing all of this modulo some number \(m \in \mathbb{N}\), it suffices to store all these numbers modulo \(m\), so the memory requirements are even smaller.

As an example, to compute \(F_n \pmod {1000000007}\) for all \(0 \le n \le 2^{32} - 1\), the code looks like this:

import Data.Int (Int64)
import Data.Array.Unboxed

a_ = listArray (0, 31) [1, 1,3,21,987,2178309,209783453,700290221,745621702,330235873,754854590,961030536,124518512,198180511,314263529,787294282,269256820,462470690,139070082,932065818,580737058,545755243,798765759,810968381,635507288,251205278,445892347,804101592,699722224,900885101,912848361,90779811] :: UArray Int Int64
b_ = listArray (0, 31) [0, 1,2,13,610,1346269,470273943,761203726,615649431,839169006,754277743,658059650,586214508,706310066,506099537,526860082,193045466,291309054,960270309,464225376,216587333,661389053,543236045,806163189,127429331,559468297,410375445,401520518,977336158,380127118,350132037,822963728] :: UArray Int Int64

m :: Int64
m = 1000000007

f :: Int -> Int64
f n = go n 0 0 1
  where
    go :: Int -> Int -> Int64 -> Int64 -> Int64
    go 0 _ a b = a `rem` m
    go n i a b = go (n `div` 2) (i + 1) a' b'
      where
        (a', b') = if (odd n)
                      then (((a + b) * (a_ ! i) + a * (b_ ! i)) `rem` m, (a * (a_ ! i) + b * (b_ ! i)) `rem` m)
                      else (a, b)

Note that this only achieves significant performance improvements for very very large \(n\). For instance, for \(n = 2^{1024}\), the previous algorithm takes 0.09 seconds, while this one takes 0.02 seconds. If you really wanted to eke out every single drop of performance out of that, you might use Data.Array.Base.unsafeAt instead of (!), and use bang patterns on the parameters of the recursive function.

There’s one final trick up our sleeves, however…

Pisano period

Since we are considering the Fibonacci sequence \(\pmod{m}\), we can exploit a simple fact: The sequence must start to repeat itself no later than \(n = m^2\). This is simply because there are at most \(m^2\) possible tuples \((F_{n - 2}, F_{n - 1})\), \(\pmod{m}\), and each one determines the sequence forwards and backwards, \(\pmod{m}\). For each \(m\), the period with which the Fibonacci sequence repeats itself modulo \(m\) is called the Pisano period of \(m\).

The reason we are interested in this is that even if we have a huge \(n\), with \(n \gg m\), if we are computing this \(\pmod{m}\), then if \(k(m)\) is the Pisano period of \(m\), we have

\[ F_n = F_{n \pmod{k(m)}} \pmod{m} \]

which could drastically decrease the number of computations we must do.

In general, to find the pisano period of \(m\), we can use the Chinese Remainder Theorem, knowing that if we denote by \(k(m)\) the Pisano period of \(m\), and \(m = \displaystyle \prod_i p_i^{k_i}\), for some primes \(p_i\), we have \(k(m) = \operatorname{lcm}\{k(p_i^{k_i})\}\).

There’s no known easy way, in general, to find the Pisano period mod \(p^k\). For some results on it, you can see Marc Renault’s thesis.

As an example of how one could do this, we consider the case \(k_i = 1\). There is a conjecture dating back to Wall (1960) that \(k(p^e) = p^{e-1} k(p)\), so if we solved the case \(k_i = 1\), we would be pretty set.

If we are considering \(F_n \pmod{p}\), with \(p\) prime, then we can consider the \((\mathbb{Z}/p\mathbb{Z})\)-algebra \(V\) generated by the powers of \(Q\). Since the minimal polynomial of \(Q\) is \[ m(Q) = x^2 - x - 1 \]

We see that \(V\) must be isomorphic to the quotient \(((\mathbb{Z}/p\mathbb{Z})[x])/\langle x^2 - x - 1 \rangle\). Then, since \(m(Q)\) has degree \(2\), when \(p \ne 5\) this quotient has to be one of two things:

\[ V \cong \begin{cases} F_{p^2} & \textrm{ if } m(Q) \textrm{ is irreducible over } \mathbb{Z}/p\mathbb{Z},\\ \left(\mathbb{Z}/p\mathbb{Z}\right)^2 & \textrm{ if } m(Q) \textrm{ splits into distinct factors }\\ \end{cases} \]

The reason we single out \(p = 5\) is because in \(\mathbb{Z}/5\mathbb{Z}\), the characteristic polynomial of \(Q\) is a perfect square, \((x + 2)^2\). In this case, the quotient is not of the above forms, and this is the only such case:

\(\Rightarrow\)). Let \(p = 5\). Then we see that \((x+2)^2 = x^2 + 4x + 4 \equiv x^2 - x - 1 \pmod{5}\). Thus, the minimal polynomial is a perfect square, which was to be shown.

\(\Leftarrow\)). Let \(x^2 - x - 1\) factor as \((x-a)^2\) over \(\mathbb{Z}/p\mathbb{Z}\). Then we have \((x-a)^2 = x^2 - 2ax - a^2 = x^2 - x - 1 \pmod{p}\). We can either use highbrow abstract algebra and \(\mathbb{Z}/p\mathbb{Z}[X]\) as a vector space, or simply evaluate at \(0\) and \(1\) to see that this implies the following system of modular equations:

\[ \begin{aligned} a^2 &\equiv -1 \pmod{p}\\ -2a &\equiv -1 \pmod{p} \end{aligned} \]

From this, we know that \(p \mid a^2 + 1\), and \(p \mid 2a - 1\). Thus it divides the \(a\) times the second minus twice the first, \(p \mid 2a^2 - a - 2a^2 - 2 = -a - 2\), so \(-a - 2\equiv 0\pmod{p}\), which just means \(-a \equiv 2 \pmod{p}\). By multiplying both sides by \(-2\), we obtain \(2a \equiv -4 \pmod{p}\), but we knew \(2a \equiv 1 \pmod{p}\) from the original system of equations. Thus \(-4 \equiv 1 \pmod{p}\). So \(p \mid 5\), and \(p\) was prime, so \(p = 5\), which was to be shown.

Let’s see first the case where the minimal polynomial is irreducible over \(\mathbb{Z}/p\mathbb{Z}\). Using the fact that the group of units \(U\) of \(F_{p^2}\) is both cyclic and of order \(p^2 - 1\), and that the order of \(Q\) must divide \(|U|\) by Lagrange’s theorem, we can check the divisors of \(|U|\) to see the smallest divisor \(d\) such that \(Q^d = I\). Incidentally, this is the case when \(p \equiv 2, 3 \pmod{5}\):

We need only use the quadratic reciprocity. The equation \(x^2 - x - 1 = 0 \pmod{p}\) will have solutions exactly when \(x^2 \equiv 5 \pmod{p}\) has solutions, using the quadratic formula. Since \(5\) is not congruent to \(3 \pmod{4}\), we know this will have a solution only when \(x^2 \equiv p \pmod{5}\) has a solution, which is only true when \(p \equiv 0, 1, 4 \pmod{5}\).

Thus, the polynomial is only irreducible when \(p \equiv 2, 3 \pmod{5}\).

For the case where the polynomial splits into distinct linear factors, things become more interesting. First, we’ll need a small lemma:

\[ Q^n = F_n Q + F_{n-1} I \]

Clearly this is valid for \(n = 1\), since we have \(Q^1 = Q\). We can also see that \(Q^2 = Q + I\), so it is valid for \(n = 2\).

In the inductive step, we assume the proposition holds for \(n\). Thus, \[ \begin{aligned} Q^{n+1} &= Q(Q^n)\\ &= Q(F_n Q + F_{n-1} I)\\ &=F_n Q^2 + F_{n-1} Q\\ &= F_n (Q + I) + F_{n-1} Q\\ &= F_n Q + F_{n-1} Q + F_n I\\ &= F_{n+1} Q + F_n I \end{aligned} \] which was to be shown.

With that done, let \(\phi\) be any of the two roots of \(x^2 - x - 1 \pmod{p}\). Then we have that

\[ F_n = \frac{\phi^n - (1 - \phi)^n}{2\phi - 1} \]

Then, if we consider \(n = p-1\), we know \(\phi^{p-1} \equiv (1 - \phi)^{p-1} \equiv 1 \pmod{p}\), by Fermat’s little theorem. So \(F_{p-1} = 0\). We also have \(\phi^{p-2} \phi \equiv 1 \Rightarrow \phi^{p-2} \equiv \phi^{-1} \equiv \phi - 1 \pmod{p}\), and \((1-\phi)^{p-2} (1 - \phi) = (1 - \phi)^{p-1} \equiv 1 \Rightarrow (1-\phi)^{p-2} = (1-\phi)^{-1} = -\phi \pmod{p}\), since \(\phi(\phi - 1) \equiv 1 \pmod{p}\). Thus the numerator is \(\phi^{-1} + \phi \equiv \phi - 1 + \phi = 2 \phi - 1\), and thus \(F_{p-2} \equiv 1 \pmod{p}\).

So we have \(F_{p-1} = 0\), \(F_{p-2} = 1\). Since we know \(Q^n = F_n Q + F_{n-1} I\), in particular \(Q^{p-1} = F_{p-1} Q + F_{p-2} I\), we look at that equation \(\pmod{p}\), we see that \(Q^{p-1} = F_{p-1} Q + F_{p-2} I = I \pmod{p}\), and thus the order of \(Q\) must divide \(p-1\). One could also do this without the matrix argument, by seeing that \(F_p = F_{p-1} + F_{p-2} \equiv 0 + 1 = 1 \pmod{p}\), so we have \(F_{p-1} \equiv 0 \pmod{p}\), \(F_p \equiv 1\pmod{p}\), so the period divides \(p-1\).

As an example, suppose we are using \(p = 1000000007\). Then, since the minimal polynomial is irreducible over \(\mathbb{Z}/p\mathbb{Z}\), we know that \(k(p)\) must divide \(p^2 - 1 = 1000000014000000048\), which factors as \(2^4 \cdot 3^2 \cdot 7 \cdot 109^2 \cdot 167 \cdot 500000003\). The number of divisors is thus \(5 \cdot 3 \cdot 2 \cdot 3 \cdot 2 \cdot 2 = 360\). We can check each one of those by simply combining the powers of the prime divisors and seeing if \(Q\) raised to that power is the identity \(\pmod{p}\), and we get that the smallest such divisor is \(k(p) = 2000000016\). Thus, if we want to compute \(F_{2^{1024}}\), we need only compute \(n = 2^{1024} \pmod{2000000016}\), and then return \(F_n\).

On Binet’s formula

One might wonder why we did not discuss Binet’s formula for computing \(F_n\) in, seemingly, constant time. Binet’s formula tells us that

\[ F_n = \frac{\varphi^n - \overline{\varphi}^n}{\sqrt{5}} \]

Where \(\varphi\) and \(\overline{\varphi}\) are the roots of the equation \(x^2 - x - 1 = 0\) (you may remember this as the minimal polynomial of \(Q\)), and \(\varphi\) is also called the Golden Ratio.

While this is a correct equation, precision issues will limit its usefulness. For instance, at \(n = 72\), even if we disregarded (truncated) everything after the decimal period, using that formula with an x86 Extended Precision, 80-bit long double to compute \(F_n\) yields an integer part of \(498454011879265\), which is already off by 1. It only gets worse from there on.

Generalizations and parting remarks

Much of what’s said here can be said for any linear recurrence with constant coefficients, it’s only the matrix \(Q\) that changes. As an interesting exercise, try to prove that one can obtain the \(n\)th element of a recurrence of the form \(a_n = a_{n - 1} + a_{n - 2} + \dots + a_{n - k}\) in order \(O(k^2 \log n)\), and not just \(O(k^3 \log n)\) as would be concluded from the naïve modular matrix exponentiation algorithm. You can then generalize this to any linear recurrence with constant coefficients.

Finally, if you know any interesting properties or generalizations of the topics above, or know of an optimizations to the algorithms described above, I’d love to hear about them in the comments :)