Why is Fibonacci iterative slower than the recursive version (with memoization)? - python

I'm comparing two versions of the Fibonacci routine in Python 3:
import functools
#functools.lru_cache()
def fibonacci_rec(target: int) -> int:
if target < 2:
return target
res = fibonacci_rec(target - 1) + fibonacci_rec(target - 2)
return res
def fibonacci_it(target: int) -> int:
if target < 2:
return target
n_1 = 2
n_2 = 1
for n in range(3, target):
new = n_2 + n_1
n_2 = n_1
n_1 = new
return n_1
The first version is recursive, with memoization (thanks to lru_cache). The second is simply iterative.
I then benchmarked the two versions and I'm slightly surprised by the results:
In [5]: %timeit fibonacci_rec(1000)
82.7 ns ± 2.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [6]: %timeit fibonacci_it(1000)
67.5 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The iterative version is waaaaay slower than the recursive one. Of course the first run of the recursive version will take lots of time (to cache all the results), and the recursive version takes more memory space (to store all the calls). But I wasn't expecting such difference on the runtime. Don't I get some overhead by calling a function, compared to just iterating over numbers and swapping variables?

As you can see, timeit invokes the function many times, to get a reliable measurement. The LRU cache of the recursive version is not being cleared between invocations, so after the first run, fibonacci_rec(1000) is just returned from the cache immediately without doing any computation.

As explained by #Thomas, the cache isn't cleared between invocations of fibonacci_rec (so the result of fibonacci(1000) will be cached and re-used). Here is a better benchmark:
def wrapper_rec(target: int) -> int:
res = fibonacci_rec(target)
fibonacci_rec.cache_clear()
return res
def wrapper_it(target: int) -> int:
res = fibonacci_it(target)
# Just to make sure the comparison will be consistent
fibonacci_rec.cache_clear()
return res
And the results:
In [9]: %timeit wrapper_rec(1000)
445 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit wrapper_it(1000)
67.5 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

Efficiently test if an item is in a sorted list of strings

Suppose I have a list of short lowercase [a-z] strings (max length 8):
L = ['cat', 'cod', 'dog', 'cab', ...]
How to efficiently determine if a string s is in this list?
I know I can do if s in L: but I could presort L and binary-tree search.
I could even build my own tree, letter by letter. So setting s='cat':
So T[ ord(s[0])-ord('a') ] gives the subtree leading to 'cat' and 'cab', etc. But eek, messy!
I could also make my own hashfunc, as L is static.
def hash_(item):
w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
return sum(w) % 123456
... and just fiddle the numbers until I don't get duplicates. Again, ugly.
Is there anything out-of-the-box I can use, or must I roll my own?
There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.
I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.
The builtin Python set is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).
So, convert your list into a set (preferably built once if multiple tests are to be made) and test for membership:
s in set(L)
Or, for multiple tests:
set_values = set(L)
# ...
if s in set_values:
# ...
Here is a simple example to illustrate the performance:
from string import ascii_lowercase
import random
n = 1_000_000
L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]
Time to build the set:
%timeit set(L)
# 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Time to query against the set:
set_values = set(L)
# non-existent string
%timeit 'foo' in set_values
# 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# existing value
s = L[-1]
a = %timeit -o s in set_values
# 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Contrast that to testing directly against the list:
b = %timeit -o s in L
# 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
b.average / a.average
# 359141.74
When's the last time you made a 350,000x speedup ;-) ?

check if large number of directories exist in Python

Is there a trick to quickly check whether 10000 Windows directories exist in Python? I currently stored them in a list, I wonder how to do the check quickly.
You can iterate through the list and call os.path.exists() (for files and directories), os.path.isfile() (for files) or os.path.isdir() for directories in order to know whether or not these directories exist:
dir_list = [...]
for dir_entry in dir_list:
if not os.path.isdir(dir_entry):
# do something if the dir does not exist
else:
# do something if the dir exists
If you want to just check the existence of the path without checking whether or not it actually is a directory, then there are additional options that may be faster, see Johnny's answer for details.
If simply iterating through the list is not fast enough, you can use a ThreadPoolExecutor in order to iterate over the list in parallel threads (assign chunks of that list (e.g. 1000 directories) to each worker) but I doubt that that would speed up much and handling the return values (if needed) would be complicated.
WORKER_COUNT=10
CHUNK_SIZE=1000
def process_dir_list(dir_list):
# implementation according to the snippet above
(...)
future_list = []
with ThreadPoolExecutor(max_workers=WORKER_COUNT) as executor:
for dir_index in range(0, len(dir_list), CHUNK_SIZE):
future_list.append(executor.submit(process_dir_list, dir_list[dir_index:dir_index + 1000]))
# wait for all futures to finish
for current_future in future_list:
# wait for the current future to finish
result = future.result(timeout=0)
# do something with the result, if desired
Use with multi-process
os.access("/file/path/foo.txt", os.F_OK)
# check file is exists
os.F_OK
# check file is readable
os.R_OK
# check file is wirteable
os.W_OK
# check file is execute
os.X_OK
It is a simple test
In [1]: import os, pathlib
In [2]: p = "/home/lpc/gitlab/config/test"
In [3]: %timeit pathlib.Path(p).exists()
5.85 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit os.path.exists(p)
1.03 µs ± 4.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit os.access(p, os.F_OK)
526 ns ± 2.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: def check(p):
...: try:
...: f = open(p)
...: f.close
...: return True
...: except:
...: pass
In [7]: %timeit check(p)
1.52 µs ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit os.path.isdir(p)
1.05 µs ± 4.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Fibonacci: Time Complexity and Big O Notation

I have seen several solution of Fibonacci from different tutorials site and one thing that I noticed is that they have the same way of solving the problem through recursive function. I tested the recursive function and it takes 77 seconds before I get the 40th item in the list so I tried to make a function without dealing recursive function through a for loop and it only takes less than a second. Did I made it right? What is the O notation of my function?
from time import time
def my_fibo(n):
temp = [0, 1]
for _ in range(n):
temp.append(temp[-1] + temp[-2])
return temp[n]
start = time()
print(my_fibo(40), f'Time: {time() - start}')
# 102334155 Time: 0.0
vs
from time import time
def recur_fibo(n):
if n <= 1:
return n
else:
return recur_fibo(n - 1) + recur_fibo(n - 2)
start = time()
print(recur_fibo(40), f'Time: {time() - start}')
# 102334155 Time: 77.78924512863159
What you have done is an example of the time-space tradeoff.
In the first (iterative) example, you have an O(n) time algorithm that also takes O(n) space. In this example, you store values so that you do not need to recompute them.
In the second (recursive) example, you have an O(2^n) time (See Computational complexity of Fibonacci Sequence for further details) algorithm that takes up significant space on the stack as well.
In practice, the latter recursive example is a 'naive' approach at handling the Fibonacci sequence and the version where the previous values are stored is significantly faster.
The answer is above, for the big-O: in the classical recursive implementation that you showed, the function calls itself two times in each pass.
In the example below, I wrote a recursive function that calls itself just one time in each pass, so it also has an O(n):
def recur_fibo3(n, curr=1, prev=0):
if n > 2: # the sequence starts with 2 elements (prev and curr)
newPrev = curr
curr += prev
return recur_fibo3(n-1, curr, newPrev) # recursive call
else:
return curr
It performs linearly with increases in n, but it is slower than a normal loop.
Also, you can note that both recursive functions (the classical and the above one) do not store the whole sequence for you to return it. Your loop-function do this, but if you just want to retrieve the n-th value in the series, you can write a faster function somewhat like this:
def my_fibo2(n):
prev1 = 0
prev2 = 1
for _ in range(n-2):
curr = prev1 + prev2
prev2 = prev1
prev1 = curr
return curr
Using %timeit to measure execution times, we can see which one is faster. But all of them are fast enough in normal conditions, because you just need to calculate a long series once and store the results for later uses... :)
Time to return the 100th element in Fibonacci series
my_fibo(100)
10.4 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
my_fibo2(100)
5.13 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
recur_fibo3(100)
14.3 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
============================================
Time to return the 1000th element in Fibonacci series
my_fibo(1000)
122 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
my_fibo2(1000)
82.4 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
recur_fibo3(1000)
207 µs ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

why is numpy.linalg.norm slow when called many times for small size data?

import numpy as np
from datetime import datetime
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
def foo(a, b, f):
l = range(a)
s = datetime.now()
for i in range(b):
f(l)
e = datetime.now()
return e-s
foo(10**4, 10**5, norm)
foo(10**4, 10**5, np.linalg.norm)
foo(10**2, 10**7, norm)
foo(10**2, 10**7, np.linalg.norm)
I got the following output:
0:00:43.156278
0:00:23.923239
0:00:44.184835
0:01:00.343875
It seems like when np.linalg.norm is called many times for small-sized data, it runs slower than my norm function.
What is the cause of that?
First of all: datetime.now() isn't appropriate to measure performance, it includes the wall-time and you may just pick a bad time (for your computer) when a high-priority process runs or Pythons GC kicks in, ...
There are dedicated timing functions/modules available in Python: the built-in timeit module or %timeit in IPython/Jupyter and several other external modules (like perf, ...)
Let's see what happens if I use these on your data:
import numpy as np
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
r1 = range(10**4)
r2 = range(10**2)
%timeit norm(r1)
3.34 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.linalg.norm(r1)
1.05 ms ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit norm(r2)
30.8 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.linalg.norm(r2)
14.2 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It isn't slower for short iterables it's still faster. However note that the real advantage from NumPy functions comes if you already have NumPy arrays:
a1 = np.arange(10**4)
a2 = np.arange(10**2)
%timeit np.linalg.norm(a1)
18.7 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.norm(a2)
4.03 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Yeah, it's quite a lot faster now. 18.7us vs. 1ms - almost 100 times faster for 10000 elements. That means most of the time of np.linalg.norm in your examples was spent in converting the range to a np.array.
You are on the right way
np.linalg.norm has a quite high overhead on small arrays. On large arrays both the jit compiled function and np.linalg.norm runs in a memory bottleneck, which is expected on a function that does simple multiplications most of the time.
If the jitted function is called from another jitted function it might get inlined, which can lead to a quite a lot larger advantage over the numpy-norm function.
Example
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def norm(l):
s = 0.
for i in range(l.shape[0]):
s += l[i]**2
return np.sqrt(s)
Performance
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:0.42µs
linalg:4.46µs
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:8.9µs
linalg:13.4µs
r1 = np.array(np.arange(10**2),dtype=np.float64)
Numba:0.35µs
linalg:3.71µs
r2 = np.array(np.arange(10**4), dtype=np.float64)
Numba:1.4µs
linalg:5.6µs
Measuring Performance
Call the jit-compiled function one time before the measurement (there is a static compilation overhead on the first call)
Make clear if the measurement is valid (since small arrays stays in processor-cache there may be to optimistic results exceeding your RAM throughput on realistic examples eg. example)

What is the fastest way in Cython to create a new array from an existing array and a variable

Suppose I have an array
from array import array
myarr = array('l', [1, 2, 3])
and a variable:
myvar = 4
what is the fastest way to create a new array:
newarray = array('l', [1, 2, 3, 4])
You can assume all elements are of 'long' type
I have tried to create a new array and use array.append()
not sure if it is fastest. I was thinking of using memoryview like:
malloc(4*sizeof(long))
but I don't know how to copy a shorter array into part of the memoryview. then insert last element into last position.
I am fairly new to Cython. Thanks for any help!
Update:
I compared the following three methods:
Cython: [100000 loops, best of 3: 5.94 µs per loop]
from libc.stdlib cimport malloc
def cappend(long[:] arr, long var, size_t N):
cdef long[:] result = <long[:(N+1)]>malloc((N+1)*sizeof(long))
result.base[:N] = arr
result.base[N] = var
return result
array: [1000000 loops, best of 3: 1.21 µs per loop]
from array import array
import copy
def pyappend(arr, x):
result = copy.copy(arr)
result.append(x)
return result
list append: [1000000 loops, best of 3: 480 ns per loop]
def pylistappend(lst, x):
result = lst[:]
result.append(x)
return result
is there hope to improve the cython part and beat the array one?
Cython gives us more access to the internals of array.array than the "normal" python, so we can utilize it to speed up the code:
almost by factor 7 for your small example (by eliminating most of the overhead).
by factor 2 for larger inputs by eliminating an unnecessary array-copy.
Read on for more details.
It's a little bit unusual to try to optimize a function for such small input, but not without (at least theoretical) interest.
So let's start with your functions as baseline:
a=array('l', [1,2,3])
%timeit pyappend(a, 8)
1.03 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
lst=[1,2,3]
%timeit pylistappend(lst, 8)
279 ns ± 6.03 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
We must be aware: what we are measuring is not the cost of copying but the cost of overhead (python interpreter, calling functions and so on), for example there is no difference whether a has 3 or 5 elements:
a=array('l', range(5))
%timeit pyappend(a, 8)
1.03 µs ± 6.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In the array-version, we have more overhead because we have indirection via copy module, we can try to eliminate that:
def pyappend2(arr, x):
result = array('l',arr)
result.append(x)
return result
%timeit pyappend2(a, 8)
496 ns ± 5.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
That is faster. Now let's use cython - this would eliminate the interpreter costs:
%%cython
def cylistappend(lst, x):
result = lst[:]
result.append(x)
return result
%%cython
from cpython cimport array
def cyappend(array.array arr, long long int x):
cdef array.array res = array.array('l', arr)
res.append(x)
return res
%timeit cylistappend(lst, 8)
193 ns ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%%timeit cyappend(a, 8)
421 ns ± 8.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The cython versions is about 33% faster for list and about 10% faster for array. The constructor array.array() expects an iterable, but we already have an array.array, so we use the functionality from cpython to get access to internals of the array.array object and improve the situation a little:
%%cython
from cpython cimport array
def cyappend2(array.array arr, long long int x):
cdef array.array res = array.copy(arr)
res.append(x)
return res
%timeit cyappend2(a, 8)
305 ns ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For the next step we need to know how array.array appends elements: Normally, it over-allocates, so append() has amortized cost O(1), however after array.copy the new array is exactly the needed number of elements and the next append invokes reallocation. We need to change that (see here for the description of the used functions):
%%cython
from cpython cimport array
from libc.string cimport memcpy
def cyappend3(array.array arr, long long int x):
cdef Py_ssize_t n=len(arr)
cdef array.array res = array.clone(arr,n+1,False)
memcpy(res.data.as_voidptr, arr.data.as_voidptr, 8*n)#that is pretty sloppy..
res.data.as_longlongs[n]=x
return res
%timeit cyappend3(a, 8)
154 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Similar to your function, the memory is over-allocated, so we don't need to call resize() any longer. Now we are faster than the list and almost 7 times faster than the original python version.
Let's compare timings for bigger array sizes (a=array('l',range(1000)), lst=list(range(1000)), where copying of the data makes most of the running time:
pyappend 1.84 µs #copy-module is slow!
pyappend2 1.02 µs
cyappend 0.94 µs #cython no big help - we are copying twice
cyappend2 0.90 µs #still copying twice
cyappend3 0.43 µs #copying only once -> twice as fast!
pylistappend 4.09 µs # needs to increment refs of integers
cylistappend 3.85 µs # the same as above
Now, eliminating the unnecessary copy for array.array gives us the expected factor 2.
For even bigger arrays (10000 elements), we see the following:
pyappend 6.9 µs #copy-module is slow!
pyappend2 4.8 µs
cyappend2 4.4 µs
cyappend3 4.4 µs
There is no longer a difference between the versions (if one discards the slow copy-module). The reason for this is the changed behavior of the array.array for such big number of elements: when copying it over-allocates thus avoiding the reallocation after the first append().
We can easily check it:
b=array('l', array('l', range(10**3)))#emulate our functions
b.buffer_info()
[] (94481422849232, 1000)
b.append(1)
b.buffer_info()
[] (94481422860352, 1001) # another pointer address -> reallocated
...
b=array('l', array('l', range(10**4)))
b.buffer_info()
[](94481426290064, 10000)
b.append(33)
b.buffer_info()
[](94481426290064, 10001) # the same pointer address -> no reallocation!

Categories

Resources