I have seen several solution of Fibonacci from different tutorials site and one thing that I noticed is that they have the same way of solving the problem through recursive function. I tested the recursive function and it takes 77 seconds before I get the 40th item in the list so I tried to make a function without dealing recursive function through a for loop and it only takes less than a second. Did I made it right? What is the O notation of my function?
from time import time
def my_fibo(n):
temp = [0, 1]
for _ in range(n):
temp.append(temp[-1] + temp[-2])
return temp[n]
start = time()
print(my_fibo(40), f'Time: {time() - start}')
# 102334155 Time: 0.0
vs
from time import time
def recur_fibo(n):
if n <= 1:
return n
else:
return recur_fibo(n - 1) + recur_fibo(n - 2)
start = time()
print(recur_fibo(40), f'Time: {time() - start}')
# 102334155 Time: 77.78924512863159
What you have done is an example of the time-space tradeoff.
In the first (iterative) example, you have an O(n) time algorithm that also takes O(n) space. In this example, you store values so that you do not need to recompute them.
In the second (recursive) example, you have an O(2^n) time (See Computational complexity of Fibonacci Sequence for further details) algorithm that takes up significant space on the stack as well.
In practice, the latter recursive example is a 'naive' approach at handling the Fibonacci sequence and the version where the previous values are stored is significantly faster.
The answer is above, for the big-O: in the classical recursive implementation that you showed, the function calls itself two times in each pass.
In the example below, I wrote a recursive function that calls itself just one time in each pass, so it also has an O(n):
def recur_fibo3(n, curr=1, prev=0):
if n > 2: # the sequence starts with 2 elements (prev and curr)
newPrev = curr
curr += prev
return recur_fibo3(n-1, curr, newPrev) # recursive call
else:
return curr
It performs linearly with increases in n, but it is slower than a normal loop.
Also, you can note that both recursive functions (the classical and the above one) do not store the whole sequence for you to return it. Your loop-function do this, but if you just want to retrieve the n-th value in the series, you can write a faster function somewhat like this:
def my_fibo2(n):
prev1 = 0
prev2 = 1
for _ in range(n-2):
curr = prev1 + prev2
prev2 = prev1
prev1 = curr
return curr
Using %timeit to measure execution times, we can see which one is faster. But all of them are fast enough in normal conditions, because you just need to calculate a long series once and store the results for later uses... :)
Time to return the 100th element in Fibonacci series
my_fibo(100)
10.4 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
my_fibo2(100)
5.13 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
recur_fibo3(100)
14.3 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
============================================
Time to return the 1000th element in Fibonacci series
my_fibo(1000)
122 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
my_fibo2(1000)
82.4 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
recur_fibo3(1000)
207 µs ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I have an array, X, which I want to make monotonic. Specifically, I want to do
y = x.copy()
for i in range(1, len(x)):
y[i] = np.max(x[:i])
This is extremely slow for large arrays, but it feels like there should be a more efficient way of doing this. How can this operation be sped up?
The OP implementation is very inefficient because it does not use the information acquired on the previous iteration, resulting in O(n²) complexity.
def max_acc_OP(arr):
result = np.empty_like(arr)
for i in range(len(arr)):
result[i] = np.max(arr[:i + 1])
return result
Note that I fixed the OP code (which was otherwise throwing a ValueError: zero-size array to reduction operation maximum which has no identity) by allowing to get the largest value among those up to position i included.
It is easy to adapt that so that values at position i are excluded, but it leaves the first value of the result undefined, and it would never use the last value of the input. The first value of the result can be taken to be equal to the first value of the input, e.g.:
def max_acc2_OP(arr):
result = np.empty_like(arr)
result[0] = arr[0] # uses first value of input
for i in range(1, len(arr) + 1):
result[i] = np.max(arr[:i])
return result
It is equally easy to have similar adaptations for the code below, and I do not think it is particularly relevant to cover both cases of the value at position i included and excluded. Henceforth, only the "included" case is covered.
Back to the efficiency of the solotion, if you keep track of the current maximum and use that to fill your output array instead of re-computing the maximum for all value up to i at each iteration, you can easily get to O(n) complexity:
def max_acc(arr):
result = np.empty_like(arr)
curr_max = arr[0]
for i, x in enumerate(arr):
if x > curr_max:
curr_max = x
result[i] = curr_max
return result
However, this is still relatively slow because of the explicit looping.
Luckily, one can either rewrite this in vectorized form combining np.fmax() (or np.maximum() -- depending on how you need NaNs to be handled) and np.ufunc.accumulate():
np.fmax.accumulate()
# or
np.maximum.accumulate()
or, accelerating the solution above with Numba:
max_acc_nb = nb.njit(max_acc)
Some timings on relatively large inputs are provided below:
n = 10000
arr = np.random.randint(0, n, n)
%timeit -n 4 -r 4 max_acc_OP(arr)
# 97.5 ms ± 14.2 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 np.fmax.accumulate(arr)
# 112 µs ± 134 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 np.maximum.accumulate(arr)
# 88.4 µs ± 107 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 max_acc(arr)
# 2.32 ms ± 146 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 max_acc_nb(arr)
# 9.11 µs ± 3.01 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
indicating that max_acc() is already much faster than max_acc_OP(), but np.maximum.accumulate() / np.fmax.accumulate() is even faster, and max_acc_nb() comes out as the fastest. As always, it is important to take these kind of numbers with a grain of salt.
I think it will work faster to just keep track of the maximum rather than calculating it each time for each sub-array
y = x.copy()
_max = y[0]
for i in range(1, len(x)):
y[i] = _max
_max = max(x[i], _max)
you can use list comprehension for it. but you need to start your loop from 1 not from 0. either you can use like that if you want loop from 0.
y=[np.max(x[:i+1]) for i in range(len(x))]
or like that
y=[np.max(x[:i]) for i in range(1,len(x)+1)]
Suppose I have a list of short lowercase [a-z] strings (max length 8):
L = ['cat', 'cod', 'dog', 'cab', ...]
How to efficiently determine if a string s is in this list?
I know I can do if s in L: but I could presort L and binary-tree search.
I could even build my own tree, letter by letter. So setting s='cat':
So T[ ord(s[0])-ord('a') ] gives the subtree leading to 'cat' and 'cab', etc. But eek, messy!
I could also make my own hashfunc, as L is static.
def hash_(item):
w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
return sum(w) % 123456
... and just fiddle the numbers until I don't get duplicates. Again, ugly.
Is there anything out-of-the-box I can use, or must I roll my own?
There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.
I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.
The builtin Python set is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).
So, convert your list into a set (preferably built once if multiple tests are to be made) and test for membership:
s in set(L)
Or, for multiple tests:
set_values = set(L)
# ...
if s in set_values:
# ...
Here is a simple example to illustrate the performance:
from string import ascii_lowercase
import random
n = 1_000_000
L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]
Time to build the set:
%timeit set(L)
# 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Time to query against the set:
set_values = set(L)
# non-existent string
%timeit 'foo' in set_values
# 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# existing value
s = L[-1]
a = %timeit -o s in set_values
# 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Contrast that to testing directly against the list:
b = %timeit -o s in L
# 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
b.average / a.average
# 359141.74
When's the last time you made a 350,000x speedup ;-) ?
I'm comparing two versions of the Fibonacci routine in Python 3:
import functools
#functools.lru_cache()
def fibonacci_rec(target: int) -> int:
if target < 2:
return target
res = fibonacci_rec(target - 1) + fibonacci_rec(target - 2)
return res
def fibonacci_it(target: int) -> int:
if target < 2:
return target
n_1 = 2
n_2 = 1
for n in range(3, target):
new = n_2 + n_1
n_2 = n_1
n_1 = new
return n_1
The first version is recursive, with memoization (thanks to lru_cache). The second is simply iterative.
I then benchmarked the two versions and I'm slightly surprised by the results:
In [5]: %timeit fibonacci_rec(1000)
82.7 ns ± 2.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [6]: %timeit fibonacci_it(1000)
67.5 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The iterative version is waaaaay slower than the recursive one. Of course the first run of the recursive version will take lots of time (to cache all the results), and the recursive version takes more memory space (to store all the calls). But I wasn't expecting such difference on the runtime. Don't I get some overhead by calling a function, compared to just iterating over numbers and swapping variables?
As you can see, timeit invokes the function many times, to get a reliable measurement. The LRU cache of the recursive version is not being cleared between invocations, so after the first run, fibonacci_rec(1000) is just returned from the cache immediately without doing any computation.
As explained by #Thomas, the cache isn't cleared between invocations of fibonacci_rec (so the result of fibonacci(1000) will be cached and re-used). Here is a better benchmark:
def wrapper_rec(target: int) -> int:
res = fibonacci_rec(target)
fibonacci_rec.cache_clear()
return res
def wrapper_it(target: int) -> int:
res = fibonacci_it(target)
# Just to make sure the comparison will be consistent
fibonacci_rec.cache_clear()
return res
And the results:
In [9]: %timeit wrapper_rec(1000)
445 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit wrapper_it(1000)
67.5 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There is a drastic performance hit when using a keyfunc in heapq.nlargest:
>>> from random import random
>>> from heapq import nlargest
>>> data = [random() for _ in range(1234567)]
>>> %timeit nlargest(10, data)
30.2 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit nlargest(10, data, key=lambda n: n)
159 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I expected a small extra cost, perhaps something like 30% - not 400%. This degradation seems to be reproducible over a few different data sizes. You can see in the source code there is a special-case handling for if key is None, but otherwise the implementation looks more or less the same.
Why is performance so degraded by using a key function? Is it only due to the extra function call overhead, or is the algorithm fundamentally changed somehow by using a keyfunc?
For comparison, sorted takes about a 30% hit with the same data and lambda.
Say your iterable has N elements. Whether sorting or doing nlargest, the key function will be called N times. When sorting, that overhead is largely buried under roughly N * log2(N) other operations. But when doing nlargest of k items, there are only roughly N * log2(k) other operations, which is much smaller when k is much smaller than N.
In your example, N = 1234567 and k = 10, and so the ratio of other operations, sorting over nlargest, is roughly:
>>> log2(1234567) / log2(10)
6.0915146640862625
That this is close to 6 is purely coincidence ;-) It's the qualitative point that matters: the overhead of using a key function is much more significant for nlargest than for sorting randomly ordered data, provided k is much smaller than N.
In fact, that greatly understates the relative burden for nlargest, because the O(log2(k)) heapreplace is called in the latter only when the next element is larger than the k'th largest seen so far. Most of the time it isn't, and so the loop on such an iteration is nearly pure overhead, calling a Python-level key function just to discover that the result isn't interesting.
Quantifying that is beyond me, though; for example, on my Win10 box under Python 3.6.5, I only see a timing difference in your code a bit less than a factor of 3. That doesn't surprise me - calling a Python-level function is much more expensive than poking a list iterator and doing an integer compare (both "at C speed").
The extra overhead of calling lambda n: n so many times is really just that expensive.
In [17]: key = lambda n: n
In [18]: x = [random() for _ in range(1234567)]
In [19]: %timeit nlargest(10, x)
33.1 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit nlargest(10, x, key=key)
133 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %%timeit
...: for i in x:
...: key(i)
...:
93.2 ms ± 978 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %%timeit
...: for i in x:
...: pass
...:
10.1 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, the cost of calling key on all the elements accounts for almost the entirety of the overhead.
Key evaluations are equally expensive for sorted, but because the total work of sorting is more expensive, the overhead of key calls is a smaller percentage of the total. You should have compared the absolute overhead of using a key with nlargest or sorted, rather than the overhead as a percentage of the base.
In [23]: %timeit sorted(x)
542 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [24]: %timeit sorted(x, key=key)
683 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, the cost of key calls accounts for about half the overhead of using this key with sorted on this input, the rest of the overhead probably coming from the work of shuffling more data around in the sort itself.
You might wonder how nlargest manages to do so little work per element. For the no-key case, most iteration happens in the following loop:
for elem in it:
if top < elem:
_heapreplace(result, (elem, order))
top = result[0][0]
order -= 1
or for the case with a key:
for elem in it:
k = key(elem)
if top < k:
_heapreplace(result, (k, order, elem))
top = result[0][0]
order -= 1
The crucial realization is that the top < elem and top < k branches are almost never taken. Once the algorithm has found 10 fairly large elements, most of the remaining elements are going to be smaller than the 10 current candidates. On the rare occasions where a heap element needs to be replaced, that just makes it even harder for further elements to pass the bar needed to call heapreplace.
On a random input, the number of heapreplace calls nlargest makes is expected logarithmic in the size of the input. Specifically, for nlargest(10, x), aside from the first 10 elements of x, element x[i] has a 10/(i+1) probability of being in the top 10 elements of l[:i+1], which is the condition necessary for a heapreplace call. By linearity of expectation, the expected number of heapreplace calls is the sum of these probabilities, and that sum is O(log(len(x))). (This analysis holds with 10 replaced by any constant, but a slightly more sophisticated analysis is needed for a variable n in nlargest(n, l).)
The performance story would be very different for a sorted input, where every element would pass the if check:
In [25]: sorted_x = sorted(x)
In [26]: %timeit nlargest(10, sorted_x)
463 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Over 10 times as expensive as the unsorted case!
import numpy as np
from datetime import datetime
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
def foo(a, b, f):
l = range(a)
s = datetime.now()
for i in range(b):
f(l)
e = datetime.now()
return e-s
foo(10**4, 10**5, norm)
foo(10**4, 10**5, np.linalg.norm)
foo(10**2, 10**7, norm)
foo(10**2, 10**7, np.linalg.norm)
I got the following output:
0:00:43.156278
0:00:23.923239
0:00:44.184835
0:01:00.343875
It seems like when np.linalg.norm is called many times for small-sized data, it runs slower than my norm function.
What is the cause of that?
First of all: datetime.now() isn't appropriate to measure performance, it includes the wall-time and you may just pick a bad time (for your computer) when a high-priority process runs or Pythons GC kicks in, ...
There are dedicated timing functions/modules available in Python: the built-in timeit module or %timeit in IPython/Jupyter and several other external modules (like perf, ...)
Let's see what happens if I use these on your data:
import numpy as np
import math
def norm(l):
s = 0
for i in l:
s += i**2
return math.sqrt(s)
r1 = range(10**4)
r2 = range(10**2)
%timeit norm(r1)
3.34 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.linalg.norm(r1)
1.05 ms ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit norm(r2)
30.8 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.linalg.norm(r2)
14.2 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It isn't slower for short iterables it's still faster. However note that the real advantage from NumPy functions comes if you already have NumPy arrays:
a1 = np.arange(10**4)
a2 = np.arange(10**2)
%timeit np.linalg.norm(a1)
18.7 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.norm(a2)
4.03 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Yeah, it's quite a lot faster now. 18.7us vs. 1ms - almost 100 times faster for 10000 elements. That means most of the time of np.linalg.norm in your examples was spent in converting the range to a np.array.
You are on the right way
np.linalg.norm has a quite high overhead on small arrays. On large arrays both the jit compiled function and np.linalg.norm runs in a memory bottleneck, which is expected on a function that does simple multiplications most of the time.
If the jitted function is called from another jitted function it might get inlined, which can lead to a quite a lot larger advantage over the numpy-norm function.
Example
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def norm(l):
s = 0.
for i in range(l.shape[0]):
s += l[i]**2
return np.sqrt(s)
Performance
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:0.42µs
linalg:4.46µs
r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:8.9µs
linalg:13.4µs
r1 = np.array(np.arange(10**2),dtype=np.float64)
Numba:0.35µs
linalg:3.71µs
r2 = np.array(np.arange(10**4), dtype=np.float64)
Numba:1.4µs
linalg:5.6µs
Measuring Performance
Call the jit-compiled function one time before the measurement (there is a static compilation overhead on the first call)
Make clear if the measurement is valid (since small arrays stays in processor-cache there may be to optimistic results exceeding your RAM throughput on realistic examples eg. example)