I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".
def factorial2(x, rest=1):
if x <= 1:
return rest
else:
return factorial2(x-1, rest*x)
def factorial3(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.
This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.
In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.
Here are my timings for fact(400), 1000 times
factorial3 : 1.22370505333
factorial2 : 1.79896998405
Edit:
Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.
Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.
I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.
The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.
$ cat fact.py
def factorial_recursive1(x):
if x <= 1:
return 1
else:
return factorial_recursive1(x-1)*x
def factorial_recursive2(x, rest=1):
if x <= 1:
return rest
else:
return factorial_recursive2(x-1, rest*x)
def factorial_reduce(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
if a + 1 < b:
c = (a+b)//2
return range_prod(a, c) * range_prod(c, b)
else:
return a
def factorial_divide_and_conquer(n):
return 1 if n <= 1 else range_prod(1, n+1)
$ ipython -i fact.py
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop
Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n2), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.
In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop
UPDATE
Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:
In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop
Related
I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.
Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.
Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.
Depending on how large your array is, and how likely duplicates are, the answer will be different.
For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.
Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.
Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.
So, there's no one best solution for all possible scenarios.
Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:
from bitarray import bitarray # pip3 install bitarray
def dupcheck(X):
# Hardcoded values to give about 5% false positives for 10000 elements
size = 62352
hashcount = 4
bits = bitarray(size)
bits.setall(0)
def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
for i in range(hashcount):
if not bits[hash((x, i)) % size]: return False
return True
def add(x):
for i in range(hashcount):
bits[hash((x, i)) % size] = True
seen = set()
seen_add = seen.add
for x in X:
if check(x) or add(x):
if x in seen or seen_add(x):
return True
return False
This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.
Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).
My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- --------- ----------- ------------ ---------
set_len 0 7.73486e-06 Baseline
unique_len 9.644e-06 2.55573e-05 Slower 0
Length 1000
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- ---------- ----------- ------------ ---------
set_len 0.00011066 0.000270466 Baseline
unique_len 4.3684e-05 8.95608e-05 Faster 0
Then I tried my own implementation, but I think it would require optimized C code to beat set:
def check_items(key_rand, **kwargs):
for i, vali in enumerate(key_rand):
for j in range(i+1, len(key_rand)):
valj = key_rand[j]
if vali == valj:
break
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
----------- ---------- ----------- ------------ ---------
set_len 0 6.74221e-06 Baseline
unique_len 0 2.14604e-05 Slower 0
check_items 1.1138e-05 2.16369e-05 Slower 0
(using my randomized compare_time() function from easyinfo)
I am trying to write code for a super-fast factorial function. I have experimented a little and have come up with the following three candidates (apart from math.factorial):
def f1():
return reduce(lambda x,y : x * y, xrange(1,31))
def f2():
result = 1
result *= 2
result *= 3
result *= 4
result *= 5
result *= 6
result *= 7
result *= 8
#and so-on...
result *= 28
result *= 29
result *= 30
return result
def f3():
return 1*2*3*4*5*6*7*8*9*10*11*12*13*14*15*16*17*18*19*20*21*22*23*24*25*26*27*28*29*30
I have timed these functions. These are the results:
In [109]: timeit f1()
100000 loops, best of 3: 11.9 µs per loop
In [110]: timeit f2()
100000 loops, best of 3: 5.05 µs per loop
In [111]: timeit f3()
10000000 loops, best of 3: 143 ns per loop
In [112]: timeit math.factorial(30)
1000000 loops, best of 3: 2.11 µs per loop
Clearly, f3() takes the cake. I have tried implementing this. To be verbose, I have tried writing code that generates a string like this:
"1*2*3*4*5*6*7*8*9*10*11*12*13*14........" and then using eval to evaluate this string. (Acknowledging that 'eval' is evil). However, this method gave me no gains in time, AT ALL. In fact, it took me nearly 150 microseconds to finish.
Please advise on how to generalize f3().
f3 is only fast because it isn't actually computing anything when you call it. The whole computation gets optimized out at compile time and replaced with the final value, so all you're timing is function call overhead.
This is particularly obvious if we disassemble the function with the dis module:
>>> import dis
>>> dis.dis(f3)
2 0 LOAD_CONST 59 (265252859812191058636308480000000L)
3 RETURN_VALUE
It is impossible to generalize this speedup to a function that takes an argument and returns its factorial.
f3() takes the cake because when the function is def'ed Python just optimizes the string of multiplications down to the final result and effective definition of f3() becomes:
def f3():
return 8222838654177922817725562880000000
which, because no computation need occur when the function is called, runs really fast!
One way to produce all the effect of placing a * operator between the list of numbers is to use reduce from the functools module. Is this sometime like what you're looking for?
from functools import reduce
def fact(x):
return reduce((lambda x, y: x * y), range(1, x+1))
I would argue that none of these are good factorial functions, since none take a parameter to the function. The reason why the last one works well is because it minimizes the number of interpreter steps, but that's still not a good answer: all of them have the same complexity (linear with the size of the value). We can do better: O(1).
import math
def factorial(x):
return math.gamma(x+1)
This scales constantly with the input value, at the sacrifice of some accuracy. Still, way better when performance matters.
We can do a quick benchmark:
import math
def factorial_gamma(x):
return math.gamma(x+1)
def factorial_linear(x):
if x == 0 or x == 1:
return 1
return x * factorial_linear(x-1)
In [10]: factorial_linear(50)
Out[10]: 30414093201713378043612608166064768844377641568960512000000000000
In [11]: factorial_gamma(50)
Out[11]: 3.0414093201713376e+64
In [12]: %timeit factorial_gamma(50)
537 ns ± 6.84 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %timeit factorial_linear(50)
17.2 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A 30 fold increase for a factorial of 50. Not bad.
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.factorial.html
As others have stated f3() isn't actually computing anything that's why you get such fast results. You can't achieve the same by giving it to a function.
Also you maybe wondering why math.factorial() is so fast it's because
the math module's functions are implemented in C:
This module is always available. It provides access to the mathematical functions defined by the C standard
By using an efficient algorithm in C, you get such fast results.
Here your best bet would be would using the below function, but using math.factorial is what I prefer if you're purely in need of performance
def f3(x):
ans=1
for i in range(1,x+1):
ans=ans*i
return ans
print(f3(30))
I'm tyring to make my code faster by removing some for loops and using arrays. The slowest step right now is the generation of the random lists.
context: I have a number of mutations in a chromosome, i want to perform 1000 random "chromosomes" with the same length and same number of mutation but their positions are randomized.
here is what I'm currently running to generate these randomized mutation positions:
iterations=1000
Chr_size=1000000
num_mut=500
randbps=[]
for k in range(iterations):
listed=np.random.choice(range(Chr_size),num_mut,replace=False)
randbps.append(listed)
I want to do something similar to what they cover in this question
np.random.choice(range(Chr_size),size=(num_mut,iterations),replace=False)
however without replacement applies to the array as a whole.
further context: later in the script i go through each randomized chromosome and count the number of mutations in a given window:
for l in range(len(randbps)):
arr=np.asarray(randbps[l])
for i in range(chr_last_window[f])[::step]:
counter=((i < arr) & (arr < i+window)).sum()
I don't know how np.random.choice is implemented but I am guessing it is optimized for a general case. Your numbers, on the other hand, are not likely to produce the same sequences. Sets may be more efficient for this case, building from scratch:
import random
def gen_2d(iterations, Chr_size, num_mut):
randbps = set()
while len(randbps) < iterations:
listed = set()
while len(listed) < num_mut:
listed.add(random.choice(range(Chr_size)))
randbps.add(tuple(sorted(listed)))
return np.array(list(randbps))
This function starts with an empty set, generates a single number in range(Chr_size) and adds that number to the set. Because of the properties of the sets it cannot add the same number again. It does the same thing for the randbps as well so each element of randbps is also unique.
For only one iteration of np.random.choice vs gen_2d:
iterations=1000
Chr_size=1000000
num_mut=500
%timeit np.random.choice(range(Chr_size),num_mut,replace=False)
10 loops, best of 3: 141 ms per loop
%timeit gen_2d(1, Chr_size, num_mut)
1000 loops, best of 3: 647 µs per loop
Based on the trick used in this solution, here's an approach that uses argsort/argpartition on an array of random elements to simulate numpy.random.choice without replacement to give us randbps as a 2D array -
np.random.rand(iterations,Chr_size).argpartition(num_mut)[:,:num_mut]
Runtime test -
In [2]: def original_app(iterations,Chr_size,num_mut):
...: randbps=[]
...: for k in range(iterations):
...: listed=np.random.choice(range(Chr_size),num_mut,replace=False)
...: randbps.append(listed)
...: return randbps
...:
In [3]: # Input params (scaled down version of params listed in question)
...: iterations=100
...: Chr_size=100000
...: num=50
...:
In [4]: %timeit original_app(iterations,Chr_size,num)
1 loops, best of 3: 1.53 s per loop
In [5]: %timeit np.random.rand(iterations,Chr_size).argpartition(num)[:,:num]
1 loops, best of 3: 424 ms per loop
I have a Python function that takes a list and returns a generator yielding 2-tuples of each adjacent pair, e.g.
>>> list(pairs([1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]
I've considered an implementation using 2 slices:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
and one written in a more procedural style:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
Testing using range(2 ** 15) as input yields the following times (you can find my testing code and output here):
2 slices: 100 loops, best of 3: 4.23 msec per loop
0 slices: 100 loops, best of 3: 5.68 msec per loop
Part of the performance hit for the sliceless implementation is the comparison in the loop (if last is not dummy). Removing that (making the output incorrect) improves its performance, but it's still slower than the zip-a-pair-of-slices implementation:
2 slices: 100 loops, best of 3: 4.48 msec per loop
0 slices: 100 loops, best of 3: 5.2 msec per loop
So, I'm stumped. Why is zipping together 2 slices, effectively iterating over the list twice in parallel, faster than iterating once, updating last and x as you go?
EDIT
Dan Lenski proposed a third implementation:
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]
Here's its comparison to the other implementations:
2 slices: 100 loops, best of 3: 4.37 msec per loop
0 slices: 100 loops, best of 3: 5.61 msec per loop
Lenski's: 100 loops, best of 3: 6.43 msec per loop
It's even slower! Which is baffling to me.
EDIT 2:
ssm suggested using itertools.izip instead of zip, and it's even faster than zip:
2 slices, izip: 100 loops, best of 3: 3.68 msec per loop
So, izip is the winner so far! But still for difficult-to inspect reasons.
Lots of interesting discussion elsewhere in this thread. Basically, we started out comparing two versions of this function, which I'm going to describe with the following dumb names:
The "zip-py" version:
def pairs(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
The "loopy" version:
def pairs(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
So why does the loopy version turn out to be slower? Basically, I think it comes down to a couple things:
The loopy version explicitly does extra work: it compares two objects' identities (if last is not dummy: ...) on every pair-generating iteration of the inner loop.
#mambocab's edit shows that not doing this comparison does make the loopy version
slightly faster, but doesn't fully close the gap.
The zippy version does more stuff in compiled C code that the loopy version does in Python code:
Combining two objects into a tuple. The loopy version does yield last,x, while in the zippy version the tuple p comes straight from zip, so it just does yield p.
Binding variable names to object: the loopy version does this twice in every loop, assigning x in the for loop and last=x. The zippy version does this just once, in the for loop.
Interestingly, there is one way in which the zippy version is actually doing more work: it uses two listiterators, iter(xs[:-1]) and iter(xs[1:]), which get passed to zip. The loopy version only uses one listiterator (for x in xs).
Creating a listiterator object (the output of iter([])) is likely a very highly optimized operation since Python programmers use it so frequently.
Iterating over list slices, xs[:-1] and xs[1:], is a very lightweight operation which adds almost no overhead compared to iterating over the whole list. Essentially, it just means moving the starting or ending point of the iterator, but not changing what happens on each iteration.
This i the result for the iZip which is actually closer to your implementation. Looks like what you would expect. The zip version is creating the entire list in memory within the function so it is the fastest. The loop version just los through the list, so it is a little slower. The izipis the closest in resemblance to the code, but I am guessing there is some memory-management backend processes which increase the time of execution.
In [11]: %timeit pairsLoop([1,2,3,4,5])
1000000 loops, best of 3: 651 ns per loop
In [12]: %timeit pairsZip([1,2,3,4,5])
1000000 loops, best of 3: 637 ns per loop
In [13]: %timeit pairsIzip([1,2,3,4,5])
1000000 loops, best of 3: 655 ns per loop
The version of code is shown below as requested:
from itertools import izip
def pairsIzip(xs):
for p in izip(xs[:-1], xs[1:]):
yield p
def pairsZip(xs):
for p in zip(xs[:-1], xs[1:]):
yield p
def pairsLoop(xs):
last = object()
dummy = last
for x in xs:
if last is not dummy:
yield last,x
last = x
I suspect the main reason that the second version is slower is because it does a comparison operation for every single pair that it yields:
# pair-generating loop
for x in xs:
if last is not dummy:
yield last,x
last = x
The first version does not do anything but spit out values. With the loop variables renamed, it's equivalent to this:
# pair-generating loop
for last,x in zip(xs[:-1], xs[1:]):
yield last,x
It's not especially pretty or Pythonic, but you could write a procedural version without a comparison in the inner loop. How fast does this one run?
def pairs(xs):
for ii in range(1,len(xs)):
yield xs[ii-1], xs[ii]
I'm really confused with functions call speed in Python. First and second cases, nothing unexpected:
%timeit reduce(lambda res, x: res+x, range(1000))
10000 loops, best of 3: 150 µs per loop
def my_add(res, x):
return res + x
%timeit reduce(my_add, range(1000))
10000 loops, best of 3: 148 µs per loop
But third case looks strange for me:
from operator import add
%timeit reduce(add, range(1000))
10000 loops, best of 3: 80.1 µs per loop
At the same time:
%timeit add(10, 100)
%timeit 10 + 100
10000000 loops, best of 3: 94.3 ns per loop
100000000 loops, best of 3: 14.7 ns per loop
So, why the third case gives speed up about 50%?
add is implemented in C.
>>> from operator import add
>>> add
<built-in function add>
>>> def my_add(res, x):
... return res + x
...
>>> my_add
<function my_add at 0x18358c0>
The reason that a straight + is faster is that add still has to call the Python VM's BINARY_ADD instruction as well as perform some other work due to it being a function, while + is only a BINARY_ADD instruction.
The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. For example, operator.add(x, y) is equivalent to the expression x+y. The function names are those used for special class methods; variants without leading and trailing __ are also provided for convenience.
From Python docs (emphasis mine)
The operator module is an efficient (native I'd assume) implementation. IMHO, calling a native implementation should be quicker than calling a python function.
You could try calling the interpreter with -O or -OO to compile the python core and check the timing again.