Python function call speed - python

I'm really confused with functions call speed in Python. First and second cases, nothing unexpected:
%timeit reduce(lambda res, x: res+x, range(1000))
10000 loops, best of 3: 150 µs per loop
def my_add(res, x):
return res + x
%timeit reduce(my_add, range(1000))
10000 loops, best of 3: 148 µs per loop
But third case looks strange for me:
from operator import add
%timeit reduce(add, range(1000))
10000 loops, best of 3: 80.1 µs per loop
At the same time:
%timeit add(10, 100)
%timeit 10 + 100
10000000 loops, best of 3: 94.3 ns per loop
100000000 loops, best of 3: 14.7 ns per loop
So, why the third case gives speed up about 50%?

add is implemented in C.
>>> from operator import add
>>> add
<built-in function add>
>>> def my_add(res, x):
... return res + x
...
>>> my_add
<function my_add at 0x18358c0>
The reason that a straight + is faster is that add still has to call the Python VM's BINARY_ADD instruction as well as perform some other work due to it being a function, while + is only a BINARY_ADD instruction.

The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. For example, operator.add(x, y) is equivalent to the expression x+y. The function names are those used for special class methods; variants without leading and trailing __ are also provided for convenience.
From Python docs (emphasis mine)

The operator module is an efficient (native I'd assume) implementation. IMHO, calling a native implementation should be quicker than calling a python function.
You could try calling the interpreter with -O or -OO to compile the python core and check the timing again.

Related

Python -- Fast Factorial function

I am trying to write code for a super-fast factorial function. I have experimented a little and have come up with the following three candidates (apart from math.factorial):
def f1():
return reduce(lambda x,y : x * y, xrange(1,31))
def f2():
result = 1
result *= 2
result *= 3
result *= 4
result *= 5
result *= 6
result *= 7
result *= 8
#and so-on...
result *= 28
result *= 29
result *= 30
return result
def f3():
return 1*2*3*4*5*6*7*8*9*10*11*12*13*14*15*16*17*18*19*20*21*22*23*24*25*26*27*28*29*30
I have timed these functions. These are the results:
In [109]: timeit f1()
100000 loops, best of 3: 11.9 µs per loop
In [110]: timeit f2()
100000 loops, best of 3: 5.05 µs per loop
In [111]: timeit f3()
10000000 loops, best of 3: 143 ns per loop
In [112]: timeit math.factorial(30)
1000000 loops, best of 3: 2.11 µs per loop
Clearly, f3() takes the cake. I have tried implementing this. To be verbose, I have tried writing code that generates a string like this:
"1*2*3*4*5*6*7*8*9*10*11*12*13*14........" and then using eval to evaluate this string. (Acknowledging that 'eval' is evil). However, this method gave me no gains in time, AT ALL. In fact, it took me nearly 150 microseconds to finish.
Please advise on how to generalize f3().
f3 is only fast because it isn't actually computing anything when you call it. The whole computation gets optimized out at compile time and replaced with the final value, so all you're timing is function call overhead.
This is particularly obvious if we disassemble the function with the dis module:
>>> import dis
>>> dis.dis(f3)
2 0 LOAD_CONST 59 (265252859812191058636308480000000L)
3 RETURN_VALUE
It is impossible to generalize this speedup to a function that takes an argument and returns its factorial.
f3() takes the cake because when the function is def'ed Python just optimizes the string of multiplications down to the final result and effective definition of f3() becomes:
def f3():
return 8222838654177922817725562880000000
which, because no computation need occur when the function is called, runs really fast!
One way to produce all the effect of placing a * operator between the list of numbers is to use reduce from the functools module. Is this sometime like what you're looking for?
from functools import reduce
def fact(x):
return reduce((lambda x, y: x * y), range(1, x+1))
I would argue that none of these are good factorial functions, since none take a parameter to the function. The reason why the last one works well is because it minimizes the number of interpreter steps, but that's still not a good answer: all of them have the same complexity (linear with the size of the value). We can do better: O(1).
import math
def factorial(x):
return math.gamma(x+1)
This scales constantly with the input value, at the sacrifice of some accuracy. Still, way better when performance matters.
We can do a quick benchmark:
import math
def factorial_gamma(x):
return math.gamma(x+1)
def factorial_linear(x):
if x == 0 or x == 1:
return 1
return x * factorial_linear(x-1)
In [10]: factorial_linear(50)
Out[10]: 30414093201713378043612608166064768844377641568960512000000000000
In [11]: factorial_gamma(50)
Out[11]: 3.0414093201713376e+64
In [12]: %timeit factorial_gamma(50)
537 ns ± 6.84 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %timeit factorial_linear(50)
17.2 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A 30 fold increase for a factorial of 50. Not bad.
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.factorial.html
As others have stated f3() isn't actually computing anything that's why you get such fast results. You can't achieve the same by giving it to a function.
Also you maybe wondering why math.factorial() is so fast it's because
the math module's functions are implemented in C:
This module is always available. It provides access to the mathematical functions defined by the C standard
By using an efficient algorithm in C, you get such fast results.
Here your best bet would be would using the below function, but using math.factorial is what I prefer if you're purely in need of performance
def f3(x):
ans=1
for i in range(1,x+1):
ans=ans*i
return ans
print(f3(30))

Slow recursion in python

I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".
def factorial2(x, rest=1):
if x <= 1:
return rest
else:
return factorial2(x-1, rest*x)
def factorial3(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.
This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.
In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.
Here are my timings for fact(400), 1000 times
factorial3 : 1.22370505333
factorial2 : 1.79896998405
Edit:
Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.
Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.
I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.
The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.
$ cat fact.py
def factorial_recursive1(x):
if x <= 1:
return 1
else:
return factorial_recursive1(x-1)*x
def factorial_recursive2(x, rest=1):
if x <= 1:
return rest
else:
return factorial_recursive2(x-1, rest*x)
def factorial_reduce(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
if a + 1 < b:
c = (a+b)//2
return range_prod(a, c) * range_prod(c, b)
else:
return a
def factorial_divide_and_conquer(n):
return 1 if n <= 1 else range_prod(1, n+1)
$ ipython -i fact.py
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop
Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n2), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.
In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop
UPDATE
Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:
In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop

Wrapping np.arrays __pow__ method

I was just revisiting some of my code to improve the performance and stumpled over something strange:
a = np.linspace(10,1000,1000000).reshape(1000,1000)
%timeit np.square(a)
100 loops, best of 3: 8.07 ms per loop
%timeit a*a
100 loops, best of 3: 8.18 ms per loop
%timeit a**2
100 loops, best of 3: 8.32 ms per loop
Ok it seems to have some overhead when using the power-operator (**) but otherwise they seem identical (I guess NumPy is doing that) but then it got strange:
In [46]: %timeit np.power(a, 2)
10 loops, best of 3: 121 ms per loop
So there is no problem but it seems a bit inconsistent to have a fallback for the magic pow but not for the UFUNC. But then I got interested since I am using third powers a lot:
%timeit a*a*a
100 loops, best of 3: 18.1 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
%timeit np.power(a, 3)
10 loops, best of 3: 121 ms per loop
There seems to be no "shortcut" in the third power and UFUNC and 'magic-pow' work the same (at least in regard to performance).
But that's not that good since I want a consistent method of using powers in my code and I'm not quite sure how to wrap the __pow__ of numpy.
So to get to the point, my question is :
Is there a way I can wrap the numpys __pow__ method? Because I want a consistent way of writing powers in my script not writing a**2 and at another place power(a, 3). Simply writing a**3, and redirecting this to my power function, would be preferred (but for that I would need to somehow wrap the ndarrays __pow__ or?).
Currently I am using a shortcut but that's not that beautiful (I even have to declare the exponent==2 case since np.power performs not optimal there):
def power(array, exponent):
if exponent == 2: #catch this, or it calls the slow np.power(array, exponent)
return np.square(array)
if exponent == 3:
return array * array * array
#As soon as np.cbrt is avaiable catch the exponent 4/3 here too
return np.power(array, exponent)
%timeit power(a, 3)
100 loops, best of 3: 17.8 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
I am using NumPy v1.9.3 and I do not want to subclass np.ndarray just for wrapping the __pow__ method. :-)
EDIT: I rewrote the part where I get to my question. To clarify it: I am not asking about why NumPy does it the way it does - that is just to explain why I ask the question.
This is a good catch. I too wonder why is that behavior. But to be short and concise answering the question, I would just do:
def mypower(array, exponent):
return reduce(lambda x,y: x*y, [array for _ in range(exponent)])
%timeit mypower(a,2)
100 loops, best of 3: 3.68 ms per loop
%timeit mypower(a,3)
100 loops, best of 3: 8.09 ms per loop
%timeit mypower(a,4)
100 loops, best of 3: 12.6 ms per loop
Obsviouly the overhead increases with the exponent but for low ones is better than 10x the time.
Note that this is different from the original numpy implementation which is not specific for a numeric exponent and supports an array of exponents as the second argument (check it out here).
Overloading the operator
The way to do what you want is to subclass ndarray and use views. See the following example:
import numexpr
import numpy as np
​
class MyArray(np.ndarray):
def __pow__(self, other):
return reduce(lambda x,y: x*y, [self for _ in range(other)])
​
class NumExprArray(np.ndarray):
def __pow__(self, other):
return numexpr.evaluate("self**%f" % other)
#This implies extra overhead, is as much as 4x slower:
#return numexpr.evaluate("self**other")
a = np.linspace(10,1000,1000000).reshape(1000,1000).view(MyArray)
na = np.linspace(10,1000,1000000).reshape(1000,1000).view(NumExprArray)
​
%timeit a**2
1000 loops, best of 3: 1.2 ms per loop
%timeit na**2
1000 loops, best of 3: 1.14 ms per loop
%timeit a**3
100 loops, best of 3: 4.69 ms per loop
%timeit na**3
100 loops, best of 3: 2.36 ms per loop
%timeit a**4
100 loops, best of 3: 6.59 ms per loop
%timeit na**4
100 loops, best of 3: 2.4 ms per loop
For more information on this method please follow this link. Another way would be to use a custom infix operator but for readability purposes is not so good. As one can see, numexpr should be the way to go.
If I read the source correctly, when numpy performs power, it checks whether the numerical value of the exponent is one of the special cases (-0.5, 0, 0.5, 1, and 2). If so, the operation is done using special routines. All other numerical values of the exponent are considered "general", and will be fed into the generic power function, which may be slow (especially if the exponent is promoted to floating-point type, but I'm not sure if this is the case with a ** 3).

Python: Vectorizing evaluations of arrays of lambda functions

How would you vectorize the evaluation of arrays of lambda functions?
Here's an example to understand what I'm talking about. (And even though I'm using numpy arrays, I'm not limiting myself to only using numpy.)
Let's say I have the following numpy arrays.
array1 = np.array(["hello", 9])
array2 = np.array([lambda s: s == "hello", lambda num: num < 10])
(You could store these kinds of objects in numpy without throwing an error, believe it or not.) What I want is something akin to the following.
array2 * array1
# Return np.array([True, True]). PS: An explanation of how to `AND` all of
# booleans together quickly would be nice too.
Of course, this seems impractical for arrays of size 2, but for arrays of arbitrary sizes, I'll assume this would yield a performance boost because of all of the low level optimizations.
So, anyone know how to write this weird kind of python code?
The simple answer, of course, is that you can't easily do this with numpy (or with standard Python, for that matter). Numpy doesn't actually vectorize most operations itself, to my knowledge: it uses libraries like BLAS/ATLAS/etc that do for certain situations. Even if it did, it would do so in C for specific situations: it certainly can't vectorize Python function execution.
If you want to involve multiprocessing in this, it is possible, but it depends on your situation. Are your individual function applications time-consuming, making them feasible to send out one-by-one, or do you need a very large number of fast function executions, in which case you'd probably want to send batches of them to each process?
In general, because of what could be argued as poor fundamental design (eg, the Global Interpreter Lock), it's very difficult with standard Python to have lightweight parallelization as you're hoping for here. There are significantly heavier methods, like the multiprocessing module or Ipython.parallel, but these require some work to use.
Alright guys, I have an answer: numpy's vectorize.
Please read the edited section though. You'll discover that python actually optimizes code for you, which actually defeats the purpose of using numpy arrays in this case. (But using numpy arrays does not decrease the performance.)
The last test really shows is that python lists are as efficient as they could be, and so this vectorization procedure is unnecessary. This is why I didn't mark this question as the "best answer".
Setup code:
def factory(i): return lambda num: num==i
array1 = list()
for i in range(10000): array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(10000))
The "unvectorized" version:
def evaluate(array1, array2):
return [func(val) for func, val in zip(array1, array2)]
%timeit evaluate(array1, array2)
# 100 loops, best of 3: 10 ms per loop
The vectorized version
def evaluate2(func, b): return func(b)
vec_evaluate = np.vectorize(evaluate2)
vec_evaluate(array1, array2)
# 100 loops, best of 3: 2.65 ms per loop
EDIT
Okay, I just wanted to paste more benchmarks that I received using the above tests, except with different test cases.
I made a third edit, showing what happens if you simply use python lists. The long story short, you actually won't regret much. This test case is on the very bottom.
Test cases only involving integers
In summary, if n is small, then the unvectorized version is better. Otherwise, vectorized is the way to go.
With n = 30
%timeit evaluate(array1, array2)
# 10000 loops, best of 3: 35.7 µs per loop
%timeit vec_evaluate(array1, array2)
# 10000 loops, best of 3: 27.6 µs per loop
With n = 7
%timeit evaluate(array1, array2)
100000 loops, best of 3: 9.93 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 21.6 µs per loop
Test cases involving strings
Vectorization wins.
Setup code:
def factory(i): return lambda num: str(num)==str(i)
array1 = list()
for i in range(7):
array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(7))
With n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 36.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 6.57 ms per loop
With n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 28.3 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 27.5 µs per loop
Random tests
Just to see how branch prediction played a role. From what I'm seeing, it didn't really change much. Vectorization still usually wins.
Setup code.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
When n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 25.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 4.67 ms per loop
When n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
Using python lists instead of numpy arrays
I ran this test to see what happened when I chose not to use the "optimized" numpy arrays, and I received some very surprising results.
The setup code is almost the same, except I'm choosing not to use numpy arrays. I'm also doing this test for only the "random" case.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
array1 = list()
for i in range(10000): array1.append(factory(i))
array2 = range(10000)
And the "unvectorized" version:
%timeit evaluate(array1, array2)
100 loops, best of 3: 4.93 ms per loop
You could see this is actually pretty surprising, because this is almost the same benchmark I was receiving with my random test case involving the vectorized evaluate.
%timeit vec_evaluate(array1, array2)
10 loops, best of 3: 19.8 ms per loop
Likewise, if you change these into numpy arrays before using vec_evaluate, you get the same 4.5 ms benchmark.

Check for multidimensional list in Python

I have some data which is either 1 or 2 dimensional. I want to iterate through every pattern in the data set and perform foo() on it. If the data is 1D then add this value to a list, if it's 2D then take the mean of the inner list and append this value.
I saw this question, and decided to implement it checking for instance of a list. I can't use numpy for this application.
outputs = []
for row in data:
if isinstance(row, list):
vals = [foo(window) for window in row]
outputs.append(sum(vals)/float(len(vals)))
else:
outputs.append(foo(row))
Is there a neater way of doing this? On each run, every pattern will have the same dimensionality, so I could make a separate class for 1D/2D but that will add a lot of classes to my code. The datasets can get quite large so a quick solution is preferable.
Your code is already almost as neat and fast as it can be. The only slight improvement is replacing [foo(window) for window in row] with map(foo, row), which can be seen by the benchmarks:
> python -m timeit "foo = lambda x: x+1; list(map(foo, range(1000)))"
10000 loops, best of 3: 132 usec per loop
> python -m timeit "foo = lambda x: x+1; [foo(a) for a in range(1000)]"
10000 loops, best of 3: 140 usec per loop
isinstance() already seems faster than its counterparts hasattr() and type() ==:
> python -m timeit "[isinstance(i, int) for i in range(1000)]"
10000 loops, best of 3: 117 usec per loop
> python -m timeit "[hasattr(i, '__iter__') for i in range(1000)]"
1000 loops, best of 3: 470 usec per loop
> python -m timeit "[type(i) == int for i in range(1000)]"
10000 loops, best of 3: 130 usec per loop
However, if you count short as neat, you can also simplify your code (after replacingmap) to:
mean = lambda x: sum(x)/float(len(x)) #or `from statistics import mean` in python3.4
output = [foo(r) if isinstance(r, int) else mean(map(foo, r)) for r in data]

Categories

Resources