Speeding up user-defined functions - python

I have a simulation in which the enduser can provide arbitrary many function which get then called in the inner most loop. Something like:
class Simulation:
def __init__(self):
self.rates []
self.amount = 1
def add(self, rate):
self.rates.append(rate)
def run(self, maxtime):
for t in range(0, maxtime):
for rate in self.rates:
self.amount *= rate(t)
def rate(t):
return t**2
simulation = Simulation()
simulation.add(rate)
simulation.run(100000)
Being a python loop this is very slow, but I can't get to work my normal approaches to speedup the loop.
Because the functions are user defined, I can't "numpyfy" the innermost call (rewriting such that the innermost work is done by optimized numpy code).
I first tried numba, but numba doesn't allow to pass in functions to other functions, even if these functions are also numba compiled. It can use closures, but because I don't know how many functions there are in the beginning, I don't think I can use it. Closing over a list of functions fails:
#numba.jit(nopython=True)
def a()
return 1
#numba.jit(nopython=True)
def b()
return 2
fs = [a, b]
#numba.jit(nopython=True)
def c()
total = 0
for f in fs:
total += f()
return total
c()
This fails with an error:
[...]
File "/home/syrn/.local/lib/python3.6/site-packages/numba/types/containers.py", line 348, in is_precise
return self.dtype.is_precise()
numba.errors.InternalError: 'NoneType' object has no attribute 'is_precise'
[1] During: typing of intrinsic-call at <stdin> (4)
I can't find the source but I think the documentation of numba stated somewhere that this is not a bug but not expected to work.
Something like the following would probably work around calling functions from a list, but seems like bad idea:
def run(self, maxtime):
len_rates = len(rates)
f1 = rates[0]
if len_rates >= 1:
f2 = rates[1]
if len_rates >= 2:
f3 = rates[2]
#[... repeat until some arbitrary limit]
#numba.jit(nopython=True)
def inner(amount):
for t in range(0, maxtime)
amount *= f1(t)
if len_rates >= 1:
amount *= f2(t)
if len_rates >= 2:
amount *= f3(t)
#[... repeat until the same arbitrary limit]
return amount
self.amount = inner(self.amount)
I guess it would also possible to do some bytecode hacking: Compile the functions with numba, pass a list of strings with the names of the functions into inner, do something like call(func_name) and then rewrite the bytecode so that it becomes func_name(t).
For cython just compiling the loop and multiplications will probably speedup a bit, but if the user defined functions are still python just calling the python function will probably still be slow (although I didn't profile that yet). I didn't really found much information on "dynamically compiling" functions with cython, but I guess I would need to somehow add some typeinformation to the user provided functions, which seems.. hard.
Is there any good way to speedup loops with user defined functions without needing to parsing and generating code from them?

I don't think you can speedup user's function - in the end it is the responsibility of the user to write an efficient code. What you can do, is to give a possibility to interact with your program in an efficient manner without the need to pay for overhead.
You can use Cython, and if the user is also game for using cython, you both could achieve speedups of around 100 compared to pure python-solution.
As baseline, I changed your example a little bit: the function rate does more work.
class Simulation:
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, maxtime):
for t in range(0, maxtime):
for rate in self.rates:
self.amount += rate(t)
def rate(t):
return t*t*t+2*t
Yields:
>>> simulation=Simulation([rate])
>>> %timeit simulation.run(10**5)
43.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can use cython to speed things up, first your run function:
%%cython
cdef class Simulation:
cdef int amount
cdef list rates
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, int maxtime):
cdef int t
for t in range(maxtime):
for rate in self.rates:
self.amount *= rate(t)
This gives us almost factor 2:
>>> %timeit simulation.run(10**5)
23.2 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The user could also use Cython to speed-up his calculation:
%%cython
def rate(int t):
return t*t*t+2*t
>>> %timeit simulation.run(10**5)
7.08 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using Cython gave us already speed-up 6, what is the bottle-neck now? We still are using python for polymorphism/dispatch and this is pretty costly because in order to use it, Python-objects (i.e. Python-integers here) must be created. Can we do better with Cython? Yes, if we define an interface for the function we pass to run at compile time:
%%cython
cdef class FunInterface:
cpdef int calc(self, int t):
pass
cdef class Simulation:
cdef int amount
cdef list rates
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, int maxtime):
cdef int t
cdef FunInterface f
for t in range(maxtime):
for f in self.rates:
self.amount *= f.calc(t)
cdef class Rate(FunInterface):
cpdef int calc(self, int t):
return t*t*t+2*t
This yield an additional speed-up of 7:
simulation=Simulation([Rate()])
>>>%timeit simulation.run(10**5)
1.03 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The most important part of the code above is line:
self.amount *= f.calc(t)
which no longer needs python for dispatch, but uses a machinery quite similar to virtual functions in c++. This c++-approach has only very small overhead of one indirection/look-up. This also means, that neither result of the function nor the arguments must be converted to Python-objects. For this to work, Rate must be a cpdef-function, you can take a look here for more gory details, how inheritance works for cpdef-functions.
The bottle-neck now is the line for f in self.rates because we still have to do a lot of python-interaction in every step. Here is an example what could be possible, if we could improve on this:
%%cython
.....
cdef class Simulation:
cdef int amount
cdef FunInterface f #just one function, no list
def __init__(self, fun):
self.f=fun
self.amount = 1
def run(self, int maxtime):
cdef int t
for t in range(maxtime):
self.amount *= self.f.calc(t)
...
>>> simulation=Simulation(Rate())
>>> %timeit simulation.run(10**5)
408 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another factor 2, but you can decide whether a more complicated code, which will be needed in order to store a list of FunInterface-objects without python-interaction is really worth it.

Related

change python list (mutable object in general) values with something like map, but not returning new object

Python map function works both on mutables and immutables, it returns new iterable object(for python 3.x).
However I would like only to change the values (of a mutable object eg. list, np.ndarray), in the nice and consistent manner similar to the map().
Given following function:
def perturbation_bin(x, prob):
'''
Perturbation operation for binary representation. Decide independently for each bit, if it will be inverted or not.
:param x: binary chromosome
:param prob: bit inversion probability
:return: None
'''
assert x.__class__ == np.ndarray, '\nBinary chromosome must np.ndarray.'
assert x.dtype == np.int8, '\nBinary chromosome np.dtype must be np.int8.'
x[np.where(np.random.rand(len(x)) <= prob)] += 1
map(partial(np.mod, x2=2), x)
This code for each 'bit' (np.int8) randomly adds or adds not +1, changing the numpy ndarray object passed to the function.
However in the next step I would like to apply modulo to all elements of this np.ndarray.
As the code is written now, it only creates new map object inside the function.
Does such map-like function in python exist, not returning new object but altering the mutable object values instead?
Thank you
No, it does not. But you could easily make one, as long as the object you are willing to modify is a Sequence (specifically, you would need __len__, __getitem__ and __setitem__ to be defined):
def apply(obj, func):
n = len(obj)
for i in range(n):
obj[i] = func(obj[i])
return obj # or None
This is not going to be the fastest execution around, but it may be accelerated with Numba (provided that you work with suitable objects, e.g. Numpy arrays).
As for your specific problem you can replace:
map(partial(np.mod, x2=2), x)
with:
apply(x, lambda x: x % 2)
or, with a much more performant:
import numba as nb
#nb.njit
def apply_nb(obj, func):
n = len(obj)
for i in range(n):
obj[i] = func(obj[i])
return obj
#nb.njit
def mod2_nb(x):
return x % 2
apply_nb(x, mod2_nb)
Note that many NumPy functions, like np.mod() also support a out parameter:
np.mod(x, 2, out=x)
which will perform the operation in-place.
Timewise, the Numba-based operation seems to have an edge:
import numpy as np
x = np.random.randint(0, 1000, 1000)
%timeit X = x.copy(); np.mod(X, 2, out=X)
# 9.47 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit X = x.copy(); apply_nb(X, mod2_nb)
# 7.51 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit X = x.copy(); apply(X, lambda x: x % 2)
# 350 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why numpy.where is much faster than alternatives

im trying to speedup the following code:
import time
import numpy as np
np.random.seed(10)
b=np.random.rand(10000,1000)
def f(a=1):
tott=0
for _ in range(a):
q=np.array(b)
t1 = time.time()
for i in range(len(q)):
for j in range(len(q[0])):
if q[i][j]>0.5:
q[i][j]=1
else:
q[i][j]=-1
t2=time.time()
tott+=t2-t1
print(tott/a)
As you can see, mainly func is about iterating in double cycle. So, i've tried to use np.nditer,np.vectorize and map instead of it. If gives some speedup (like 4-5 times except np.nditer), but! with np.where(q>0.5,1,-1) speedup is almost 100x.
How can i iterate over numpy arrays as fast as np.where does it? And why is it so much faster?
It's because the core of numpy is implemented in C. You're basically comparing the speed of C with Python.
If you want to use the speed advantage of numpy, you should make as few calls as possible in your Python code. If you use a Python-loop, you have already lost, even if you use numpy functions in that loop only. Use higher-level functions provided by numpy (that's why they ship so many special functions). Internally, it will use a much more efficient (C-)loop
You can implement a function in C (with loops) yourself and call that from Python. That should give comparable speeds.
To answer this question, you can gain the same speed (100x acceleration) by using the numba library:
from numba import njit
def f(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
#njit
def f_jit(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
Compare the speed:
Plain Python
%timeit f(b)
592 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba (just-in-time compiled using LLVM ~ C speed)
%timeit f_jit(b)
5.97 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speeding up itertools combinations with cython

I have the following code to generate all possible combinations in specified range using itertools but I cant get any speed improvements from using the code with cython. Original code is this:
from itertools import *
def x(e,f,g):
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
and after declaring types for cython:
from itertools import *
cpdef x(int e,int f,int g):
cpdef tuple c
cpdef list a
cpdef list d
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
I saved the latter as test_cy.pyx and compiled using cythonize -a -i test_cy.pyx
After compiling, I created a new script with the following code and ran it:
import test_cy
test_cy.x(1,45,6)
I didnt get any significant speed improvement, still took about the same time as the original script, about 10.8 sec.
Is there anything I did wrong or is itertools already so optimised that there cant be any bigger improvements to its speed?
As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.
Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":
%%cython
ctypedef unsigned long long ull
cdef ull next_subset(ull subset):
cdef ull smallest, ripple, ones
smallest = subset& -subset
ripple = subset + smallest
ones = subset ^ ripple
ones = (ones >> 2)//smallest
subset= ripple | ones
return subset
cdef subset2list(ull subset, int offset, int cnt):
cdef list lst=[0]*cnt #pre-allocate
cdef int current=0;
cdef int index=0
while subset>0:
if((subset&1)!=0):
lst[index]=offset+current
index+=1
subset>>=1
current+=1
return lst
def all_k_subsets(int start, int end, int k):
cdef int n=end-start
cdef ull MAX=1L<<n;
cdef ull subset=(1L<<k)-1L;
lst=[]
while(MAX>subset):
lst.append(subset2list(subset, start, k))
subset=next_subset(subset)
return lst
This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.
If we compare both approaches:
>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The speed-up of factor 2 is quite disappointing.
However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.
My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.

Python -- Fast Factorial function

I am trying to write code for a super-fast factorial function. I have experimented a little and have come up with the following three candidates (apart from math.factorial):
def f1():
return reduce(lambda x,y : x * y, xrange(1,31))
def f2():
result = 1
result *= 2
result *= 3
result *= 4
result *= 5
result *= 6
result *= 7
result *= 8
#and so-on...
result *= 28
result *= 29
result *= 30
return result
def f3():
return 1*2*3*4*5*6*7*8*9*10*11*12*13*14*15*16*17*18*19*20*21*22*23*24*25*26*27*28*29*30
I have timed these functions. These are the results:
In [109]: timeit f1()
100000 loops, best of 3: 11.9 µs per loop
In [110]: timeit f2()
100000 loops, best of 3: 5.05 µs per loop
In [111]: timeit f3()
10000000 loops, best of 3: 143 ns per loop
In [112]: timeit math.factorial(30)
1000000 loops, best of 3: 2.11 µs per loop
Clearly, f3() takes the cake. I have tried implementing this. To be verbose, I have tried writing code that generates a string like this:
"1*2*3*4*5*6*7*8*9*10*11*12*13*14........" and then using eval to evaluate this string. (Acknowledging that 'eval' is evil). However, this method gave me no gains in time, AT ALL. In fact, it took me nearly 150 microseconds to finish.
Please advise on how to generalize f3().
f3 is only fast because it isn't actually computing anything when you call it. The whole computation gets optimized out at compile time and replaced with the final value, so all you're timing is function call overhead.
This is particularly obvious if we disassemble the function with the dis module:
>>> import dis
>>> dis.dis(f3)
2 0 LOAD_CONST 59 (265252859812191058636308480000000L)
3 RETURN_VALUE
It is impossible to generalize this speedup to a function that takes an argument and returns its factorial.
f3() takes the cake because when the function is def'ed Python just optimizes the string of multiplications down to the final result and effective definition of f3() becomes:
def f3():
return 8222838654177922817725562880000000
which, because no computation need occur when the function is called, runs really fast!
One way to produce all the effect of placing a * operator between the list of numbers is to use reduce from the functools module. Is this sometime like what you're looking for?
from functools import reduce
def fact(x):
return reduce((lambda x, y: x * y), range(1, x+1))
I would argue that none of these are good factorial functions, since none take a parameter to the function. The reason why the last one works well is because it minimizes the number of interpreter steps, but that's still not a good answer: all of them have the same complexity (linear with the size of the value). We can do better: O(1).
import math
def factorial(x):
return math.gamma(x+1)
This scales constantly with the input value, at the sacrifice of some accuracy. Still, way better when performance matters.
We can do a quick benchmark:
import math
def factorial_gamma(x):
return math.gamma(x+1)
def factorial_linear(x):
if x == 0 or x == 1:
return 1
return x * factorial_linear(x-1)
In [10]: factorial_linear(50)
Out[10]: 30414093201713378043612608166064768844377641568960512000000000000
In [11]: factorial_gamma(50)
Out[11]: 3.0414093201713376e+64
In [12]: %timeit factorial_gamma(50)
537 ns ± 6.84 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %timeit factorial_linear(50)
17.2 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A 30 fold increase for a factorial of 50. Not bad.
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.factorial.html
As others have stated f3() isn't actually computing anything that's why you get such fast results. You can't achieve the same by giving it to a function.
Also you maybe wondering why math.factorial() is so fast it's because
the math module's functions are implemented in C:
This module is always available. It provides access to the mathematical functions defined by the C standard
By using an efficient algorithm in C, you get such fast results.
Here your best bet would be would using the below function, but using math.factorial is what I prefer if you're purely in need of performance
def f3(x):
ans=1
for i in range(1,x+1):
ans=ans*i
return ans
print(f3(30))

Why is Cython Decorator version slower than Cython Pyx version?

I am trying all sorts of ways to write the factorial function in Cython. First I tried the pyx file version in iPython Notebook.
%%file pyxfact.pyx
cdef long pyxfact(long n):
if n <=0:
return 1
else:
return n * pyxfact(n-1)
def fact(long n):
return pyxfact(n)
Then I tried the same, as least I think so, in Cython decorator, like this:
%%file cydecofact.py
import cython
#cython.cfunc # equivalent to cdef, while #cython.ccall is equivalent to cpdef
#cython.returns(cython.long)
#cython.locals(n=cython.long)
def deco_fact(n):
if n <=0:
return 1
else:
return n * deco_fact(n-1)
#cython.locals(n=cython.long)
def fact(n):
return deco_fact(n)
To my surprise, the two versions have a huge run time difference:
%timeit -n 10000 pyxfact.fact(10)
%timeit -n 10000 cydecofact.fact(10)
10000 loops, best of 3: 219 ns per loop
10000 loops, best of 3: 2 µs per loop
You need a #cython.compile to actually compile the code. However, it looks like neither cython.cfunc nor recursion is supported for #cython.compile.

Categories

Resources