I have the following code to generate all possible combinations in specified range using itertools but I cant get any speed improvements from using the code with cython. Original code is this:
from itertools import *
def x(e,f,g):
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
and after declaring types for cython:
from itertools import *
cpdef x(int e,int f,int g):
cpdef tuple c
cpdef list a
cpdef list d
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
I saved the latter as test_cy.pyx and compiled using cythonize -a -i test_cy.pyx
After compiling, I created a new script with the following code and ran it:
import test_cy
test_cy.x(1,45,6)
I didnt get any significant speed improvement, still took about the same time as the original script, about 10.8 sec.
Is there anything I did wrong or is itertools already so optimised that there cant be any bigger improvements to its speed?
As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.
Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":
%%cython
ctypedef unsigned long long ull
cdef ull next_subset(ull subset):
cdef ull smallest, ripple, ones
smallest = subset& -subset
ripple = subset + smallest
ones = subset ^ ripple
ones = (ones >> 2)//smallest
subset= ripple | ones
return subset
cdef subset2list(ull subset, int offset, int cnt):
cdef list lst=[0]*cnt #pre-allocate
cdef int current=0;
cdef int index=0
while subset>0:
if((subset&1)!=0):
lst[index]=offset+current
index+=1
subset>>=1
current+=1
return lst
def all_k_subsets(int start, int end, int k):
cdef int n=end-start
cdef ull MAX=1L<<n;
cdef ull subset=(1L<<k)-1L;
lst=[]
while(MAX>subset):
lst.append(subset2list(subset, start, k))
subset=next_subset(subset)
return lst
This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.
If we compare both approaches:
>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The speed-up of factor 2 is quite disappointing.
However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.
My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.
Related
im trying to speedup the following code:
import time
import numpy as np
np.random.seed(10)
b=np.random.rand(10000,1000)
def f(a=1):
tott=0
for _ in range(a):
q=np.array(b)
t1 = time.time()
for i in range(len(q)):
for j in range(len(q[0])):
if q[i][j]>0.5:
q[i][j]=1
else:
q[i][j]=-1
t2=time.time()
tott+=t2-t1
print(tott/a)
As you can see, mainly func is about iterating in double cycle. So, i've tried to use np.nditer,np.vectorize and map instead of it. If gives some speedup (like 4-5 times except np.nditer), but! with np.where(q>0.5,1,-1) speedup is almost 100x.
How can i iterate over numpy arrays as fast as np.where does it? And why is it so much faster?
It's because the core of numpy is implemented in C. You're basically comparing the speed of C with Python.
If you want to use the speed advantage of numpy, you should make as few calls as possible in your Python code. If you use a Python-loop, you have already lost, even if you use numpy functions in that loop only. Use higher-level functions provided by numpy (that's why they ship so many special functions). Internally, it will use a much more efficient (C-)loop
You can implement a function in C (with loops) yourself and call that from Python. That should give comparable speeds.
To answer this question, you can gain the same speed (100x acceleration) by using the numba library:
from numba import njit
def f(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
#njit
def f_jit(b):
q = np.zeros_like(b)
for i in range(b.shape[0]):
for j in range(b.shape[1]):
if q[i][j] > 0.5:
q[i][j] = 1
else:
q[i][j] = -1
return q
Compare the speed:
Plain Python
%timeit f(b)
592 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba (just-in-time compiled using LLVM ~ C speed)
%timeit f_jit(b)
5.97 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a simulation in which the enduser can provide arbitrary many function which get then called in the inner most loop. Something like:
class Simulation:
def __init__(self):
self.rates []
self.amount = 1
def add(self, rate):
self.rates.append(rate)
def run(self, maxtime):
for t in range(0, maxtime):
for rate in self.rates:
self.amount *= rate(t)
def rate(t):
return t**2
simulation = Simulation()
simulation.add(rate)
simulation.run(100000)
Being a python loop this is very slow, but I can't get to work my normal approaches to speedup the loop.
Because the functions are user defined, I can't "numpyfy" the innermost call (rewriting such that the innermost work is done by optimized numpy code).
I first tried numba, but numba doesn't allow to pass in functions to other functions, even if these functions are also numba compiled. It can use closures, but because I don't know how many functions there are in the beginning, I don't think I can use it. Closing over a list of functions fails:
#numba.jit(nopython=True)
def a()
return 1
#numba.jit(nopython=True)
def b()
return 2
fs = [a, b]
#numba.jit(nopython=True)
def c()
total = 0
for f in fs:
total += f()
return total
c()
This fails with an error:
[...]
File "/home/syrn/.local/lib/python3.6/site-packages/numba/types/containers.py", line 348, in is_precise
return self.dtype.is_precise()
numba.errors.InternalError: 'NoneType' object has no attribute 'is_precise'
[1] During: typing of intrinsic-call at <stdin> (4)
I can't find the source but I think the documentation of numba stated somewhere that this is not a bug but not expected to work.
Something like the following would probably work around calling functions from a list, but seems like bad idea:
def run(self, maxtime):
len_rates = len(rates)
f1 = rates[0]
if len_rates >= 1:
f2 = rates[1]
if len_rates >= 2:
f3 = rates[2]
#[... repeat until some arbitrary limit]
#numba.jit(nopython=True)
def inner(amount):
for t in range(0, maxtime)
amount *= f1(t)
if len_rates >= 1:
amount *= f2(t)
if len_rates >= 2:
amount *= f3(t)
#[... repeat until the same arbitrary limit]
return amount
self.amount = inner(self.amount)
I guess it would also possible to do some bytecode hacking: Compile the functions with numba, pass a list of strings with the names of the functions into inner, do something like call(func_name) and then rewrite the bytecode so that it becomes func_name(t).
For cython just compiling the loop and multiplications will probably speedup a bit, but if the user defined functions are still python just calling the python function will probably still be slow (although I didn't profile that yet). I didn't really found much information on "dynamically compiling" functions with cython, but I guess I would need to somehow add some typeinformation to the user provided functions, which seems.. hard.
Is there any good way to speedup loops with user defined functions without needing to parsing and generating code from them?
I don't think you can speedup user's function - in the end it is the responsibility of the user to write an efficient code. What you can do, is to give a possibility to interact with your program in an efficient manner without the need to pay for overhead.
You can use Cython, and if the user is also game for using cython, you both could achieve speedups of around 100 compared to pure python-solution.
As baseline, I changed your example a little bit: the function rate does more work.
class Simulation:
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, maxtime):
for t in range(0, maxtime):
for rate in self.rates:
self.amount += rate(t)
def rate(t):
return t*t*t+2*t
Yields:
>>> simulation=Simulation([rate])
>>> %timeit simulation.run(10**5)
43.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can use cython to speed things up, first your run function:
%%cython
cdef class Simulation:
cdef int amount
cdef list rates
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, int maxtime):
cdef int t
for t in range(maxtime):
for rate in self.rates:
self.amount *= rate(t)
This gives us almost factor 2:
>>> %timeit simulation.run(10**5)
23.2 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The user could also use Cython to speed-up his calculation:
%%cython
def rate(int t):
return t*t*t+2*t
>>> %timeit simulation.run(10**5)
7.08 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using Cython gave us already speed-up 6, what is the bottle-neck now? We still are using python for polymorphism/dispatch and this is pretty costly because in order to use it, Python-objects (i.e. Python-integers here) must be created. Can we do better with Cython? Yes, if we define an interface for the function we pass to run at compile time:
%%cython
cdef class FunInterface:
cpdef int calc(self, int t):
pass
cdef class Simulation:
cdef int amount
cdef list rates
def __init__(self, rates):
self.rates=list(rates)
self.amount = 1
def run(self, int maxtime):
cdef int t
cdef FunInterface f
for t in range(maxtime):
for f in self.rates:
self.amount *= f.calc(t)
cdef class Rate(FunInterface):
cpdef int calc(self, int t):
return t*t*t+2*t
This yield an additional speed-up of 7:
simulation=Simulation([Rate()])
>>>%timeit simulation.run(10**5)
1.03 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The most important part of the code above is line:
self.amount *= f.calc(t)
which no longer needs python for dispatch, but uses a machinery quite similar to virtual functions in c++. This c++-approach has only very small overhead of one indirection/look-up. This also means, that neither result of the function nor the arguments must be converted to Python-objects. For this to work, Rate must be a cpdef-function, you can take a look here for more gory details, how inheritance works for cpdef-functions.
The bottle-neck now is the line for f in self.rates because we still have to do a lot of python-interaction in every step. Here is an example what could be possible, if we could improve on this:
%%cython
.....
cdef class Simulation:
cdef int amount
cdef FunInterface f #just one function, no list
def __init__(self, fun):
self.f=fun
self.amount = 1
def run(self, int maxtime):
cdef int t
for t in range(maxtime):
self.amount *= self.f.calc(t)
...
>>> simulation=Simulation(Rate())
>>> %timeit simulation.run(10**5)
408 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another factor 2, but you can decide whether a more complicated code, which will be needed in order to store a list of FunInterface-objects without python-interaction is really worth it.
Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.
For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.
alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')
%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a list comprehension is just as fast
%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the object data type, we cannot use np.char. The list comprehension is 3x as fast.
%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.
import numpy as np
cimport numpy as np
from numpy cimport ndarray
def lower(ndarray[object] arr):
cdef int i
cdef int n = len(arr)
cdef ndarray[object] result = np.empty(n, dtype='object')
for i in range(n):
result[i] = arr[i].lower()
return result
This yields a modest improvement
%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have tried accessing the memory directly with the data ndarray attribute like this:
def lower_fast(ndarray[object] arr):
cdef int n = len(arr)
cdef int i
cdef char* data = arr.data
cdef int itemsize = arr.itemsize
for i in range(n):
# no idea here
I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.
Update with fastest method (doesn't work for unicode)
Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.
cdef int f(char *a, int itemsize, int shape):
cdef int i
cdef int num
cdef int loc
for i in range(shape * itemsize):
num = a[i]
print(num)
if 65 <= num <= 90:
a[i] +=32
def lower_fast(ndarray arr):
cdef char *inp
inp = arr.data
f(inp, arr.itemsize, arr.shape[0])
return arr
This is 100x faster than the others and what I am looking for.
%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.
In some C file, say, _lower.c, have
#include <stdlib.h>
#include <string.h>
#include <unistr.h>
#include <unicase.h>
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len, s_len;
uint8_t *s_ptr = *s, *lowered;
while(s_ptr - *s < total_len) {
s_len = u8_strlen(s_ptr);
if (s_len == 0) {
s_ptr += 1;
continue;
}
lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
memcpy(s_ptr, lowered, lower_len);
free(lowered);
s_ptr += s_len;
}
}
Then, in lower.pxd you do
cdef extern from "_lower.c":
cdef void _c_tolower(unsigned char **s, unsigned int total_len)
And finally, in lower.pyx:
cpdef void lower(ndarray arr):
cdef unsigned char * _arr
_arr = <unsigned char *> arr.data
_c_tolower(&_arr, arr.shape[0] * arr.itemsize)
On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.
There are a few points of improvement that I don't know will make much of a difference:
arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.
Not really an answer, but hope it helps your further investigations!
PS You'll notice that this does the lowering in-place, so the usage would be like this:
>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
... print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094
EDIT
DUH, you modify the C function to look like this
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len;
uint8_t *lowered;
lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
memcpy(*s, lowered, lower_len);
free(lowered);
}
and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.
Anyway, ~40% faster this way:
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078
I am trying to understand why filling a 64 bit array is slower than filling a 32 bit array.
Here are the examples:
#cython.boundscheck(False)
#cython.wraparound(False)
def test32(long long int size):
cdef np.ndarray[np.int32_t,ndim=1] index = np.zeros(size, np.int32)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
#cython.boundscheck(False)
#cython.wraparound(False)
def test64(long long int size):
cdef np.ndarray[np.int64_t,ndim=1] index = np.zeros(size, np.int64)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
The timings:
In [4]: %timeit test32(1000000)
1000 loops, best of 3: 1.13 ms over loop
In [5]: %timeit test64(1000000)
100 loops, best of 3: 2.04 ms per loop
I am using a 64 bit computer and the Visual C++ for Python (9.0) compiler.
Edit:
Initializing a 64bit array and a 32bit array seems to be taking the same amount of time, which means the time difference is due to the filling process.
In [8]: %timeit np.zeros(1000000,'int32')
100000 loops, best of 3: 2.49 μs per loop
In [9]: %timeit np.zeros(1000000,'int64')
100000 loops, best of 3: 2.49 μs per loop
Edit2:
As DavidW points out, this behavior can be replicated with np.arange, which means that this is expected:
In [7]: %timeit np.arange(1000000,dtype='int32')
10 loops, best of 3: 1.22 ms per loop
In [8]: %timeit np.arange(1000000,dtype='int64')
10 loops, best of 3: 2.03 ms per loop
A quick set of measurements (which I initially posted in the comments) suggests that you see very similar behaviour (64 bits taking twice as long as 32 bits) for np.full, np.ones and np.arange, with all three showing similar times to each other (arange being about 10% slower than the other 2 for me).
I believe this suggests that this is expected behaviour and just the time it takes to fill the memory. (64 bits obviously has twice as much memory as 32 bits).
The interesting question is then why np.zeros is so uniform (and fast) - a complete answer is probably given in this C-based question but the basic summary is that allocating zeros (as done by the C function calloc) can be done lazily - i.e. it doesn't actually allocate or fill much until you actually try to write to the memory.
I need to implement a function for summing the elements of an array with a variable section length.
So,
a = np.arange(10)
section_lengths = np.array([3, 2, 4])
out = accumulate(a, section_lengths)
print out
array([ 3., 7., 35.])
I attempted an implementation in cython here:
https://gist.github.com/2784725
for performance I am comparing to the pure numpy solution for the case where the section_lengths are all the same:
LEN = 10000
b = np.ones(LEN, dtype=np.int) * 2000
a = np.arange(np.sum(b), dtype=np.double)
out = np.zeros(LEN, dtype=np.double)
%timeit np.sum(a.reshape(-1,2000), axis=1)
10 loops, best of 3: 25.1 ms per loop
%timeit accumulate.accumulate(a, b, out)
10 loops, best of 3: 64.6 ms per loop
would you have any suggestion for improving performance?
You might try some of the following:
In addition to the #cython.boundscheck(False) compiler directive, also try adding #cython.wraparound(False)
In your setup.py script, try adding in some optimization flags:
ext_modules = [Extension("accumulate", ["accumulate.pyx"], extra_compile_args=["-O3",])]
Take a look at the .html file generated by cython -a accumulate.pyx to see if there are sections that are missing static typing or relying heavily on Python C-API calls:
http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types
Add a return statement at the end of the method. Currently it is doing a bunch of unnecessary error checking in your tight loop at i_el += 1.
Not sure if it will make a difference but I tend to make loop counters cdef unsigned int rather than just int
You also might compare your code to numpy when section_lengths are unequal, since it will probably require a bit more than just a simple sum.
In the nest for loop update out[i_bas] is slow, you can create a temporary variable to do the accumerate, and update out[i_bas] when nest for loop finished. The following code will be as fast as numpy version:
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_int_t
ctypedef np.double_t DTYPE_double_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def accumulate(
np.ndarray[DTYPE_double_t, ndim=1] a not None,
np.ndarray[DTYPE_int_t, ndim=1] section_lengths not None,
np.ndarray[DTYPE_double_t, ndim=1] out not None,
):
cdef int i_el, i_bas, sec_length, lenout
cdef double tmp
lenout = out.shape[0]
i_el = 0
for i_bas in range(lenout):
tmp = 0
for sec_length in range(section_lengths[i_bas]):
tmp += a[i_el]
i_el+=1
out[i_bas] = tmp