I need to implement a function for summing the elements of an array with a variable section length.
So,
a = np.arange(10)
section_lengths = np.array([3, 2, 4])
out = accumulate(a, section_lengths)
print out
array([ 3., 7., 35.])
I attempted an implementation in cython here:
https://gist.github.com/2784725
for performance I am comparing to the pure numpy solution for the case where the section_lengths are all the same:
LEN = 10000
b = np.ones(LEN, dtype=np.int) * 2000
a = np.arange(np.sum(b), dtype=np.double)
out = np.zeros(LEN, dtype=np.double)
%timeit np.sum(a.reshape(-1,2000), axis=1)
10 loops, best of 3: 25.1 ms per loop
%timeit accumulate.accumulate(a, b, out)
10 loops, best of 3: 64.6 ms per loop
would you have any suggestion for improving performance?
You might try some of the following:
In addition to the #cython.boundscheck(False) compiler directive, also try adding #cython.wraparound(False)
In your setup.py script, try adding in some optimization flags:
ext_modules = [Extension("accumulate", ["accumulate.pyx"], extra_compile_args=["-O3",])]
Take a look at the .html file generated by cython -a accumulate.pyx to see if there are sections that are missing static typing or relying heavily on Python C-API calls:
http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types
Add a return statement at the end of the method. Currently it is doing a bunch of unnecessary error checking in your tight loop at i_el += 1.
Not sure if it will make a difference but I tend to make loop counters cdef unsigned int rather than just int
You also might compare your code to numpy when section_lengths are unequal, since it will probably require a bit more than just a simple sum.
In the nest for loop update out[i_bas] is slow, you can create a temporary variable to do the accumerate, and update out[i_bas] when nest for loop finished. The following code will be as fast as numpy version:
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_int_t
ctypedef np.double_t DTYPE_double_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def accumulate(
np.ndarray[DTYPE_double_t, ndim=1] a not None,
np.ndarray[DTYPE_int_t, ndim=1] section_lengths not None,
np.ndarray[DTYPE_double_t, ndim=1] out not None,
):
cdef int i_el, i_bas, sec_length, lenout
cdef double tmp
lenout = out.shape[0]
i_el = 0
for i_bas in range(lenout):
tmp = 0
for sec_length in range(section_lengths[i_bas]):
tmp += a[i_el]
i_el+=1
out[i_bas] = tmp
Related
I have the following code to generate all possible combinations in specified range using itertools but I cant get any speed improvements from using the code with cython. Original code is this:
from itertools import *
def x(e,f,g):
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
and after declaring types for cython:
from itertools import *
cpdef x(int e,int f,int g):
cpdef tuple c
cpdef list a
cpdef list d
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
I saved the latter as test_cy.pyx and compiled using cythonize -a -i test_cy.pyx
After compiling, I created a new script with the following code and ran it:
import test_cy
test_cy.x(1,45,6)
I didnt get any significant speed improvement, still took about the same time as the original script, about 10.8 sec.
Is there anything I did wrong or is itertools already so optimised that there cant be any bigger improvements to its speed?
As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.
Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":
%%cython
ctypedef unsigned long long ull
cdef ull next_subset(ull subset):
cdef ull smallest, ripple, ones
smallest = subset& -subset
ripple = subset + smallest
ones = subset ^ ripple
ones = (ones >> 2)//smallest
subset= ripple | ones
return subset
cdef subset2list(ull subset, int offset, int cnt):
cdef list lst=[0]*cnt #pre-allocate
cdef int current=0;
cdef int index=0
while subset>0:
if((subset&1)!=0):
lst[index]=offset+current
index+=1
subset>>=1
current+=1
return lst
def all_k_subsets(int start, int end, int k):
cdef int n=end-start
cdef ull MAX=1L<<n;
cdef ull subset=(1L<<k)-1L;
lst=[]
while(MAX>subset):
lst.append(subset2list(subset, start, k))
subset=next_subset(subset)
return lst
This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.
If we compare both approaches:
>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The speed-up of factor 2 is quite disappointing.
However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.
My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.
I wrote a python script to perform the box covering on a graph but it takes more than a minute when I run it on small graphs (100 nodes).
Today someone recommend cython to improve its efficiency so I follow this guide to adadpt the code that I had.
Running the python code the results where:
In [6]: %timeit test.test()
1000 loops, best of 3: 1.88 ms per loop
After following the guide the results were:
In [7]: %timeit c_test.test()
1000 loops, best of 3: 1.05 ms per loop
The performance was better but I am sure that it is a lot that can be improved. Given that I just meet cython today, I want to ask you how can I improve this code:
import random as rnd
import numpy as np
cimport cython
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def choose_color(not_valid_colors, valid_colors):
possible_values = list(valid_colors - not_valid_colors)
if possible_values:
return rnd.choice(possible_values)
else:
return max(valid_colors.union(not_valid_colors)) + 1
#cython.boundscheck(False)
cdef np.ndarray[DTYPE_t, ndim=2] greedy_coloring(np.ndarray[DTYPE_t, ndim=2] distances, int num_nodes, int diameter):
cdef int i, lb, j
cdef np.ndarray[DTYPE_t, ndim=2] c = np.empty((num_nodes+1, diameter+2), dtype=DTYPE)
c.fill(-1)
# Matrix C will not use the 0 column and 0 row to
# let the algorithm look very similar to the paper
# pseudo-code
nodes = list(range(1, num_nodes+1))
rnd.shuffle(nodes)
c[nodes[0], :] = 0
# Algorithm
for i in nodes[1:]:
for lb in range(2, diameter+1):
not_valid_colors = set()
valid_colors = set()
for j in nodes[:i]:
if distances[i-1, j-1] >= lb:
not_valid_colors.add(c[j, lb])
else:
valid_colors.add(c[j, lb])
c[i, lb] = choose_color(not_valid_colors, valid_colors)
return c
def test():
distances = np.matrix('0 3 2 4 1 1; \
3 0 1 1 3 2; \
2 1 0 2 2 1; \
4 1 2 0 4 3; \
1 3 2 4 0 1; \
1 2 1 3 1 0')
c = greedy_coloring(distances, 6, 4)
In Cython, you will get faster speed as you remove more Python calls inside your Cython functions.
For example, skimming through your code, you are making calls to choose_color() inside the nested loop in greedy_coloring(). That should be typed as well, along with variables defined inside the function. Since it will be called repeatedly, it will bring a lot of overhead.
You can use cython with -a option (e.g., cython -a file.pyx) to generate an annotated html file which shows visually which part of your code is making Python calls (yellow lines). This will help you a lot in terms of improving your Cython code.
I'm sorry for lack of specific pointers - hope this is helpful.
I encountered a pretty weird result from a benchmark
Those are all different flavors of a bubblesort implementation, and the fastest approach at n=10^4 is converting a Python lists to C arrays internally. In contrast, the yellow line corresponds to code where I am using NumPy arrays with memoryview. I am expected the results to be vice versa. I (and colleagues) repeated the benchmark a couple of times and always got the same results. Maybe someone has an idea of what is going on here ...
The black line in the plot would correspond to the code:
%%cython
cimport cython
from libc.stdlib cimport malloc, free
def cython_bubblesort_clist(a_list):
"""
The Cython implementation of bubble sort with internal
conversion between Python list objects and C arrays.
"""
cdef int *c_list
c_list = <int *>malloc(len(a_list)*cython.sizeof(int))
cdef int count, i, j # static type declarations
count = len(a_list)
# convert Python list to C array
for i in range(count):
c_list[i] = a_list[i]
for i in range(count):
for j in range(1, count):
if c_list[j] < c_list[j-1]:
c_list[j-1], c_list[j] = c_list[j], c_list[j-1]
# convert C array back to Python list
for i in range(count):
a_list[i] = c_list[i]
free(c_list)
return a_list
and the pink line to this code:
%%cython
import numpy as np
cimport numpy as np
cimport cython
def cython_bubblesort_numpy(long[:] np_ary):
"""
The Cython implementation of bubble sort with NumPy memoryview.
"""
cdef int count, i, j # static type declarations
count = np_ary.shape[0]
for i in range(count):
for j in range(1, count):
if np_ary[j] < np_ary[j-1]:
np_ary[j-1], np_ary[j] = np_ary[j], np_ary[j-1]
return np.asarray(np_ary)
As suggested in the comments above, I added the decorators
%%cython
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef cython_bubblesort_numpy(long[:] np_ary):
"""
The Cython implementation of bubble sort with NumPy memoryview.
"""
cdef int count, i, j # static type declarations
count = np_ary.shape[0]
for i in range(count):
for j in range(1, count):
if np_ary[j] < np_ary[j-1]:
np_ary[j-1], np_ary[j] = np_ary[j], np_ary[j-1]
return np.asarray(np_ary)
and the results are more what I expected now :)
It is worth making one trivial change to your code to see if it improves things further:
cpdef cython_bubblesort_numpy(long[::1] np_ary):
# ...
This tells cython that np_ary is a C contiguous array, and the generated code in the nested for loops can be further optimized with this information.
This code won't accept non-contiguous arrays as arguments, but that is fairly trivial to handle by using numpy.ascontiguousarray().
I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.
On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot
To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3
If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.
Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s
I have this cython code just for testing:
cimport cython
cpdef loop(int k):
return real_loop(k)
#cython.cdivision
cdef real_loop(int k):
cdef int i
cdef float a
for i in xrange(k):
a = i
a = a**2 / (a + 1)
return a
And I test speed diffence between this cython code and the same code in pure python with a script like that:
import mymodule
print(mymodule.loop(100000))
I get 80times faster. But if I remove the two return statement in cython code , I get 800-900times faster. Why ?
Another things is if I run this code ( with return ) on my old ACER Aspire ONE notebook I get 700times faster and on a new desktop i7 PC at home , 80times faster.
Somebody know why ?
I tested you problem with the following code:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False
def loop(int k):
return real_loop(k)
def loop2(int k):
cdef float a
real_loop2(k, &a)
return a
def loop3(int k):
real_loop3(k)
return None
def loop4(int k):
return real_loop4(k)
def loop5(int k):
cdef float a
real_loop5(k, &a)
return a
cdef float real_loop(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
return a
cdef void real_loop2(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]**2 / (a[0] + 1)
cdef void real_loop3(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
cdef float real_loop4(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a*a / (a + 1)
return a
cdef void real_loop5(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]*a[0] / (a[0] + 1)
where real_loop() is close to your function, with a modified formula for a since the original one seemed strange.
Function real_loop2() is not returning any value, just updating a by reference.
Function real_loop3() is not returning any value.
Checking the generated C code for real_loop3() one can see that the loop is there, and the code is called... but I had the same conclusion as #dmytro, changing k won't change the timing significantly... so there must be a point that I am missing here.
From the timings below we can say that return is not the bottleneck since real_loop2() and real_loop5() have do not return any value, and their performance is the same as real_loop() and real_loop4(), respectively.
In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop
In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop
In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop
In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop
In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop
Note the ~2X speed-up changing a**2 by a*a, since a**2 requires a function call powf() inside the loop.