I have this cython code just for testing:
cimport cython
cpdef loop(int k):
return real_loop(k)
#cython.cdivision
cdef real_loop(int k):
cdef int i
cdef float a
for i in xrange(k):
a = i
a = a**2 / (a + 1)
return a
And I test speed diffence between this cython code and the same code in pure python with a script like that:
import mymodule
print(mymodule.loop(100000))
I get 80times faster. But if I remove the two return statement in cython code , I get 800-900times faster. Why ?
Another things is if I run this code ( with return ) on my old ACER Aspire ONE notebook I get 700times faster and on a new desktop i7 PC at home , 80times faster.
Somebody know why ?
I tested you problem with the following code:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False
def loop(int k):
return real_loop(k)
def loop2(int k):
cdef float a
real_loop2(k, &a)
return a
def loop3(int k):
real_loop3(k)
return None
def loop4(int k):
return real_loop4(k)
def loop5(int k):
cdef float a
real_loop5(k, &a)
return a
cdef float real_loop(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
return a
cdef void real_loop2(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]**2 / (a[0] + 1)
cdef void real_loop3(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
cdef float real_loop4(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a*a / (a + 1)
return a
cdef void real_loop5(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]*a[0] / (a[0] + 1)
where real_loop() is close to your function, with a modified formula for a since the original one seemed strange.
Function real_loop2() is not returning any value, just updating a by reference.
Function real_loop3() is not returning any value.
Checking the generated C code for real_loop3() one can see that the loop is there, and the code is called... but I had the same conclusion as #dmytro, changing k won't change the timing significantly... so there must be a point that I am missing here.
From the timings below we can say that return is not the bottleneck since real_loop2() and real_loop5() have do not return any value, and their performance is the same as real_loop() and real_loop4(), respectively.
In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop
In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop
In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop
In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop
In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop
Note the ~2X speed-up changing a**2 by a*a, since a**2 requires a function call powf() inside the loop.
Related
I have the following code to generate all possible combinations in specified range using itertools but I cant get any speed improvements from using the code with cython. Original code is this:
from itertools import *
def x(e,f,g):
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
and after declaring types for cython:
from itertools import *
cpdef x(int e,int f,int g):
cpdef tuple c
cpdef list a
cpdef list d
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
I saved the latter as test_cy.pyx and compiled using cythonize -a -i test_cy.pyx
After compiling, I created a new script with the following code and ran it:
import test_cy
test_cy.x(1,45,6)
I didnt get any significant speed improvement, still took about the same time as the original script, about 10.8 sec.
Is there anything I did wrong or is itertools already so optimised that there cant be any bigger improvements to its speed?
As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.
Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":
%%cython
ctypedef unsigned long long ull
cdef ull next_subset(ull subset):
cdef ull smallest, ripple, ones
smallest = subset& -subset
ripple = subset + smallest
ones = subset ^ ripple
ones = (ones >> 2)//smallest
subset= ripple | ones
return subset
cdef subset2list(ull subset, int offset, int cnt):
cdef list lst=[0]*cnt #pre-allocate
cdef int current=0;
cdef int index=0
while subset>0:
if((subset&1)!=0):
lst[index]=offset+current
index+=1
subset>>=1
current+=1
return lst
def all_k_subsets(int start, int end, int k):
cdef int n=end-start
cdef ull MAX=1L<<n;
cdef ull subset=(1L<<k)-1L;
lst=[]
while(MAX>subset):
lst.append(subset2list(subset, start, k))
subset=next_subset(subset)
return lst
This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.
If we compare both approaches:
>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The speed-up of factor 2 is quite disappointing.
However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.
My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.
I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A and 0x0B into 0xAB.
def pack(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.
cython3 -a pack.cyx generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.
From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.
The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array for Cython?
Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list instead of object (same as no type) or using for loop instead of list comprehension:
def packx(list it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef list r = [0]*n
for i in range(n):
r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
return r
I think my earlier test just specified an array.array as input, but following the comments I've now just tried
from cpython cimport array
import array
def packa(array.array a):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
cdef unsigned int b[256*64/2]
for i in range(n):
b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;
cdef array.array c = array.array("B", b)
return c
which compiles but
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()
segfaults.
I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.
Taking #ead's first approach, plus combining division and shifting (seems to save a microsecond:
#cython: boundscheck=False, wraparound=False
def packa(char[::1] a):
"""Cythonize python nibble packing loop, typed with array"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
# cdef unsigned int b[256*64/2]
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);
return res
compiles much longer, but runs much faster:
python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop
Amazing! But, with the additional bytes-to-array and array-to-list conversion
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do
it now only takes about 1.7 ms - very cool!
Down to 150 us timed or approx. 0.4 ms actual:
from cython cimport boundscheck, wraparound
from cpython cimport array
import array
#boundscheck(False)
#wraparound(False)
def pack(const unsigned char[::1] di):
cdef:
unsigned int i, n = len(di)
unsigned char h, l, r
array.array do = array.array('B')
array.resize(do, n>>1)
for i in range(0, n, 2):
h = di[i] & 0xF0
l = di[i+1] >> 4
r = h | l
do.data.as_uchars[i>>1] = r
return do
I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (# 10 MHz).
If you wanna to be as fast as C you should not use list with python-integers inside but an array.array. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array.
Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):
import random
l=[random.randint(0,15) for _ in range(1000)]
As baseline, your python-implementation with list:
def packx(it):
n = len(it)//2
r = [0]*n
for i in range(n):
r[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return r
%timeit packx(l)
143 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
By the way, I use % instead of //, which is probably what you want, otherwise you would get only 0s as result (only lower bits have data in your description).
After cythonizing the same function (with %%cython-magic) we get a speed-up of around 2:
%timeit packx(l)
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's look at the html produced by option -a, we see the following for the line corresponding to the for-loop:
.....
__pyx_t_2 = PyNumber_Multiply(__pyx_v_i, __pyx_int_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__pyx_t_5 = PyObject_GetItem(__pyx_v_it, __pyx_t_2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_5);
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
__pyx_t_2 = __Pyx_PyInt_RemainderObjC(__pyx_t_5, __pyx_int_16, 16, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
...
Py_NumberMultiply means that we use slow python-multiplication, Pyx_DECREF- all temporaries are slow python-objects. We need to change that!
Let's pass not a list but an array.array of bytes to our function and return an array.array of bytes back. Lists have full fledged python objects inside, array.arraythe lowly raw c-data which is faster:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('b', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return res
import array
a=array.array('B', l)
%timeit cy_apackx(a)
19.2 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Better, but let take a look at the generated html, there is still some slow python-code:
__pyx_t_2 = __Pyx_PyInt_From_long(((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_res), __pyx_v_i, __pyx_t_2, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
We still use a python-setter for array (__Pax_SetItemInt) and for this a python objecct __pyx_t_2 is needed, to avoid this we use array.data.as_chars:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.86 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Much better, but let's take a look at html again, and we see some calls to __Pyx_RaiseBufferIndexError - this safety costs some time, so let's switch it off:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False) # switch of safety-checks
#cython.wraparound(False) # switch of safety-checks
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.53 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When we look at the generated html, we see:
__pyx_t_7 = (__pyx_v_i * 2);
__pyx_t_8 = ((__pyx_v_i * 2) + 1);
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16));
No python-stuff! Good so far. However, I'm not sure about __Pyx_mod_long, its definition is:
static CYTHON_INLINE long __Pyx_mod_long(long a, long b) {
long r = a % b;
r += ((r != 0) & ((r ^ b) < 0)) * b;
return r;
}
So C and Python have differences for mod of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizing a%16 as a&15. We have only positive numbers, so no need to care about them, thus we need to do the a&15-trick by ourselves:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]&15)<<4 | (it[i*2+1]&15)
return res
%timeit cy_apackx(a)
1.02 µs ± 8.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I'm also satified with the resulting C-code/html (only one line):
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))) & 15) << 4) | ((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))) & 15));
Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.
If you need the upper bits and can assume that lower bits are without dirt (otherwise &250 can help), you can use:
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = it[i*2] | (it[i*2+1]>>4)
return res
%timeit cy_apackx(a)
819 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
res=[0]*n
for i in range(n):
res[i] = it[i*2] | (it[i*2+1]>>4))
return res
%timeit cy_packx(l)
20.7 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.
However, it[i*2] | (it[i*2+1]>>4) is done with python-integer, as next step we make it cdef-operations:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
res[i]= a | (b>>4)
return res
%timeit cy_packx(l)
7.3 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for array.array.
Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[i]= s
return res
%timeit cy_packx(l)
In [79]: %timeit cy_packx(l)
7.67 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The usage of the s variable means, it does not get optimized away in the second version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[0]=s
return res
In [81]: %timeit cy_packx(l)
5.46 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
return s
In [84]: %timeit cy_packx(l)
3.84 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That leads to the following performance break down of the list-version:
Time(in µs) Percentage(in %)
all 7.7 100
calculation 1 12
alloc memory 1.6 21
create ints 2.2 29
access data/cast 2.6 38
I must confess, I expected create ints to play a bigger role and didn't thing accessing the data in the list and casting it to chars will cost that much.
I have been trying to work with Cython and I encountered the following peculiar scenario where a sum function over an array takes 3 times the amount of time that the average of an array takes.
Here are my three functions
cpdef FLOAT_t cython_sum(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef double [:] x = A
cdef double sum = 0
cdef unsigned int N = A.shape[0]
for i in xrange(N):
sum += x[i]
return sum
cpdef FLOAT_t cython_avg(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef double [:] x = A
cdef double sum = 0
cdef unsigned int N = A.shape[0]
for i in xrange(N):
sum += x[i]
return sum/N
cpdef FLOAT_t cython_silly_avg(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef unsigned int N = A.shape[0]
return cython_avg(A)*N
Here are the run times in ipython
In [7]: A = np.random.random(1000000)
In [8]: %timeit np.sum(A)
1000 loops, best of 3: 906 us per loop
In [9]: %timeit np.mean(A)
1000 loops, best of 3: 919 us per loop
In [10]: %timeit cython_avg(A)
1000 loops, best of 3: 896 us per loop
In [11]: %timeit cython_sum(A)
100 loops, best of 3: 2.72 ms per loop
In [12]: %timeit cython_silly_avg(A)
1000 loops, best of 3: 862 us per loop
I am unable to account for the memory jump in simple cython_sum. Is it because of some memory allocation? Since these are random nos from 0 to 1. The sum is around 500K.
Since line_profiler doesn't work with cython, I was unable to profile my code.
It seems like the results from #nbren12 are the definite answer: these results cannot be reproduced.
The evidence (and logic) point out that both methods have the same runtime.
The Cython documentation on typed memory views list three ways of assigning to a typed memory view:
from a raw C pointer,
from a np.ndarray and
from a cython.view.array.
Assume that I don't have data passed in to my cython function from outside but instead want to allocate memory and return it as a np.ndarray, which of those options do I chose? Also assume that the size of that buffer is not a compile-time constant i.e. I can't allocate on the stack, but would need to malloc for option 1.
The 3 options would therefore looke something like this:
from libc.stdlib cimport malloc, free
cimport numpy as np
from cython cimport view
np.import_array()
def memview_malloc(int N):
cdef int * m = <int *>malloc(N * sizeof(int))
cdef int[::1] b = <int[:N]>m
free(<void *>m)
def memview_ndarray(int N):
cdef int[::1] b = np.empty(N, dtype=np.int32)
def memview_cyarray(int N):
cdef int[::1] b = view.array(shape=(N,), itemsize=sizeof(int), format="i")
What is surprising to me is that in all three cases, Cython generates quite a lot of code for the memory allocation, in particular a call to __Pyx_PyObject_to_MemoryviewSlice_dc_int. This suggests (and I might be wrong here, my insight into the inner workings of Cython are very limited) that it first creates a Python object and then "casts" it into a memory view, which seems unnecessary overhead.
A simple benchmark doesn't reveal much difference between the three methods, with 2. being the fastest by a thin margin.
Which of the three methods is recommended? Or is there a different, better option?
Follow-up question: I want to finally return the result as a np.ndarray, after having worked with that memory view in the function. Is a typed memory view the best choice or would I rather just use the old buffer interface as below to create an ndarray in the first place?
cdef np.ndarray[DTYPE_t, ndim=1] b = np.empty(N, dtype=np.int32)
Look here for an answer.
The basic idea is that you want cpython.array.array and cpython.array.clone (not cython.array.*):
from cpython.array cimport array, clone
# This type is what you want and can be cast to things of
# the "double[:]" syntax, so no problems there
cdef array[double] armv, templatemv
templatemv = array('d')
# This is fast
armv = clone(templatemv, L, False)
EDIT
It turns out that the benchmarks in that thread were rubbish. Here's my set, with my timings:
# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False
import time
import sys
from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as numpy
cimport numpy as numpy
cdef int loops
def timefunc(name):
def timedecorator(f):
cdef int L, i
print("Running", name)
for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
start = time.clock()
f(L)
end = time.clock()
print(format((end-start) / loops * 1e6, "2f"), end=" ")
sys.stdout.flush()
print("μs")
return timedecorator
print()
print("INITIALISATIONS")
loops = 100000
#timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
cdef array template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
template = numpy.empty((L,), dtype='double')
for i in range(loops):
arr = numpy.empty_like(template)
# Prevents dead code elimination
str(arr[0])
#timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
free(arrptr)
# Prevents dead code elimination
str(arrptr[0])
#timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr
cdef double[::1] arr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
arr = <double[:L]>arrptr
free(arrptr)
# Prevents dead code elimination
str(arr[0])
#timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
for i in range(loops):
arr = cvarray((L,),sizeof(double),'d')
# Prevents dead code elimination
str(arr[0])
print()
print("ITERATING")
loops = 1000
#timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = numpy.empty((L,), dtype='double')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double d
for i in range(loops):
for i in range(L):
d = arrptr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
#timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double[::1] arr = <double[:L]>arrptr
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
#timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = cvarray((L,),sizeof(double),'d')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
Output:
INITIALISATIONS
Running cpython.array buffer
0.100040 0.097140 0.133110 0.121820 0.131630 0.108420 0.112160 μs
Running cpython.array memoryview
0.339480 0.333240 0.378790 0.445720 0.449800 0.414280 0.414060 μs
Running cpython.array raw C type
0.048270 0.049250 0.069770 0.074140 0.076300 0.060980 0.060270 μs
Running numpy.empty_like memoryview
1.006200 1.012160 1.128540 1.212350 1.250270 1.235710 1.241050 μs
Running malloc
0.021850 0.022430 0.037240 0.046260 0.039570 0.043690 0.030720 μs
Running malloc memoryview
1.640200 1.648000 1.681310 1.769610 1.755540 1.804950 1.758150 μs
Running cvarray memoryview
1.332330 1.353910 1.358160 1.481150 1.517690 1.485600 1.490790 μs
ITERATING
Running cpython.array buffer
0.010000 0.027000 0.091000 0.669000 6.314000 64.389000 635.171000 μs
Running cpython.array memoryview
0.013000 0.015000 0.058000 0.354000 3.186000 33.062000 338.300000 μs
Running cpython.array raw C type
0.014000 0.146000 0.979000 9.501000 94.160000 916.073000 9287.079000 μs
Running numpy.empty_like memoryview
0.042000 0.020000 0.057000 0.352000 3.193000 34.474000 333.089000 μs
Running malloc
0.002000 0.004000 0.064000 0.367000 3.599000 32.712000 323.858000 μs
Running malloc memoryview
0.019000 0.032000 0.070000 0.356000 3.194000 32.100000 327.929000 μs
Running cvarray memoryview
0.014000 0.026000 0.063000 0.351000 3.209000 32.013000 327.890000 μs
(The reason for the "iterations" benchmark is that some methods have surprisingly different characteristics in this respect.)
In order of initialisation speed:
malloc: This is a harsh world, but it's fast. If you need to to allocate a lot of things and have unhindered iteration and indexing performance, this has to be it. But normally you're a good bet for...
cpython.array raw C type: Well damn, it's fast. And it's safe. Unfortunately it goes through Python to access its data fields. You can avoid that by using a wonderful trick:
arr.data.as_doubles[i]
which brings it up to the standard speed while removing safety! This makes this a wonderful replacement for malloc, being basically a pretty reference-counted version!
cpython.array buffer: Coming in at only three to four times the setup time of malloc, this is looks a wonderful bet. Unfortunately it has significant overhead (albeit small compared to the boundscheck and wraparound directives). That means it only really competes against full-safety variants, but it is the fastest of those to initialise. Your choice.
cpython.array memoryview: This is now an order of magnitude slower than malloc to initialise. That's a shame, but it iterates just as fast. This is the standard solution that I would suggest unless boundscheck or wraparound are on (in which case cpython.array buffer might be a more compelling tradeoff).
The rest. The only one worth anything is numpy's, due to the many fun methods attached to the objects. That's it, though.
As a follow up to Veedrac's answer: be aware using the memoryview support of cpython.array with python 2.7 appears to lead to memory leaks currently. This seems to be a long-standing issue as it is mentioned on the cython-users mailing list here in a post from November 2012. Running Veedrac's benchmark scrip with Cython version 0.22 with both Python 2.7.6 and Python 2.7.9 leads to a large memory leak on when initialising a cpython.array using either a buffer or memoryview interface. No memory leaks occur when running the script with Python 3.4. I've filed a bug report on this to the Cython developers mailing list.
I need to implement a function for summing the elements of an array with a variable section length.
So,
a = np.arange(10)
section_lengths = np.array([3, 2, 4])
out = accumulate(a, section_lengths)
print out
array([ 3., 7., 35.])
I attempted an implementation in cython here:
https://gist.github.com/2784725
for performance I am comparing to the pure numpy solution for the case where the section_lengths are all the same:
LEN = 10000
b = np.ones(LEN, dtype=np.int) * 2000
a = np.arange(np.sum(b), dtype=np.double)
out = np.zeros(LEN, dtype=np.double)
%timeit np.sum(a.reshape(-1,2000), axis=1)
10 loops, best of 3: 25.1 ms per loop
%timeit accumulate.accumulate(a, b, out)
10 loops, best of 3: 64.6 ms per loop
would you have any suggestion for improving performance?
You might try some of the following:
In addition to the #cython.boundscheck(False) compiler directive, also try adding #cython.wraparound(False)
In your setup.py script, try adding in some optimization flags:
ext_modules = [Extension("accumulate", ["accumulate.pyx"], extra_compile_args=["-O3",])]
Take a look at the .html file generated by cython -a accumulate.pyx to see if there are sections that are missing static typing or relying heavily on Python C-API calls:
http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types
Add a return statement at the end of the method. Currently it is doing a bunch of unnecessary error checking in your tight loop at i_el += 1.
Not sure if it will make a difference but I tend to make loop counters cdef unsigned int rather than just int
You also might compare your code to numpy when section_lengths are unequal, since it will probably require a bit more than just a simple sum.
In the nest for loop update out[i_bas] is slow, you can create a temporary variable to do the accumerate, and update out[i_bas] when nest for loop finished. The following code will be as fast as numpy version:
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_int_t
ctypedef np.double_t DTYPE_double_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def accumulate(
np.ndarray[DTYPE_double_t, ndim=1] a not None,
np.ndarray[DTYPE_int_t, ndim=1] section_lengths not None,
np.ndarray[DTYPE_double_t, ndim=1] out not None,
):
cdef int i_el, i_bas, sec_length, lenout
cdef double tmp
lenout = out.shape[0]
i_el = 0
for i_bas in range(lenout):
tmp = 0
for sec_length in range(section_lengths[i_bas]):
tmp += a[i_el]
i_el+=1
out[i_bas] = tmp