Related
I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A and 0x0B into 0xAB.
def pack(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.
cython3 -a pack.cyx generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.
From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.
The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array for Cython?
Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list instead of object (same as no type) or using for loop instead of list comprehension:
def packx(list it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef list r = [0]*n
for i in range(n):
r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
return r
I think my earlier test just specified an array.array as input, but following the comments I've now just tried
from cpython cimport array
import array
def packa(array.array a):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
cdef unsigned int b[256*64/2]
for i in range(n):
b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;
cdef array.array c = array.array("B", b)
return c
which compiles but
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()
segfaults.
I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.
Taking #ead's first approach, plus combining division and shifting (seems to save a microsecond:
#cython: boundscheck=False, wraparound=False
def packa(char[::1] a):
"""Cythonize python nibble packing loop, typed with array"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
# cdef unsigned int b[256*64/2]
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);
return res
compiles much longer, but runs much faster:
python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop
Amazing! But, with the additional bytes-to-array and array-to-list conversion
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do
it now only takes about 1.7 ms - very cool!
Down to 150 us timed or approx. 0.4 ms actual:
from cython cimport boundscheck, wraparound
from cpython cimport array
import array
#boundscheck(False)
#wraparound(False)
def pack(const unsigned char[::1] di):
cdef:
unsigned int i, n = len(di)
unsigned char h, l, r
array.array do = array.array('B')
array.resize(do, n>>1)
for i in range(0, n, 2):
h = di[i] & 0xF0
l = di[i+1] >> 4
r = h | l
do.data.as_uchars[i>>1] = r
return do
I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (# 10 MHz).
If you wanna to be as fast as C you should not use list with python-integers inside but an array.array. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array.
Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):
import random
l=[random.randint(0,15) for _ in range(1000)]
As baseline, your python-implementation with list:
def packx(it):
n = len(it)//2
r = [0]*n
for i in range(n):
r[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return r
%timeit packx(l)
143 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
By the way, I use % instead of //, which is probably what you want, otherwise you would get only 0s as result (only lower bits have data in your description).
After cythonizing the same function (with %%cython-magic) we get a speed-up of around 2:
%timeit packx(l)
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's look at the html produced by option -a, we see the following for the line corresponding to the for-loop:
.....
__pyx_t_2 = PyNumber_Multiply(__pyx_v_i, __pyx_int_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__pyx_t_5 = PyObject_GetItem(__pyx_v_it, __pyx_t_2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_5);
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
__pyx_t_2 = __Pyx_PyInt_RemainderObjC(__pyx_t_5, __pyx_int_16, 16, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
...
Py_NumberMultiply means that we use slow python-multiplication, Pyx_DECREF- all temporaries are slow python-objects. We need to change that!
Let's pass not a list but an array.array of bytes to our function and return an array.array of bytes back. Lists have full fledged python objects inside, array.arraythe lowly raw c-data which is faster:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('b', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return res
import array
a=array.array('B', l)
%timeit cy_apackx(a)
19.2 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Better, but let take a look at the generated html, there is still some slow python-code:
__pyx_t_2 = __Pyx_PyInt_From_long(((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_res), __pyx_v_i, __pyx_t_2, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
We still use a python-setter for array (__Pax_SetItemInt) and for this a python objecct __pyx_t_2 is needed, to avoid this we use array.data.as_chars:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.86 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Much better, but let's take a look at html again, and we see some calls to __Pyx_RaiseBufferIndexError - this safety costs some time, so let's switch it off:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False) # switch of safety-checks
#cython.wraparound(False) # switch of safety-checks
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.53 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When we look at the generated html, we see:
__pyx_t_7 = (__pyx_v_i * 2);
__pyx_t_8 = ((__pyx_v_i * 2) + 1);
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16));
No python-stuff! Good so far. However, I'm not sure about __Pyx_mod_long, its definition is:
static CYTHON_INLINE long __Pyx_mod_long(long a, long b) {
long r = a % b;
r += ((r != 0) & ((r ^ b) < 0)) * b;
return r;
}
So C and Python have differences for mod of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizing a%16 as a&15. We have only positive numbers, so no need to care about them, thus we need to do the a&15-trick by ourselves:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]&15)<<4 | (it[i*2+1]&15)
return res
%timeit cy_apackx(a)
1.02 µs ± 8.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I'm also satified with the resulting C-code/html (only one line):
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))) & 15) << 4) | ((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))) & 15));
Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.
If you need the upper bits and can assume that lower bits are without dirt (otherwise &250 can help), you can use:
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = it[i*2] | (it[i*2+1]>>4)
return res
%timeit cy_apackx(a)
819 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
res=[0]*n
for i in range(n):
res[i] = it[i*2] | (it[i*2+1]>>4))
return res
%timeit cy_packx(l)
20.7 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.
However, it[i*2] | (it[i*2+1]>>4) is done with python-integer, as next step we make it cdef-operations:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
res[i]= a | (b>>4)
return res
%timeit cy_packx(l)
7.3 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for array.array.
Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[i]= s
return res
%timeit cy_packx(l)
In [79]: %timeit cy_packx(l)
7.67 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The usage of the s variable means, it does not get optimized away in the second version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[0]=s
return res
In [81]: %timeit cy_packx(l)
5.46 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
return s
In [84]: %timeit cy_packx(l)
3.84 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That leads to the following performance break down of the list-version:
Time(in µs) Percentage(in %)
all 7.7 100
calculation 1 12
alloc memory 1.6 21
create ints 2.2 29
access data/cast 2.6 38
I must confess, I expected create ints to play a bigger role and didn't thing accessing the data in the list and casting it to chars will cost that much.
As a follow up of this question here (thanks MSeifert for your help) I came up with the problem that I have to mask a numpy array new_values with an index array new_vals_idx before passing the masked array to update val_dict.
To the proposed solutions in answer of MSeifert in the old post I tried to apply the array masking, but the performance is not satisfying.
The arrays and dicts I used for the following examples are:
import numpy as np
val_dict = {'a': 5.0, 'b': 18.8, 'c': -55/2}
for i in range(200):
val_dict[str(i)] = i
val_dict[i] = i**2
keys = ('b', 123, '89', 'c') # dict keys to update
new_values = np.arange(1, 51, 1) / 1.0 # array with new values which has to be masked
new_vals_idx = np.array((0, 3, 5, -1)) # masking array
valarr = np.zeros((new_vals_idx.shape[0])) # preallocation for masked array
length = new_vals_idx.shape[0]
To make my code-snippets easier to compare with my old question, I'll stick to the function naming of MSeifert's answer. These are my tries to get the best performance out of python/cython (the other answers were left out because of too poor performance):
def old_for(val_dict, keys, new_values, new_vals_idx, length):
for i in range(length):
val_dict[keys[i]] = new_values[new_vals_idx[i]]
%timeit old_for(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.6 µs per loop
def old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
valarr = new_values[new_vals_idx]
for i in range(length):
val_dict[keys[i]] = valarr[i]
%timeit old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.33 µs per loop
def new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
valarr = new_values[new_vals_idx].tolist()
for key, val in zip(keys, valarr):
val_dict[key] = val
%timeit new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.01 µs per loop
Cython functions:
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef new3_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
cdef Py_ssize_t i
cdef double val # this gives about 10 µs speed boost compared to directly assigning it to val_dict
for i in range(length):
val = new_values[new_vals_idx[i]]
val_dict[keys[i]] = val
%timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop
cpdef new3_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
cdef Py_ssize_t i
cdef int[:] mview_idx = new_vals_idx
cdef double [:] mview_vals = new_values
for i in range(length):
val_dict[keys[i]] = mview_vals[mview_idx[i]]
%timeit new3_cy_mview(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop
# NOT WORKING:
cpdef new2_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
cdef double [new_vals_idx] masked_vals = new_values
for key, val in zip(keys, masked_vals.tolist()):
val_dict[key] = val
cpdef new2_cy_mask(dict val_dict, tuple keys, double[:] new_values, valarr, int[:] new_vals_idx, Py_ssize_t length):
valarr = new_values[new_vals_idx]
for key, val in zip(keys, valarr.tolist()):
val_dict[key] = val
The Cython functions new3_cy and new3_cy_mview do not seem to be considerably faster than old_for. Passing valarr to avoid array construction inside the function (as it is going to be called several million times) even seems to slow it down.
Masking in new2_cy_mask with the new_vals_idx array in Cython gives me the error: 'Invalid index for memoryview specified, type int[:]'. Is there any type like Py_ssize_t for arrays of indexes?
Trying to create a masked memoryview in new2_cy_mview gives me the error 'Cannot assign type 'double[:]' to 'double [__pyx_v_new_vals_idx]''. Is there even something like masked memoryviews? I wasn't able to find information on this topic...
Comparing the timing results with those from my old question I guess that the array masking is the process taking up most of the time. And as it is most likely already highly optimized in numpy, there is probably not much to do. But the slow-down is so huge, that there must be (hopefully) a better way to do it.
Any help is appreciated! Thanks in advance!
One thing you can do in the current construction is turn off bounds checking (if it's safe to!). Won't make a huge difference, but some incremental performance.
%%cython
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef new4_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
cdef Py_ssize_t i
cdef double val # this gives about 10 µs speed boost compared to directly assigning it to val_dict
for i in range(length):
val = new_values[new_vals_idx[i]]
val_dict[keys[i]] = val
In [36]: %timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
1.76 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [37]: %timeit new4_cy(val_dict, keys, new_values, new_vals_idx, length)
1.45 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I am trying to understand why filling a 64 bit array is slower than filling a 32 bit array.
Here are the examples:
#cython.boundscheck(False)
#cython.wraparound(False)
def test32(long long int size):
cdef np.ndarray[np.int32_t,ndim=1] index = np.zeros(size, np.int32)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
#cython.boundscheck(False)
#cython.wraparound(False)
def test64(long long int size):
cdef np.ndarray[np.int64_t,ndim=1] index = np.zeros(size, np.int64)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
The timings:
In [4]: %timeit test32(1000000)
1000 loops, best of 3: 1.13 ms over loop
In [5]: %timeit test64(1000000)
100 loops, best of 3: 2.04 ms per loop
I am using a 64 bit computer and the Visual C++ for Python (9.0) compiler.
Edit:
Initializing a 64bit array and a 32bit array seems to be taking the same amount of time, which means the time difference is due to the filling process.
In [8]: %timeit np.zeros(1000000,'int32')
100000 loops, best of 3: 2.49 μs per loop
In [9]: %timeit np.zeros(1000000,'int64')
100000 loops, best of 3: 2.49 μs per loop
Edit2:
As DavidW points out, this behavior can be replicated with np.arange, which means that this is expected:
In [7]: %timeit np.arange(1000000,dtype='int32')
10 loops, best of 3: 1.22 ms per loop
In [8]: %timeit np.arange(1000000,dtype='int64')
10 loops, best of 3: 2.03 ms per loop
A quick set of measurements (which I initially posted in the comments) suggests that you see very similar behaviour (64 bits taking twice as long as 32 bits) for np.full, np.ones and np.arange, with all three showing similar times to each other (arange being about 10% slower than the other 2 for me).
I believe this suggests that this is expected behaviour and just the time it takes to fill the memory. (64 bits obviously has twice as much memory as 32 bits).
The interesting question is then why np.zeros is so uniform (and fast) - a complete answer is probably given in this C-based question but the basic summary is that allocating zeros (as done by the C function calloc) can be done lazily - i.e. it doesn't actually allocate or fill much until you actually try to write to the memory.
TLDR: in cython, why (or when?) is iterating over a numpy array faster than iterating over a python list?
Generally:
I've used Cython before and was able to get tremendous speed ups over naive python impl',
However, figuring out what exactly needs to be done seems non-trivial.
Consider the following 3 implementations of a sum() function.
They reside in a cython file called 'cy' (obviously, there's np.sum(), but that's besides my point..)
Naive python:
def sum_naive(A):
s = 0
for a in A:
s += a
return s
Cython with a function that expects a python list:
def sum_list(A):
cdef unsigned long s = 0
for a in A:
s += a
return s
Cython with a function that expects a numpy array.
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
I would expect that in terms of running time, sum_np < sum_list < sum_naive, however, the following script demonstrates to the contrary (for completeness, I added np.sum() )
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit cy.sum_naive(v_list)
%timeit cy.sum_naive(v_np)
%timeit cy.sum_list(v_list)
%timeit cy.sum_np(v_np)
%timeit v_np.sum()
with results:
In [18]: %timeit cyMatching.sum_naive(v_list)
100 loops, best of 3: 18.7 ms per loop
In [19]: %timeit cyMatching.sum_naive(v_np)
1 loops, best of 3: 389 ms per loop
In [20]: %timeit cyMatching.sum_list(v_list)
10 loops, best of 3: 82.9 ms per loop
In [21]: %timeit cyMatching.sum_np(v_np)
1 loops, best of 3: 1.14 s per loop
In [22]: %timeit v_np.sum()
1000 loops, best of 3: 659 us per loop
What's going on?
Why is cython+numpy slow?
P.S.
I do use
#cython: boundscheck=False
#cython: wraparound=False
There is a better way to implement this in cython that at least on my machine beats np.sum because it avoids type checking and other things that numpy normally has to do when dealing with an arbitrary array:
#cython.wraparound=False
#cython.boundscheck=False
cimport numpy as np
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
def sum_np2(np.int64_t[::1] A):
cdef:
unsigned long s = 0
size_t k
for k in range(A.shape[0]):
s += A[k]
return s
And then the timings:
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit sum(v_list)
%timeit sum_naive(v_list)
%timeit np.sum(v_np)
%timeit sum_np(v_np)
%timeit sum_np2(v_np)
10 loops, best of 3: 19.5 ms per loop
10 loops, best of 3: 64.9 ms per loop
1000 loops, best of 3: 1.62 ms per loop
1 loops, best of 3: 1.7 s per loop
1000 loops, best of 3: 1.42 ms per loop
You don't want to iterate over the numpy array via the Python style, but rather access elements using indexing as that it can be translated into pure C, rather than relying on the Python API.
a is untyped and thus there will be lots of conversions from Python to C types and back. These can be slow.
JoshAdel correctly pointed out that instead of iterating though, you should iterate over a range. Cython will convert the indexing to C, which is fast.
Using cython -a myfile.pyx will highlight these sorts of things for you; you want all of your loop logic to be white for maximum speed.
PS: Note that np.ndarray[np.int64_t, ndim=1] is outdated and has been deprecated in favour of the faster and more general long[:].
I have this cython code just for testing:
cimport cython
cpdef loop(int k):
return real_loop(k)
#cython.cdivision
cdef real_loop(int k):
cdef int i
cdef float a
for i in xrange(k):
a = i
a = a**2 / (a + 1)
return a
And I test speed diffence between this cython code and the same code in pure python with a script like that:
import mymodule
print(mymodule.loop(100000))
I get 80times faster. But if I remove the two return statement in cython code , I get 800-900times faster. Why ?
Another things is if I run this code ( with return ) on my old ACER Aspire ONE notebook I get 700times faster and on a new desktop i7 PC at home , 80times faster.
Somebody know why ?
I tested you problem with the following code:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False
def loop(int k):
return real_loop(k)
def loop2(int k):
cdef float a
real_loop2(k, &a)
return a
def loop3(int k):
real_loop3(k)
return None
def loop4(int k):
return real_loop4(k)
def loop5(int k):
cdef float a
real_loop5(k, &a)
return a
cdef float real_loop(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
return a
cdef void real_loop2(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]**2 / (a[0] + 1)
cdef void real_loop3(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
cdef float real_loop4(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a*a / (a + 1)
return a
cdef void real_loop5(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]*a[0] / (a[0] + 1)
where real_loop() is close to your function, with a modified formula for a since the original one seemed strange.
Function real_loop2() is not returning any value, just updating a by reference.
Function real_loop3() is not returning any value.
Checking the generated C code for real_loop3() one can see that the loop is there, and the code is called... but I had the same conclusion as #dmytro, changing k won't change the timing significantly... so there must be a point that I am missing here.
From the timings below we can say that return is not the bottleneck since real_loop2() and real_loop5() have do not return any value, and their performance is the same as real_loop() and real_loop4(), respectively.
In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop
In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop
In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop
In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop
In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop
Note the ~2X speed-up changing a**2 by a*a, since a**2 requires a function call powf() inside the loop.