Cython actually slowing me down

Cython actually slowing me down - python

I am trying to 'cythonize' my code however while the following code does work, it is not adding any speed to my code (in fact its a shade slower). I am wondering if anyone knows what I am doing wrong if anything at all. Note I am passing a numpy array and its data type should be float16. Never used cython before and I am using Jupyter notebook right now. In the cell above I have Cython loaded.
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.float16
ctypedef np.int_t DTYPE_t
def arrange_waveforms(np.ndarray arr,dim,mintomaxEventslistrange):
import timeit
start_time = timeit.default_timer()
cdef int key
#mintomaxEventslistrange is basically range(2000000)
dictlist = dict((key, [[] for _ in xrange(1536)]) for key in mintomaxEventslistrange)
# A seperate one to hold the timing info (did this to minimize memory of dictlist)
window = dict((key, [[] for _ in xrange(1536)]) for key in mintomaxEventslistrange)
#arrange waveforms
cdef np.ndarray pixel=arr[:,0].astype(int)
cdef int i
cdef int lines = dim[0]
for i in range(lines):
dictlist[arr[i,1]][pixel[i]].extend(arr[i,9:])
window[arr[i,1]][pixel[i]].append(arr[i,6])
elapsed = timeit.default_timer() - start_time
print elapsed
return dictlist,window

Related

Issue converting a python program to cython for speedup

I am trying to write a fast algorithm in Cython. The algorithm is fairly straightforward and is described here (https://arxiv.org/pdf/1411.4357.pdf - the paragraph before Theorem 8 on page 17). The idea is relatively straightforward: assuming that the data is sparse it can be given as input in the form (i,j,A_ij) for a matrix A which has n rows and d columns. Now to take advantage of the sparsity we need two functions h which maps the n rows into s buckets uniformly at random (which is a parameter to the algorithm) and a sign function s which is just ±1 with equal probability. Then for every triple (i,j,A_ij) the algorithm must output (h(i),j, s(h(i))*A_ij) and return these as arrays to be given as input to another sparse.
The problem is that I can't get a speedup. This should run extremely fast and outperform matrix multiplication of realising S and multiplying by A as illustrated in the above reference.
Python approach (roughly 11ms):
import numpy as np
import numpy.random as npr
from scipy.sparse import coo_matrix
def countSketch_faster(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
observed_rows = set([])
sign_hashes = []
hash_table = {}
sketched_data = []
hashed_rows = []
for row_id, col_id, Aij in zip(input_matrix.row, input_matrix.col, input_matrix.data):
if row_id in observed_rows:
sketched_data.append(hash_table[row_id][1]*Aij)
hashed_rows.append(hash_row_val)
else:
hash_row_val, sign_val = np.random.choice(sketch_size), np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data.append(sign_val*Aij)
hashed_rows.append(hash_row_val)
observed_rows.add(row_id)
hashed_rows = np.asarray(hashed_rows)
sketched_data = np.asarray(sketched_data)
row_hashes = np.asarray(row_hashes)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col))).tocsr()
return S
Converting to Cython:
Analysing the annotated output highlights that the most python-heavy lines are those which call numpy. I have tried to cdef all variables and specify dtype for any call to numpy. Also, I have removed the zip command and tried to keep the loop simpler and more C-like. All of this has actually had an adverse effect though and it runs very slowly, and I'm not really sure why. It seems like quite a simple algorithm to implement so if someone could help me get the runtime down to something very small I would be extremely grateful.
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from scipy.sparse import coo_matrix
def countSketch_cython(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
hash_table = {}
sketched_data = np.zeros_like(input_matrix.data,dtype=np.float64)
hashed_rows = np.zeros_like(sketched_data,dtype=np.float64)
observed_rows = set([])
for idx in range(input_matrix.row.shape[0]):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = np.random.randint(low=0,high=sketch_size+1)
sign_val = np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)))
return S`
UPDATE: I have managed to speed the code up by removing some of the slower lines. These were calls to np.random for which the C random number generator was faster, giving the length in the input so this did not need to be calculated, and not converting to a sparse matrix before returning (for the purpose of this experiment I am interested in how quickly the transform can be done as opposed to details surrounding the conversion for downstream use).
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from libc.stdlib cimport rand
##cython.boundscheck(False) # these don't contribute much in this
example
##cython.wraparound(False)
def countSketch(input_matrix, input_matrix_length, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
input_matrix_length - number of rows in input matrix
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
cdef int arr_lengths = input_matrix_length
# These two lines are still annotated most boldly
cdef double[:,:] sketched_data =
np.zeros((arr_lengths,1),dtype=np.float64)
cdef double[:,:] hashed_rows = np.zeros((arr_lengths,1))
hash_table = {}
observed_rows = set([])
for idx in range(arr_lengths):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = rand()%(sketch_size)
#np.random.randint(low=0,high=sketch_size+1)
sign_val = 2*rand()%(2) - 1
#np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
#S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)), dtype=np.float64)
return hashed_rows, sketched_data
On a random sparse matrix A = scipy.sparse.random(1000, 50, density=0.1) this now achieves 508 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) using %timeit countSketch(A, A.shape[0],100). I should imagine there are still gains to be made as I don't know if I have set the arrays in the best way.

Cython/numpy vs pure numpy for least square fitting

A T.A at school showed me this code as an example of a least square fitting algorithm.
import numpy as np
#return the coefficients (a0,..aN) of the fit y=a0+a1*x+..an*x^n
#with associated sigma dy
#x,y,dy are all np.arrays with dtype= np.float64
def fit_poly(x,y,dy,n):
V = np.asmatrix(np.diag(dy**2))
M = []
for k in range(n+1):
M.append(x**k)
M = np.asmatrix(M).T
theta = (M.T*V.I*M).I*M.T*V.I*np.asmatrix(y).T
cov_t = (M.T*V.I*M).I
return np.asarray(theta.T)[0], np.asarray(cov_t)
Im trying to optimize his codes using cython. i got this code
cimport numpy as np
import numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef poly_c(np.ndarray[np.float64_t, ndim=1] x ,
np.ndarray[np.float64_t, ndim=1] y np.ndarray[np.float64_t,ndim=1]dy , np.int n):
cdef np.ndarray[np.float64_t, ndim=2] V, M
V=np.asmatrix(np.diag(dy**2),dtype=np.float64)
M=np.asmatrix([x**k for k in range(n+1)],dtype=np.float64).T
return ((M.T*V.I*M).I*M.T*V.I*(np.asmatrix(y).T))[0],(M.T*V.I*M).I
But the runtime seems to be the same for both programs,i did used an 'assert' to make sure the outputs where the same. What am i missing/doing wrong?
Thank you for your time and hopefully you can help me.
ps: this is the code im profiling with(not sure if i can call this profiling but w/e)
import numpy as np
from polyC import poly_c
from time import time
from pancho_fit import fit_poly
#pancho's the T.A,sup pancho
x=np.arange(1,1000)
x=np.asarray(x,dtype=np.float64)
y=3*x+np.random.random(999)
y=np.asarray(y,dtype=np.float64)
dy=np.array([y.std() for i in range(1,1000)],dtype=np.float64)
t0=time()
a,b=poly_c(x,y,dy,4)
#a,b=fit_poly(x,y,dy,4)
print("time={}s".format(time()-t0))

Except for [x**k for k in range(n+1)] I don't see any iterations for cython to improve. Most of the action is in matrix products. Those are already done with compiled code (with np.dot for ndarrays).
And n is only 4, not many iterations.
But why iterate this?
In [24]: x=np.arange(1,1000.)
In [25]: M1=x[:,None]**np.arange(5)
# np.matrix(M1)
does the same thing.
So no, this does not look like a good cython candidate - not unless you are prepared to write out all those matrix products in compilable detail.
I'd skip also the asmatrix stuff and use regular dot, # and einsum, but that's more a matter of style than speed.

Cython fails to recognize overloaded constructor

I'm trying to compile the cyrand project using cython, but am running into a bizarre compile error when testing overloaded constructors. See this gist for the files in question.
From the gist, I can compile and run example.pyx just fine, which uses the default constructor:
import numpy as np
cimport numpy as np
cimport cython
include "random.pyx"
#cython.boundscheck(False)
def example(n):
cdef int N = n
cdef rng r
cdef rng_sampler[double] * rng_p = new rng_sampler[double](r)
cdef rng_sampler[double] rng = deref(rng_p)
cdef np.ndarray[np.double_t, ndim=1] result = np.empty(N, dtype=np.double)
for i in range(N):
result[i] = rng.normal(0.0, 2.0)
print result
return result
^ this works and runs fine. An example run produces the following output:
$ python test_example.py
[ 0.47237842 3.153744849 3.6854932057 ]
Yet when I try to compile and run the test which used a constructor that takes an unsigned long as argument:
import numpy as np
cimport numpy as np
cimport cython
include "random.pyx"
#cython.boundscheck(False)
def example_seed(n, seed):
cdef int N = n
cdef unsigned long Seed = seed
cdef rng r
cdef rng_sampler[double] * rng_p = new rng_sampler[double](Seed)
cdef rng_sampler[double] rng = deref(rng_p)
cdef np.ndarray[np.double_t, ndim=1] result = np.empty(N, dtype=np.double)
for i in range(N):
result[i] = rng.normal(0.0, 2.0)
print result
return result
I get the following cython compiler error:
Error compiling Cython file:
-----------------------------------------------------------
...
cdef int N = n
cdef unsigned long Seed = seed
cdef rng_sampler[double] * rng_p = new rng_sampler[double](Seed)
----------------------------------------------------------
example/example_seed.pyx:15:67 Cannot assign type 'unsigned long' to 'mt19937'
I interpret this message, along with the fact that example.pyx compiles and produces a working example.so file, that cython cannot find (or manage) the rng_sampler constructor that takes an unsigned long as input. I've not used cython before, and my cpp is middling at best. Can anyone shed light on how to fix this simple problem?
python: 2.7.10 (Anaconda 2.0.1)
cython: 0.22.1

I resolved the error, it was to do with how boost is installed. I had installed boost via apt-get. After downloading/untarring and changing the pointers to boost in setup.py, it works.

Why does numba have worse optimization than Cython in this code?

I am trying to optimize some code with numba. The problem is that a simple Cython optimization (just specifying data types) is six times faster than using autojit, so I don't know if I'm doing something wrong.
The function to optimize is:
from numba import autojit
#autojit(nopython=True)
def get_energy(system, i,j,m):
#system is an array, (i,j) some indices and m the size of the array
up=i-1; down=i+1; left=j-1; right=j+1
if up<0: total=system[m,j]
else: total=system[up,j]
if down>m: total+=system[0,j]
else: total+=system[down,j]
if left<0: total+=system[i,m]
else: total+=system[i,left]
if right>m: total+=system[i,0]
else: total+=system[i,right]
return 2*system[i,j]*total
A simple run would be something like this:
import numpy as np
x=np.random.rand(50,50)
get_energy(x, 3, 5, 50)
I've understood that numba is good at loops but may not optimize other things very well. Anyhow, I would expect a similar performance to Cython, is numba slower accessing arrays or at conditional statements?
The .pyx file in Cython is:
import numpy as np
cimport cython
cimport numpy as np
def get_energy(np.ndarray[np.float64_t, ndim=2] system, int i,int j,unsigned int m):
cdef int up
cdef int down
cdef int left
cdef int right
cdef np.float64_t total
up=i-1; down=i+1; left=j-1; right=j+1
if up<0: total=system[m,j]
else: total=system[up,j]
if down>m: total+=system[0,j]
else: total+=system[down,j]
if left<0: total+=system[i,m]
else: total+=system[i,left]
if right>m: total+=system[i,0]
else: total+=system[i,right]
return 2*system[i,j]*total
Please comment if I need to give further information.

Force NumPy ndarray to take ownership of its memory in Cython

Following this answer to "Can I force a numpy ndarray to take ownership of its memory?" I attempted to use the Python C API function PyArray_ENABLEFLAGS through Cython's NumPy wrapper and found it is not exposed.
The following attempt to expose it manually (this is just a minimum example reproducing the failure)
from libc.stdlib cimport malloc
import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.int32_t DTYPE_t
cdef extern from "numpy/ndarraytypes.h":
void PyArray_ENABLEFLAGS(np.PyArrayObject *arr, int flags)
def test():
cdef int N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, np.NPY_INT32, data)
PyArray_ENABLEFLAGS(arr, np.NPY_ARRAY_OWNDATA)
fails with a compile error:
Error compiling Cython file:
------------------------------------------------------------
...
def test():
cdef int N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, np.NPY_INT32, data)
PyArray_ENABLEFLAGS(arr, np.NPY_ARRAY_OWNDATA)
^
------------------------------------------------------------
/tmp/test.pyx:19:27: Cannot convert Python object to 'PyArrayObject *'
My question: Is this the right approach to take in this case? If so, what am I doing wrong? If not, how do I force NumPy to take ownership in Cython, without going down to a C extension module?

You just have some minor errors in the interface definition. The following worked for me:
from libc.stdlib cimport malloc
import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.int32_t DTYPE_t
cdef extern from "numpy/arrayobject.h":
void PyArray_ENABLEFLAGS(np.ndarray arr, int flags)
cdef data_to_numpy_array_with_spec(void * ptr, np.npy_intp N, int t):
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, t, ptr)
PyArray_ENABLEFLAGS(arr, np.NPY_OWNDATA)
return arr
def test():
N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
arr = data_to_numpy_array_with_spec(data, N, np.NPY_INT32)
return arr
This is my setup.py file:
from distutils.core import setup, Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("_owndata", ["owndata.pyx"])]
setup(cmdclass={'build_ext': build_ext}, ext_modules=ext_modules)
Build with python setup.py build_ext --inplace. Then verify that the data is actually owned:
import _owndata
arr = _owndata.test()
print arr.flags
Among others, you should see OWNDATA : True.
And yes, this is definitely the right way to deal with this, since numpy.pxd does exactly the same thing to export all the other functions to Cython.

#Stefan's solution works for most scenarios, but is somewhat fragile. Numpy uses PyDataMem_NEW/PyDataMem_FREE for memory-management and it is an implementation detail, that these calls are mapped to the usual malloc/free + some memory tracing (I don't know which effect Stefan's solution has on the memory tracing, at least it seems not to crash).
There are also more esoteric cases possible, in which free from numpy-library doesn't use the same memory-allocator as malloc in the cython code (linked against different run-times for example as in this github-issue or this SO-post).
The right tool to pass/manage the ownership of the data is PyArray_SetBaseObject.
First we need a python-object, which is responsible for freeing the memory. I'm using a self-made cdef-class here (mostly because of logging/demostration), but there are obviously other possiblities as well:
%%cython
from libc.stdlib cimport free
cdef class MemoryNanny:
cdef void* ptr # set to NULL by "constructor"
def __dealloc__(self):
print("freeing ptr=", <unsigned long long>(self.ptr)) #just for debugging
free(self.ptr)
#staticmethod
cdef create(void* ptr):
cdef MemoryNanny result = MemoryNanny()
result.ptr = ptr
print("nanny for ptr=", <unsigned long long>(result.ptr)) #just for debugging
return result
...
Now, we use a MemoryNanny-object as sentinel for the memory, which gets freed as soon as the parent-numpy-array gets destroyed. The code is a little bit awkward, because PyArray_SetBaseObject steals the reference, which is not handled by Cython automatically:
%%cython
...
from cpython.object cimport PyObject
from cpython.ref cimport Py_INCREF
cimport numpy as np
#needed to initialize PyArray_API in order to be able to use it
np.import_array()
cdef extern from "numpy/arrayobject.h":
# a little bit awkward: the reference to obj will be stolen
# using PyObject* to signal that Cython cannot handle it automatically
int PyArray_SetBaseObject(np.ndarray arr, PyObject *obj) except -1 # -1 means there was an error
cdef array_from_ptr(void * ptr, np.npy_intp N, int np_type):
cdef np.ndarray arr = np.PyArray_SimpleNewFromData(1, &N, np_type, ptr)
nanny = MemoryNanny.create(ptr)
Py_INCREF(nanny) # a reference will get stolen, so prepare nanny
PyArray_SetBaseObject(arr, <PyObject*>nanny)
return arr
...
And here is an example, how this functionality can be called:
%%cython
...
from libc.stdlib cimport malloc
def create():
cdef double *ptr=<double*>malloc(sizeof(double)*8);
ptr[0]=42.0
return array_from_ptr(ptr, 8, np.NPY_FLOAT64)
which can be used as follows:
>>> m = create()
nanny for ptr= 94339864945184
>>> m.flags
...
OWNDATA : False
...
>>> m[0]
42.0
>>> del m
freeing ptr= 94339864945184
with results/output as expected.
Note: the resulting arrays doesn't really own the data (i.e. flags return OWNDATA : False), because the memory is owned be the memory-nanny, but the result is the same: the memory gets freed as soon as array is deleted (because nobody holds a reference to the nanny anymore).
MemoryNanny doesn't have to guard a raw C-pointer. It can be anything else, for example also a std::vector:
%%cython -+
from libcpp.vector cimport vector
cdef class VectorNanny:
#automatically default initialized/destructed by Cython:
cdef vector[double] vec
#staticmethod
cdef create(vector[double]& vec):
cdef VectorNanny result = VectorNanny()
result.vec.swap(vec) # swap and not copy
return result
# for testing:
def create_vector(int N):
cdef vector[double] vec;
vec.resize(N, 2.0)
return VectorNanny.create(vec)
The following test shows, that the nanny works:
nanny=create_vector(10**8) # top shows additional 800MB memory are used
del nanny # top shows, this additional memory is no longer used.

The latest Cython version allows you to do with with minimal syntax, albeit slightly more overhead than the lower-level solutions suggested.
numpy_array = np.asarray(<np.int32_t[:10, :10]> my_pointer)
https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#coercion-to-numpy
This alone does not pass ownership.
Notably, a Cython array is generated with this call, via array_cwrapper.
This generates a cython.array, without allocating memory. The cython.array uses the stdlib.h malloc and free by default, so it would be expected that you use the default malloc, as well, instead of any special CPython/Numpy allocators.
free is only called if ownership is set for this cython.array, which it is by default only if it allocates data. For our case, we can manually set it via:
my_cyarr.free_data = True
So to return a 1D array, it would be as simple as:
from cython.view cimport array as cvarray
# ...
cdef cvarray cvarr = <np.int32_t[:N]> data
cvarr.free_data = True
return np.asarray(cvarr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython actually slowing me down - python

Related

Issue converting a python program to cython for speedup

Cython/numpy vs pure numpy for least square fitting

Cython fails to recognize overloaded constructor

Why does numba have worse optimization than Cython in this code?

Force NumPy ndarray to take ownership of its memory in Cython

Categories

Resources