I have parallalised some Cython code with OpenMP. Once in a while, the code computes wrong results.
I created a nearly minimal working example of my problem. "Nearly", because the frequency of the wrong results seem to depend on even the tiniest changes in code, thus, e.g. I kept the function pointers in.
The Cython code is
#cython: language_level=3, boundscheck=False, wraparound=False, cdivision=True
# distutils: language = c++
import numpy as np
cimport cython
from cython.parallel import prange, parallel
from libcpp.vector cimport vector
cimport numpy as np
cdef inline double estimator_matheron(const double f_diff) nogil:
return f_diff * f_diff
ctypedef double (*_estimator_func)(const double) nogil
cdef inline void normalization_matheron(
vector[double]& variogram,
vector[long]& counts,
const int variogram_len
):
cdef int i
for i in range(variogram_len):
if counts[i] == 0:
counts[i] = 1
variogram[i] /= (2. * counts[i])
ctypedef void (*_normalization_func)(vector[double]&, vector[long]&, const int)
def test(const double[:] f):
cdef _estimator_func estimator_func = estimator_matheron
cdef _normalization_func normalization_func = normalization_matheron
cdef int i_max = f.shape[0] - 1
cdef int j_max = i_max + 1
cdef vector[double] variogram_local, variogram
cdef vector[long] counts_local, counts
cdef int i, j
with nogil, parallel():
variogram_local.resize(j_max, 0.0)
counts_local.resize(j_max, 0)
for i in range(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += estimator_func(f[i] - f[i+j])
normalization_func(variogram_local, counts_local, j_max)
return np.asarray(variogram_local)
To test the code, I used this script:
import numpy as np
from cython_parallel import test
z = np.array(
(41.2, 40.2, 39.7, 39.2, 40.1, 38.3, 39.1, 40.0, 41.1, 40.3),
dtype=np.double,
)
print(test(z))
The result should be
[0. 0.49166667 0.7625 1.09071429 0.90166667 1.336
0.9525 0.435 0.005 0.405 ]
This is what a wrong result typically looks like
[0. 0.44319444 0.75483871 1.09053571 0.90166667 1.336
0.9525 0.435 0.005 0.405 ]
This code mainly sums up numbers into the the vector variogram_local. Most of the time, this code works, but without having made sufficient statistics, wrong results are produced maybe every 30th time. It always works, if I change the line with nogil, parallel(): to with nogil:. It also always works, if I don't use the function pointers at all, like this:
with nogil, parallel():
variogram_local.resize(j_max, 0.0)
counts_local.resize(j_max, 0)
for i in range(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])
for j in range(j_max):
if counts_local[j] == 0:
counts_local[j] = 1
variogram_local[j] /= (2. * counts_local[j])
return np.asarray(variogram_local)
The full code is tested on different platforms and these problems mainly occure on MacOS with clang, e.g.:
https://ci.appveyor.com/project/conda-forge/staged-recipes/builds/29018878
EDIT
Thanks to your input, I modified the code and with num_threads=2 it works. But as soon as num_threads>2 I get wrong results again. Do you think that, if Cython support for OpenMP would be perfect, my new code should work or am I still getting something wrong?
If this should be on Cython's side, I guess I will indeed implement the code in pure C++.
def test(const double[:] f):
cdef int i_max = f.shape[0] - 1
cdef int j_max = i_max + 1
cdef vector[double] variogram_local, variogram
cdef vector[long] counts_local, counts
cdef int i, j, k
variogram.resize(j_max, 0.0)
counts.resize(j_max, 0)
with nogil, parallel(num_threads=2):
variogram_local = vector[double](j_max, 0.0)
counts_local = vector[long)(j_max, 0)
for i in prange(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])
for k in range(j_max):
counts[k] += counts_local[k]
variogram[k] += variogram_local[k]
for i in range(j_max):
if counts[i] == 0:
counts[i] = 1
variogram[i] /= (2. * counts[i])
return np.asarray(variogram)
Contrary to their name, variogram_local and counts_local are not actually local. They are shared and all threads mess around with them in parallel, hence the undefined result.
Note that you don't actually share any work. It's just all threads doing the same thing - the whole serial task.
A somewhat sensible parallel version would look more like this:
variogram.resize(j_max, 0.0)
counts.resize(j_max, 0)
with nogil, parallel():
for i in range(i_max):
for j in prange(1, j_max-i):
counts[j] += 1
variogram[j] += estimator_func(f[i] - f[i+j])
The shared arrays are initialized outside and then the threads share the inner j-loop. Since no two threads will ever work on the same j, this is safe to do.
Now it may not be ideal to parallelize the inner loop. If you were to actually parallelize the outer loop, you would have to in fact make actual local variables and merge/reduce them afterwards.
The problem with your modified code is that you have a race condition the section which adds up counts_local and variogram_local. You want this in the parallel block (so that you still have access to the thread-local variables) but you only want one thread at a time to be working on it. The easiest way is to put it in a with gil: block so that Python enforces the "one thread at a time":
with gil:
for k in range(j_max):
counts[k] += counts_local[k]
variogram[k] += variogram_local[k]
This bit should hopefully be a quick task at the end, so shouldn't take too long.
If it were in C/C++ you'd probably use #pragma openmp atomic or #pragma openmp critical instead for the block. It's difficult to to this in Cython since their OpenMP support is quite basic, but you probably could abuse wrapped C macros to make the addition atomic.
Cython's OpenMP support is really geared around simple loops and scalar reductions. If you're doing more than that then it doesn't have the syntax to give you fine control of OpenMP and for this reason I'd tend to recommend writing your critical OpenMP functions in C or C++ (whichever you're more comfortable with).
Related
I want to make a pure function in c-style which take an array as an argument (pointer) and do something with it. But I cannot find out how to define an array argument for a cdef function. Here is some toy code I have made.
cdef void test(double[] array ) except? -2:
cdef int i,n
i = 0
n = len(array)
for i in range(0,n):
array[i] = array[i]+1.0
def ctest(a):
n = len(a)
#Make a C-array on the heap.
cdef double *v
v = <double *>malloc(n*sizeof(double))
#Copy in the python array
for i in range(n):
v[i] = float(a[i])
#Calling the C-function which do something with the array
test(v)
#Puttint the changed C-array back into python
for i in range(n):
a[i] = v[i]
free(v)
return a
The code will not compile. Have search for how to define C-arrays in Cython, but have not found how to do it. The double[] array does clearly not not work. Have also tried with:
cdef void test(double* array ) except? -2:
I can manage to do the same in pure c, but not in cython:(
D:\cython-test\ python setup.py build_ext --inplace
Compiling ctest.pyx because it changed.
[1/1] Cythonizing ctest.pyx
Error compiling Cython file:
------------------------------------------------------------
...
from libc.stdlib cimport malloc, free
cdef void test(double[] array):
cdef int i,n
n = len(array)
^
------------------------------------------------------------
ctest.pyx:5:17: Cannot convert 'double *' to Python object
Error compiling Cython file:
------------------------------------------------------------
...
from libc.stdlib cimport malloc, free
cdef void test(double[] array):
cdef int i,n
n = len(array)
for i in range(0,len(array)):
^
------------------------------------------------------------
ctest.pyx:6:30: Cannot convert 'double *' to Python object
Traceback (most recent call last):
File "setup.py", line 10, in <module>
ext_modules = cythonize("ctest.pyx"),
File "C:\Anaconda\lib\site-packages\Cython\Build\Dependencies.py", line 877, i
n cythonize
cythonize_one(*args)
File "C:\Anaconda\lib\site-packages\Cython\Build\Dependencies.py", line 997, i
n cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: ctest.pyx
E:\GD\UD\Software\BendStiffener\curvmom>
UPDATE:
Have updated my code after all advices and it compiles now:) But my array do still not update. I will expect that all entries should be updated with 5.0, but they do not
from libc.stdlib cimport malloc, free
cdef void test(double[] array):
cdef int i,n
n = sizeof(array)/sizeof(double)
for i in range(0,n):
array[i] = array[i]+5.0
def ctest(a):
n = len(a)
#Make a C-array on the heap.
cdef double* v
v = <double*>malloc(n*sizeof(double))
#Copy in the python array
for i in range(n):
v[i] = float(a[i])
#Calling the C-function which do something with the array
test(v)
#Puttint the changed C-array back into python
for i in range(n):
a[i] = v[i]
free(v)
for x in a:
print x
return a
Here are a python test program for testing my code:
import ctest
a = [0,0,0]
ctest.ctest(a)
So there is still something I am doing wrong. Any suggestion?
len() is a python function that works only on python objects. This is why it won't compile.
For a C-array you could replace n=len(array) by n = sizeof(array) / sizeof(double).
You might want to take a look at typed memoryviews and the buffer interface. These provide a nice interface to array like data structures like those underlying numpy arrays, but can also be used to work with C arrays. From the documentation:
For example, they can handle C arrays and the Cython array type (Cython arrays).
In your case this might help:
cdef test(double[:] array) except? -2:
...
The double[:] allows all 1d double arrays to be passed to the function. Those can then be modified. As the [:] defines a memoryview, all changes will be made in the array you created the memoryview on (the variable you passed as the parameter to test).
%%cython -f -c=-O3 -c=-fopenmp --link-args=-fopenmp
from cython.parallel import parallel, prange
from libc.stdlib cimport abort, malloc, free
cdef int idx, i, n = 100
cdef int k
cdef int * local_buf
cdef int size = 10
cdef void func(int* lb) nogil:
cdef int j
for j in xrange(size):
lb[j] += -1*j
local_buf = <int *> malloc(sizeof(int) * size)
with nogil, parallel():
if local_buf == NULL:
abort()
# populate our local buffer in a sequential loop
for i in xrange(size):
local_buf[i] = i * 2
# share the work using the thread-local buffer(s)
for k in prange(n, schedule='guided'):
func(local_buf)
for i in xrange(size):
print local_buf[i]
free(local_buf)
0
-98
-196
-294
-392
-490
-588
-686
-784
-882
edit:
The above block shows the output after one run, but the contents in local_buf seems to change every or so re-run. What's going on?
The result there seems reasonable with the code given, do you actually get different result each run?
This should be regular python equivalent:
size = 10
n = 100
lst = [i*2 for i in range(size)]
for i in range(n):
for j in range(size):
lst[j] += -1*j
print lst
#[0, -98, -196, -294, -392, -490, -588, -686, -784, -882]
I'd like to call my C function from Python, in order to manipulate some NumPy arrays. The function is like this:
void c_func(int *in_array, int n, int *out_array);
where the results are supplied in out_array, whose size I know in advance (not my function, actually). I try to do in the corresponding .pyx file the following, in order to able to pass the input to the function from a NumPy array, and store the result in a NumPy array:
def pyfunc(np.ndarray[np.int32_t, ndim=1] in_array):
n = len(in_array)
out_array = np.zeros((512,), dtype = np.int32)
mymodule.c_func(<int *> in_array.data, n, <int *> out_array.data)
return out_array
But I get
"Python objects cannot be cast to pointers of primitive types" error for the output assignment. How do I accomplish this?
(If I require that the Python caller allocates the proper output array, then I can do
def pyfunc(np.ndarray[np.int32_t, ndim=1] in_array, np.ndarray[np.int32_t, ndim=1] out_array):
n = len(in_array)
mymodule.cfunc(<int *> in_array.data, n, <int*> out_array.data)
But can I do this in a way that the caller doesn't have to pre-allocate the appropriately sized output array?
You should add cdef np.ndarray before the out_array assignement:
def pyfunc(np.ndarray[np.int32_t, ndim=1] in_array):
cdef np.ndarray out_array = np.zeros((512,), dtype = np.int32)
n = len(in_array)
mymodule.c_func(<int *> in_array.data, n, <int *> out_array.data)
return out_array
Here is an example how to manipulate NumPy arrays using code written in C/C++ through ctypes.
I wrote a small function in C, taking the square of numbers from a first array and writing the result to a second array. The number of elements is given by a third parameter. This code is compiled as shared object.
squares.c compiled to squares.so:
void square(double* pin, double* pout, int n) {
for (int i=0; i<n; ++i) {
pout[i] = pin[i] * pin[i];
}
}
In python, you just load the library using ctypes and call the function. The array pointers are obtained from the NumPy ctypes interface.
import numpy as np
import ctypes
n = 5
a = np.arange(n, dtype=np.double)
b = np.zeros(n, dtype=np.double)
square = ctypes.cdll.LoadLibrary("./square.so")
aptr = a.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
bptr = b.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
square.square(aptr, bptr, n)
print b
This will work for any c-library, you just have to know which argument types to pass, possibly rebuilding c-structs in python using ctypes.
I am trying to speed up some python code with cython, and I'm making use of cython's -a option to see where I can improve things. My understanding is that in the generated html file, the highlighted lines are ones where python functions are called - is that correct?
In the following trivial function, I have declared the numpy array argument arr using the buffer syntax. I thought that this allows indexing operations to take place purely in C without having to call python functions. However, cython -a (version 0.15) highlights the line where I set the value of an element of arr, though not the one where i read one of its elements. Why does this happen? Is there a more efficient way of accessing numpy array elements?
import numpy
cimport numpy
def foo(numpy.ndarray[double, ndim=1] arr not None):
cdef int i
cdef double elem
for i in xrange(10):
elem = arr[i] #not highlighted
arr[i] = 1.0 + elem #highlighted
EDIT: Also, how does the mode buffer argument interact with numpy? Assuming I haven't changed the order argument of numpy.array from the default, is it always safe to use mode='c'? Does this actually make a difference to performance?
EDIT after delnan's comment: arr[i] += 1 also gets highlighted (that is why I split it up in the first place, to see which part of the operation was causing the issue). If I turn off bounds checking to simplify things (this makes no difference to what gets highlighted), the generated c code is:
/* "ct.pyx":11
* cdef int i
* cdef double elem
* for i in xrange(10): # <<<<<<<<<<<<<<
* elem = arr[i]
* arr[i] = 1.0 + elem
*/
for (__pyx_t_1 = 0; __pyx_t_1 < 10; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
/* "ct.pyx":12
* cdef double elem
* for i in xrange(10):
* elem = arr[i] # <<<<<<<<<<<<<<
* arr[i] = 1.0 + elem
*/
__pyx_t_2 = __pyx_v_i;
__pyx_v_elem = (*__Pyx_BufPtrStrided1d(double *, __pyx_bstruct_arr.buf, __pyx_t_2, __pyx_bstride_0_arr));
/* "ct.pyx":13
* for i in xrange(10):
* elem = arr[i]
* arr[i] = 1.0 + elem # <<<<<<<<<<<<<<
*/
__pyx_t_3 = __pyx_v_i;
*__Pyx_BufPtrStrided1d(double *, __pyx_bstruct_arr.buf, __pyx_t_3, __pyx_bstride_0_arr) = (1.0 + __pyx_v_elem);
}
The answer is that the highlighter fools the reader.
I compiled your code and the instructions generated under the highlight are those needed
to handle the error cases and the return value, they are not related to the array assignment.
Indeed if you change the code to read :
def foo(numpy.ndarray[double, ndim=1] arr not None):
cdef int i
cdef double elem
for i in xrange(10):
elem = arr[i]
arr[i] = 1.0 + elem
return # + add this
The highlight would be on the last line and not more in the assignment.
You can further speed up your code by using the #cython.boundscheck:
import numpy
cimport numpy
cimport cython
#cython.boundscheck(False)
def foo(numpy.ndarray[double, ndim=1] arr not None):
cdef int i
cdef double elem
for i in xrange(10):
elem = arr[i]
arr[i] = 1.0 + elem
return
I am trying to speed up my Numpy code and decided that I wanted to implement one particular function where my code spent most of the time in C.
I'm actually a rookie in C, but I managed to write the function which normalizes every row in a matrix to sum to 1. I can compile it and I tested it with some data (in C) and it does what I want. At that point I was very proud of myself.
Now I'm trying to call my glorious function from Python where it should accept a 2d-Numpy array.
The various things I've tried are
SWIG
SWIG + numpy.i
ctypes
My function has the prototype
void normalize_logspace_matrix(size_t nrow, size_t ncol, double mat[nrow][ncol]);
So it takes a pointer to a variable-length array and modifies it in place.
I tried the following pure SWIG interface file:
%module c_utils
%{
extern void normalize_logspace_matrix(size_t, size_t, double mat[*][*]);
%}
extern void normalize_logspace_matrix(size_t, size_t, double** mat);
Then I would do (on Mac OS X 64bit):
> swig -python c-utils.i
> gcc -fPIC c-utils_wrap.c -o c-utils_wrap.o \
-I/Library/Frameworks/Python.framework/Versions/6.2/include/python2.6/ \
-L/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/ -c
c-utils_wrap.c: In function ‘_wrap_normalize_logspace_matrix’:
c-utils_wrap.c:2867: warning: passing argument 3 of ‘normalize_logspace_matrix’ from incompatible pointer type
> g++ -dynamiclib c-utils.o -o _c_utils.so
In Python I then get the following error on importing my module:
>>> import c_utils
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dynamic module does not define init function (initc_utils)
Next I tried this approach using SWIG + numpy.i:
%module c_utils
%{
#define SWIG_FILE_WITH_INIT
#include "c-utils.h"
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply ( int DIM1, int DIM2, DATA_TYPE* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
%include "c-utils.h"
However, I don't get any further than this:
> swig -python c-utils.i
c-utils.i:13: Warning 453: Can't apply (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2). No typemaps are defined.
SWIG doesn't seem to find the typemaps defined in numpy.i, but I don't understand why, because numpy.i is in the same directory and SWIG doesn't complain that it can't find it.
With ctypes I didn't get very far, but got lost in the docs pretty quickly since I couldn't figure out how to pass it a 2d-array and then get the result back.
So could somebody show me the magic trick how to make my function available in Python/Numpy?
Unless you have a really good reason not to, you should use cython to interface C and python. (We are starting to use cython instead of raw C inside numpy/scipy themselves).
You can see a simple example in my scikits talkbox (since cython has improved quite a bit since then, I think you could write it better today).
def cslfilter(c_np.ndarray b, c_np.ndarray a, c_np.ndarray x):
"""Fast version of slfilter for a set of frames and filter coefficients.
More precisely, given rank 2 arrays for coefficients and input, this
computes:
for i in range(x.shape[0]):
y[i] = lfilter(b[i], a[i], x[i])
This is mostly useful for processing on a set of windows with variable
filters, e.g. to compute LPC residual from a signal chopped into a set of
windows.
Parameters
----------
b: array
recursive coefficients
a: array
non-recursive coefficients
x: array
signal to filter
Note
----
This is a specialized function, and does not handle other types than
double, nor initial conditions."""
cdef int na, nb, nfr, i, nx
cdef double *raw_x, *raw_a, *raw_b, *raw_y
cdef c_np.ndarray[double, ndim=2] tb
cdef c_np.ndarray[double, ndim=2] ta
cdef c_np.ndarray[double, ndim=2] tx
cdef c_np.ndarray[double, ndim=2] ty
dt = np.common_type(a, b, x)
if not dt == np.float64:
raise ValueError("Only float64 supported for now")
if not x.ndim == 2:
raise ValueError("Only input of rank 2 support")
if not b.ndim == 2:
raise ValueError("Only b of rank 2 support")
if not a.ndim == 2:
raise ValueError("Only a of rank 2 support")
nfr = a.shape[0]
if not nfr == b.shape[0]:
raise ValueError("Number of filters should be the same")
if not nfr == x.shape[0]:
raise ValueError, \
"Number of filters and number of frames should be the same"
tx = np.ascontiguousarray(x, dtype=dt)
ty = np.ones((x.shape[0], x.shape[1]), dt)
na = a.shape[1]
nb = b.shape[1]
nx = x.shape[1]
ta = np.ascontiguousarray(np.copy(a), dtype=dt)
tb = np.ascontiguousarray(np.copy(b), dtype=dt)
raw_x = <double*>tx.data
raw_b = <double*>tb.data
raw_a = <double*>ta.data
raw_y = <double*>ty.data
for i in range(nfr):
filter_double(raw_b, nb, raw_a, na, raw_x, nx, raw_y)
raw_b += nb
raw_a += na
raw_x += nx
raw_y += nx
return ty
As you can see, besides the usual argument checking you would do in python, it is almost the same thing (filter_double is a function which can be written in pure C in a separate library if you want to). Of course, since it is compiled code, failing to check your argument will crash your interpreter instead of raising exception (there are several levels of safety vs speed tradeoffs available with recent cython, though).
To answer the real question: SWIG doesn't tell you it can't find any typemaps. It tells you it can't apply the typemap (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2), which is because there is no typemap defined for DATA_TYPE *. You need to tell it you want to apply it to a double*:
%apply ( int DIM1, int DIM2, double* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
First, are you sure that you were writing the fastest possible numpy code? If by normalise you mean divide the whole row by its sum, then you can write fast vectorised code which looks something like this:
matrix /= matrix.sum(axis=0)
If this is not what you had in mind and you are still sure that you need a fast C extension, I would strongly recommend you write it in cython instead of C. This will save you all the overhead and difficulties in wrapping code, and allow you to write something which looks like python code but which can be made to run as fast as C in most circumstances.
I agree with others that a little Cython is well worth learning.
But if you must write C or C++, use a 1d array which overlays the 2d, like this:
// sum1rows.cpp: 2d A as 1d A1
// Unfortunately
// void f( int m, int n, double a[m][n] ) { ... }
// is valid c but not c++ .
// See also
// http://stackoverflow.com/questions/3959457/high-performance-c-multi-dimensional-arrays
// http://stackoverflow.com/questions/tagged/multidimensional-array c++
#include <stdio.h>
void sum1( int n, double x[] ) // x /= sum(x)
{
float sum = 0;
for( int j = 0; j < n; j ++ )
sum += x[j];
for( int j = 0; j < n; j ++ )
x[j] /= sum;
}
void sum1rows( int nrow, int ncol, double A1[] ) // 1d A1 == 2d A[nrow][ncol]
{
for( int j = 0; j < nrow*ncol; j += ncol )
sum1( ncol, &A1[j] );
}
int main( int argc, char** argv )
{
int nrow = 100, ncol = 10;
double A[nrow][ncol];
for( int j = 0; j < nrow; j ++ )
for( int k = 0; k < ncol; k ++ )
A[j][k] = (j+1) * k;
double* A1 = &A[0][0]; // A as 1d array -- bad practice
sum1rows( nrow, ncol, A1 );
for( int j = 0; j < 2; j ++ ){
for( int k = 0; k < ncol; k ++ ){
printf( "%.2g ", A[j][k] );
}
printf( "\n" );
}
}
Added 8 Nov: as you probably know, numpy.reshape can overlay a numpy 2d array with a 1d view to pass to sum1rows, like this:
import numpy as np
A = np.arange(10).reshape((2,5))
A1 = A.reshape(A.size) # a 1d view of A, not a copy
# sum1rows( 2, 5, A1 )
A[1,1] += 10
print "A:", A
print "A1:", A1
SciPy has an extension tutorial with example code for arrays.
http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html