Cython Memoryview Seg Fault - python

I am running into a segmentation fault when trying using Cython's memoryview. This is my code:
def fock_build_init_with_inputs(tei_ints):
# set the number of orbitals
norb = tei_ints.shape[0]
# get memory view of TEIs
cdef double[:,:,:,::1] tei_memview = tei_ints
# get index pairs
prep_ipss_serial(norb, &tei_memview[0,0,0,0])
void prep_ipss_serial(const int n, const double * const tei) {
int p, q, r, s, np;
double maxval;
const double thresh = 1.0e-9;
// first we count the number of index pairs with above-threshold integrals
np = 0;
for (q = 0; q < n; q++)
for (p = q; p < n; p++) {
maxval = 0.0;
for (s = 0; s < n; s++)
for (r = s; r < n; r++) {
maxval = fmax( maxval, fabs( tei[ r + n*s + n*n*p + n*n*n*q ] ) );
}
if ( maxval > thresh )
np++;
}
ipss_np = np;
When I run the code by calling the first function with an input of numpy.zeros([n,n,n,n]), I run into a segmentation fault when n exceeds certain number (212). Does anyone know what is causing this problem and how to resolve it?
Thanks,
Luning

This looks like a 32bit integer overflow - i.e. 213*213*213*213 it's greater than the maximum 32 bit integer. You should use 64bit integers as your indexes (long or more explicitly int64_t).
Why are you converting your memoryview to a pointer though? You'll won't have gained much speed and you'll have lost any information about the shape (for example, you have an assumption that all the dimensions are the same) and you could let Cython handle the multi-dimensional indexing for you. It would be much better to make the tei argument a memoryview rather than a pointer.

Related

Efficient way to read a set of 3 channel images from Python into a two dimensional array to be used in C

I am working on a project involving object detection through deep learning, with the underlying detection code written in C. Due to the requirements of the project, this code has a Python wrapper around it, which interfaces with the required C functions through ctypes. Images are read from Python, and then transferred into C to be processed as a batch.
In its current state, the code is very unoptimized: the images (640x360x3 each) are read using cv2.imread then stacked into a numpy array. For example, for a batch size of 16, the dimensions of this array are (16,360,640,3). Once this is done, a pointer to this array is passed through ctypes into C where the array is parsed, pixel values are normalized and rearranged into a 2D array. The dimensions of the 2D array are 16x691200 (16x(640*360*3)), arranged as follows.
row [0]: Image 0: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
row [1]: Image 1: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
.
.
row [15]: Image 15: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
`
The C code for doing this currently looks like this, where the pixel values are accessed through strides and arranged sequentially per image. nb is the total number of images in the batch (usually 16); h, w, c are 360,640 and 3 respectively.
matrix ndarray_to_matrix(unsigned char* src, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
matrix X = make_matrix(nb, h*w*c);
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
return X;
}
And the corresponding Python code that calls this function: (array is the original numpy array)
for i in range(start, end):
imgName = imgDir + '/' + allImageName[i]
img = cv2.imread(imgName, 1)
batchImageData[i-start,:,:] = img[:,:]
data = batchImageData.ctypes.data_as(POINTER(c_ubyte))
resmatrix = self.ndarray_to_matrix(data, batchImageData.ctypes.shape, batchImageData.ctypes.strides)
As of now, this ctypes implementation takes about 35 ms for a batch of 16 images. I'm working on a very FPS critical image processing pipeline, so is there a more efficient way of doing these operations? Specifically:
Can I read the image directly as a 'strided' one dimensional array in Python from disk, thus avoiding the iterative access and copying?
I have looked into numpy operations such as:
np.ascontiguousarray(img.transpose(2,0,1).flat, dtype=float)/255. which should achieve something similar, but this is actually taking more time possibly because of it being called in Python.
Would Cython help anywhere during the read operation?
Regarding the ascontiguousarray method, I'm assuming that it's pretty slow as python has to do some memory works to return a C-like contiguous array.
EDIT 1:
I saw this answer, apparently openCV's imread function should already return a contiguous array.
I am not very familiar with ctypes, but happen to use the PyBind library and can only recommend using it. It implements Python's buffer protocol hence allowing you to interact with python data with almost no overhead.
I've answered a question explaining how to pass a numpy array from Python to C/C++, do something dummy to it in C++ and return a dynamically created array back to Python.
EDIT 2 : I've added a simple example that receives a Numpy array, send it to C and prints it from C. You can find it here. Hope it helps!
EDIT 3 :
To answer your last comment, yes you can definitely do that.
You could modify your code to (1) instantiate a 2D numpy array in C++, (2) pass its reference to the data to your C function that will modify it instead of declaring a Matrix and (3) return that instance to Python by reference.
Your function would become:
void ndarray_to_matrix(unsigned char* src, double * x, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
}
And you'd add, in your C++ wrapper code
// Instantiate the output array, assuming we know b, h, c,w
py::array_t<double> x = py::array_t<double>(b*h*c*w);
py::buffer_info bufx = x.request();
double*ptrx = (double *) bufx.ptr;
// Call to your C function with ptrx as input
ndarray_to_matrix(src, ptrx, shape, strides);
// now reshape x
x.reshape({b, h*c*w});
Do not forget to modify the prototype of the C++ wrapper function to return a numpy array like:
py::array_t<double> read_matrix(...){}...
This should work, I didn't test it though :)

How to make python of exponential growth to code equivalent to c++?

In this code I am computing a numerical approximation of the solution of an ODE u'(tk)=u(tk)=uk and storing all the uk and tk values as shown below.
Code:
def compute_u(u0,T,n):
t = linspace(0,T,n+1)
t[0] = 0
u=zeros(n+1)
u[0]= u0
dt = T/float(n)
for k in range(0, n, 1):
u[k+1] = (1+dt)*u[k]
t[k+1] = t[k] + dt
return u, t
I am now trying to implement this code into c++ and I am facing a few rocks along the way. I am relatively new in C++ and I was wondering if anyone in this forum could point me to the right direction since python has functions that c++ does not such as linspace or zeros. Any input will be helpful.
Here you have linspace:
std::vector< float > linspace(float a, float b, uint32_t n)
{
std::vector< float > result(n);
float step = (b - a) / (float) (n - 1);
for (uint32_t i = 0; i <= n - 2; i++) {
result[i] = a + (float) i * step;
}
result.back() = b;
return result;
}
try out zeros yourself.
Or a better solution: use Eigen, it has both functions.

OpenMP, Python, C Extension, Memory Access and the evil GIL

so I am currently trying to do something like A**b for some 2d ndarray and a double b in parallel for Python. I would like to do it with a C extension using OpenMP (yes I know, there is Cython etc. but at some point I always ran into trouble with those 'high-level' approaches...).
So here is the gaussian.c Code for my gaussian.so:
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
(A is a square double Matrix, out has the same shape and n is the number of rows/columns) So the point is to update some symmetric distance matrix - ind2 is the transposed index of ind1.
I compile it using gcc -shared -fopenmp -o gaussian.so -lm gaussian.c. I access the function directly via ctypes in Python:
test = c_gaussian.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sample
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sampl
ctypes.c_int # number of samples
]
The function 'test' is working smoothly as long as I comment the #pragma line - otherwise it ends with error number 139.
A = np.random.rand(1000, 1000) + 2.0
out = np.empty((1000, 1000))
test(A, out, 1000)
When I change the inner loop to just print ind1 and ind2 it runs smoothly in parallel. It also works, when I just access the ind1 location and leave ind2 alone (even in parallel)! Where do I screw up the memory access? How can I fix this?
thank you!
Update: Well I guess this is running into the GIL, but I am not yet sure...
Update: Okay, I am pretty sure now, that it is evil GIL killing me here, so I altered the example:
I now have gil.c:
#include <Python.h>
#define _USE_MATH_DEFINES
#include <math.h>
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
Py_BEGIN_ALLOW_THREADS
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Py_END_ALLOW_THREADS
}
which is compiled using gcc -shared -fopenmp -o gil.so -lm gil.c -I /usr/include/python2.7 -L /usr/lib/python2.7/ -lpython2.7 and the corresponding Python file:
import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer
import pylab as pl
path = '../src/gil.so'
c_gil = ctypes.cdll.LoadLibrary(path)
test = c_gil.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ctypes.c_int
]
n = 100
A = np.random.rand(n, n) + 2.0
out = np.empty((n,n))
test(A, out, n)
This gives me
Fatal Python error: PyEval_SaveThread: NULL tstate
Process finished with exit code 134
Now somehow it seems to not be able to save the current thread - but the API doc does not go into detail here, I was hoping that I could ignore Python when writing my C function, but this seems to be quite messy :( any ideas? I found this very helpful: GIL
Your problem is much simpler than you think and does not involve GIL in any way. You are running in an out-of-bound access to out[] when you access it via ind2 since j easily becomes larger than n. The reason is simply that you have not applied any data sharing clause to your parallel region and all variables except i remain shared (as per default in OpenMP) and therefore subject to data races - in that case multiple simultaneous increments being done by the different threads. Having too large j is less of a problem with ind1, but not with ind2 since there the too large value is multiplied by n and thus becomes far too large.
Simply make j, ind1 and ind2 private as they should be:
#pragma omp parallel for private(j,ind1,ind2)
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Even better, declare them inside the scope where they are being used. That automatically makes them private:
#pragma omp parallel for
for (i = 0; i < n; i++) {
int j;
for (j = i; j < n; j++) {
int ind1 = i*n + j;
int ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}

Returning C arrays into python scope from scipy's weave.inline

I am using scipy's weave.inline to perform computationally expensive tasks. I have problems returning an one-dimensional array back into the python scope. Weave.inline uses a special argument called "return_val" for the purpose of returning values back into the python scope.
The following example returning an integer value works well:
>>> from scipy.weave import inline
>>> print inline(r'''int N = 10; return_val = N;''')
10
However the following example, which indeed compiles without prompting an error, does not return the array i would expect:
>>> from scipy.weave import inline
>>> code =\
r'''
int* pairs;
int lenght = 0;
for (int i=0;i<N;i++){
lenght += 1;
pairs = (int *)malloc(sizeof(int)*lenght);
pairs[i] = i;
std::cout << pairs[i] << std::endl;
}
return_val = pairs;
'''
>>> N = 5
>>> R = inline(code,['N'])
>>> print "RETURN_VAL:",R
0
1
2
3
4
RETURN_VAL: 1
I need to reallocate the size of the array "pairs" dynamically which is why I can't pass a numpy.array or python list per se.
All you need to do is use the raw python c-api calls, or if you're looking for something a bit more convenient, the built in scipy weave wrappers.
No guarantees about leaks or efficiency, but it should look something a bit like this:
from scipy.weave import inline
code = r'''
py::list ret;
for(int i = 0; i < N; i++) {
py::list item;
for(int j = 0; j < i; j++) {
item.append(j);
}
ret.append(item);
}
return_val = ret;
'''
N = 5
R = inline(code,['N'])
print R
If you absolutely don't know the size of the output array in advance, you must create it in your inline code. I'm pretty sure that your array allocated by using malloc will result in leaked memory since you have no way of controlling when this memory is to be freed.
The solution is to create a numpy array, fill it with your function's results and return it.
import scipy.weave
code = r"""
npy_intp dims[1] = {n};
PyObject* out_array = PyArray_SimpleNew(1, dims, NPY_DOUBLE);
double* data = (double*) ((PyArrayObject*) out_array)->data;
for (int i=0; i<n; ++i) data[i] = i;
return_val = out_array;
Py_XDECREF(out_array);
"""
n = 5
out_array = scipy.weave.inline(code, ["n"])
print "Array:", out_array

Extending Numpy with C function

I am trying to speed up my Numpy code and decided that I wanted to implement one particular function where my code spent most of the time in C.
I'm actually a rookie in C, but I managed to write the function which normalizes every row in a matrix to sum to 1. I can compile it and I tested it with some data (in C) and it does what I want. At that point I was very proud of myself.
Now I'm trying to call my glorious function from Python where it should accept a 2d-Numpy array.
The various things I've tried are
SWIG
SWIG + numpy.i
ctypes
My function has the prototype
void normalize_logspace_matrix(size_t nrow, size_t ncol, double mat[nrow][ncol]);
So it takes a pointer to a variable-length array and modifies it in place.
I tried the following pure SWIG interface file:
%module c_utils
%{
extern void normalize_logspace_matrix(size_t, size_t, double mat[*][*]);
%}
extern void normalize_logspace_matrix(size_t, size_t, double** mat);
Then I would do (on Mac OS X 64bit):
> swig -python c-utils.i
> gcc -fPIC c-utils_wrap.c -o c-utils_wrap.o \
-I/Library/Frameworks/Python.framework/Versions/6.2/include/python2.6/ \
-L/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/ -c
c-utils_wrap.c: In function ‘_wrap_normalize_logspace_matrix’:
c-utils_wrap.c:2867: warning: passing argument 3 of ‘normalize_logspace_matrix’ from incompatible pointer type
> g++ -dynamiclib c-utils.o -o _c_utils.so
In Python I then get the following error on importing my module:
>>> import c_utils
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dynamic module does not define init function (initc_utils)
Next I tried this approach using SWIG + numpy.i:
%module c_utils
%{
#define SWIG_FILE_WITH_INIT
#include "c-utils.h"
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply ( int DIM1, int DIM2, DATA_TYPE* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
%include "c-utils.h"
However, I don't get any further than this:
> swig -python c-utils.i
c-utils.i:13: Warning 453: Can't apply (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2). No typemaps are defined.
SWIG doesn't seem to find the typemaps defined in numpy.i, but I don't understand why, because numpy.i is in the same directory and SWIG doesn't complain that it can't find it.
With ctypes I didn't get very far, but got lost in the docs pretty quickly since I couldn't figure out how to pass it a 2d-array and then get the result back.
So could somebody show me the magic trick how to make my function available in Python/Numpy?
Unless you have a really good reason not to, you should use cython to interface C and python. (We are starting to use cython instead of raw C inside numpy/scipy themselves).
You can see a simple example in my scikits talkbox (since cython has improved quite a bit since then, I think you could write it better today).
def cslfilter(c_np.ndarray b, c_np.ndarray a, c_np.ndarray x):
"""Fast version of slfilter for a set of frames and filter coefficients.
More precisely, given rank 2 arrays for coefficients and input, this
computes:
for i in range(x.shape[0]):
y[i] = lfilter(b[i], a[i], x[i])
This is mostly useful for processing on a set of windows with variable
filters, e.g. to compute LPC residual from a signal chopped into a set of
windows.
Parameters
----------
b: array
recursive coefficients
a: array
non-recursive coefficients
x: array
signal to filter
Note
----
This is a specialized function, and does not handle other types than
double, nor initial conditions."""
cdef int na, nb, nfr, i, nx
cdef double *raw_x, *raw_a, *raw_b, *raw_y
cdef c_np.ndarray[double, ndim=2] tb
cdef c_np.ndarray[double, ndim=2] ta
cdef c_np.ndarray[double, ndim=2] tx
cdef c_np.ndarray[double, ndim=2] ty
dt = np.common_type(a, b, x)
if not dt == np.float64:
raise ValueError("Only float64 supported for now")
if not x.ndim == 2:
raise ValueError("Only input of rank 2 support")
if not b.ndim == 2:
raise ValueError("Only b of rank 2 support")
if not a.ndim == 2:
raise ValueError("Only a of rank 2 support")
nfr = a.shape[0]
if not nfr == b.shape[0]:
raise ValueError("Number of filters should be the same")
if not nfr == x.shape[0]:
raise ValueError, \
"Number of filters and number of frames should be the same"
tx = np.ascontiguousarray(x, dtype=dt)
ty = np.ones((x.shape[0], x.shape[1]), dt)
na = a.shape[1]
nb = b.shape[1]
nx = x.shape[1]
ta = np.ascontiguousarray(np.copy(a), dtype=dt)
tb = np.ascontiguousarray(np.copy(b), dtype=dt)
raw_x = <double*>tx.data
raw_b = <double*>tb.data
raw_a = <double*>ta.data
raw_y = <double*>ty.data
for i in range(nfr):
filter_double(raw_b, nb, raw_a, na, raw_x, nx, raw_y)
raw_b += nb
raw_a += na
raw_x += nx
raw_y += nx
return ty
As you can see, besides the usual argument checking you would do in python, it is almost the same thing (filter_double is a function which can be written in pure C in a separate library if you want to). Of course, since it is compiled code, failing to check your argument will crash your interpreter instead of raising exception (there are several levels of safety vs speed tradeoffs available with recent cython, though).
To answer the real question: SWIG doesn't tell you it can't find any typemaps. It tells you it can't apply the typemap (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2), which is because there is no typemap defined for DATA_TYPE *. You need to tell it you want to apply it to a double*:
%apply ( int DIM1, int DIM2, double* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
First, are you sure that you were writing the fastest possible numpy code? If by normalise you mean divide the whole row by its sum, then you can write fast vectorised code which looks something like this:
matrix /= matrix.sum(axis=0)
If this is not what you had in mind and you are still sure that you need a fast C extension, I would strongly recommend you write it in cython instead of C. This will save you all the overhead and difficulties in wrapping code, and allow you to write something which looks like python code but which can be made to run as fast as C in most circumstances.
I agree with others that a little Cython is well worth learning.
But if you must write C or C++, use a 1d array which overlays the 2d, like this:
// sum1rows.cpp: 2d A as 1d A1
// Unfortunately
// void f( int m, int n, double a[m][n] ) { ... }
// is valid c but not c++ .
// See also
// http://stackoverflow.com/questions/3959457/high-performance-c-multi-dimensional-arrays
// http://stackoverflow.com/questions/tagged/multidimensional-array c++
#include <stdio.h>
void sum1( int n, double x[] ) // x /= sum(x)
{
float sum = 0;
for( int j = 0; j < n; j ++ )
sum += x[j];
for( int j = 0; j < n; j ++ )
x[j] /= sum;
}
void sum1rows( int nrow, int ncol, double A1[] ) // 1d A1 == 2d A[nrow][ncol]
{
for( int j = 0; j < nrow*ncol; j += ncol )
sum1( ncol, &A1[j] );
}
int main( int argc, char** argv )
{
int nrow = 100, ncol = 10;
double A[nrow][ncol];
for( int j = 0; j < nrow; j ++ )
for( int k = 0; k < ncol; k ++ )
A[j][k] = (j+1) * k;
double* A1 = &A[0][0]; // A as 1d array -- bad practice
sum1rows( nrow, ncol, A1 );
for( int j = 0; j < 2; j ++ ){
for( int k = 0; k < ncol; k ++ ){
printf( "%.2g ", A[j][k] );
}
printf( "\n" );
}
}
Added 8 Nov: as you probably know, numpy.reshape can overlay a numpy 2d array with a 1d view to pass to sum1rows, like this:
import numpy as np
A = np.arange(10).reshape((2,5))
A1 = A.reshape(A.size) # a 1d view of A, not a copy
# sum1rows( 2, 5, A1 )
A[1,1] += 10
print "A:", A
print "A1:", A1
SciPy has an extension tutorial with example code for arrays.
http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html

Categories

Resources