I use cython and I need to store the data as shown below. Earlier I used for loops to store the data from pus_image[0] into a 3D array but when running for n frames it created a bottleneck in performance. Hence I used PyArray_NewFromDescr to store which solves the bottleneck issue earlier faced. But the displayed images look different from the previous method, as I am not able to do increment _puc_image += aoiStride. Could anyone please help me solve this issue.
Code 1 :
def LiveAquisition(self,nframes,np.ndarray[np.uint16_t,ndim = 3,mode = 'c']data):
cdef:
int available
AT_64 sizeInBytes
AT_64 aoiStride
AT_WC string[20]
AT_WC string1[20]
AT_WC string2[20]
AT_WC string3[20]
unsigned char * pBuf
unsigned char * _puc_image
int BufSize
unsigned int i, j, k, l = 0
for i in range(nframes):
pBuf = <unsigned char *>calloc(sizeInBytes, sizeof(unsigned char))
AT_QueueBuffer(<AT_H>self.cameraHandle, pBuf, sizeInBytes)
print "Frame number is :",
print i
response_code = AT_WaitBuffer(<AT_H>self.cameraHandle, &pBuf, &BufSize, 500)
_puc_image = pBuf
pus_image = <unsigned short*>pBuf
for j in range(self.aoiWidth/self.hbin):
pus_image = <unsigned short*>(_puc_image)
for k in range(self.aoiHeight/self.vbin):
data[l][j][k] = pus_image[0]
pus_image += 1
_puc_image += aoiStride
free(pBuf)
return data
Code 2 : Using PyArray_NewFromDescr
Prior to which its defined as :
from cpython.ref cimport PyTypeObject
from python_ref cimport Py_INCREF
cdef extern from "<numpy/arrayobject.h>":
object PyArray_NewFromDescr(PyTypeObject *subtype, np.dtype descr,int nd, np.npy_intp* dims,np.npy_intp*strides,void* data, int flags, object obj)
def LiveAquisition(self,nframes,np.ndarray[np.uint16_t,ndim = 3,mode = 'c']data):
cdef:
int available
AT_64 sizeInBytes
AT_64 aoiStride
AT_WC string[20]
AT_WC string1[20]
AT_WC string2[20]
AT_WC string3[20]
unsigned char * pBuf
unsigned char * _puc_image
int BufSize
unsigned int i, j, k, l = 0
np.npy_intp dims[2]
np.dtype dtype = np.dtype('<B')
for i in range(nframes):
pBuf = <unsigned char *>calloc(sizeInBytes, sizeof(unsigned char))
AT_QueueBuffer(<AT_H>self.cameraHandle, pBuf, sizeInBytes)
print "Frame number is :",
print i
response_code = AT_WaitBuffer(<AT_H>self.cameraHandle, &pBuf, &BufSize, 500)
Py_INCREF(dtype)
dims[0] = self.aoiWidth
dims[1] = self.aoiHeight
data[i,:,:] = PyArray_NewFromDescr(<PyTypeObject *> np.ndarray, np.dtype('<B'), 2,dims, NULL,pBuf, np.NPY_C_CONTIGUOUS, None)
free(pBuf)
return data
There's a few large errors in the way you're doing this. However, what you're doing is totally unnecessary, and there's a much simpler approach.
You can simply allocate the data using Numpy, and get the address of the first element of that array:
# earlier
cdef unsigned char[:,::1] p
# in loop
p = np.array((self.aoiWidth,self.aoiHeight),dtype=np.uint8)
pbuf = &p[0,0] # address of first element of p
# code goes here
data[i,:,:] = p
Errors in what you're doing:
pBuf = <unsigned char *>calloc(sizeInBytes, sizeof(unsigned char))
Here, sizeInBytes is uninitialized, and therefore the size you allocate with be arbitrary.
PyArray_NewFromDescr steals a reference to the descr argument. This means that it does not increment the reference count of the argument. The line
PyArray_NewFromDescr(<PyTypeObject *> np.ndarray, np.dtype('<B'), ...)
will be translated as Cython to something like
temp_dtype = np.dtype('<B') # refcount 1
PyArray_NewFromDescr(<PyTypeObject *> np.ndarray, temp_dtype, ...)
# temp_dtype refcount is still 1
Py_DECREF(temp_dtype) # Cython's own cleanup
# temp_dtype has now been destroyed, but is still being used by your array
It looks like you copied some code that dealt with this correctly (Py_INCREF(dtype), which was then passed to PyArray_NewFromDescr), but chose to ignore that and create your own temporary object.
PyArray_NewFromDescr does not own the data. Therefore you are responsible for deallocating it once it has been used (and only when you're sure it's no longer needed). You only do one free, after the loop, so you are leaking almost all the memory you allocated. Either put the free in the loop, or modify the OWNDATA flag to give your new array ownership of your array.
In summary, unless you have a good understanding of the Python C API I recommend don't using PyArray_NewFromDescr and using numpy arrays to allocate your data instead.
Related
I am running into a segmentation fault when trying using Cython's memoryview. This is my code:
def fock_build_init_with_inputs(tei_ints):
# set the number of orbitals
norb = tei_ints.shape[0]
# get memory view of TEIs
cdef double[:,:,:,::1] tei_memview = tei_ints
# get index pairs
prep_ipss_serial(norb, &tei_memview[0,0,0,0])
void prep_ipss_serial(const int n, const double * const tei) {
int p, q, r, s, np;
double maxval;
const double thresh = 1.0e-9;
// first we count the number of index pairs with above-threshold integrals
np = 0;
for (q = 0; q < n; q++)
for (p = q; p < n; p++) {
maxval = 0.0;
for (s = 0; s < n; s++)
for (r = s; r < n; r++) {
maxval = fmax( maxval, fabs( tei[ r + n*s + n*n*p + n*n*n*q ] ) );
}
if ( maxval > thresh )
np++;
}
ipss_np = np;
When I run the code by calling the first function with an input of numpy.zeros([n,n,n,n]), I run into a segmentation fault when n exceeds certain number (212). Does anyone know what is causing this problem and how to resolve it?
Thanks,
Luning
This looks like a 32bit integer overflow - i.e. 213*213*213*213 it's greater than the maximum 32 bit integer. You should use 64bit integers as your indexes (long or more explicitly int64_t).
Why are you converting your memoryview to a pointer though? You'll won't have gained much speed and you'll have lost any information about the shape (for example, you have an assumption that all the dimensions are the same) and you could let Cython handle the multi-dimensional indexing for you. It would be much better to make the tei argument a memoryview rather than a pointer.
I am working on a project involving object detection through deep learning, with the underlying detection code written in C. Due to the requirements of the project, this code has a Python wrapper around it, which interfaces with the required C functions through ctypes. Images are read from Python, and then transferred into C to be processed as a batch.
In its current state, the code is very unoptimized: the images (640x360x3 each) are read using cv2.imread then stacked into a numpy array. For example, for a batch size of 16, the dimensions of this array are (16,360,640,3). Once this is done, a pointer to this array is passed through ctypes into C where the array is parsed, pixel values are normalized and rearranged into a 2D array. The dimensions of the 2D array are 16x691200 (16x(640*360*3)), arranged as follows.
row [0]: Image 0: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
row [1]: Image 1: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
.
.
row [15]: Image 15: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
`
The C code for doing this currently looks like this, where the pixel values are accessed through strides and arranged sequentially per image. nb is the total number of images in the batch (usually 16); h, w, c are 360,640 and 3 respectively.
matrix ndarray_to_matrix(unsigned char* src, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
matrix X = make_matrix(nb, h*w*c);
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
return X;
}
And the corresponding Python code that calls this function: (array is the original numpy array)
for i in range(start, end):
imgName = imgDir + '/' + allImageName[i]
img = cv2.imread(imgName, 1)
batchImageData[i-start,:,:] = img[:,:]
data = batchImageData.ctypes.data_as(POINTER(c_ubyte))
resmatrix = self.ndarray_to_matrix(data, batchImageData.ctypes.shape, batchImageData.ctypes.strides)
As of now, this ctypes implementation takes about 35 ms for a batch of 16 images. I'm working on a very FPS critical image processing pipeline, so is there a more efficient way of doing these operations? Specifically:
Can I read the image directly as a 'strided' one dimensional array in Python from disk, thus avoiding the iterative access and copying?
I have looked into numpy operations such as:
np.ascontiguousarray(img.transpose(2,0,1).flat, dtype=float)/255. which should achieve something similar, but this is actually taking more time possibly because of it being called in Python.
Would Cython help anywhere during the read operation?
Regarding the ascontiguousarray method, I'm assuming that it's pretty slow as python has to do some memory works to return a C-like contiguous array.
EDIT 1:
I saw this answer, apparently openCV's imread function should already return a contiguous array.
I am not very familiar with ctypes, but happen to use the PyBind library and can only recommend using it. It implements Python's buffer protocol hence allowing you to interact with python data with almost no overhead.
I've answered a question explaining how to pass a numpy array from Python to C/C++, do something dummy to it in C++ and return a dynamically created array back to Python.
EDIT 2 : I've added a simple example that receives a Numpy array, send it to C and prints it from C. You can find it here. Hope it helps!
EDIT 3 :
To answer your last comment, yes you can definitely do that.
You could modify your code to (1) instantiate a 2D numpy array in C++, (2) pass its reference to the data to your C function that will modify it instead of declaring a Matrix and (3) return that instance to Python by reference.
Your function would become:
void ndarray_to_matrix(unsigned char* src, double * x, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
}
And you'd add, in your C++ wrapper code
// Instantiate the output array, assuming we know b, h, c,w
py::array_t<double> x = py::array_t<double>(b*h*c*w);
py::buffer_info bufx = x.request();
double*ptrx = (double *) bufx.ptr;
// Call to your C function with ptrx as input
ndarray_to_matrix(src, ptrx, shape, strides);
// now reshape x
x.reshape({b, h*c*w});
Do not forget to modify the prototype of the C++ wrapper function to return a numpy array like:
py::array_t<double> read_matrix(...){}...
This should work, I didn't test it though :)
Is there a way to use AES-NI instructions within Cython code?
Closest I could find is how someone accessed SIMD instructions:
https://groups.google.com/forum/#!msg/cython-users/nTnyI7A6sMc/a6_GnOOsLuQJ
AES-NI in Python thread was not answered:
Python support for AES-NI
You should be able to just define the intrinsics as if they're normal C functions in Cython. Something like
cdef extern from "emmintrin.h": # I'm going off the microsoft documentation for where the headers are
# define the datatype as an opaque type
ctypedef struct __m128i:
pass
__m128i _mm_set_epi32 (int i3, int i2, int i1, int i0)
cdef extern from "wmmintrin.h":
__m128i _mm_aesdec_si128(__m128i v,__m128i rkey)
# then in some Cython function
def f():
cdef __m128i v = _mm_set_epi32(1,2,3,4)
cdef __m128i key = _mm_set_epi32(5,6,7,8)
cdef __m128i result = _mm_aesdec_si128(v,key)
The question "how do I apply this over a bytes array"? First, you get a char* of the bytes array. Then just iterate over it with range (being careful not to run off the end).
# assuming you already have an __m128i key
cdef __m128i v
cdef char* array = python_bytes_array # auto conversion
cdef int i, j
# you NEED to ensure that the byte array has a length divisible by
# 16, otherwise you'll probably get a segmentation fault.
for i in range(0,len(python_bytes_array),16):
# go over in chunks of 16
v = _mm_set_epi8(array[i+15],array[i+14],array[i+13],
# etc... fill in the rest
array[i+1], array[i])
cdef __m128 result = _mm_aesdec_si128(v,key)
# write back to the same place?
for j in range(16):
array[i+j] = _mm_extract_epi8(result,j)
Well, this seems easy, but I can't find a single reference on the web. In C we can create a char array of n null-characters as follows:
char arr[n] = "";
But when I try to do the same in Cython with
cdef char arr[n] = ""
I get this compilation error:
Error compiling Cython file:
------------------------------------------------------------
...
cdef char a[n] = ""
^
------------------------------------------------------------
Syntax error in C variable declaration
Obviously Cython doesn't allow to declare arrays this way, but is there an alternative? I don't want to manually set each item in the array, that is I'm not looking for something like this
cdef char a[10]
for i in range(0, 10, 1):
a[i] = b"\0"
You don't have to set each element to make a length-zero C string. It is sufficient to just zero the first element:
cdef char arr[n]
arr[0] = 0
Next, if you want to zero the whole char array, use memset
from libc.string cimport memset
cdef char arr[n]
memset(arr, 0, n)
And if C purists complain about the 0 instead of '\0', note that the '\0' is a Python string (unicode in Python 3) in Cython. '\0' is not a C char in Cython! memset expects an integer value for its second argument, not a Python string.
If you really want to know the int value of a C '\0' in Cython, you must write a helper function in C:
/* zerochar.h */
static int zerochar()
{
return '\0';
}
And now:
cdef extern from "zerochar.h":
int zerochar()
cdef char arr[n]
arr[0] = zerochar()
or
cdef extern from "zerochar.h":
int zerochar()
from libc.string cimport memset
cdef char arr[n]
memset(arr, zerochar(), n)
In C '' is used for a char, and "" for a string. But any 'empty char' does not really make sense, probably what you want is '\0' or just 0
Maybe:
import cython
from libc.stdlib cimport malloc, free
cdef char * test():
n = 10
cdef char *arr = <char *>malloc(n * sizeof(char))
for n in range(n):
arr[n] = '\0'
return arr
Edit
void *
calloc(size_t count, size_t size);
Does that for you,
How about:
cdef char *arr = ['\0']*n
I am trying to speed up my Numpy code and decided that I wanted to implement one particular function where my code spent most of the time in C.
I'm actually a rookie in C, but I managed to write the function which normalizes every row in a matrix to sum to 1. I can compile it and I tested it with some data (in C) and it does what I want. At that point I was very proud of myself.
Now I'm trying to call my glorious function from Python where it should accept a 2d-Numpy array.
The various things I've tried are
SWIG
SWIG + numpy.i
ctypes
My function has the prototype
void normalize_logspace_matrix(size_t nrow, size_t ncol, double mat[nrow][ncol]);
So it takes a pointer to a variable-length array and modifies it in place.
I tried the following pure SWIG interface file:
%module c_utils
%{
extern void normalize_logspace_matrix(size_t, size_t, double mat[*][*]);
%}
extern void normalize_logspace_matrix(size_t, size_t, double** mat);
Then I would do (on Mac OS X 64bit):
> swig -python c-utils.i
> gcc -fPIC c-utils_wrap.c -o c-utils_wrap.o \
-I/Library/Frameworks/Python.framework/Versions/6.2/include/python2.6/ \
-L/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/ -c
c-utils_wrap.c: In function ‘_wrap_normalize_logspace_matrix’:
c-utils_wrap.c:2867: warning: passing argument 3 of ‘normalize_logspace_matrix’ from incompatible pointer type
> g++ -dynamiclib c-utils.o -o _c_utils.so
In Python I then get the following error on importing my module:
>>> import c_utils
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dynamic module does not define init function (initc_utils)
Next I tried this approach using SWIG + numpy.i:
%module c_utils
%{
#define SWIG_FILE_WITH_INIT
#include "c-utils.h"
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply ( int DIM1, int DIM2, DATA_TYPE* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
%include "c-utils.h"
However, I don't get any further than this:
> swig -python c-utils.i
c-utils.i:13: Warning 453: Can't apply (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2). No typemaps are defined.
SWIG doesn't seem to find the typemaps defined in numpy.i, but I don't understand why, because numpy.i is in the same directory and SWIG doesn't complain that it can't find it.
With ctypes I didn't get very far, but got lost in the docs pretty quickly since I couldn't figure out how to pass it a 2d-array and then get the result back.
So could somebody show me the magic trick how to make my function available in Python/Numpy?
Unless you have a really good reason not to, you should use cython to interface C and python. (We are starting to use cython instead of raw C inside numpy/scipy themselves).
You can see a simple example in my scikits talkbox (since cython has improved quite a bit since then, I think you could write it better today).
def cslfilter(c_np.ndarray b, c_np.ndarray a, c_np.ndarray x):
"""Fast version of slfilter for a set of frames and filter coefficients.
More precisely, given rank 2 arrays for coefficients and input, this
computes:
for i in range(x.shape[0]):
y[i] = lfilter(b[i], a[i], x[i])
This is mostly useful for processing on a set of windows with variable
filters, e.g. to compute LPC residual from a signal chopped into a set of
windows.
Parameters
----------
b: array
recursive coefficients
a: array
non-recursive coefficients
x: array
signal to filter
Note
----
This is a specialized function, and does not handle other types than
double, nor initial conditions."""
cdef int na, nb, nfr, i, nx
cdef double *raw_x, *raw_a, *raw_b, *raw_y
cdef c_np.ndarray[double, ndim=2] tb
cdef c_np.ndarray[double, ndim=2] ta
cdef c_np.ndarray[double, ndim=2] tx
cdef c_np.ndarray[double, ndim=2] ty
dt = np.common_type(a, b, x)
if not dt == np.float64:
raise ValueError("Only float64 supported for now")
if not x.ndim == 2:
raise ValueError("Only input of rank 2 support")
if not b.ndim == 2:
raise ValueError("Only b of rank 2 support")
if not a.ndim == 2:
raise ValueError("Only a of rank 2 support")
nfr = a.shape[0]
if not nfr == b.shape[0]:
raise ValueError("Number of filters should be the same")
if not nfr == x.shape[0]:
raise ValueError, \
"Number of filters and number of frames should be the same"
tx = np.ascontiguousarray(x, dtype=dt)
ty = np.ones((x.shape[0], x.shape[1]), dt)
na = a.shape[1]
nb = b.shape[1]
nx = x.shape[1]
ta = np.ascontiguousarray(np.copy(a), dtype=dt)
tb = np.ascontiguousarray(np.copy(b), dtype=dt)
raw_x = <double*>tx.data
raw_b = <double*>tb.data
raw_a = <double*>ta.data
raw_y = <double*>ty.data
for i in range(nfr):
filter_double(raw_b, nb, raw_a, na, raw_x, nx, raw_y)
raw_b += nb
raw_a += na
raw_x += nx
raw_y += nx
return ty
As you can see, besides the usual argument checking you would do in python, it is almost the same thing (filter_double is a function which can be written in pure C in a separate library if you want to). Of course, since it is compiled code, failing to check your argument will crash your interpreter instead of raising exception (there are several levels of safety vs speed tradeoffs available with recent cython, though).
To answer the real question: SWIG doesn't tell you it can't find any typemaps. It tells you it can't apply the typemap (int DIM1,int DIM2,DATA_TYPE *INPLACE_ARRAY2), which is because there is no typemap defined for DATA_TYPE *. You need to tell it you want to apply it to a double*:
%apply ( int DIM1, int DIM2, double* INPLACE_ARRAY2 )
{(size_t nrow, size_t ncol, double* mat)};
First, are you sure that you were writing the fastest possible numpy code? If by normalise you mean divide the whole row by its sum, then you can write fast vectorised code which looks something like this:
matrix /= matrix.sum(axis=0)
If this is not what you had in mind and you are still sure that you need a fast C extension, I would strongly recommend you write it in cython instead of C. This will save you all the overhead and difficulties in wrapping code, and allow you to write something which looks like python code but which can be made to run as fast as C in most circumstances.
I agree with others that a little Cython is well worth learning.
But if you must write C or C++, use a 1d array which overlays the 2d, like this:
// sum1rows.cpp: 2d A as 1d A1
// Unfortunately
// void f( int m, int n, double a[m][n] ) { ... }
// is valid c but not c++ .
// See also
// http://stackoverflow.com/questions/3959457/high-performance-c-multi-dimensional-arrays
// http://stackoverflow.com/questions/tagged/multidimensional-array c++
#include <stdio.h>
void sum1( int n, double x[] ) // x /= sum(x)
{
float sum = 0;
for( int j = 0; j < n; j ++ )
sum += x[j];
for( int j = 0; j < n; j ++ )
x[j] /= sum;
}
void sum1rows( int nrow, int ncol, double A1[] ) // 1d A1 == 2d A[nrow][ncol]
{
for( int j = 0; j < nrow*ncol; j += ncol )
sum1( ncol, &A1[j] );
}
int main( int argc, char** argv )
{
int nrow = 100, ncol = 10;
double A[nrow][ncol];
for( int j = 0; j < nrow; j ++ )
for( int k = 0; k < ncol; k ++ )
A[j][k] = (j+1) * k;
double* A1 = &A[0][0]; // A as 1d array -- bad practice
sum1rows( nrow, ncol, A1 );
for( int j = 0; j < 2; j ++ ){
for( int k = 0; k < ncol; k ++ ){
printf( "%.2g ", A[j][k] );
}
printf( "\n" );
}
}
Added 8 Nov: as you probably know, numpy.reshape can overlay a numpy 2d array with a 1d view to pass to sum1rows, like this:
import numpy as np
A = np.arange(10).reshape((2,5))
A1 = A.reshape(A.size) # a 1d view of A, not a copy
# sum1rows( 2, 5, A1 )
A[1,1] += 10
print "A:", A
print "A1:", A1
SciPy has an extension tutorial with example code for arrays.
http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html