I want to reduce a memory copy step during my data processing pipeline.
I want to do the following:
Generate some data from a custom C library
Feed the generated data into a MXNet model running on GPU.
For now, my pipeline does the following:
Create a C-contiguous numpy array via np.empty(...).
Get the pointer to numpy array via np.ndarray.__array_interface__
Call the C library from python (via ctypes) to fill the numpy array.
Convert the numpy array into mxnet NDArray, this will copy the underlying memory buffer.
Pack NDArrays into a mx.io.DataBatch instance, then feed into model.
Please note, before being fed into model, all arrays stay in CPU memory.
I noticed a mx.io.DataBatch can only take a list of mx.ndarray.NDArrays as data and label parameter, but not numpy arrays. It works until you feed it into a model. On the other hand, I have a C library that can write directly to a C-contiguous array.
I would like to avoid the memory copying in step 3. One possible way is somehow getting a raw pointer to buffer of NDArray, while totally ignoring numpy. But whatever works.
I figured out a hacky way to achieve this. Here's a small example.
from ctypes import *
import numpy as np
import mxnet as mx
m = mx.ndarray.zeros((4,4))
m.wait_to_read() # make sure the data is allocated
c_uint64_p = POINTER(c_uint64)
handle= cast(m.handle, c_uint64_p) # NDArray*
ptr_ = cast(handle[0], c_uint64_p) # shared_ptr<Chunk>
dptr = cast(ptr_[0], POINTER(c_float)) # shandle.dptr
n = np.ctypeslib.as_array(dptr, shape=(4,4)) # m and n will share buffer
I derived the above code by looking at MxNet C++ source code. Some explanation:
First, note the NDArray.handle attribute. It's a c_void_p. Read the python source code, you will know it's NDArrayHandle. Now dive into src/c_api/c_api_ndarray.cc code, it's reinterpreted as NDArray*.
In the source tree, go to include/mxnet/ndarray.h and find NDArray class. The first field is:
/*! \brief internal data of NDArray */
std::shared_ptr<Chunk> ptr_{nullptr};
Checking Chunk, which is a struct defined inside NDArray, we see:
/*! \brief the real data chunk that backs NDArray */
// shandle is used to store the actual values in the NDArray
// aux_handles store the aux data(such as indices) if it's needed by non-default storage.
struct Chunk {
/*! \brief storage handle from storage engine.
for non-default storage, shandle stores the data(value) array.
*/
Storage::Handle shandle;
Finally, shandle is defined in include/mxnet/storage.h:
struct Handle {
/*!
* \brief Pointer to the data.
*/
void* dptr{nullptr};
Writing a small program shows sizeof(shared_ptr<some_type>) is 16. Based on this question, we can guess shared_ptr is composed of two pointers. It's not too hard to figure out the first pointer is the pointer to data. Putting everything together, all that needed are two pointer de-referencing.
On the down site, this method cannot be used in production environment or large projects. It could break in future release, or introduce tough bugs and security holes.
Related
I would like to use something like a structarray in cython, and I would like this structarray as easily accessible in python as in cython.
Based on a whim I used a recarray using a dtype that looks like the struct that I would like to use. Curiously, it just works and allows me to use a c structarray that, over the hood ;), is a numpy recarray for the python user.
Here is my example
# This is a "structarray in cython with numpy recarrays" testfile
import numpy as np
cimport numpy as np
# My structarray has nodes with fields x and y
# This also works without packed, but I have seen packed used in other places where people asked similar questions
# I assume that for two doubles that is equivalent but is necessary for in8s in between
cdef packed struct node:
double x
double y
# I suppose that would be the equivalent numpy dtype?
# Note: During compilation it warns me about double to float downcasts, but I do not see where
nodetype = [('x' , np.float64),('y', np.float64)]
def fun():
# Make 10 element recarray
# (Just looked it up. A point where 1-based indexing would save a look in the docs)
mynode1 = np.recarray(10,dtype=nodetype)
# Recarray with cdef struct
mynode1 = np.recarray(10,dtype=nodetype)
# Fill it with non-garbage somewhere
mynode1[2].x=1.0
mynode1[2].y=2.0
# Brave: give recarray element to a c function assuming its equivalent to the struct
ny = cfuny(mynode1[2])
assert ny==2.0 # works!
# Test memoryview, assuming type node
cdef node [:] nview = mynode1
ny = cfunyv(nview,2)
assert ny==2.0 # works!
# This sets the numpy recarray value with a c function the gts a memoryview
cfunyv_set(nview,5,9.0)
assert mynode1[5].y==9.0 # alsow works!
return 0
# return node element y from c struct node
cdef double cfuny(node n):
return n.y
# give element i from memoryview of recarray to c function expecting a c struct
cdef double cfunyv(node [:] n, int i):
return cfuny(n[i])
# write into recarray with a function expecting a memoryview with type node
cdef int cfunyv_set(node [:] n,int i,double val):
n[i].y = val
return 0
Of course I am not the first to try this.
Here for example the same thing is done, and it even states that this usage would be part of the manual here, but I cannot find this on the page. I suspect it was there at some point. There are also several discussions involving the use of strings in such a custom type (e.g. here), and from the answers I gather that the possibility of casting a recarray on a cstruct is intended behaviour, as the discussion talks about incorporating a regression test about the given example and having fixed the string error at some point.
My question
I could not find any documentation that states that this should work besides forum answers. Can someone show me where that is documented?
And, for some additional curiosity
Will this likely break at any point during the development of numpy or cython?
From the other forum entries on the subject it seems that packed is necessary for this to work once more interesting datatypes are part of the struct. I am not a compiler expert and have never used structure packing myself, but I suspect that whether a structure gets packed or not depends on the compiler settings. Does that mean that someone who compiles numpy without packing structures needs to compile this cython code without the packed?
This doesn't seem to be directly documented. Best reference I can give you is the typed memoryview docs here.
Rather than specific cython support for numpy structured dtypes this instead seems a consequence of support for the PEP 3118 buffer protocol. numpy exposes a Py_buffer struct for its arrays, and cython knows how to cast those into structs.
The packing is necessary. My understanding is x86 is aligned on itemsize byte boundaries, whereas as a numpy structured dtype is packed into the minimum space possible. Probably clearest by example:
%%cython
import numpy as np
cdef struct Thing:
char a
# 7 bytes padding, double must be 8 byte aligned
double b
thing_dtype = np.dtype([('a', np.byte), ('b', np.double)])
print('dtype size: ', thing_dtype.itemsize)
print('unpacked struct size', sizeof(Thing))
dtype size: 9
unpacked struct size 16
Just answering the final sub-question:
From the other forum entries on the subject it seems that packed is necessary for this to work once more interesting datatypes are part of the struct. I am not a compiler expert and have never used structure packing myself, but I suspect that whether a structure gets packed or not depends on the compiler settings. Does that mean that someone who compiles numpy without packing structures needs to compile this cython code without the packed?
Numpy's behaviour is decided at runtime rather than compile-time. It will calculate the minimum amount of space a structure can need and allocate blocks of that. It won't be changed by any compiler settings so should be reliable.
cdef packed struct is therefore always needed to match numpy. However, it does not generate standards compliant C code. Instead, it uses extensions to GCC, MSVC (and others). Therefore it works fine on the major C compilers that currently exist, but in principle might fail on a future compiler. It looks like it should be possible to use the C11 standard alignas to achieve the same thing in a standards compliant way, so Cython could hopefully be modified to do that if needed.
The title could be confusing. Here I will state my question more clearly.
I want to create a website based on python (a lot of existing framework like Flask and cherryPy) and along with a C++ computation engine for the sake of processing speed. So I need to create an interface for python to call C++ functions. Fortunately the boost.python can do the job. However, every time I send a data from python, say a matrix, to C++, I have to use python list, which means I have to transform the matrix data into list and in C++ context transform the list into an internal matrix object. As a result, a lot of data copy occur, which could not be an intelligent or efficient approach. So my questions is that if, considering the complexity, we don't map C++ matrix class to a python class through boost.python, is there a better way to do the similar work without or only with small number of copy?
However, every time I send a data from python, say a matrix, to C++, I
have to use python list, which means I have to transform the matrix
data into list and in C++ context transform the list into an internal
matrix object.
No, you don't have to use the python list. You can use numpy array which allocates data as a C contiguous segment which can be passed down to C++ without copying and viewed as a matrix using a matrix wrapper class.
In python allocate 2d array using numpy:
>>> y=np.empty((2,5), dtype=np.int16)
>>> y
array([[ 12, 27750, 26465, 2675, 0],
[ 0, 0, 0, 2601, 0]], dtype=int16)
>>> y.flags['C_CONTIGUOUS']
True
>>> foo(y,2,5)
Pass matrix data to C++ using below function exposed to python:
void foo(python::object obj, size_t size1, size_t size2)
{
PyObject* pobj = obj.ptr();
Py_buffer pybuf;
PyObject_GetBuffer(pobj, &pybuf, PyBUF_SIMPLE);
void *buf = pybuf.buf;
int16_t *p = (int16_t*)buf;
Py_XDECREF(pobj);
MyMatrixWrapper matrix(p, size1, size2);
// ....
}
There's a project called ndarray with this as their exact aim: https://github.com/ndarray/ndarray see also https://github.com/ndarray/Boost.NumPy .
There is some compatibility with other C++ matrix libraries (eg. Eigen) which should help with the computations.
I've created a shared library. And I'm using it like that
class CAudioRecoveryStrategy(AbstractAudioRecoveryStrategy):
def __init__(self):
array_1d_double = npct.ndpointer(dtype=numpy.double, ndim=1, flags='CONTIGUOUS')
self.lib = npct.load_library("libhello", ".")
self.lib.demodulate.argtypes = [array_1d_double, array_1d_double, ctypes.c_int]
def demodulate(self, input):
output = numpy.empty_like(input)
self.lib.demodulate(input, output, input.size)
return output
Right now I have a problem, which is in c++ code I only have pointer to array of output data, not the array. So I can't return the array, unless I manually copy it.
What is the right way to do it? It must be efficient (like aligned memory etc.)
Numpy arrays implement the buffer protocol, see
https://docs.python.org/2/c-api/buffer.html. In particular,
parse the input object to a PyObject* (conversion O if you're
using PyArg_ParseTuple or PyArg_ParseTupleAndKeywords), then
do PyObject_CheckBuffer, to ensure that the type supports the
protocol (numpy arrays do), then PyObject_GetBuffer to fill in
a Py_buffer struct with the physical addresses, dimensions,
etc. of the underlying memory block. To return a numpy buffer
is more complicated; in general, I've found it sufficient to
create objects of my own type which also support the buffer
protocol (set tp_as_buffer to non null in the PyTypeObject).
Otherwise (but I've not actually tried this), you'll have to
import the numpy module, get its array attribute, call it with
the correct arguments, and then use the buffer protocol above on
the object you thus construct.
Is there a good way to do this? I am trying to do something like
pydata = numpy.PyArray_SimpleNew(2, <numpy.npy_intp*> shape, numpy.NPY_OBJECT) # (1)
# then do some sort of memcpy
or
pydata = PyArray_SimpleNewFromData(2, <numpy.npy_intp*> shape, numpy.NPY_OBJECT, data) # (2)
But am getting segfaults even on the pure creation line (1). I think I am going about something in the wrong way. The other approach is to create a python (numpy) object in C++ but I was hoping to keep the interaction with the python api within the cython code.
I have a C function that mallocs() and populates a 2D array of floats. It "returns" that address and the size of the array. The signature is
int get_array_c(float** addr, int* nrows, int* ncols);
I want to call it from Python, so I use ctypes.
import ctypes
mylib = ctypes.cdll.LoadLibrary('mylib.so')
get_array_c = mylib.get_array_c
I never figured out how to specify argument types with ctypes. I tend to just write a python wrapper for each C function I'm using, and make sure I get the types right in the wrapper. The array of floats is a matrix in column-major order, and I'd like to get it as a numpy.ndarray. But its pretty big, so I want to use the memory allocated by the C function, not copy it. (I just found this PyBuffer_FromMemory stuff in this StackOverflow answer: https://stackoverflow.com/a/4355701/3691)
buffer_from_memory = ctypes.pythonapi.PyBuffer_FromMemory
buffer_from_memory.restype = ctypes.py_object
import numpy
def get_array_py():
nrows = ctypes.c_int()
ncols = ctypes.c_int()
addr_ptr = ctypes.POINTER(ctypes.c_float)()
get_array_c(ctypes.byref(addr_ptr), ctypes.byref(nrows), ctypes.byref(ncols))
buf = buffer_from_memory(addr_ptr, 4 * nrows * ncols)
return numpy.ndarray((nrows, ncols), dtype=numpy.float32, order='F',
buffer=buf)
This seems to give me an array with the right values. But I'm pretty sure it's a memory leak.
>>> a = get_array_py()
>>> a.flags.owndata
False
The array doesn't own the memory. Fair enough; by default, when the array is created from a buffer, it shouldn't. But in this case it should. When the numpy array is deleted, I'd really like python to free the buffer memory for me. It seems like if I could force owndata to True, that should do it, but owndata isn't settable.
Unsatisfactory solutions:
Make the caller of get_array_py() responsible for freeing the memory. That's super annoying; the caller should be able to treat this numpy array just like any other numpy array.
Copy the original array into a new numpy array (with its own, separate memory) in get_array_py, delete the first array, and free the memory inside get_array_py(). Return the copy instead of the original array. This is annoying because it's an ought-to-be unnecessary memory copy.
Is there a way to do what I want? I can't modify the C function itself, although I could add another C function to the library if that's helpful.
I just stumbled upon this question, which is still an issue in August 2013. Numpy is really picky about the OWNDATA flag: There is no way it can be modified on the Python level, so ctypes will most likely not be able to do this. On the numpy C-API level - and now we are talking about a completely different way of making Python extension modules - one has to explicitly set the flag with:
PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);
On numpy < 1.7, one had to be even more explicit:
((PyArrayObject*)arr)->flags |= NPY_OWNDATA;
If one has any control over the underlying C function/library, the best solution is to pass it an empty numpy array of the appropriate size from Python to store the result in. The basic principle is that memory allocation should always be done on the highest level possible, in this case on the level of the Python interpreter.
As kynan commented below, if you use Cython, you have to expose the function PyArray_ENABLEFLAGS manually, see this post Force NumPy ndarray to take ownership of its memory in Cython.
The relevant documentation is here
and here.
I would tend to have two functions exported from my C library:
int get_array_c_nomalloc(float* addr, int nrows, int ncols); /* Pass addr as argument */
int get_array_c(float **addr, int nrows, int ncols); /* Calls function above */
I would then write my Python wrapper[1] of get_array_c to allocate the array, then call get_array_c_nomalloc. Then Python does own the memory. You could integrate this wrapper into your library so your user never has to be aware of get_array_c_nomalloc's existence.
[1] This isn't really a wrapper anymore, but instead is an adapter.