I'm relatively new to Python and this is my first attempt at writing a C extension.
Background
In my Python 3.X project I need to load and parse large binary files (10-100MB) to extract data for further processing. The binary content is organized in frames: headers followed by a variable amount of data. Due to the low performance in Python I decided to go for a C extension to speedup the loading part.
The standalone C code outperforms Python by a factor in between 20x-500x so I am pretty satisfied with it.
The problem: the memory keeps growing when I invoke the function from my C-extension multiple times within the same Python module.
my_c_ext.c
#include <Python.h>
#include <numpy/arrayobject.h>
#include "my_c_ext.h"
static unsigned short *X, *Y;
static PyObject* c_load(PyObject* self, PyObject* args)
{
char *filename;
if(!PyArg_ParseTuple(args, "s", &filename))
return NULL;
PyObject *PyX, *PyY;
__load(filename);
npy_intp dims[1] = {n_events};
PyX = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, X);
PyArray_ENABLEFLAGS((PyArrayObject*)PyX, NPY_ARRAY_OWNDATA);
PyY = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, Y);
PyArray_ENABLEFLAGS((PyArrayObject*)PyY, NPY_ARRAY_OWNDATA);
PyObject *xy = Py_BuildValue("NN", PyX, PyY);
return xy;
}
...
//More Python C-extension boilerplate (methods, etc..)
...
void __load(char *) {
// open file, extract frame header and compute new_size
X = realloc(X, new_size * sizeof(*X));
Y = realloc(Y, new_size * sizeof(*Y));
X[i] = ...
Y[i] = ...
return;
}
test.py
import my_c_ext as ce
binary_files = ['file1.bin',...,'fileN.bin']
for f in binary_files:
x,y = ce.c_load(f)
del x,y
Here I am deleting the returned objects in hope of lowering memory usage.
After reading several posts (e.g. this, this and this), I am still stuck.
I tried to add/remove the PyArray_ENABLEFLAGS setting the NPY_ARRAY_OWNDATA flag without experiencing any difference. It is not yet clear to me if the NPY_ARRAY_OWNDATA implies a free(X) in C. If I explicitly free the arrays in C, I ran into a segfault when trying to load second file in the for loop in test.py.
Any idea of what am I doing wrong?
This looks like a memory management disaster. NPY_ARRAY_OWNDATA should cause it to call free on the data (or at least PyArray_free which isn't necessarily the same thing...).
However once this is done you still have the global variables X and Y pointing to a now-invalid area of memory. You then call realloc on those invalid pointers. At this point you're well into undefined behaviour and so anything could happen.
If it's a global variable then the memory needs to be managed globally, not by Numpy. If the memory is managed by the Numpy array then you need to ensure that you store no other way to access it except through that Numpy array. Anything else is going to cause you problems.
C++ code:
#include <string>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <sys/time.h>
using namespace std;
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
int main() {
timeval tv1, tv2, tv3, tve;
gettimeofday(&tv1, 0);
int size = 0x1000000;
int fd = open("data", O_RDWR | O_CREAT | O_TRUNC, FILE_MODE);
ftruncate(fd, size);
char *data = (char *) mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
for(int i = 0; i < size; i++) {
data[i] = 'S';
}
munmap(data, size);
close(fd);
gettimeofday(&tv2, 0);
timersub(&tv2, &tv1, &tve);
printf("Time elapsed: %ld.%06lds\n", (long int) tve.tv_sec, (long int) tve.tv_usec);
}
Python code:
import mmap
import time
t1 = time.time()
size = 0x1000000
f = open('data/data', 'w+')
f.truncate(size)
f.close()
file = open('data/data', 'r+b')
buffer = mmap.mmap(file.fileno(), 0)
for i in xrange(size):
buffer[i] = 'S'
buffer.close()
file.close()
t2 = time.time()
print "Time elapsed: %.3fs" % (t2 - t1)
I think these two program are the essentially same since C++ and Python call the same system call(mmap).
But the Python version is much slower than C++'s:
Python: Time elapsed: 1.981s
C++: Time elapsed: 0.062143s
Could any one please explain the reason why the mmap Python of is much slower than C++?
Environment:
C++:
$ c++ --version
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.5.0
Python:
$ python --version
Python 2.7.11 :: Anaconda 4.0.0 (x86_64)
Not mmap is slower, but the filling of a array with values. Python is known, to be slow on doing primitive operations. Use higher-level operations:
buffer[:] = 'S' * size
To elaborate on what #Daniel said – any Python operation has more overhead (in some cases way more, like orders of magnitude) than the comparable amount of code implementing a solution in C++.
The loop filling the buffer is indeed the culprit – but also the mmap module itself has a lot more housekeeping to do than you might think, despite that it offers an interface whose semantics are, misleadingly, verrrry closely aligned with POSIX mmap(). You know how POSIX mmap() just tosses you a void* (which you just have to use munmap() to clean up after it, at some point)? Python’s mmap has to allocate a PyObject structure to babysit the void* – making it conform to Python’s buffer protocol by furnishing metadata and callbacks to the runtime, propagating and queueing reads and writes, maintaining GIL state, cleaning up its allocations no matter what errors occur…
All of that stuff takes time and memory, too. I personally don’t ever find myself using the mmap module, as it doesn’t give you a clear-cut advantage on any I/O problem, like out-of-the-box – you can just as easily use mmap to make things slower as you might make them faster.
Contrastingly I often *do* find that using POSIX mmap() can be VERY advantageous when doing I/O from within a Python C/C++ extension (provided you’re minding the GIL state), precisely because coding around mmap() avoids all that Python internal-infrastructure stuff in the first place.
I have made a module in Python using SimpleITK, which I tried to speed up by reimplementing in C++. It turns out to be quite a lot slower.
The bottleneck is the usage of the DisplacementFieldJacobianDeterminantFilter.
These two snippets give an example of the usage of the filters.
1000 generations: C++ = 55s, python = 8s
Should I expect the c++ to be faster?
def test_DJD(label_path, ngen):
im = sitk.ReadImage(label_path)
for i in range(ngen):
jacobian = sitk.DisplacementFieldJacobianDeterminant(im)
if __name__ == '__main__':
label = sys.argv[1]
ngen = int(sys.argv[2])
test_DJD(label, ngen)
And the c++ code
typedef itk::Vector<float, 3> VectorType;
typedef itk::Image<VectorType, 3> VectorImageType;
typedef itk::DisplacementFieldJacobianDeterminantFilter<VectorImageType > JacFilterType;
typedef itk::Image<float, 3> FloatImageType;
int main(int argc, char** argv) {
std::string idealJacPath = argv[1];
std::string numGensString = argv[2];
int numGens;
istringstream ( numGensString ) >> numGens;
typedef itk::ImageFileReader<VectorImageType> VectorReaderType;
VectorReaderType::Pointer reader=VectorReaderType::New();
reader->SetFileName(idealJacPath);
reader->Update();
VectorImageType::Pointer vectorImage=reader->GetOutput();
JacFilterType::Pointer jacFilter = JacFilterType::New();
FloatImageType::Pointer generatedJac = FloatImageType::New();
for (int i =0; i < numGens; i++){
jacFilter->SetInput(vectorImage);
jacFilter->Update();
jacFilter->Modified();
generatedJac = jacFilter->GetOutput();
}
return 0;
}
I'm using the c++ ITK 4.8.2 and compiled in 'release' mode on Ubuntu 15.4. And the python SimpleITK v 9.0
You seem to be benchmarking using loops. Using loops for benchmarking is not a good practice, because the compilers and interpreters does a lot of optimizations to them.
I believe that in here
for i in range(ngen):
jacobian = sitk.DisplacementFieldJacobianDeterminant(im)
The python interpreter most probably realized that you are only using the last value assigned to the jacobian variable, therefore executing only ONE iteration of the loop. This is a very common loop optimization.
On the other hand, since you call a couple of dynamic method in the C++ version (jacFilter->Update();), is possible that the compiler could not infer that the other calls are not being used, making your C++ version slower since all the invocations to the DisplacementFieldJacobianDeterminant::update method are actually made.
Another possible cause is that the ITK pipeline in Python is not being forced to update, as you call explicitly the jacFilter->Modified() in C++ but this is not explicit in the Python version.
I am creating small GUI system and I would like to make my rendering with python and cairo and pystacia libraries. For C++/Python interaction I am using Boost Python but I am having troubles with pointers.
I have seen this kind of question asked few times but didn't quite understand how to solve it.
If I have a strcut/class with only pointer for image data:
struct ImageData{
unsigned char* data;
void setData(unsigned char* data) { this->data = data; } // lets assume there is more code here that manages memory
unsigned char* getData() { return data; }
}
how can I make this available for python to do this (C++):
ImageData myimage;
myimage.data = some_image_data;
global["myimage"] = python::ptr(&myimage);
and in python:
import mymodule
from mymodule import ImageData
myimagedata = myimage.GetData()
#manipulate with image data and those manipulations can be read from C++ data pointer that is passed
My code works for calling basic method calling of passed ptr to class. This is probably basic use case but I haven't been able to make it work. I tried shared_ptr but failed. Should it be solved using shared_ptr, proxy class or some other way?
You have a problem here with the locality of your variable: myimage will get deleted when you get out of its scope. To fix this, you can create it with dynamic memory allocation and moving then moving the pointer to python:
ImageData * myimage = new ImageData();
myimage->data = new ImageRawData(some_image_data); // assuming copy of the buffer
global["myimage"] = boost::python::ptr(myimage);
Please notice that this does not take care of memory handling on python. You shall use boost::python::handle<> to correctly state you are transferring memory management to python.
I want to extend a large C project with some new functionality, but I really want to write it in Python. Basically, I want to call Python code from C code. However, Python->C wrappers like SWIG allow for the OPPOSITE, that is writing C modules and calling C from Python.
I'm considering an approach involving IPC or RPC (I don't mind having multiple processes); that is, having my pure-Python component run in a separate process (on the same machine) and having my C project communicate with it by writing/reading from a socket (or unix pipe). my python component can read/write to socket to communicate. Is that a reasonable approach? Is there something better? Like some special RPC mechanism?
Thanks for the answer so far - however, i'd like to focus on IPC-based approaches since I want to have my Python program in a separate process as my C program. I don't want to embed a Python interpreter. Thanks!
I recommend the approaches detailed here. It starts by explaining how to execute strings of Python code, then from there details how to set up a Python environment to interact with your C program, call Python functions from your C code, manipulate Python objects from your C code, etc.
EDIT: If you really want to go the route of IPC, then you'll want to use the struct module or better yet, protlib. Most communication between a Python and C process revolves around passing structs back and forth, either over a socket or through shared memory.
I recommend creating a Command struct with fields and codes to represent commands and their arguments. I can't give much more specific advice without knowing more about what you want to accomplish, but in general I recommend the protlib library, since it's what I use to communicate between C and Python programs (disclaimer: I am the author of protlib).
Have you considered just wrapping your python application in a shell script and invoking it from within your C application?
Not the most elegant solution, but it is very simple.
See the relevant chapter in the manual: http://docs.python.org/extending/
Essentially you'll have to embed the python interpreter into your program.
I haven't used an IPC approach for Python<->C communication but it should work pretty well. I would have the C program do a standard fork-exec and use redirected stdin and stdout in the child process for the communication. A nice text-based communication will make it very easy to develop and test the Python program.
If I had decided to go with IPC, I'd probably splurge with XML-RPC -- cross-platform, lets you easily put the Python server project on a different node later if you want, has many excellent implementations (see here for many, including C and Python ones, and here for the simple XML-RPC server that's part the Python standard library -- not as highly scalable as other approaches but probably fine and convenient for your use case).
It may not be a perfect IPC approach for all cases (or even a perfect RPC one, by all means!), but the convenience, flexibility, robustness, and broad range of implementations outweigh a lot of minor defects, in my opinion.
This seems quite nice http://thrift.apache.org/, there is even a book about it.
Details:
The Apache Thrift software framework, for scalable cross-language
services development, combines a software stack with a code generation
engine to build services that work efficiently and seamlessly between
C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
I've used the "standard" approach of Embedding Python in Another Application. But it's complicated/tedious. Each new function in Python is painful to implement.
I saw an example of Calling PyPy from C. It uses CFFI to simplify the interface but it requires PyPy, not Python. Read and understand this example first, at least at a high level.
I modified the C/PyPy example to work with Python. Here's how to call Python from C using CFFI.
My example is more complicated because I implemented three functions in Python instead of one. I wanted to cover additional aspects of passing data back and forth.
The complicated part is now isolated to passing the address of api to Python. That only has to be implemented once. After that it's easy to add new functions in Python.
interface.h
// These are the three functions that I implemented in Python.
// Any additional function would be added here.
struct API {
double (*add_numbers)(double x, double y);
char* (*dump_buffer)(char *buffer, int buffer_size);
int (*release_object)(char *obj);
};
test_cffi.c
//
// Calling Python from C.
// Based on Calling PyPy from C:
// http://doc.pypy.org/en/latest/embedding.html#more-complete-example
//
#include <stdio.h>
#include <assert.h>
#include "Python.h"
#include "interface.h"
struct API api; /* global var */
int main(int argc, char *argv[])
{
int rc;
// Start Python interpreter and initialize "api" in interface.py using
// old style "Embedding Python in Another Application":
// https://docs.python.org/2/extending/embedding.html#embedding-python-in-another-application
PyObject *pName, *pModule, *py_results;
PyObject *fill_api;
#define PYVERIFY(exp) if ((exp) == 0) { fprintf(stderr, "%s[%d]: ", __FILE__, __LINE__); PyErr_Print(); exit(1); }
Py_SetProgramName(argv[0]); /* optional but recommended */
Py_Initialize();
PyRun_SimpleString(
"import sys;"
"sys.path.insert(0, '.')" );
PYVERIFY( pName = PyString_FromString("interface") )
PYVERIFY( pModule = PyImport_Import(pName) )
Py_DECREF(pName);
PYVERIFY( fill_api = PyObject_GetAttrString(pModule, "fill_api") )
// "k" = [unsigned long],
// see https://docs.python.org/2/c-api/arg.html#c.Py_BuildValue
PYVERIFY( py_results = PyObject_CallFunction(fill_api, "k", &api) )
assert(py_results == Py_None);
// Call Python function from C using cffi.
printf("sum: %f\n", api.add_numbers(12.3, 45.6));
// More complex example.
char buffer[20];
char * result = api.dump_buffer(buffer, sizeof buffer);
assert(result != 0);
printf("buffer: %s\n", result);
// Let Python perform garbage collection on result now.
rc = api.release_object(result);
assert(rc == 0);
// Close Python interpreter.
Py_Finalize();
return 0;
}
interface.py
import cffi
import sys
import traceback
ffi = cffi.FFI()
ffi.cdef(file('interface.h').read())
# Hold references to objects to prevent garbage collection.
noGCDict = {}
# Add two numbers.
# This function was copied from the PyPy example.
#ffi.callback("double (double, double)")
def add_numbers(x, y):
return x + y
# Convert input buffer to repr(buffer).
#ffi.callback("char *(char*, int)")
def dump_buffer(buffer, buffer_len):
try:
# First attempt to access data in buffer.
# Using the ffi/lib objects:
# http://cffi.readthedocs.org/en/latest/using.html#using-the-ffi-lib-objects
# One char at time, Looks inefficient.
#data = ''.join([buffer[i] for i in xrange(buffer_len)])
# Second attempt.
# FFI Interface:
# http://cffi.readthedocs.org/en/latest/using.html#ffi-interface
# Works but doc says "str() gives inconsistent results".
#data = str( ffi.buffer(buffer, buffer_len) )
# Convert C buffer to Python str.
# Doc says [:] is recommended instead of str().
data = ffi.buffer(buffer, buffer_len)[:]
# The goal is to return repr(data)
# but it has to be converted to a C buffer.
result = ffi.new('char []', repr(data))
# Save reference to data so it's not freed until released by C program.
noGCDict[ffi.addressof(result)] = result
return result
except:
print >>sys.stderr, traceback.format_exc()
return ffi.NULL
# Release object so that Python can reclaim the memory.
#ffi.callback("int (char*)")
def release_object(ptr):
try:
del noGCDict[ptr]
return 0
except:
print >>sys.stderr, traceback.format_exc()
return 1
def fill_api(ptr):
global api
api = ffi.cast("struct API*", ptr)
api.add_numbers = add_numbers
api.dump_buffer = dump_buffer
api.release_object = release_object
Compile:
gcc -o test_cffi test_cffi.c -I/home/jmudd/pgsql-native/Python-2.7.10.install/include/python2.7 -L/home/jmudd/pgsql-native/Python-2.7.10.install/lib -lpython2.7
Execute:
$ test_cffi
sum: 57.900000
buffer: 'T\x9e\x04\x08\xa8\x93\xff\xbf]\x86\x04\x08\x00\x00\x00\x00\x00\x00\x00\x00'
$
Few tips for binding it with Python 3
file() not supported, use open()
ffi.cdef(open('interface.h').read())
PyObject* PyStr_FromString(const char *u)
Create a PyStr from a UTF-8 encoded null-terminated character buffer.
Python 2: PyString_FromString
Python 3: PyUnicode_FromString
Change to: PYVERIFY( pName = PyUnicode_FromString("interface") )
Program name
wchar_t *name = Py_DecodeLocale(argv[0], NULL);
Py_SetProgramName(name);
for compiling
gcc cc.c -o cc -I/usr/include/python3.6m -I/usr/include/x86_64-linux-gnu/python3.6m -lpython3.6m
I butchered dump def .. maybe it will give some ideas
def get_prediction(buffer, buffer_len):
try:
data = ffi.buffer(buffer, buffer_len)[:]
result = ffi.new('char []', data)
print('\n I am doing something here here........',data )
resultA = ffi.new('char []', b"Failed") ### New message
##noGCDict[ffi.addressof(resultA)] = resultA
return resultA
except:
print >>sys.stderr, traceback.format_exc()
return ffi.NULL
}
Hopefully it will help and save you some time
apparently Python need to be able to compile to win32 dll, it will solve the problem
In such a way that converting c# code to win32 dlls will make it usable by any development tool