pybind11 somehow slows c++ function

pybind11 somehow slows c++ function - python

All stackoverflow/github issues I've seen were about speeding up functions calls from Python in case of marshalling objects.
But my problem is about the working time of pure c++ function inside pybind11 C++ module function.
I have the training function that loads dataset and calls train method from the native C++ library class:
void runSvm()
// read smvlight file into required sparse representation
auto problem = read_problem( "random4000x20.train.svml" );
CSvmBinaryClassifierBuilder::CParams params( CSvmKernel::KT_Linear );
params.Degree = 3;
params.Gamma = 1/20;
params.Coeff0 = 0;
// measure time here
using namespace std::chrono;
system_clock::time_point startTime = high_resolution_clock::now();
CSvmBinaryClassifierBuilder( params ).Train( *problem ); // measure time only for this line
nanoseconds delay = duration_cast<nanoseconds>( high_resolution_clock::now() - startTime );
std::cout << setprecision(3) << delay / 1e6 << std::endl;
}
I bound this function to Python via pybind11:
PYBIND11_MODULE(PythonWrapper, m) {
m.def( "runSvm", &runSvm );
}
Then compiled pybind11 module library and called it from Python. The timer value was over 3000ms.
But when I call this function from pure C++, the timer value is around 800ms.
Of course, I was expecting some overhead but not in this place and not so much.
I ran it in one thread and both cases it 100% loaded one core.
Where the issue can be? Who faced the same and how did you handle this?

When I was working on a reproducible example, I found out that I compare different svm kernels in C++ example (it was based on libsvm params proved 'rbf') and in pybind11 lib (it was hardcoded 'linear'). After fixing it and comparing the same algorithms there was no difference in time.

Related

How can I make animated colormap plot of C++ arrays? Export to Python and use \matplotlib?

I have a C++ code which creates an array at each time. (C++ is needed because of the speed - if the code was built directly in Python, it would take 8 days to execute.) The structure of the code looks something like:
const int StopTime = 10000;
int main(void){
int Time = 0;
while(Time<StopTime){
//Create a 2D array of dimensions 10x10 filled with integers
}
}
The final result shall be an animated gif figure, which would display the colormap plots of the 10000 arrays subsequently generated. I am quite familiar with Python's \matplotlib package, so I hope to save all of these 10000 arrays to some file which would then be imported to Python's code a plotted with \matplotlib. How could I do it? Or is there some C++ library to help without the need to export the arrays? The easier to understand and learn, the better. (I am much more confortable working with Python.)
I have already had a look on https://github.com/lava/matplotlib-cpp but I cannot even install it correctly.
Can someone more experienced help me please? Thank you so much!
PS: I am a beginner in both C++ and Python, and mathematician by training.

The quick and dirty approach to such a problem is often to generate python code from your primary program. Here your C++ program. For example, the string [0, 3, 5, 1] is a valid python fragment defining an array of four values.
So, if you have your std::array<T,N> or your T[N] or std::vector<T> with your data, you can serialise them fairly easily to something that can be used as input to the python interpreter:
template <typename Os, typename It>
void serialise(Os& os, It begin, It end) {
os << '[';
std::string sep;
for (auto it = begin; it != end; ++it) {
os << sep << *it;
sep = ", ";
}
os << " ]";
}
Which you can call with
std::cout << "data = ";
serialise(std::cout,std::begin(data),std::end(data));
std::cout << "\n";
You can easily abstract this for multi-dimensional data structures to print something like [ [1,2,3], [4,5,6] ].
You can then wrap your program into a small script (using bash, python, whatever) and have the output of your C++ program be prepended to your matplotlib python script, which then accesses the variable data.
Consider the following a very rough sketch how you could do it with bash:
# generate data array
./main > run.py
# add actual python code that accesses data
cat matplot.tmpl.py >> run.py
# run generated python program
python3 run.py

C++ Python Not Always Executing Python Script

I currently have a piece of hardware connected to C++ code using the MFC (Windows programming) framework. Basically the hardware is passing in image frames to my C++ code. In my C++ code, I am then calling a Python script using the CPython (Python embedding in C++) API to execute a model on that image. I've been noticing some weird behavior with the images though.
My C++ code is executing my Python script perfectly until some frame in the range of 80-90. After that point, my C++ code, for some reason, just stops executing the Python script. Despite that, the C++ code is still running normally - EXCEPT for the fact (which I just stated) that it's not executing the Python script.
Something to note: my Python script takes 5 seconds to execute the FIRST time, but then only 0.02 seconds to execute each frame after that first frame (I think due to the model getting set up).
At first, I thought it was a problem with the speed, so I replaced all my Python code with just a "time.sleep()" call with varying time, and, even if I sleep 5 seconds each C++ call to Python still always gets executed. As a result, I don't think it's a matter of the total time. For instance, if I do "time.sleep(1)" which sleeps for a second (which is longer than my Python script execution time AFTER the first frame), my Python script still always gets executed.
Does anyone have any idea why this might be happening? Could it be because of the uneven running times? Since it's taking 5 seconds to run the first frame and then significantly faster for each frame after that. Could it be that the Python is somehow unable to catch up after that time period?
This is my first time executing C++/Python on hardware, so I'm also new to this. Any helps would be greatly appreciated!
To give some idea of my code, here is a snippet:
if (pFuncFrame && PyCallable_Check(pFuncFrame)) {
PyObject* pArgs = PyTuple_New(1);
PyTuple_SetItem(pArgs, 0, PyUnicode_FromString("img.bmp"));
PyObject_CallObject(pFuncFrame, pArgs);
std::cout << "Called the frame function";
}
else {
std::cout << "Did not get the frame function";
}

I'm willing to bet that the first execution ends in a Python exception which isn't cleared until you execute some new Python statement in the second iteration, which therefore fails immediately. I recommend fixing the memory leaks and adding some error handling code to get some diagnostics (which will be useful either which way). For example (haven't tried, since you didn't provide a compilable example, but the following shouldn't be too far off):
if (pFuncFrame && PyCallable_Check(pFuncFrame)) {
PyObject* pArgs = PyTuple_New(1);
PyTuple_SetItem(pArgs, 0, PyUnicode_FromString("img.bmp"));
PyObject* res = PyObject_CallObject(pFuncFrame, pArgs);
if (!res) {
if (PyErr_Occurred()) PyErr_Print();
else std::cerr << "Python exception without error set\n";
} else {
Py_DECREF(res);
std::cout << "Called the frame function";
}
Py_DECREF(pArgs);
}
else {
std::cout << "Did not get the frame function";
}

ITK Filter slower in C++ than Python

I have made a module in Python using SimpleITK, which I tried to speed up by reimplementing in C++. It turns out to be quite a lot slower.
The bottleneck is the usage of the DisplacementFieldJacobianDeterminantFilter.
These two snippets give an example of the usage of the filters.
1000 generations: C++ = 55s, python = 8s
Should I expect the c++ to be faster?
def test_DJD(label_path, ngen):
im = sitk.ReadImage(label_path)
for i in range(ngen):
jacobian = sitk.DisplacementFieldJacobianDeterminant(im)
if __name__ == '__main__':
label = sys.argv[1]
ngen = int(sys.argv[2])
test_DJD(label, ngen)
And the c++ code
typedef itk::Vector<float, 3> VectorType;
typedef itk::Image<VectorType, 3> VectorImageType;
typedef itk::DisplacementFieldJacobianDeterminantFilter<VectorImageType > JacFilterType;
typedef itk::Image<float, 3> FloatImageType;
int main(int argc, char** argv) {
std::string idealJacPath = argv[1];
std::string numGensString = argv[2];
int numGens;
istringstream ( numGensString ) >> numGens;
typedef itk::ImageFileReader<VectorImageType> VectorReaderType;
VectorReaderType::Pointer reader=VectorReaderType::New();
reader->SetFileName(idealJacPath);
reader->Update();
VectorImageType::Pointer vectorImage=reader->GetOutput();
JacFilterType::Pointer jacFilter = JacFilterType::New();
FloatImageType::Pointer generatedJac = FloatImageType::New();
for (int i =0; i < numGens; i++){
jacFilter->SetInput(vectorImage);
jacFilter->Update();
jacFilter->Modified();
generatedJac = jacFilter->GetOutput();
}
return 0;
}
I'm using the c++ ITK 4.8.2 and compiled in 'release' mode on Ubuntu 15.4. And the python SimpleITK v 9.0

You seem to be benchmarking using loops. Using loops for benchmarking is not a good practice, because the compilers and interpreters does a lot of optimizations to them.
I believe that in here
for i in range(ngen):
jacobian = sitk.DisplacementFieldJacobianDeterminant(im)
The python interpreter most probably realized that you are only using the last value assigned to the jacobian variable, therefore executing only ONE iteration of the loop. This is a very common loop optimization.
On the other hand, since you call a couple of dynamic method in the C++ version (jacFilter->Update();), is possible that the compiler could not infer that the other calls are not being used, making your C++ version slower since all the invocations to the DisplacementFieldJacobianDeterminant::update method are actually made.
Another possible cause is that the ITK pipeline in Python is not being forced to update, as you call explicitly the jacFilter->Modified() in C++ but this is not explicit in the Python version.

Equivalent to GetTickCount() on Linux

I'm looking for an equivalent to GetTickCount() on Linux.
Presently I am using Python's time.time() which presumably calls through to gettimeofday(). My concern is that the time returned (the unix epoch), may change erratically if the clock is messed with, such as by NTP. A simple process or system wall time, that only increases positively at a constant rate would suffice.
Does any such time function in C or Python exist?

You can use CLOCK_MONOTONIC e.g. in C:
struct timespec ts;
if(clock_gettime(CLOCK_MONOTONIC,&ts) != 0) {
//error
}
See this question for a Python way - How do I get monotonic time durations in python?

This seems to work:
#include <unistd.h>
#include <time.h>
uint32_t getTick() {
struct timespec ts;
unsigned theTick = 0U;
clock_gettime( CLOCK_REALTIME, &ts );
theTick = ts.tv_nsec / 1000000;
theTick += ts.tv_sec * 1000;
return theTick;
}
yes, get_tick()
Is the backbone of my applications.
Consisting of one state machine for each 'task'
eg, can multi-task without using threads and Inter Process Communication
Can implement non-blocking delays.

You should use: clock_gettime(CLOCK_MONOTONIC, &tp);. This call is not affected by the adjustment of the system time just like GetTickCount() on Windows.

Yes, the kernel has high-resolution timers but it is differently. I would recommend that you look at the sources of any odd project that wraps this in a portable manner.
From C/C++ I usually #ifdef this and use gettimeofday() on Linux which gives me microsecond resolution. I often add this as a fraction to the seconds since epoch I also receive giving me a double.

How do you call Python code from C code?

I want to extend a large C project with some new functionality, but I really want to write it in Python. Basically, I want to call Python code from C code. However, Python->C wrappers like SWIG allow for the OPPOSITE, that is writing C modules and calling C from Python.
I'm considering an approach involving IPC or RPC (I don't mind having multiple processes); that is, having my pure-Python component run in a separate process (on the same machine) and having my C project communicate with it by writing/reading from a socket (or unix pipe). my python component can read/write to socket to communicate. Is that a reasonable approach? Is there something better? Like some special RPC mechanism?
Thanks for the answer so far - however, i'd like to focus on IPC-based approaches since I want to have my Python program in a separate process as my C program. I don't want to embed a Python interpreter. Thanks!

I recommend the approaches detailed here. It starts by explaining how to execute strings of Python code, then from there details how to set up a Python environment to interact with your C program, call Python functions from your C code, manipulate Python objects from your C code, etc.
EDIT: If you really want to go the route of IPC, then you'll want to use the struct module or better yet, protlib. Most communication between a Python and C process revolves around passing structs back and forth, either over a socket or through shared memory.
I recommend creating a Command struct with fields and codes to represent commands and their arguments. I can't give much more specific advice without knowing more about what you want to accomplish, but in general I recommend the protlib library, since it's what I use to communicate between C and Python programs (disclaimer: I am the author of protlib).

Have you considered just wrapping your python application in a shell script and invoking it from within your C application?
Not the most elegant solution, but it is very simple.

See the relevant chapter in the manual: http://docs.python.org/extending/
Essentially you'll have to embed the python interpreter into your program.

I haven't used an IPC approach for Python<->C communication but it should work pretty well. I would have the C program do a standard fork-exec and use redirected stdin and stdout in the child process for the communication. A nice text-based communication will make it very easy to develop and test the Python program.

If I had decided to go with IPC, I'd probably splurge with XML-RPC -- cross-platform, lets you easily put the Python server project on a different node later if you want, has many excellent implementations (see here for many, including C and Python ones, and here for the simple XML-RPC server that's part the Python standard library -- not as highly scalable as other approaches but probably fine and convenient for your use case).
It may not be a perfect IPC approach for all cases (or even a perfect RPC one, by all means!), but the convenience, flexibility, robustness, and broad range of implementations outweigh a lot of minor defects, in my opinion.

This seems quite nice http://thrift.apache.org/, there is even a book about it.
Details:
The Apache Thrift software framework, for scalable cross-language
services development, combines a software stack with a code generation
engine to build services that work efficiently and seamlessly between
C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

I've used the "standard" approach of Embedding Python in Another Application. But it's complicated/tedious. Each new function in Python is painful to implement.
I saw an example of Calling PyPy from C. It uses CFFI to simplify the interface but it requires PyPy, not Python. Read and understand this example first, at least at a high level.
I modified the C/PyPy example to work with Python. Here's how to call Python from C using CFFI.
My example is more complicated because I implemented three functions in Python instead of one. I wanted to cover additional aspects of passing data back and forth.
The complicated part is now isolated to passing the address of api to Python. That only has to be implemented once. After that it's easy to add new functions in Python.
interface.h
// These are the three functions that I implemented in Python.
// Any additional function would be added here.
struct API {
double (*add_numbers)(double x, double y);
char* (*dump_buffer)(char *buffer, int buffer_size);
int (*release_object)(char *obj);
};
test_cffi.c
//
// Calling Python from C.
// Based on Calling PyPy from C:
// http://doc.pypy.org/en/latest/embedding.html#more-complete-example
//
#include <stdio.h>
#include <assert.h>
#include "Python.h"
#include "interface.h"
struct API api; /* global var */
int main(int argc, char *argv[])
{
int rc;
// Start Python interpreter and initialize "api" in interface.py using
// old style "Embedding Python in Another Application":
// https://docs.python.org/2/extending/embedding.html#embedding-python-in-another-application
PyObject *pName, *pModule, *py_results;
PyObject *fill_api;
#define PYVERIFY(exp) if ((exp) == 0) { fprintf(stderr, "%s[%d]: ", __FILE__, __LINE__); PyErr_Print(); exit(1); }
Py_SetProgramName(argv[0]); /* optional but recommended */
Py_Initialize();
PyRun_SimpleString(
"import sys;"
"sys.path.insert(0, '.')" );
PYVERIFY( pName = PyString_FromString("interface") )
PYVERIFY( pModule = PyImport_Import(pName) )
Py_DECREF(pName);
PYVERIFY( fill_api = PyObject_GetAttrString(pModule, "fill_api") )
// "k" = [unsigned long],
// see https://docs.python.org/2/c-api/arg.html#c.Py_BuildValue
PYVERIFY( py_results = PyObject_CallFunction(fill_api, "k", &api) )
assert(py_results == Py_None);
// Call Python function from C using cffi.
printf("sum: %f\n", api.add_numbers(12.3, 45.6));
// More complex example.
char buffer[20];
char * result = api.dump_buffer(buffer, sizeof buffer);
assert(result != 0);
printf("buffer: %s\n", result);
// Let Python perform garbage collection on result now.
rc = api.release_object(result);
assert(rc == 0);
// Close Python interpreter.
Py_Finalize();
return 0;
}
interface.py
import cffi
import sys
import traceback
ffi = cffi.FFI()
ffi.cdef(file('interface.h').read())
# Hold references to objects to prevent garbage collection.
noGCDict = {}
# Add two numbers.
# This function was copied from the PyPy example.
#ffi.callback("double (double, double)")
def add_numbers(x, y):
return x + y
# Convert input buffer to repr(buffer).
#ffi.callback("char *(char*, int)")
def dump_buffer(buffer, buffer_len):
try:
# First attempt to access data in buffer.
# Using the ffi/lib objects:
# http://cffi.readthedocs.org/en/latest/using.html#using-the-ffi-lib-objects
# One char at time, Looks inefficient.
#data = ''.join([buffer[i] for i in xrange(buffer_len)])
# Second attempt.
# FFI Interface:
# http://cffi.readthedocs.org/en/latest/using.html#ffi-interface
# Works but doc says "str() gives inconsistent results".
#data = str( ffi.buffer(buffer, buffer_len) )
# Convert C buffer to Python str.
# Doc says [:] is recommended instead of str().
data = ffi.buffer(buffer, buffer_len)[:]
# The goal is to return repr(data)
# but it has to be converted to a C buffer.
result = ffi.new('char []', repr(data))
# Save reference to data so it's not freed until released by C program.
noGCDict[ffi.addressof(result)] = result
return result
except:
print >>sys.stderr, traceback.format_exc()
return ffi.NULL
# Release object so that Python can reclaim the memory.
#ffi.callback("int (char*)")
def release_object(ptr):
try:
del noGCDict[ptr]
return 0
except:
print >>sys.stderr, traceback.format_exc()
return 1
def fill_api(ptr):
global api
api = ffi.cast("struct API*", ptr)
api.add_numbers = add_numbers
api.dump_buffer = dump_buffer
api.release_object = release_object
Compile:
gcc -o test_cffi test_cffi.c -I/home/jmudd/pgsql-native/Python-2.7.10.install/include/python2.7 -L/home/jmudd/pgsql-native/Python-2.7.10.install/lib -lpython2.7
Execute:
$ test_cffi
sum: 57.900000
buffer: 'T\x9e\x04\x08\xa8\x93\xff\xbf]\x86\x04\x08\x00\x00\x00\x00\x00\x00\x00\x00'
$

Few tips for binding it with Python 3
file() not supported, use open()
ffi.cdef(open('interface.h').read())
PyObject* PyStr_FromString(const char *u)
Create a PyStr from a UTF-8 encoded null-terminated character buffer.
Python 2: PyString_FromString
Python 3: PyUnicode_FromString
Change to: PYVERIFY( pName = PyUnicode_FromString("interface") )
Program name
wchar_t *name = Py_DecodeLocale(argv[0], NULL);
Py_SetProgramName(name);
for compiling
gcc cc.c -o cc -I/usr/include/python3.6m -I/usr/include/x86_64-linux-gnu/python3.6m -lpython3.6m
I butchered dump def .. maybe it will give some ideas
def get_prediction(buffer, buffer_len):
try:
data = ffi.buffer(buffer, buffer_len)[:]
result = ffi.new('char []', data)
print('\n I am doing something here here........',data )
resultA = ffi.new('char []', b"Failed") ### New message
##noGCDict[ffi.addressof(resultA)] = resultA
return resultA
except:
print >>sys.stderr, traceback.format_exc()
return ffi.NULL
}
Hopefully it will help and save you some time

apparently Python need to be able to compile to win32 dll, it will solve the problem
In such a way that converting c# code to win32 dlls will make it usable by any development tool

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pybind11 somehow slows c++ function - python

When I was working on a reproducible example, I found out that I compare different svm kernels in C++ example (it was based on libsvm params proved 'rbf') and in pybind11 lib (it was hardcoded 'linear'). After fixing it and comparing the same algorithms there was no difference in time.

Related

How can I make animated colormap plot of C++ arrays? Export to Python and use \matplotlib?

C++ Python Not Always Executing Python Script

ITK Filter slower in C++ than Python

Equivalent to GetTickCount() on Linux

How do you call Python code from C code?

Categories

Resources