Calling PyObject_Call from thread causes stack overflow - python

I am trying to use Jedi https://github.com/davidhalter/jedi to create a custom python editor and I am using c++, it works perfectly, but it is a bit slow and stalls for a small while, so I am calling those functions from inside a thread in c++, but doing so I am having a stack overflow error sometimes.
Here is my code:
//~ Create Script Instance class
PyObject* pScript = PyObject_Call(
PyObject_GetAttrString(GetJediModule(), "Script"),
PyTuple_Pack(1, PyString_FromString(TCHAR_TO_UTF8(*Source))),
NULL);
if (pScript == NULL)
{
UE_LOG(LogTemp, Verbose, TEXT("unable to get Script Class from Jedi Module"));
Py_DECREF(pScript);
ClearPython();
return;
}
//~ Call complete method from Script Class
PyObject* Result = PyObject_Call(
PyObject_GetAttrString(pScript, "complete"),
PyTuple_Pack(2, Py_BuildValue("i", Line), Py_BuildValue("i", Offset)),
NULL);
if (Result == NULL)
{
UE_LOG(LogTemp, Verbose, TEXT("unable to call complete method from Script class"));
Py_DECREF(Result);
ClearPython();
return;
}
The error happens when call PyObject_Call, I assume is because the thread, since it works perfectly when I call the function from the main thread, but the stack is not telling me anything useful, just an error inside the python.dll

Well I found the answer just by luck, it is possible to choose the stack size when I launch my thread in UE and I was using a super tiny value of 1024, I did a small modification and I have been testing for 3 hours without crashes anymore, so I guess is safe to assume is working now.
Here is how I setup the stack size, third arg is the stack size:
Thread = FRunnableThread::Create(this, TEXT("FAutoCompleteWorker"), 8 * 8 * 4096, TPri_Normal);

Related

Embed Python/C API in a multi-threading C++ program

I am trying to embed Python in a C++ multi-threading program using the Python/C API (version 3.7.3) on a quad-core ARM 64 bit architecture. A dedicated thread-safe class "PyHandler" takes care of all the Python API calls:
class PyHandler
{
public:
PyHandler();
~PyHandler();
bool run_fun();
// ...
private:
PyGILState_STATE _gstate;
std::mutex _mutex;
}
In the constructor I initialize the Python interpreter:
PyHandler::PyHandler()
{
Py_Initialize();
//PyEval_SaveThread(); // UNCOMMENT TO MAKE EVERYTHING WORK !
}
And in the destructor I undo all initializations:
PyHandler::~PyHandler()
{
_gstate = PyGILState_Ensure();
if (Py_IsInitialized()) // finalize python interpreter
Py_Finalize();
}
Now, in order to make run_fun() callable by one thread at a time, I use the mutex variable _mutex (see below). On top of this, I call PyGILState_Ensure() to make sure the current thread holds the python GIL, and call PyGILState_Release() at the end to release it. All the remaining python calls happen within these two calls:
bool PyHandler::run_fun()
{
std::lock_guard<std::mutex> lockGuard(_mutex);
_gstate = PyGILState_Ensure(); // give the current thread the Python GIL
// Python calls...
PyGILState_Release(_gstate); // release the Python GIL till now assigned to the current thread
return true;
}
Here is how the main() looks like:
int main()
{
PyHandler py; // constructor is called !
int n_threads = 10;
std::vector<std::thread> threads;
for (int i = 0; i < n_threads; i++)
threads.push_back(std::thread([&py]() { py.run_fun(); }));
for (int i = 0; i < n_threads; i++)
if (threads[i].joinable())
threads[i].join();
}
Although all precautions, the program always deadlocks at the PyGILState_Ensure() line in run_fun() during the very first attempt. BUT when I uncomment the line with PyEval_SaveThread() in the constructor everything magically works. Why is that ?
Notice that I am not calling PyEval_RestoreThread() anywhere. Am I supposed to use the macros Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS instead ? I thought these macros and PyEval_SaveThread() are only used dealing with Python threads and NOT with non-Python threads, as in my case! Am I missing something ?
The documentation for my case, only mentions the use of PyGILState_Ensure() and PyGILState_Release. Any help is highly appreciated.

C++ Python Not Always Executing Python Script

I currently have a piece of hardware connected to C++ code using the MFC (Windows programming) framework. Basically the hardware is passing in image frames to my C++ code. In my C++ code, I am then calling a Python script using the CPython (Python embedding in C++) API to execute a model on that image. I've been noticing some weird behavior with the images though.
My C++ code is executing my Python script perfectly until some frame in the range of 80-90. After that point, my C++ code, for some reason, just stops executing the Python script. Despite that, the C++ code is still running normally - EXCEPT for the fact (which I just stated) that it's not executing the Python script.
Something to note: my Python script takes 5 seconds to execute the FIRST time, but then only 0.02 seconds to execute each frame after that first frame (I think due to the model getting set up).
At first, I thought it was a problem with the speed, so I replaced all my Python code with just a "time.sleep()" call with varying time, and, even if I sleep 5 seconds each C++ call to Python still always gets executed. As a result, I don't think it's a matter of the total time. For instance, if I do "time.sleep(1)" which sleeps for a second (which is longer than my Python script execution time AFTER the first frame), my Python script still always gets executed.
Does anyone have any idea why this might be happening? Could it be because of the uneven running times? Since it's taking 5 seconds to run the first frame and then significantly faster for each frame after that. Could it be that the Python is somehow unable to catch up after that time period?
This is my first time executing C++/Python on hardware, so I'm also new to this. Any helps would be greatly appreciated!
To give some idea of my code, here is a snippet:
if (pFuncFrame && PyCallable_Check(pFuncFrame)) {
PyObject* pArgs = PyTuple_New(1);
PyTuple_SetItem(pArgs, 0, PyUnicode_FromString("img.bmp"));
PyObject_CallObject(pFuncFrame, pArgs);
std::cout << "Called the frame function";
}
else {
std::cout << "Did not get the frame function";
}
I'm willing to bet that the first execution ends in a Python exception which isn't cleared until you execute some new Python statement in the second iteration, which therefore fails immediately. I recommend fixing the memory leaks and adding some error handling code to get some diagnostics (which will be useful either which way). For example (haven't tried, since you didn't provide a compilable example, but the following shouldn't be too far off):
if (pFuncFrame && PyCallable_Check(pFuncFrame)) {
PyObject* pArgs = PyTuple_New(1);
PyTuple_SetItem(pArgs, 0, PyUnicode_FromString("img.bmp"));
PyObject* res = PyObject_CallObject(pFuncFrame, pArgs);
if (!res) {
if (PyErr_Occurred()) PyErr_Print();
else std::cerr << "Python exception without error set\n";
} else {
Py_DECREF(res);
std::cout << "Called the frame function";
}
Py_DECREF(pArgs);
}
else {
std::cout << "Did not get the frame function";
}

Returning new PyObject * from C++ to Python eventually segfaults

I am writing the C++ and Python side of a library that exposes some functionality in our software written in C++ to Python scripts. I'm compiling some source files of interest and a wrapper file that looks like below into a shared library and loading that library using ctypes.
extern "C" {
PyObject *py_get_cxx_set_EXAMPLE(void)
{
std::set<long> cset = get_cxx_set_for_python();
PyGILState_STATE gstate = PyGILState_Ensure();
PyObject *pyset = PySet_New(NULL);
for (long c_long: cset)
PySet_Add(pyset, PyLong_FromLong(c_long));
PyGILState_Release(gstate);
return pyset;
}
}
and on the python side:
example_lib.py_get_cxx_set_EXAMPLE.restype = ctypes.py_object
for i in range(0, 1000):
ret = example_lib.py_get_cxx_set_EXAMPLE()
I find that the first few calls would be successful, but the C++ code would segfault in the middle of the loop. Upon GDB'ing, I would find the end of the callstack like this:
#0 0x000055555563244a in PyErr_Occurred ()
#1 0x000055555562a387 in _PyObject_GC_Malloc ()
#2 0x0000555555629ebd in _PyObject_GC_New ()
#3 0x000055555562b23c in PyDict_New ()
#4 0x00007ffff66df9be in python::to_python_object<db::pmbus_diagnostics> (t=...) at python_wrapper/python.hpp:101
It looks like the Python runtime refuses to make more Python objects (in this case, a dict) for me...!
What did I do wrong in the C++ code?
EDIT::
Updated, see answer
OK, I forgot to add the code to acquire and release the global interpreter lock for some class of functions. Sorry for the silly question.
Have faith in Python kids.

Memory leak when running python script from C++

The following minimal example of calling a python function from C++ has a memory leak on my system:
script.py:
import tensorflow
def foo(param):
return "something"
main.cpp:
#include "python3.5/Python.h"
#include <iostream>
#include <string>
int main()
{
Py_Initialize();
PyRun_SimpleString("import sys");
PyRun_SimpleString("if not hasattr(sys,'argv'): sys.argv = ['']");
PyRun_SimpleString("sys.path.append('./')");
PyObject* moduleName = PyUnicode_FromString("script");
PyObject* pModule = PyImport_Import(moduleName);
PyObject* fooFunc = PyObject_GetAttrString(pModule, "foo");
PyObject* param = PyUnicode_FromString("dummy");
PyObject* args = PyTuple_Pack(1, param);
PyObject* result = PyObject_CallObject(fooFunc, args);
Py_CLEAR(result);
Py_CLEAR(args);
Py_CLEAR(param);
Py_CLEAR(fooFunc);
Py_CLEAR(pModule);
Py_CLEAR(moduleName);
Py_Finalize();
}
compiled with
g++ -std=c++11 main.cpp $(python3-config --cflags) $(python3-config --ldflags) -o main
and run with valgrind
valgrind --leak-check=yes ./main
produces the following summary
LEAK SUMMARY:
==24155== definitely lost: 161,840 bytes in 103 blocks
==24155== indirectly lost: 33 bytes in 2 blocks
==24155== possibly lost: 184,791 bytes in 132 blocks
==24155== still reachable: 14,067,324 bytes in 130,118 blocks
==24155== of which reachable via heuristic:
==24155== stdstring : 2,273,096 bytes in 43,865 blocks
==24155== suppressed: 0 bytes in 0 blocks
I'm using Linux Mint 18.2 Sonya, g++ 5.4.0, Python 3.5.2 and TensorFlow 1.4.1.
Removing import tensorflow makes the leak disappear. Is this a bug in TensorFlow or did I do something wrong? (I expect the latter to be true.)
Additionally when I create a Keras layer in Python
#script.py
from keras.layers import Input
def foo(param):
a = Input(shape=(32,))
return "str"
and run the call to Python from C++ repeatedly
//main.cpp
#include "python3.5/Python.h"
#include <iostream>
#include <string>
int main()
{
Py_Initialize();
PyRun_SimpleString("import sys");
PyRun_SimpleString("if not hasattr(sys,'argv'): sys.argv = ['']");
PyRun_SimpleString("sys.path.append('./')");
PyObject* moduleName = PyUnicode_FromString("script");
PyObject* pModule = PyImport_Import(moduleName);
for (int i = 0; i < 10000000; ++i)
{
std::cout << i << std::endl;
PyObject* fooFunc = PyObject_GetAttrString(pModule, "foo");
PyObject* param = PyUnicode_FromString("dummy");
PyObject* args = PyTuple_Pack(1, param);
PyObject* result = PyObject_CallObject(fooFunc, args);
Py_CLEAR(result);
Py_CLEAR(args);
Py_CLEAR(param);
Py_CLEAR(fooFunc);
}
Py_CLEAR(pModule);
Py_CLEAR(moduleName);
Py_Finalize();
}
the memory consumption of the application continuously grows ad infinitum during runtime.
So I guess there is something fundamentally wrong with the way I call the python function from C++, but what is it?
There are two different types "memory leaks" in your question.
Valgrind is telling you about the first type of memory leaks. However, it is pretty usual for python modules to "leak" memory - it is mostly some globals which are allocated/initialized when the module is loaded. And because the module is loaded only once in Python its not a big problem.
A well known example is numpy's PyArray_API: It must be initialized via _import_array, is then never deleted and stays in memory until the python interpreter is shut down.
So it is a "memory leak" per design, you can argue whether it is a good design or not, but at the end of the day there is nothing you could do about it.
I don't have enough insight into the tensorflow-module to pin-point the places where such memory leaks happen, but I'm pretty sure that it's nothing you should worry about.
The second "memory leak" is more subtle.
You can get a lead, when you compare the valgrind output for 10^4 and 10^5 iterations of the loop - there will be almost no difference! There is however difference in the peak-memory consumption.
Differently from C++, Python has a garbage collector - so you cannot know when exactly an object is destructed. CPython uses reference counting, so when a reference count gets 0, the object is destroyed. However, when there is a cycle of references (e.g. object A holds a reference of object B and object B holds a reference of object B) it is not so simple: the garbage collector needs to iterate through all objects to find such no longer used cycles.
One could think, that keras.layers.Input has such a cycle somewhere (and this is true), but this is not the reason for this "memory leak", which can be observed also for pure python.
We use objgraph-package to inspect the references, let's run the following python script:
#pure.py
from keras.layers import Input
import gc
import sys
import objgraph
def foo(param):
a = Input(shape=(1280,))
return "str"
### MAIN :
print("Counts at the beginning:")
objgraph.show_most_common_types()
objgraph.show_growth(limit=7)
for i in range(int(sys.argv[1])):
foo(" ")
gc.collect()# just to be sure
print("\n\n\n Counts at the end")
objgraph.show_most_common_types()
objgraph.show_growth(limit=7)
import random
objgraph.show_chain(
objgraph.find_backref_chain(
random.choice(objgraph.by_type('Tensor')), #take some random tensor
objgraph.is_proper_module),
filename='chain.png')
and run it:
>>> python pure.py 1000
We can see the following: at the end there are exactly 1000 Tersors, that means none of our created objects got disposed!
If we take a look at the chain, which keeps a tensor-object alive (was created with objgraph.show_chain), so we see:
that there is a tensorflow-Graph-object where all tensors are registered and stay there until session is closed.
So far the theory, however neighter:
#close session and free resources:
import keras
keras.backend.get_session().close()#free all resources
print("\n\n\n Counts after session.close():")
objgraph.show_most_common_types()
nor the here proposed solution:
with tf.Graph().as_default(), tf.Session() as sess:
for step in range(int(sys.argv[1])):
foo(" ")
has worked for the current tensorflow-version. Which is probably a bug.
In a nutshell: You do nothing wrong in your c++-code, there are no memory leaks you are responsible for. In fact you would see exactly same memory consumption if you would call the function foo from a pure python-script over and over again.
All created Tensors are registered in a Graph-object and aren't automatically released, you must release them by closing the backend session - which however doesn't work due to a bug in the current tensorflow-version 1.4.0.

Stop embedded python

I'm embedding python in a C++ plug-in. The plug-in calls a python algorithm dozens of times during each session, each time sending the algorithm different data. So far so good
But now I have a problem:
The algorithm takes sometimes minutes to solve and to return a solution, and during that time often the conditions change making that solution irrelevant. So, what I want is to stop the running of the algorithm at any moment, and run it immediately after with other set of data.
Here's the C++ code for embedding python that I have so far:
void py_embed (void*data){
counter_thread=false;
PyObject *pName, *pModule, *pDict, *pFunc;
//To inform the interpreter about paths to Python run-time libraries
Py_SetProgramName(arg->argv[0]);
if(!gil_init){
gil_init=1;
PyEval_InitThreads();
PyEval_SaveThread();
}
PyGILState_STATE gstate = PyGILState_Ensure();
// Build the name object
pName = PyString_FromString(arg->argv[1]);
if( !pName ){
textfile3<<"Can't build the object "<<endl;
}
// Load the module object
pModule = PyImport_Import(pName);
if( !pModule ){
textfile3<<"Can't import the module "<<endl;
}
// pDict is a borrowed reference
pDict = PyModule_GetDict(pModule);
if( !pDict ){
textfile3<<"Can't get the dict"<<endl;
}
// pFunc is also a borrowed reference
pFunc = PyDict_GetItemString(pDict, arg->argv[2]);
if( !pFunc || !PyCallable_Check(pFunc) ){
textfile3<<"Can't get the function"<<endl;
}
/*Call the algorithm and treat the data that is returned from it
...
...
*/
// Clean up
Py_XDECREF(pArgs2);
Py_XDECREF(pValue2);
Py_DECREF(pModule);
Py_DECREF(pName);
PyGILState_Release(gstate);
counter_thread=true;
_endthread();
};
Edit: The python's algorithm is not my work and I shouldn't change it
This is based off of a cursory knowledge of python, and reading the python docs quickly.
PyThreadState_SetAsyncExc lets you inject an exception into a running python thread.
Run your python interpreter in some thread. In another thread, PyGILState_STATE then PyThreadState_SetAsyncExc into the main thread. (This may require some precursor work to teach the python interpreter about the 2nd thread).
Unless the python code you are running is full of "catch alls", this should cause it to terminate execution.
You can also look into the code to create python sub-interpreters, which would let you start up a new script while the old one shuts down.
Py_AddPendingCall is also tempting to use, but there are enough warnings around it maybe not.
Sorry, but your choices are short. You can either change the python code (ok, plugin - not an option) or run it on another PROCESS (with some nice ipc between). Then you can use the system api to wipe it out.
So, I finally thought of a solution (more of a workaround really).
Instead of terminating the thread that is running the algorithm - let's call it T1 -, I create another one -T2 - with the set of data that is relevant at that time.
In every thread i do this:
thread_counter+=1; //global variable
int thisthread=thread_counter;
and after the solution from python is given I just verify which is the most "recent", the one from T1 or from T2:
if(thisthread==thread_counter){
/*save the solution and treat it */
}
Is terms of computer effort this is not the best solution obviously, but it serves my purposes.
Thank you for the help guys
I've been thinking about this problem, and I agree that sub interpreters may provide you one possible solution https://docs.python.org/2/c-api/init.html#sub-interpreter-support. It supports calls for creating new interpreters and ending existing ones. The bugs & caveats sections describes some issues that depending on your architecture may or may not present a problem.
Another possible solution is to use the python multiprocessing module, and within your worker thread test a global variable (something like time_to_die). Then from the parent, you grab the GIL, set the variable, release the GIL and wait for the child to finish.
But then another idea ocurred to me. Why not just use fork(), init your python interpreter in the child and when the parent decides it's time for the python thread to end, just kill it. Something like this:
void process() {
int pid = fork();
if (pid) {
// in parent
sleep(60);
kill(pid, 9);
}
else{
// in child
Py_Initialize();
PyRun_SimpleString("# insert long running python calculation");
}
}
(This example assumes *nix, if you're on windows, substitute CreateProcess()/TerminateProcess())

Categories

Resources