Pybind11 and std::vector -- How to free data using capsules? - python

I have a C++ function that returns a std::vector and, using Pybind11, I would like to return the contents of that vector as a Numpy array without having to copy the underlying data of the vector into a raw data array.
Current Attempt
In this well-written SO answer the author demonstrates how to ensure that a raw data array created in C++ is appropriately freed when the Numpy array has zero reference count. I tried to write a version of this using std::vector instead:
// aside - I made a templated version of the wrapper with which
// I create specific instances of in the PYBIND11_MODULE definitions:
//
// m.def("my_func", &wrapper<int>, ...)
// m.def("my_func", &wrapper<float>, ...)
//
template <typename T>
py::array_t<T> wrapper(py::array_t<T> input) {
auto proxy = input.template unchecked<1>();
std::vector<T> result = compute_something_returns_vector(proxy);
// give memory cleanup responsibility to the Numpy array
py::capsule free_when_done(result.data(), [](void *f) {
auto foo = reinterpret_cast<T *>(f);
delete[] foo;
});
return py::array_t<T>({result.size()}, // shape
{sizeof(T)}, // stride
result.data(), // data pointer
free_when_done);
}
Observed Issues
However, if I call this from Python I observe two things: (1) the data in the output array is garbage and (2) when I manually delete the Numpy array I receive the following error (SIGABRT):
python3(91198,0x7fff9f2c73c0) malloc: *** error for object 0x7f8816561550: pointer being freed was not allocated
My guess is that this issue has to do with the line "delete[] foo", which presumably is being called with foo set to result.data(). This is not the way to deallocate a std::vector.
Possible Solutions
One possible solution is to create a T *ptr = new T[result.size()] and copy the contents of result to this raw data array. However, I have cases where the results might be large and I want to avoid taking all of that time to allocate and copy. (But perhaps it's not as long as I think it would be.)
Also, I don't know much about std::allocator but perhaps there is a way to allocate the raw data array needed by the output vector outside the compute_something_returns_vector() function call and then discard the std::vector afterwards, retaining the underlying raw data array?
The final option is to rewrite compute_something_returns_vector.

After an offline discussion with a colleague I resolved my problem. I do not want to commit an SO faux pas so I won't accept my own answer. However, for the sake of using SO as a catalog of information I want to provide the answer here for others.
The problem was simple: result was stack-allocated and needed to be heap-allocated so that free_when_done can take ownership. Below is an example fix:
{
// ... snip ...
std::vector<T> *result = new std::vector<T>(compute_something_returns_vector(proxy));
py::capsule free_when_done(result, [](void *f) {
auto foo = reinterpret_cast<std::vector<T> *>(f);
delete foo;
});
return py::array_t<T>({result->size()}, // shape
{sizeof(T)}, // stride
result->data(), // data pointer
free_when_done);
}
I was also able to implement a solution using std::unique_ptr that doesn't require the use of a free_when_done function. However, I wasn't able to run Valgrind with either solution so I'm not 100% sure that the memory held by the vector was appropriately freed. (Valgrind + Python is a mystery to me.) For completeness, below is the std::unique_ptr approach:
{
// ... snip ...
std::unique_ptr<std::vector<T>> result =
std::make_unique<std::vector<T>>(compute_something_returns_vector(proxy));
return py::array_t<T>({result->size()}, // shape
{sizeof(T)}, // stride
result->data()); // data pointer
}
I was, however, able to inspect the addresses of the vectors allocated in both the Python and C++ code and confirmed that no copies of the output of compute_something_returns_vector() were made.

Related

Is there any way to know which c++ core function is called in tensorflow?

I heard that tensorflow is wrapped with python and the core function is implemented as c++.
I wonder which core c++ function is called after Python code is called. Is there a way to know? The tensorflow profiler only provides information about the python function. Thank you
There are 3 levels of depth that you need to go through to get to C++ code: Python implementation, wrappers, and C++.
In case it's an OP (like Conv / Matmul / ...)
First you need to trace what gets called by python implementation. If you're using some high level library utils like Keras, it may be quite hard. It's way easier if you're directly calling math operations in TF (like nn.conv2d).
Most of ops are implemented in tensorflow/python/ops. For example, function nn.conv2d is implemented in tensorflow/python/ops/nn.ops.py.
As you can see, this op (as most ops) delegate the work to gen_nn_ops.conv2d.py. There are auto-generated filed during build, so unless you're willing to inspect bazel files and build from source, you can't view this file.
Fortunately, it seems to me that there is a direct mapping between function available in gen_ filed and ops defined in .cc files.
By investigating tensorflow/core/ops/nn_ops.cc you can find Registration of Conv Op
REGISTER_OP("Conv2D")
.Input("input: T")
.Input("filter: T")
.Output("output: T")
.Attr("T: {half, bfloat16, float, double}")
.Attr("strides: list(int)")
.Attr("use_cudnn_on_gpu: bool = true")
.Attr(GetPaddingAttrStringWithExplicit())
.Attr(GetExplicitPaddingsAttrString())
.Attr(GetConvnetDataFormatAttrString())
.Attr("dilations: list(int) = [1, 1, 1, 1]")
.SetShapeFn(shape_inference::Conv2DShapeWithExplicitPadding);
Unfortunately, this macro only tells tensorflow that there is such operation as Conv2D, but it doesn't say anything about how it should run.
In tensorflow, Op specifies what needs to be done, but Kernel is the one that actually does the job. You can find kernels that can run a given op by looking for REGISTER_KERNEL_BUILDER macro. It is responsible for matching a kernel to an Op.
For conv2d, you can find one in tensorflow/core/kernels/conv_ops.cc
#define REGISTER_CPU(T) \
REGISTER_KERNEL_BUILDER( \
Name("Conv2D").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
Conv2DOp<CPUDevice, T>);
// If we're using the alternative GEMM-based implementation of Conv2D for the
// CPU implementation, don't register this EigenTensor-based version.
#if !defined(USE_GEMM_FOR_CONV)
TF_CALL_half(REGISTER_CPU);
TF_CALL_float(REGISTER_CPU);
TF_CALL_double(REGISTER_CPU);
#endif // USE_GEMM_FOR_CONV
This finally brings us to what we were looking for. Kernels have compute method, so we're interested in Conv2DOp<CPUDevice, T>::Compute.
Here it is (defined in the same file):
void Compute(OpKernelContext* context) override {
// Input tensor is of the following dimensions:
// [ batch, in_rows, in_cols, in_depth ]
const Tensor& input = context->input(0);
// Input filter is of the following dimensions:
// [ filter_rows, filter_cols, in_depth, out_depth]
const Tensor& filter = context->input(1);
Conv2DDimensions dimensions;
OP_REQUIRES_OK(context,
ComputeConv2DDimension(params_, input, filter, &dimensions));
TensorShape out_shape = ShapeFromFormat(
params_.data_format, dimensions.batch, dimensions.out_rows,
dimensions.out_cols, dimensions.out_depth);
// Output tensor is of the following dimensions:
// [ in_batch, out_rows, out_cols, out_depth ]
Tensor* output = nullptr;
OP_REQUIRES_OK(context, context->allocate_output(0, out_shape, &output));
... // Skipped for clarity
if (params_.padding != EXPLICIT &&
LaunchDeepConvOp<Device, T>::Run(
context, input, filter, dimensions.batch, dimensions.input_rows,
dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
dimensions.filter_cols, dimensions.pad_rows_before,
dimensions.pad_cols_before, dimensions.out_rows,
dimensions.out_cols, dimensions.out_depth, dimensions.dilation_rows,
dimensions.dilation_cols, dimensions.stride_rows,
dimensions.stride_cols, output, params_.data_format)) {
return;
}
...
This is the end of the journey. Some ops have actual implementation in this place. Conv2D is not very satisfying - it turns out it delegates the work to LaunchDeepConvOp. You can dig deeper if you need.
In case it's not an Op
Ops are quite special in TF. Other code is linked to python by means of C API.
C api is available as c_api.cc and c_api.h. Header file declares C functions available to python. Source file (.cc) is a bridge between C and C++ - it defines C functions (or to be more precise, functions with C linkage) that call corresponding C++ functions. If you know the C function, it's pretty easy to trace which C++ function was called.
From Python, it usually looks like
# Import
from tensorflow.python import pywrap_tensorflow as c_api
...
# Usage
def get_all_registered_kernels():
"""Returns a KernelList proto of all registered kernels.
"""
buf = c_api.TF_GetAllRegisteredKernels()
As you can see names are matching. Implementation of this wrapper is generated, so don't look for it.

How does one deal with various errors in statically typed languages (or when typing in general)

For context, my primary langauge is Python, and I'm just beginning to use annotations. This is in preparation for learning C++ (and because, intuitively, it feels better).
I have something like this:
from models import UserLocation
from typing import Optional
import cluster_module
import db
def get_user_location(user_id: int, data: list) -> Optional[UserLocation]:
loc = UserLocation.query.filter_by(user_id=user_id).one_or_none()
if loc:
return loc
try:
clusters = cluster_module.cluster(data)
except ValueError:
return None # cluster throws an error if there is not enough data to cluster
if list(clusters.keys()) == [-1]:
return None # If there is enough data to cluster, the cluster with an index of -1 represents all data that didn't fit into a cluster. It's possible for NO data to fit into a cluster.
loc = UserLocation(user_id=user_id, location = clusters[0].center)
db.session.add(loc)
db.session.commit()
return loc
So, I use typing.Optional to ensure that I can return None in case there's an error (if I understand correctly, the static-typing-language equivalent of this would be to return a null pointer of the appropriate type). Though, how does one distinguish between the two errors? What I'd like to do, for example, is return -1 if there's not enough data to cluster and -2 if there's data, but none of them fit into a cluster (or some similar thing). In Python, this is easy enough (because it isn't statically typed). Even with mypy, I can say something like typing.Union[UserLocation, int].
But, how does one do this in, say, C++ or Java? Would a Java programmer need to do something like set the function to return int, and return the ID of UserLocation instead of the object itself (then, whatever code uses the get_user_location function would itself do the lookup)? Is there runtime benefit to doing this, or is it just restructuring the code to fit the fact that a language is statically typed?
I believe I understand most of the obvious benefits of static typing w.r.t. code readability, compile-time, and efficiency at runtime—but I'm not sure what to make of this particular issue.
In a nutshell: How does one deal with functions (which return a non-basic type) indicating they ran into different errors in statically typed languages?
The direct C++ equivalent to the python solution would be std::variant<T, U> where T is the expected return value and U the error code type. You can then check which of the types the variant contains and go from there. For example :
#include <cstdlib>
#include <iostream>
#include <string>
#include <variant>
using t_error_code = int;
// Might return either `std::string` OR `t_error_code`
std::variant<std::string, t_error_code> foo()
{
// This would cause a `t_error_code` to be returned
//return 10;
// This causes an `std::string` to be returned
return "Hello, World!";
}
int main()
{
auto result = foo();
// Similar to the Python `if isinstance(result, t_error_code)`
if (std::holds_alternative<t_error_code>(result))
{
const auto error_code = std::get<t_error_code>(result);
std::cout << "error " << error_code << std::endl;
return EXIT_FAILURE;
}
std::cout << std::get<std::string>(result) << std::endl;
}
However this isn't often seen in practice. If a function is expected to fail, then a single failed return value like a nullptr or end iterator suffices. Such failures are expected and aren't errors. If failure is unexpected, exceptions are preferred which also eliminates the problem you describe here. It's unusual to both expect failure and care about the details of why the failure occurred.

How with pybind11 to bind a function that takes as argument a numpy.array() with, for instance, a shape (10, 10, 3)?

I would like to write a function that can take a multidimensional numpy array, not just 2D.
void compute(Eigen::Ref<Eigen::MatrixXd> array3d) {
// change the array in-place
// ...
}
or
Eigen::MatrixXd &compute() {
// create array
// ...
// and return it
}
I am using Eigen here just to depict the goal, I believe Eigen does not support 3D, or more dimensions, arrays.
I appreciate your feedback and patience since I am not familiar with Pybind11 nor Eigen.
From the pybind information, you can extract the dimension information.
For instance, this is what I do inside Audio ToolKit with m the current Python module you want to build:
py::class_<MyClass>(m,"name")
.def("set_pointer", [](MyClass& instance, const py::array_t<DataType>& array)
{
gsl::index channels = 1;
gsl::index size = array.shape(0);
if(array.ndim() == 2)
{
channels = array.shape(0);
size = array.shape(1);
}
// Call using array.data() and possibly add more dimension information, this is specific to my use case
instance.set_pointer(array.data(), channels, size);
});
From this, you can create the Eigen::Map call instead to create an Eigen-like matrix that you can use in your templated code.
Basically, pybind11 allows you to create a lambda where you can create your wrapper for your use case. The same works for return, you can get the Eigen class, create a pybind array that you populate with the Eigen data.
Eigen has the Tensor class that you can use as well.

Boundingbox defintion for opencv object tracking

How is the boundingbox object defined that takes opencv's tracker.init() function?
is it (xcenter,ycenter,boxwidht,boxheight)
or (xmin,ymin,xmax,ymax)
or (ymin,xmin,ymax,xmax)
or something completely different?
I am using python and OpenCV 3.3 and i basically do the following on each object i want to track for each frame of a video:
tracker = cv2.trackerKCF_create()
ok = tracker.init(previous_frame,bbox)
bbox = tracker.update(current_frame)
The Answer is: (xmin,ymin,boxwidth,boxheight)
The other post states the answer as a fact, so let's look at how to figure it out on your own.
The Python version of OpenCV is a wrapper around the main C++ API, so when in doubt, it's always useful to consult either the main documentation, or even the source code. There is a short tutorial providing some basic information about the Python bindings.
First, let's look at cv::TrackerKCF. The init member takes the bounding box as an instance of cv::Rect2d (i.e. a variant of cv::Rect_ which represents the parameters using double values):
bool cv::Tracker::init(InputArray image, const Rect2d& boundingBox)
Now, the question is, how is a cv::Rect2d (or in general, the variants of cv::Rect_) represented in Python? I haven't found any part of documentation that states this clearly (although I think it's hinted at in the tutorials), but there is some useful information in the bindings tutorial mentioned earlier:
... But there may be some basic OpenCV datatypes like Mat, Vec4i,
Size. They need to be extended manually. For example, a Mat type
should be extended to Numpy array, Size should be extended to a tuple
of two integers etc. ... All such manual wrapper functions are placed
in modules/python/src2/cv2.cpp.
Not much, so let's look at the code they point us at. Lines 941-954 are what we're after:
template<>
bool pyopencv_to(PyObject* obj, Rect2d& r, const char* name)
{
(void)name;
if(!obj || obj == Py_None)
return true;
return PyArg_ParseTuple(obj, "dddd", &r.x, &r.y, &r.width, &r.height) > 0;
}
template<>
PyObject* pyopencv_from(const Rect2d& r)
{
return Py_BuildValue("(dddd)", r.x, r.y, r.width, r.height);
}
The PyArg_ParseTuple in the first function is quite self-explanatory. A 4-tuple of double (floating point) values, in the order x, y, width and height.

Assignment into Python 3.x Buffers with itemsize > 1

I am trying to expose a buffer of image pixel information (32 bit RGBA) through the Python 3.x buffer interface. After quite a bit of playing around, I was able to get this working like so:
int Image_get_buffer(PyObject* self, Py_buffer* view, int flags)
{
int img_len;
void* img_bytes;
// Do my image fetch magic
get_image_pixel_data(self, &img_bytes, &img_len);
// Let python fill my buffer
PyBuffer_FillInfo(view, self, img_bytes, img_len, 0, flags);
}
And in python I can play with it like so:
mv = memoryview(image)
print(mv[0]) # prints b'\x00'
mv[0] = b'\xFF' # set the first pixels red component to full
mx[0:4] = b'\xFF\xFF\xFF\xFF' # set the first pixel to white
And that works splendidly. However, it would be great if I could work with the full pixel value (int, 4 byte) instead of individual bytes, so I modified the buffer fetch like so:
int Image_get_buffer(PyObject* self, Py_buffer* view, int flags)
{
int img_len;
void* img_bytes;
// Do my image fetch magic
get_image_pixel_data(self, &img_bytes, &img_len);
// Fill my buffer manually (derived from the PyBuffer_FillInfo source)
Py_INCREF(self);
view->readonly = 0;
view->obj = self;
view->buf = img_bytes;
view->itemsize = 4;
view->ndim = 1;
view->len = img_len;
view->suboffsets = NULL;
view->format = NULL;
if ((flags & PyBUF_FORMAT) == PyBUF_FORMAT)
view->format = "I";
view->shape = NULL;
if ((flags & PyBUF_ND) == PyBUF_ND)
{
Py_ssize_t shape[] = { (int)(img_len/4) };
view->shape = shape;
}
view->strides = NULL;
if((flags & PyBUF_STRIDED) == PyBUF_STRIDED)
{
Py_ssize_t strides[] = { 4 };
view->strides = strides;
}
return 0;
}
This actually returns the data and I can read it correctly, but any attempt to assign a value into it now fails!
mv = memoryview(image)
print(mv[0]) # prints b'\x00\x00\x00\x00'
mv[0] = 0xFFFFFFFF # ERROR (1)
mv[0] = b'\xFF\xFF\xFF\xFF' # ERROR! (2)
mv[0] = mv[0] # ERROR?!? (3)
In case 1 the error informs me that 'int' does not support the buffer interface, which is a shame and a bit confusing (I did specify that the buffer format was "I" after all), but I can deal with that. In case 2 and 3 things get really weird, though: Both cases gime me an TypeError reading mismatching item sizes for "my.Image" and "bytes" (Where my.Image is, obviously, my image type)
This is very confusing to me, since the data I'm passing in is obviously the same size as what I get out of that element. It seems as though buffers simply stop allowing assignment if the itemsize is greater than 1. Of course, the documentation for this interface is really sparse and perusing through the python code doesn't really give any usage examples so I'm fairly stuck. Am I missing some snippit of documentation that states "buffers become essentially useless when itemsize > 1", am I doing something wrong that I can't see, or is this a bug in Python? (Testing against 3.1.1)
Thanks for any insight you can give on this (admittedly advanced) issue!
I found this in the python code (in memoryobject.c in Objects) in the function memory_ass_sub:
/* XXX should we allow assignment of different item sizes
as long as the byte length is the same?
(e.g. assign 2 shorts to a 4-byte slice) */
if (srcview.itemsize != view->itemsize) {
PyErr_Format(PyExc_TypeError,
"mismatching item sizes for \"%.200s\" and \"%.200s\"",
view->obj->ob_type->tp_name, srcview.obj->ob_type->tp_name);
goto _error;
}
that's the source of the latter two errors. It looks like the itemsize for even mv[0] is still not equal to itself.
Update
Here's what I think is going on. When you try to assign something in mv, it calls memory_ass_sub in Objects/memoryobject.c, but that function takes only a PyObject as input. This object is then changed into a buffer inside using the PyObject_GetBuffer function even though in the case of mv[0] it is already a buffer (and the buffer you want!). My guess is that this function takes the object and makes it into a simple buffer of itemsize=1 regardless of whether it is already a buffer or not. That is why you get the mismatching item sizes even for
mv[0] = mv[0]
The problem with the first assignment,
mv[0] = 0xFFFFFFFF
stems (I think) from checking if the int is able to be used as a buffer, which currently it isn't set-up for from what I understand.
In other words, the buffer system isn't currently able to handle item sizes bigger from 1. It doesn't look like it is so far off, but it would take a bit more work on your end. If you do get it working, you should probably submit the changes back to the main Python distribution.
Another Update
The error code from your first try at assigning mv[0] stems from the int failing the PyObject_CheckBuffer when PyObject_CheckBuffer is called on it. Apparently the system only handles copies from bufferable objects. This seems like it should be changed too.
Conclusion
Currently the Python buffer system can't handle items with itemsize > 1 as you guessed. Also, it can't handle assignments to a buffer from non-bufferable objects such as ints.

Categories

Resources