I've got a large chunk of generated data (A[i,j,k]) on the device, but I only need one 'slice' of A[i,:,:], and in regular CUDA this could be easily accomplished with some pointer arithmetic.
Can the same thing be done within pycuda? i.e
cuda.memcpy_dtoh(h_iA,d_A+(i*stride))
Obviously this is completely wrong since theres no size information (unless inferred from the dest shape), but hopefully you get the idea?
The pyCUDA gpuArray class supports slicing of 1D arrays, but not higher dimensions that require a stride (although it is coming). You can, however, get access to the underlying pointer in a multidimensional gpuArray from the gpuarray member, which is a pycuda.driver.DeviceAllocation type, and the size information from the gpuArray.dtype.itemsize member. You can then do the same sort of pointer arithmetic you had in mind to get something that the driver memcpy functions will accept.
It isn't very pythonic, but it does work (or at least it did when I was doing a lot of pyCUDA + MPI hacking last year).
Is unlikely that is implemented in PyCuda.
I can think to the following solutions:
Copy the entire Array A in memory and make a numpy array from the interested slice.
Create a Kernel that read the matrix and creates the desired slice.
Rearrange the Produced Data in a way that you can read a slice at a time from pointer arithmetic.
Related
Say there is a C++ class in which we would like to define a function to be called in python. On the python side the goal is being able to call this function with:
Input: of type 2D numpy-array(float32), or list of lists, or other suggestions
Output: of type 2D numpy-array(float32), or list of lists, or other suggestions
and if it helps latency/simplicity 1D array is also ok.
One would for example define a function in the header with:
bool func(const std::string& name);
which has string type as input and bool as output.
What can be a good choice with the above requirements to write in the header?
And finally, after the header file, what should be written in the pyx/pyd file for Cython?
Cython Input
The most natural Cython type to use for the input interface between Python and Cython would be a 2D typed memoryview. This will take a 2D numpy array, as well as any other 2D array type that exports the buffer interface (there aren't too many other types since Numpy is pretty ubiquitous, but some image-handling libraries have some alternatives).
I'd avoid using list-of-lists as an interface - the length of the second dimension is poorly defined. However, Numpy arrays are easily created from list-of-lists.
Cython Output
For output you return either a Cython memoryview, or a Numpy array (easily created from a memoryview with np.asarray(memview)). I'd probably return a Numpy array, but make a decision based on whether you want to make Numpy a hard dependency.
C++ interface
This is very difficult to answer without knowing about your code. If you have existing code you should just use the type that's most natural to that if at all possible.
You can get a pointer from your memoryview with &memview[0,0], and access its attributes .shape and .strides to get information about how the data is stored. (If you make the memoryview contiguous then you know strides from shape so it's simpler). You then need to decided whether to copy the data, or just use a pointer to the Python-owned data (if C++ only keeps the data for the duration of the function call then using a pointer is good).
Similar considerations apply to the output data, but it's hard to know without knowing what you're trying to do in C++.
I'm currently embedding Python in my C++ program using boost/python in order to use matplotlib. Now I'm stuck at a point where I have to construct a large data structure, let's say a dense 10000x10000 matrix of doubles. I want to plot columns of that matrix and I figured that i have multiple options to do so:
Iterating and copying every value into a numpy array --> I don't want to do that for an obvious reason which is doubled memory consumption
Iterating and exporting every value into a file than importing it in python --> I could do that completely without boost/python and I don't think this is a nice way
Allocate and store the matrix in Python and just update the values from C++ --> But as stated here it's not a good idea to switch back and forth between the Python interpreter and my C++ program
Somehow expose the matrix to python without having to copy it --> All I can find on that matter is about extending Python with C++ classes and not embedding
Which of these is the best option concerning performance and of course memory consumption or is there an even better way of doing that kind of task.
To prevent copying in Boost.Python, one can either:
Use policies to return internal references
Allocate on the free store and use policies to have Python manage the object
Allocate the Python object then extract a reference to the array within C++
Use a smart pointer to share ownership between C++ and Python
If the matrix has a C-style contiguous memory layout, then consider using the Numpy C-API. The PyArray_SimpleNewFromData() function can be used to create an ndarray object thats wraps memory that has been allocated elsewhere. This would allow one to expose the data to Python without requiring copying or transferring each element between the languages. The how to extend documentation is a great resource for dealing with the Numpy C-API:
Sometimes, you want to wrap memory allocated elsewhere into an ndarray object for downstream use. This routine makes it straightforward to do that. [...] A new reference to an ndarray is returned, but the ndarray will not own its data. When this ndarray is deallocated, the pointer will not be freed.
[...]
If you want the memory to be freed as soon as the ndarray is deallocated then simply set the OWNDATA flag on the returned ndarray.
Also, while the plotting function may create copies of the array, it can do so within the C-API, allowing it to take advantage of the memory layout.
If performance is a concern, it may be worth considering the plotting itself:
taking a sample of the data and plotting it may be sufficient depending on the data distribution
using a raster based backend, such as Agg, will often out perform vector based backends on large datasets
benchmarking other tools that are designed for large data, such as Vispy
Altough Tanner's answer brought me a big step forward, I ended up using Boost.NumPy, an inofficial extension to Boost.Python that can easily be added. It wraps around the NumPy C API and makes it more save and easier to use.
Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?
numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.
#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.
I am using quite a lot of fortran libraries to do some mathematical computation. So all the arrays in numpy need to be Fortran-contiguous.
Currently I accomplish this with numpy.asfortranarray().
My questions are:
Is this a fast way of telling numpy that the array should be stored in fortran style or is there a faster one?
Is there the possibility to set some numpy flag, so that every array that is created is in fortran style?
Use optional argument order='F' (default 'C'), when generating numpy.array objects. This is the way I do it, probably does the same thing that you are doing. About number 2, I am not aware of setting default order, but it's easy enough to just include order optional argument when generating arrays.
Regarding question 2: you may be concerned about retaining Fortran ordering after performing array transformations and operations. I had a similar issue with endianness. I loaded a big-endian raw array from file, but when I applied a log transformation, the resultant array would be little-endian. I got around the problem by first allocating a second big-endian array, then performing an in-place log:
b=np.zeros(a.shape,dtype=a.dtype)
np.log10(1+100*a,b)
In your case you would allocate b with Fortran ordering.
I'm porting an C++ scientific application to python, and as I'm new to python, some problems come to my mind:
1) I'm defining a class that will contain the coordinates (x,y). These values will be accessed several times, but they only will be read after the class instantiation. Is it better to use an tuple or an numpy array, both in memory and access time wise?
2) In some cases, these coordinates will be used to build a complex number, evaluated on a complex function, and the real part of this function will be used. Assuming that there is no way to separate real and complex parts of this function, and the real part will have to be used on the end, maybe is better to use directly complex numbers to store (x,y)? How bad is the overhead with the transformation from complex to real in python? The code in c++ does a lot of these transformations, and this is a big slowdown in that code.
3) Also some coordinates transformations will have to be performed, and for the coordinates the x and y values will be accessed in separate, the transformation be done, and the result returned. The coordinate transformations are defined in the complex plane, so is still faster to use the components x and y directly than relying on the complex variables?
Thank you
In terms of memory consumption, numpy arrays are more compact than Python tuples.
A numpy array uses a single contiguous block of memory. All elements of the numpy array must be of a declared type (e.g. 32-bit or 64-bit float.) A Python tuple does not necessarily use a contiguous block of memory, and the elements of the tuple can be arbitrary Python objects, which generally consume more memory than numpy numeric types.
So this issue is a hands-down win for numpy, (assuming the elements of the array can be stored as a numpy numeric type).
On the issue of speed, I think the choice boils down to the question, "Can you vectorize your code?"
That is, can you express your calculations as operations done on entire arrays element-wise.
If the code can be vectorized, then numpy will most likely be faster than Python tuples. (The only case I could imagine where it might not be, is if you had many very small tuples. In this case the overhead of forming the numpy arrays and one-time cost of importing numpy might drown-out the benefit of vectorization.)
An example of code that could not be vectorized would be if your calculation involved looking at, say, the first complex number in an array z, doing a calculation which produces an integer index idx, then retrieving z[idx], doing a calculation on that number, which produces the next index idx2, then retrieving z[idx2], etc. This type of calculation might not be vectorizable. In this case, you might as well use Python tuples, since you won't be able to leverage numpy's strength.
I wouldn't worry about the speed of accessing the real/imaginary parts of a complex number. My guess is the issue of vectorization will most likely determine which method is faster. (Though, by the way, numpy can transform an array of complex numbers to their real parts simply by striding over the complex array, skipping every other float, and viewing the result as floats. Moreover, the syntax is dead simple: If z is a complex numpy array, then z.real is the real parts as a float numpy array. This should be far faster than the pure Python approach of using a list comprehension of attribute lookups: [z.real for z in zlist].)
Just out of curiosity, what is your reason for porting the C++ code to Python?
A numpy array with an extra dimension is tighter in memory use, and at least as fast!, as a numpy array of tuples; complex numbers are at least as good or even better, including for your third question. BTW, you may have noticed that -- while questions asked later than yours were getting answers aplenty -- your was laying fallow: part of the reason is no doubt that asking three questions within a question turns responders off. Why not just ask one question per question? It's not as if you get charged for questions or anything, you know...!-)