I was working on increasing the performance of existing Python applications lately. What I found is that arranging data in arrays of basic data types (as a struct of arrays) instead of an array of structs/classes can increase performance. If data is saved in contiguous memory it also makes outsourcing heavy calculations to the GPU easier. My goal is to provide our users with a way to make use of SOA without having to know about numpy, numba, simd, etc.
Intel provides a template library and containers that can generate simd friendly data layouts from a struct/class.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/libraries/introduction-to-the-simd-data-layout-templates.html
This already gives an idea how something similar could be done in Python.
So, instead of reading objects from memory (that can be distributed somewhere in memory),
class A:
def __init__(self, x):
self.x = x
a_s = [A(1) for _ in range(10)]
for a in a_s:
a.x=2
I would like to have the data accessible as numpy array AND as instance of class A. So, that data can be accessed something like this:
sdlt_container = generate_SDLT(A(), 10)
a = sdlt_container[2] # returns instance of class A
a.x = 2 # returns a view to second element in numpy array and sets x[2]=2
sdlt_container.x[0:5] = 3 # Change x in several "instances of A"
Accessing data as an instance of class A might involve creating a new instance of A but the variables in this object should "point to" the correct index in the numpy array. I understand that optimizations like the Intel compiler does in a for loop are not possible in Python (interpreted vs compiled).
Thanks for any ideas!
Related
I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.
I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.
I have a class and it's method. The method repeats many times during execution. This method uses a numpy array as a temporary buffer. I don't need to store values inside the buffer between method's calls. Should I create a member instance of the array to avoid time leaks on memory allocation during the method execution? I know, that it is preferred to use local variables. But, is Python smart enough to allocate memory for the array only once?
class MyClass:
def __init__(self, n):
self.temp = numpy.zeros(n)
def method(self):
# do some stuff using self.temp
Or
class MyClass:
def __init__(self, n):
self.n = n
def method(self):
temp = numpy.zeros(self.n)
# do some stuff using temp
Update: replaced empty with zeros
Numpy arrays are fast, once created. However, creating an array is pretty expensive. Much more than, say, creating a python list.
In a case such as yours, where you create a new array again and again (in a for loop?), I would ALWAYS pre-allocate the array structure and then reuse it.
I can't comment on whether Python is smart enough to optimize this, but I would guess it's not :)
How big is your array and how frequent are calls to this method?
Yes, you need to preallocate large arrays. But if this will be efficient depends on how you use these arrays then.
This will cause several new allocations for intermediate results of computation:
self.temp = a * b + c
This will not (if self.x is preallocated):
numpy.multiply(a, b, out=self.x)
numpy.add(c, self.x, out=self.temp)
But for these cases (when you work with large arrays in not-trivial formulae) it is better to use numexpr or einsum for matrix calculations.
I'm currently writing code that can be heavily parallelized using GPUs. My code structure essentially looks like this:
Create two arrays, let's call them A and B of length N. (CPU)
Perform NxN calculations that eventually return a scalar. These calculations only depend on A and B and can therefore be parallelized. (GPU)
Gather all these scalars in a list and take the smallest one. (CPU)
Modify A and B with this scalar (CPU)
Go back to step 2 and repeat until a certain condition is met.
Most examples are very illustrative but they all seem to work like this: Execute the major part of the code on the CPU and only perform intermediate matrix multiplications etc. on the GPU. In particular the host usually knows all the variables the kernel is going to use.
For me its exactly vice versa, I want to perform the major part of the code on the GPU and only a very small amount of steps on the CPU itself. My host knows literally nothing about whats going on inside my individual threads. Its only managing the list of scalars as well as my arrays A and B.
My questions are therefore:
How do I properly define variables inside a kernel? In particular, how do I define and initialize arrays/lists?
How do I write a device function that returns an array? (s. below MatrixMultiVector doesn't work)
Why can I not use numpy and other libraries inside CUDA Kernels? What alternatives do I have?
An example of what I currently have looks like this:
from __future__ import division
import numpy as np
from numbapro import *
# Device Functions
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Works and can be called corrently from TestKernel Scalar
#cuda.jit('float32(float32, float32)', device=True)
def myfuncScalar(a, b):
return a+b;
# Works and can be called correctly from TestKernel Array
#cuda.jit('float32[:](float32[:])', device=True)
def myfuncArray(A):
for k in xrange(4):
A[k] += 2*k;
return A
# Takes Matrix A and Vector v, multiplies them and returns a vector of shape v. Does not even compile.
# Failed at nopython (nopython frontend), Only accept returning of array passed into the function as argument
# But v is passed to the function as argument...
#cuda.jit('float32[:](float32[:,:], float32[:])', device=True)
def MatrixMultiVector(A,v):
tmp = cuda.local.array(shape=4, dtype=float32); # is that thing even empty? It could technically be anything, right?
for i in xrange(A[0].size):
for j in xrange(A[1].size):
tmp[i] += A[i][j]*v[j];
v = tmp;
return v;
# Kernels
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# TestKernel Scalar - Works
#cuda.jit(void(float32[:,:]))
def TestKernelScalar(InputArray):
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] = myfuncScalar(5,7);
# TestKernel Array
#cuda.jit(void(float32[:,:]))
def TestKernelArray(InputArray):
# Defining arrays this way seems super tedious, there has to be a better way.
M = cuda.local.array(shape=4, dtype=float32);
M[0] = 1; M[1] = 0; M[2] = 0; M[3] = 0;
tmp = myfuncArray(M);
#tmp = MatrixMultiVector(A,M); -> we still have to define a 4x4 matrix for that.
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] += tmp[j];
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Main
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
N = 4;
C = np.zeros((N,N), dtype=np.float32);
TestKernelArray[1,N](C);
print(C)
The short answer is you can't define dynamic lists or arrays in CUDA Python. You can have statically defined local or shared memory arrays (see cuda.local.array() and cuda.shared.array in the documentation), but those have thread or block scope and can't be reused after their associated thread or block is retired. But that is about all that is supported. You can pass externally defined arrays to kernels, but their attributes are read-only.
As per your myfuncArray you can return an externally defined array. You can't return a dynamically defined array, because dynamically defined arrays (or any objects for that matter) are not supported in kernels.
You can read the CUDA Python specification for yourself, but the really short answer is that CUDA Python is a superset of Numba's No Python Mode, and while there are elementary scalar functions available, there is no Python object model support. That excludes much Python functionality, including objects and numpy.
I'm quite new at Python.
After using Matlab for many many years, recently, I started studying numpy/scipy
It seems like the most basic element of numpy seems to be ndarray.
In ndarray, there are following attributes:
ndarray.ndim
ndarray.shape
ndarray.size
...etc
I'm quite familiar with C++/JAVA classes, but I'm a novice at Python OOP.
Q1: My first question is what is the identity of the above attributes?
At first, I assumed that the above attribute might be public member variables. But soon, I found that a.ndim = 10 doesn't work (assuming a is an object of ndarray) So, it seems it is not a public member variable.
Next, I guessed that they might be public methods similar to getter methods in C++. However, when I tried a.nidm() with a parenthesis, it doesn't work. So, it seems that it is not a public method.
The other possibility might be that they are private member variables, but but print a.ndim works, so they cannot be private data members.
So, I cannot figure out what is the true identity of the above attributes.
Q2. Where I can find the Python code implementation of ndarray? Since I installed numpy/scipy on my local PC, I guess there might be some ways to look at the source code, then I think everything might be clear.
Could you give some advice on this?
numpy is implemented as a mix of C code and Python code. The source is available for browsing on github, and can be downloaded as a git repository. But digging your way into the C source takes some work. A lot of the files are marked as .c.src, which means they pass through one or more layers of perprocessing before compiling.
And Python is written in a mix of C and Python as well. So don't try to force things into C++ terms.
It's probably better to draw on your MATLAB experience, with adjustments to allow for Python. And numpy has a number of quirks that go beyond Python. It is using Python syntax, but because it has its own C code, it isn't simply a Python class.
I use Ipython as my usual working environment. With that I can use foo? to see the documentation for foo (same as the Python help(foo), and foo?? to see the code - if it is writen in Python (like the MATLAB/Octave type(foo))
Python objects have attributes, and methods. Also properties which look like attributes, but actually use methods to get/set. Usually you don't need to be aware of the difference between attributes and properties.
x.ndim # as noted, has a get, but no set; see also np.ndim(x)
x.shape # has a get, but can also be set; see also np.shape(x)
x.<tab> in Ipython shows me all the completions for a ndarray. There are 4*18. Some are methods, some attributes. x._<tab> shows a bunch more that start with __. These are 'private' - not meant for public consumption, but that's just semantics. You can look at them and use them if needed.
Off hand x.shape is the only ndarray property that I set, and even with that I usually use reshape(...) instead. Read their docs to see the difference. ndim is the number of dimensions, and it doesn't make sense to change that directly. It is len(x.shape); change the shape to change ndim. Likewise x.size shouldn't be something you change directly.
Some of these properties are accessible via functions. np.shape(x) == x.shape, similar to MATLAB size(x). (MATLAB doesn't have . attribute syntax).
x.__array_interface__ is a handy property, that gives a dictionary with a number of the attributes
In [391]: x.__array_interface__
Out[391]:
{'descr': [('', '<f8')],
'version': 3,
'shape': (50,),
'typestr': '<f8',
'strides': None,
'data': (165646680, False)}
The docs for ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None), the __new__ method lists these attributes:
`Attributes
----------
T : ndarray
Transpose of the array.
data : buffer
The array's elements, in memory.
dtype : dtype object
Describes the format of the elements in the array.
flags : dict
Dictionary containing information related to memory use, e.g.,
'C_CONTIGUOUS', 'OWNDATA', 'WRITEABLE', etc.
flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator
allows assignments, e.g., ``x.flat = 3`` (See `ndarray.flat` for
assignment examples; TODO).
imag : ndarray
Imaginary part of the array.
real : ndarray
Real part of the array.
size : int
Number of elements in the array.
itemsize : int
The memory use of each array element in bytes.
nbytes : int
The total number of bytes required to store the array data,
i.e., ``itemsize * size``.
ndim : int
The array's number of dimensions.
shape : tuple of ints
Shape of the array.
strides : tuple of ints
The step-size required to move from one element to the next in
memory. For example, a contiguous ``(3, 4)`` array of type
``int16`` in C-order has strides ``(8, 2)``. This implies that
to move from element to element in memory requires jumps of 2 bytes.
To move from row-to-row, one needs to jump 8 bytes at a time
(``2 * 4``).
ctypes : ctypes object
Class containing properties of the array needed for interaction
with ctypes.
base : ndarray
If the array is a view into another array, that array is its `base`
(unless that array is also a view). The `base` array is where the
array data is actually stored.
All of these should be treated as properties, though I don't think numpy actually uses the property mechanism. In general they should be considered to be 'read-only'. Besides shape, I only recall changing data (pointer to a data buffer), and strides.
Regarding your first question, Python has syntactic sugar for properties, including fine-grained control of getting, setting, deleting them, as well as restricting any of the above.
So, for example, if you have
class Foo(object):
#property
def shmip(self):
return 3
then you can write Foo().shmip to obtain 3, but, if that is the class definition, you've disabled setting Foo().shmip = 4.
In other words, those are read-only properties.
Question 1
The list you're mentioning is one that contains attributes for a Numpy array.
For example:
a = np.array([1, 2, 3])
print(type(a))
>><class 'numpy.ndarray'>
Since a is an nump.ndarray you're able to use those attributes to find out more about it. (i.e a.size will result in 3). To get information about what each one does visit SciPy's documentation about the attributes.
Question 2
You can start here to get yourself familiar with some of the basics tools of Numpy as well as the Reference Manual assuming you're using v1.9. For information specific to Numpy Array you can go to Array Objects.
Their documentation is very extensive and very helpful. Examples are provided throughout the website showing multiple examples.
This question already has answers here:
Dictionary vs Object - which is more efficient and why?
(8 answers)
Closed 9 years ago.
Refer to the following code as an example:
import numpy as np
N = 200
some_prop = np.random.randint(0,100, [N, N, N])
#option 1
class ObjectThing():
def __init__(self, some_prop):
self.some_prop = some_prop
object_thing = ObjectThing(some_prop)
#option 2
pseudo_thing = {'some_prop' : some_prop }
I like the structure that option 1 provides, it makes the operation of an application more rigid and whatnot. However, I'm wondering if there are other more absolute benefits that I'm not aware of.
The obvious advantage of using objects is that you can extend their functionality beyond simply storing data. You could, for instance, have two attributes, and define and __eq__ method that uses both attributes in some way other than simply comparing both of them and returning False unless both match.
Also, once you've got a class defined, you can easily define new instances of that class that will share the structure of the original, but with a dictionary, you'd either have to redefine that structure or make a copy of some sort of the original and then change each element to match the values you want the new pseudo-object to have.
The primary advantages of dictionaries are that they come with a variety of pre-defined methods (such as .items()), can easily be iterated over using in, can be conveniently created using a dict comprehension, and allow for easy access of data "members" using a string variable (though really, the getattr function achieves the same thing with objects).
If you're using an implementation of Python that includes a JIT compiler (e.g. PyPy), using actual objects can improve the compiler's ability to optimize your code (because it's easier for the compiler to reason about how members of an object are utilized, unlike a plain dictionary).
Using objects also allows for subclassing, which can save some redundant implementation.