Arrays in CUDA Kernels using Python with numba-pro

Arrays in CUDA Kernels using Python with numba-pro - python

I'm currently writing code that can be heavily parallelized using GPUs. My code structure essentially looks like this:
Create two arrays, let's call them A and B of length N. (CPU)
Perform NxN calculations that eventually return a scalar. These calculations only depend on A and B and can therefore be parallelized. (GPU)
Gather all these scalars in a list and take the smallest one. (CPU)
Modify A and B with this scalar (CPU)
Go back to step 2 and repeat until a certain condition is met.
Most examples are very illustrative but they all seem to work like this: Execute the major part of the code on the CPU and only perform intermediate matrix multiplications etc. on the GPU. In particular the host usually knows all the variables the kernel is going to use.
For me its exactly vice versa, I want to perform the major part of the code on the GPU and only a very small amount of steps on the CPU itself. My host knows literally nothing about whats going on inside my individual threads. Its only managing the list of scalars as well as my arrays A and B.
My questions are therefore:
How do I properly define variables inside a kernel? In particular, how do I define and initialize arrays/lists?
How do I write a device function that returns an array? (s. below MatrixMultiVector doesn't work)
Why can I not use numpy and other libraries inside CUDA Kernels? What alternatives do I have?
An example of what I currently have looks like this:
from __future__ import division
import numpy as np
from numbapro import *
# Device Functions
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Works and can be called corrently from TestKernel Scalar
#cuda.jit('float32(float32, float32)', device=True)
def myfuncScalar(a, b):
return a+b;
# Works and can be called correctly from TestKernel Array
#cuda.jit('float32[:](float32[:])', device=True)
def myfuncArray(A):
for k in xrange(4):
A[k] += 2*k;
return A
# Takes Matrix A and Vector v, multiplies them and returns a vector of shape v. Does not even compile.
# Failed at nopython (nopython frontend), Only accept returning of array passed into the function as argument
# But v is passed to the function as argument...
#cuda.jit('float32[:](float32[:,:], float32[:])', device=True)
def MatrixMultiVector(A,v):
tmp = cuda.local.array(shape=4, dtype=float32); # is that thing even empty? It could technically be anything, right?
for i in xrange(A[0].size):
for j in xrange(A[1].size):
tmp[i] += A[i][j]*v[j];
v = tmp;
return v;
# Kernels
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# TestKernel Scalar - Works
#cuda.jit(void(float32[:,:]))
def TestKernelScalar(InputArray):
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] = myfuncScalar(5,7);
# TestKernel Array
#cuda.jit(void(float32[:,:]))
def TestKernelArray(InputArray):
# Defining arrays this way seems super tedious, there has to be a better way.
M = cuda.local.array(shape=4, dtype=float32);
M[0] = 1; M[1] = 0; M[2] = 0; M[3] = 0;
tmp = myfuncArray(M);
#tmp = MatrixMultiVector(A,M); -> we still have to define a 4x4 matrix for that.
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] += tmp[j];
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Main
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
N = 4;
C = np.zeros((N,N), dtype=np.float32);
TestKernelArray[1,N](C);
print(C)

The short answer is you can't define dynamic lists or arrays in CUDA Python. You can have statically defined local or shared memory arrays (see cuda.local.array() and cuda.shared.array in the documentation), but those have thread or block scope and can't be reused after their associated thread or block is retired. But that is about all that is supported. You can pass externally defined arrays to kernels, but their attributes are read-only.
As per your myfuncArray you can return an externally defined array. You can't return a dynamically defined array, because dynamically defined arrays (or any objects for that matter) are not supported in kernels.
You can read the CUDA Python specification for yourself, but the really short answer is that CUDA Python is a superset of Numba's No Python Mode, and while there are elementary scalar functions available, there is no Python object model support. That excludes much Python functionality, including objects and numpy.

Related

Is there a Python package that generates SOA data structure from AOS?

I was working on increasing the performance of existing Python applications lately. What I found is that arranging data in arrays of basic data types (as a struct of arrays) instead of an array of structs/classes can increase performance. If data is saved in contiguous memory it also makes outsourcing heavy calculations to the GPU easier. My goal is to provide our users with a way to make use of SOA without having to know about numpy, numba, simd, etc.
Intel provides a template library and containers that can generate simd friendly data layouts from a struct/class.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/libraries/introduction-to-the-simd-data-layout-templates.html
This already gives an idea how something similar could be done in Python.
So, instead of reading objects from memory (that can be distributed somewhere in memory),
class A:
def __init__(self, x):
self.x = x
a_s = [A(1) for _ in range(10)]
for a in a_s:
a.x=2
I would like to have the data accessible as numpy array AND as instance of class A. So, that data can be accessed something like this:
sdlt_container = generate_SDLT(A(), 10)
a = sdlt_container[2] # returns instance of class A
a.x = 2 # returns a view to second element in numpy array and sets x[2]=2
sdlt_container.x[0:5] = 3 # Change x in several "instances of A"
Accessing data as an instance of class A might involve creating a new instance of A but the variables in this object should "point to" the correct index in the numpy array. I understand that optimizations like the Intel compiler does in a for loop are not possible in Python (interpreted vs compiled).
Thanks for any ideas!

Parallelizing a numba loop

Previously, I asked a question about a relatively simple loop that Numba was failing to parallelize. A solution turned out to make all the loops explicit.
Now, I need to do a simpler version of the same task: I now have arrays alpha and beta respectively of shape (m,n) and (b,m,n), and I want to compute the computes the Frobenius product of 2D slices of the arguments and find the slice of beta which maximizes this product. Previously, there was an additional, large first dimension of alpha so it was over this dimension that I parallelized; now I want to parallelize over the first dimension of beta as the calculation becomes expensive when b>1000.
If I naively modify the code that worked for the previous problem, I obtain:
#njit(parallel=True)
def parallel_value_numba(alpha,beta):
dot = np.zeros(beta.shape[0])
for i in prange(beta.shape[0]):
for j in prange(beta.shape[1]):
for k in prange(beta.shape[2]):
dot[i] += alpha[j,k]*beta[i, j, k]
index=np.argmax(dot)
value=dot[index]
return value,index
But Numba doesn't like this for some reason and complains:
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
scalar type memoryview(float64, 2d, C) given for non scalar argument #3
So instead, I tried
#njit(parallel=True)
def parallel_value_numba_2(alpha,beta):
product=np.multiply(alpha,beta)
dot1=np.sum(product,axis=2)
dot2=np.sum(dot1,axis=1)
index=np.argmax(dot2)
value=dot2[index]
return value,index
This compiles as long as you broadcast alpha to beta.shape before passing it to the function, and in principal Numba is capable of parallelizing the numpy operations. But it runs painfully slowly, much slower than the serial, pure Python code
def einsum_value(alpha,beta):
dot=np.einsum('kl,jkl->j',alpha,beta)
index=np.argmax(dot)
value=dot[index]
return value,index
So, my current working code uses this last implementation, but this function is still bottlenecking the runtime and I'd like to speed it up. Can anyone convince Numba to parallelize this function with an appreciable speedup?

This is not exactly an answer with a solution, but formatting comments is harder.
Numba generates different code depending on the arguments passed to the function. For example, your code works with the following example:
>>> alpha = np.random.random((5, 4))
>>> beta = np.random.random((3, 5, 4))
>>> parallel_value_numba(alpha, beta)
(5.89447648574048, 0)
In order to diagnose the problem, it's necessary to have an example of the specific argument values causing the problem.
Reading the error message, it seems you are passing a memoryview object, but Numba may not have full support for it.
As a side comment, you don't need to use prange in every loop. It's normally enough to use it in the outer loop, as long as the number of expected iterations is larger than the number of cores in your machine.

Cython - Declare List from Input

I am trying to convert a list of objects (GeoJSON) to shapely objects using cython, but I am running into a error:
This piece of code seems to be the issue: cdef object result[N]. How do I declare a list/array from a given list?
Here is my current code:
def convert_geoms(list array):
cdef int i, N=len(array)
cdef double x, s=0.0
cdef object result[N] # ERROR HERE
for i in range(N):
g = build_geometry_objects2(array[i])
result[i] = g
return result

There's two issues with cdef object result[N]:
It creates a C array of Python objects - this doesn't really work because C arrays aren't easily integrated with the Python object reference counting (in this you'd need to copy the whole array to something else when you returned it anyway, since it's a local variable that's only scoped to the function).
For C arrays of the form sometype result[N], N must be known at compile-time. In this case N is different for each function call, so the variable definition is invalid anyway.
There's multiple solutions - most of them involve just accepting that you're using Python objects so not worrying about specifying the types and just writing valid Python code. I'd probably write it as a list comprehension. I suspect Cython will do surprisingly well at producing optimized code for that
return [ build_geometry_objects2(array[i]) for i in range(len(array)) ]
# or
return [ build_geometry_objects2(a) for a in array ]
The second version is probably better, but if it matters you can time it.
If the performance really matters you can use Python C API calls which you can cimport from cpython.list. See Cythonize list of all splits of a string for an example of something similar where list creation is optimized this way. The advantage of PyList_New is that it creates an appropriately sized list at the start filled with NULL pointers, which you can then fill in.

Compiling njit nopython version of function fails due to data types

I'm writing a function in njit to speed up a very slow reservoir operations optimization code. The function is returning the maximum value for spill releases based on the reservoir level and gate availability. I am passing in a parameter size that specifies the number of flows to calculate (in some calls it's one and in some its many). I'm also passing in a numpy.zeros array that I can then fill with the function output. A simplified version of the function is written as follows:
import numpy as np
from numba import njit
#njit(cache=True)
def fncMaxFlow(elev, flag, size, MaxQ):
if (flag == 1): # SPOG2 running
if size==0:
if (elev>367.28):
return 861.1
else: return 0
else:
for i in range(size):
if((elev[i]>367.28) & (elev[i]<385)):
MaxQ[i]=861.1
return MaxQ
else:
if size==0: return 0
else: return MaxQ
fncMaxFlow(np.random.randint(368, 380, 3), 1, 3, np.zeros(3))
The error I'm getting:
Can't unify return type from the following types: array(float64, 1d, C), float64, int32
What is the reason for this? Is there any workaround or some step I'm missing so I can use numba to speed things up? This function and others like it are being called millions of times so they are a major factor in the computational efficiency. Any advice would help - I'm pretty new to python.

A variable within a numba function must have consistent type including the return variable. In your code you can either return MaxQ (an array), 861.1 (a float) or 0 (an int).
You need to refactor this code so that it always returns a consistent type regardless of code path.
Also note that in several places where you are comparing a numpy array to a scalar (elev > 367.28), what you are getting back is an array of boolean values, which is going to cause you issues. Your example function doesn't run as a pure python function (dropping the numba decorator) because of this.

How can I improve python code performance using numpy

I have read this blog which shows how an algorithm had a 250x speed-up by using numpy. I have tried to improve the following code by using numpy but I couldn't make it work:
for i in nodes[1:]:
for lb in range(2, diameter+1):
not_valid_colors = set()
valid_colors = set()
for j in nodes:
if j == i:
break
if distances[i-1, j-1] >= lb:
not_valid_colors.add(c[j, lb])
else:
valid_colors.add(c[j, lb])
c[i, lb] = choose_color(not_valid_colors, valid_colors)
return c
Explanation
The code above is part of an algorithm used to calculate the self similar dimension of a graph. It works basically by constructing dual graphs G' where a node is connected to each other node if the distance between them is greater or equals to a given value (Lb) and then compute the graph coloring on those dual networks.
The algorithm description is the following:
Assign a unique id from 1 to N to all network nodes, without assigning any colors yet.
For all Lb values, assign a color value 0 to the node with id=1, i.e. C_1l = 0.
Set the id value i = 2. Repeat the following until i = N.
a) Calculate the distance l_ij from i to all the nodes in the network with id j less than i.
b) Set Lb = 1
c) Select one of the unused colors C[ j][l_ij] from all nodes j < i for which l_ij ≥ Lb . This is the color C[i][Lb] of node i for the given Lb value.
d) Increase Lb by one and repeat (c) until Lb = Lb_max.
e) Increase i by 1.
I wrote it in python but it takes more than a minute when try to use it with small networks which have 100 nodes and p=0.9.
As I'm still new to python and numpy I did not find the way to improve its efficiency.
Is it possible to remove the loops by using the numpy.where to find where the paths are longer than the given Lb? I tried to implement it but didn't work...

Vectorized operations with numpy arrays are fast since actual calculations are done with underlying libraries such as BLAS and LAPACK without Python overheads. With loop-intensive operations, you will not see those benefits.
You usually have to figure out a way to vectorize operations (usually possible with a smart use of array slicing). Some operations are inherently loop-intensive, however, and sometimes it is not easy to vectorize them (which seems to be the case for your code).
In those cases, you can first try Numba, which generates optimized machine code from a Python function without any modifications. (You just annotate the function and it will automatically do it for you). I do not have a lot of experience with it, and have not tried using this for complicated functions.
If this does not work, then you can use Cython, which converts Python-like code (with typed variables) into efficient C code automatically and generates a Python extension module that you can import and use in Python. That will usually give you at least an order of magnitude (usually two orders of magnitude) speedup for loop-intensive operations. I generally find Cython easy to use since unlike pure C, one can access your numpy arrays directly in Cython code.
I recommend using Anaconda Python distribution, since you will be able to install these packages easily. I'm sorry I don't have a specific answer for your code.

if you want to go to numpy, you can just change the lists into arrays,
for example distances[i-1][j-1] becomes distances[i-1, j-1] after you declare distances as a numpy array. same with c[i][lb]. About valid_colors and not_valid_colors you should think a bit more because with numpy arrays you cannot append things: the array have fixed length, so you should fix a maximum size before. Another idea is that after you have everything in numpy, you can cythonize your code http://docs.cython.org/src/tutorial/cython_tutorial.html it means that all your loops will become very fast. In any case, if you don't want cython and you look at the blog, you see that distances is declared as an array in the main()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Arrays in CUDA Kernels using Python with numba-pro - python

Related

Is there a Python package that generates SOA data structure from AOS?

Parallelizing a numba loop

Cython - Declare List from Input

Compiling njit nopython version of function fails due to data types

How can I improve python code performance using numpy

Categories

Resources