I am new to NumbaPro in Python. I have the following code which I want to parallelize in x,y grid in CUDA (Anaconda Accelerate), however everytime I run this it gives a NotImplementedError at the Decorator line, I am not sure what is wrong, can someone please help me? Many Thanks:
#cuda.jit(argtypes=(float64[:,:], float64[:,:,:], float64, float64, float64), device=True)
def computeflow(PMapping2, Array_hist2, Num_f1, p_depth1, image_width1):
x, y = cuda.grid(2);
for y in xrange(0,p_depth1):
for x in xrange(0,image_width1):
Array_H, bin_edges = numpy.histogram(Array_hist2[y,x,:], bins=Num_f1, range=None, normed=False, weights=None, density=None);
Array_H = (numpy.imag(numpy.fft.ifft(Array_H,n=1024)));
Array_H1 = Array_H[0:len(Array_H)/2];
Array_H1[20:1024] = 0;
PMapping2[y,x] = numpy.sum(Array_H1);
Mapping1=cuda.to_device(PMapping);
Array_hist1=cuda.to_device(Array_hist);
computeflow[(3,3),(3,3)](PMapping, Array_hist, Num_f, p_depth, image_width);
PMapping1.to_host();
NotImplementedError: offset=203 opcode=2b opname=STORE_SLICE+3
This mean that the slice operation a[i:j] = b is not implemented yet. ref
And looking at the function you're trying to use cuda on, it looks like you do not fully understand how cuda works. I suggest you look up some general guides on for example cuda/pycuda or opencl/pyopencl, to get a quick grasp on how function for parallalizing for gpu's need to be designed. It a too big of a topic to go through here. The doc's for these kinds of things is sadly pretty bad on continiums pages. Probably because there is still a lot of development going on.
Related
I am a beginner in fenics and I am trying to resolve Poisson equation with a boundary condition which is a Perlin noise generated by opensimplex, a Python library.
I'm trying to define f, the boundary condition by Expression().
I tried Expression('function(x[0],x[1],x[2])') where function (x,y,z)=opensimplex.tmp.noise3d(x,y,z)). However, as this opensimplex function is not managed by C++, I got a compilation error; Compilation failed!.
Is there any solution to overcome this error ?
I had a similar problem when starting to work with transient flows in FEniCS.
Defining a subclass for UserExpression, before defining your variational form should enable the compilation.
from dolfin import *
parameters["reorder_dofs_serial"] = True
### (Here you add your domain generation and FunctionSpace definition)
class Expression(SubDomain):
def inside(self,a,on_boundary):
return (x[0]) and (x[1]) and (x[2]) and on_boundary
f=MyExpression(2.0)
print(assemble(f*dx(domain=UnitIntervalMesh(1))))
If this still doesn't enable compilation, please attach the relevant portions of your code, and we can try to work through them.
If you have a fixed dimension order (e.g. 2-D), you might also have to add this after reordering dofs:
parameters["form_compiler"]["quadrature_degree"]=2
Good luck!
I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
#cuda.jit(device=True)
def lookup(x):
return lut[x]
#vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup with the following simplified code results in the same error:
#cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0
The above code is fixed by modifying the part in bold:
#cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with #talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.
I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.
Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.
I have a piece of code that is supposed to calculate a simple
matrix product, in python (using theano). The matrix that I intend to multiply with is a shared variable.
The example is the smallest example that demonstrates my problem.
I have made use of two helper-functions. floatX converts its input to something of type theano.config.floatX
init_weights generates a random matrix (in type floatX), of given dimensions.
The last line causes the code to crash. In fact, this forces so much output on the commandline that I can't even scroll to the top of it anymore.
So, can anyone tell me what I'm doing wrong?
def floatX(x):
return numpy.asarray(x,dtype=theano.config.floatX)
def init_weights(shape):
return floatX(numpy.random.randn(*shape))
a = init_weights([3,3])
b = theano.shared(value=a,name="b")
x = T.matrix()
y = T.dot(x,b)
f = theano.function([x],y)
This work for me. So my guess is that you have a problem with your blas installation. Make sure to use Theano development version:
http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
It have better default for some configuration. If that do not fix the problem, look at the error message. There is main part that is after the code dump. After the stack trace. This is what is the most useful normally.
You can disable direct linking by Theano to blas with this Theano flag: blas.ldflags=
This can cause slowdown. But it is a quick check to confirm the problem is blas.
If you want more help, dump the error message to a text file and put it on the web and link to it from here.
I am working on translating a model from MATLAB to Python. The crux of the model lies in MATLAB's ode15s. In the MATLAB execution, the ode15s has standard options:
options = odeset()
[t P] = ode15s(#MODELfun, tspan, y0, options, params)
For reference, y0 is a vector (of size 98) as is MODELfun.
My Python attempt at an equivalent is as follows:
ode15s = scipy.integrate.ode(Model.fun)
ode15s.set_integrator('vode', method = 'bdf', order = 15)
ode15s.set_initial_value(y0).set_f_params(params)
dt = 1
while ode15s.successful() and ode15s.t < duration:
ode15s.integrate(ode15s.t+dt)
This though, does not seem to be working. Any suggestions, or an alternative?
Edit:
After looking at the output, the result I'm getting from the Python is either no change in some elements of y0 over time, or a constant change at each step for the rest of the y0. Any experience with something like this?
According to the SciPy wiki for Matlab Users, the right way for using the ode15s is
scipy.integrate.ode(f).set_integrator('vode', method='bdf', order=15)
One point to make clear is that, unlike Matlab's ode15s, the scipy integrator 'vode' does not support models with a mass matrix. So any recommendation should include this caveat.