Optimising python function with numba

Optimising python function with numba - python

I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.

Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.

Related

Scipy minimize iterating past bounds

I am trying to minimize a function of 3 input variables using scipy. The function reads like so-
def myfunc(x):
x[0] = a
x[1] = b
x[2] = c
n = f(a,b,c)
return n
bound1 = (80,100)
bound2 = (10,20)
bound3 = (312,740)
guess = [a0,b0,c0]
bds = (bound1,bound2,bound3)
result = minimize(myfunc, guess,method='L-BFGS-B',bounds=bds)
The function I am trying to currently run reaches a minimum at a=100,b=10,c=740, which is at the end of the bounds.
The minimize function keeps trying to iterate past the end of bound 3 (gets to c0 value of 740.0000000149012 on its last iteration.
Is there any way to stop this from happening? i.e. stop the iteration at the actual end of my bound?

This happens due to numerical-differentiation, which itself is not only needed to infer the step-direction and size, but also to reason about termination.
In general you can't do much without being very careful in regards to whatever solver (and there are many backend-solvers) being used. The basic idea is to replace the automatic numerical-differentiation with one provided by you: this one then respects those bounds and must be careful about the solvers-internals, e.g. "how to reason about termination at this end".
Fix A:
Your problem should vanish automatically when using: Pull-request #10673, which touches your configuration: L-BFGS-B.
It seems, this PR is not part of the current release SciPy 1.4.1 (as this was 2 months before the PR).
See also #6026, where a milestone of 1.5.0 is mentioned in regards to some changes including respecting bounds in num-diff.
For above PR, you will need to install scipy from the sources, which is:
quite doable on linux (and maybe os x)
not something you should try on windows!
trust me...
See the documentation if needed.
Fix B:
Apart from that, as you are doing unconstrained-optimization (with variable-bounds) where more solver-backends are available (compared to constrained-optimization), you might try another solver, trust-constr, which has explicit support for this, see #9098.
Be careful to recognize, that you need to signal this explicitly when setting up the bounds!

Why is this numba.cuda lookup table implementation failing?

I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
#cuda.jit(device=True)
def lookup(x):
return lut[x]
#vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup with the following simplified code results in the same error:
#cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0

The above code is fixed by modifying the part in bold:
#cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with #talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.

Theano matrix multiplication

I have a piece of code that is supposed to calculate a simple
matrix product, in python (using theano). The matrix that I intend to multiply with is a shared variable.
The example is the smallest example that demonstrates my problem.
I have made use of two helper-functions. floatX converts its input to something of type theano.config.floatX
init_weights generates a random matrix (in type floatX), of given dimensions.
The last line causes the code to crash. In fact, this forces so much output on the commandline that I can't even scroll to the top of it anymore.
So, can anyone tell me what I'm doing wrong?
def floatX(x):
return numpy.asarray(x,dtype=theano.config.floatX)
def init_weights(shape):
return floatX(numpy.random.randn(*shape))
a = init_weights([3,3])
b = theano.shared(value=a,name="b")
x = T.matrix()
y = T.dot(x,b)
f = theano.function([x],y)

This work for me. So my guess is that you have a problem with your blas installation. Make sure to use Theano development version:
http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
It have better default for some configuration. If that do not fix the problem, look at the error message. There is main part that is after the code dump. After the stack trace. This is what is the most useful normally.
You can disable direct linking by Theano to blas with this Theano flag: blas.ldflags=
This can cause slowdown. But it is a quick check to confirm the problem is blas.
If you want more help, dump the error message to a text file and put it on the web and link to it from here.

NotImplementedError at Decorator of NumbaPro (Python)

I am new to NumbaPro in Python. I have the following code which I want to parallelize in x,y grid in CUDA (Anaconda Accelerate), however everytime I run this it gives a NotImplementedError at the Decorator line, I am not sure what is wrong, can someone please help me? Many Thanks:
#cuda.jit(argtypes=(float64[:,:], float64[:,:,:], float64, float64, float64), device=True)
def computeflow(PMapping2, Array_hist2, Num_f1, p_depth1, image_width1):
x, y = cuda.grid(2);
for y in xrange(0,p_depth1):
for x in xrange(0,image_width1):
Array_H, bin_edges = numpy.histogram(Array_hist2[y,x,:], bins=Num_f1, range=None, normed=False, weights=None, density=None);
Array_H = (numpy.imag(numpy.fft.ifft(Array_H,n=1024)));
Array_H1 = Array_H[0:len(Array_H)/2];
Array_H1[20:1024] = 0;
PMapping2[y,x] = numpy.sum(Array_H1);
Mapping1=cuda.to_device(PMapping);
Array_hist1=cuda.to_device(Array_hist);
computeflow[(3,3),(3,3)](PMapping, Array_hist, Num_f, p_depth, image_width);
PMapping1.to_host();

NotImplementedError: offset=203 opcode=2b opname=STORE_SLICE+3
This mean that the slice operation a[i:j] = b is not implemented yet. ref
And looking at the function you're trying to use cuda on, it looks like you do not fully understand how cuda works. I suggest you look up some general guides on for example cuda/pycuda or opencl/pyopencl, to get a quick grasp on how function for parallalizing for gpu's need to be designed. It a too big of a topic to go through here. The doc's for these kinds of things is sadly pretty bad on continiums pages. Probably because there is still a lot of development going on.

Creating an R data.frame in python with low level rpy2

I am using the rpy2 package to bring some R functionality to python.
The functions I'm using in R need a data.frame object, and by using rlike.TaggedList and then robjects.DataFrame I am able to make this work.
However I'm having performance issues, when comparing to the exact same R functions with the exact same data, which led me to try and use the rpy2 low level interface as mentioned here - http://rpy.sourceforge.net/rpy2/doc-2.3/html/performances.html
So far I have tried:
Using TaggedList with FloatSexpVector objects instead of numpy arrays, and the DataFrame object.
Dumping the TaggedList and DataFrame classes by using a dictionary like this:
d = dict((var_name, var_sexp_vector) for ...)
dataframe = robjects.r('data.frame')(**d)
Both did not get me any noticeable speedup.
I have noticed that DataFrame objects can get a rinterface.SexpVector in their constructor , so I have thought of creating a such a named vector, but I have no idea on how to put in the names (in R I know its just names(vec) = c('a','b'...)).
How do I do that? Is there another way?
And is there an easy way to profile rpy itself, so I could know where the bottleneck is?
EDIT:
The following code seem to work great (x4 faster) on newer rpy (2.2.3)
data = ro.r('list')([ri.FloatSexpVector(x) for x in vectors])[0]
data.names = ri.StrSexpVector(vector_names)
However it doesn't on version 2.0.8 (last one supported by windows), since R cant seem to be able to use the names: "Error in eval(expr, envir, enclos) : object 'y' not found"
Ideas?
EDIT #2:
Someone did the fine job of building a rpy2.3 binary for windows (python 2.7), the mentioned works great with it (almost x6 faster for my code)
link: https://bitbucket.org/breisfeld/rpy2_w32_fix/issue/1/binary-installer-for-win32

Python can be several times faster than R (even byte-compiled R), and I managed to perform operations on R data structures with rpy2 faster than R would have. Sharing the relevant R and rpy2 code would help make more specific advice (and improve rpy2 if needed).
In the meantime, SexpVector might not be what you want; it is little more than an abstract class for all R vectors (see class diagram for rpy2.rinterface). ListSexpVector might be more appropriate:
import rpy2.rinterface as ri
ri.initr()
l = ri.ListSexpVector([ri.IntSexpVector((1,2,3)),
ri.StrSexpVector(("a","b","c")),])
An important detail is that R lists are recursive data structures, and R avoids a catch 22-type of situation by having the operator "[[" (in addittion to "["). Python does not have that, and I have not (yet ?) implemented "[[" as a method at the low-level.
Profiling in Python can be done with the module stdlib module cProfile, for example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimising python function with numba - python

Related

Scipy minimize iterating past bounds

Why is this numba.cuda lookup table implementation failing?

Theano matrix multiplication

NotImplementedError at Decorator of NumbaPro (Python)

Creating an R data.frame in python with low level rpy2

Categories

Resources