optimizing python quad using ctypes

optimizing python quad using ctypes - python

I need to use ctypes functions to reduce the running time of quad in python. Here is my original question original question, but now i know what path i need to follow. I need to follow the same steps as in here similar problem link.
However in my case the function that will be handled in the numerical integration is calling another python function. Like this:
from sklearn.neighbors import KernelDensity
import numpy as np
funcA = lambda x: np.exp(kde_bad.score_samples([[x]]))
quad(funcA, 0, cut_off)
where cut_off is just a scalar that I decide in my code, and kde_bad is the kernel object created using KernelDensity.
So my question is how do I need to specify the function in C? the equivalent of this:
//testlib.c
double f(int n, double args[n])
{
return args[0] - args[1] * args[2]; //corresponds to x0 - x1 * x2
}
Any input is appreciated!

You can do this using ctypes's callback function facilities.
That said, it's debatable whether or not you'll actually achieve any speed gains if your function calls something from Python. There are essentially two reasons that ctypes speeds up integration: (1) the integrand function itself is faster as compiled C than as Python bytecode, and (2) it avoids calling back to Python from the compiled (Fortran!) QUADPACK routines. What you're proposing completely eliminates the second of these sources of performance gains, and might even increase the penalty if you make such a call more than once. If, however, the large bulk of the execution time of your integrand is in its own code, rather than in these other Python functions that you need to call, then you might see some benefit.

As answered in the other question, quadpy is here to save the day with its vectorized computation capabilities.

Related

Parallelizing a numba loop

Previously, I asked a question about a relatively simple loop that Numba was failing to parallelize. A solution turned out to make all the loops explicit.
Now, I need to do a simpler version of the same task: I now have arrays alpha and beta respectively of shape (m,n) and (b,m,n), and I want to compute the computes the Frobenius product of 2D slices of the arguments and find the slice of beta which maximizes this product. Previously, there was an additional, large first dimension of alpha so it was over this dimension that I parallelized; now I want to parallelize over the first dimension of beta as the calculation becomes expensive when b>1000.
If I naively modify the code that worked for the previous problem, I obtain:
#njit(parallel=True)
def parallel_value_numba(alpha,beta):
dot = np.zeros(beta.shape[0])
for i in prange(beta.shape[0]):
for j in prange(beta.shape[1]):
for k in prange(beta.shape[2]):
dot[i] += alpha[j,k]*beta[i, j, k]
index=np.argmax(dot)
value=dot[index]
return value,index
But Numba doesn't like this for some reason and complains:
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
scalar type memoryview(float64, 2d, C) given for non scalar argument #3
So instead, I tried
#njit(parallel=True)
def parallel_value_numba_2(alpha,beta):
product=np.multiply(alpha,beta)
dot1=np.sum(product,axis=2)
dot2=np.sum(dot1,axis=1)
index=np.argmax(dot2)
value=dot2[index]
return value,index
This compiles as long as you broadcast alpha to beta.shape before passing it to the function, and in principal Numba is capable of parallelizing the numpy operations. But it runs painfully slowly, much slower than the serial, pure Python code
def einsum_value(alpha,beta):
dot=np.einsum('kl,jkl->j',alpha,beta)
index=np.argmax(dot)
value=dot[index]
return value,index
So, my current working code uses this last implementation, but this function is still bottlenecking the runtime and I'd like to speed it up. Can anyone convince Numba to parallelize this function with an appreciable speedup?

This is not exactly an answer with a solution, but formatting comments is harder.
Numba generates different code depending on the arguments passed to the function. For example, your code works with the following example:
>>> alpha = np.random.random((5, 4))
>>> beta = np.random.random((3, 5, 4))
>>> parallel_value_numba(alpha, beta)
(5.89447648574048, 0)
In order to diagnose the problem, it's necessary to have an example of the specific argument values causing the problem.
Reading the error message, it seems you are passing a memoryview object, but Numba may not have full support for it.
As a side comment, you don't need to use prange in every loop. It's normally enough to use it in the outer loop, as long as the number of expected iterations is larger than the number of cores in your machine.

I have issues understanding the use of Numbas vectorize decorator in Python

I am currently looking into the use of Numba to speed up my python software. I am entirely new to the concept and currently trying to learn the absolute basics. What I am stuck on for now is:
I don't understand, what's the big benefit of the vectorize decorator.
The documentation explains, that the decorator is used to turn a normal python function into a Numpy ufunc. From what I understand, the benefit of a ufunc is, that it can take numpy arrays (instead of scalars) and provide features such as broadcasting.
But all examples I can find online, can be just as easily solved without this decorator.
Take for instance, this example from the numba documentation.
#vectorize([float64(float64, float64)])
def f(x, y):
return x + y
They claim, that now the function works like a numpy ufunc. But doesn't it anyways, even without the decorator? If I were to just run the following code:
def f(x,y):
return x+y
x = np.arange(10)
y = np.arange(10)
print(f(x,y))
That works just as fine. The function already takes arguments of type other than scalars.
What am I misunderstanding here?

Just read the docs few lines below :
You might ask yourself, “why would I go through this instead of compiling a simple iteration loop using the #jit decorator?”. The answer is that NumPy ufuncs automatically get other features such as reduction, accumulation or broadcasting.
For example f.reduce(arr) will sum all the elements off arr at C speed, what f cannot.

Can I decorate an explicit function call such as np.sqrt()

I understand a bit about python function decorators. I think the answer to my question is no, but I want to make sure. With a decorator and a numpy array of x = np.array([1,2,3]) I can override x.sqrt() and change the behavior. Is there some way I can override np.sqrt(x) in Python?
Use case: working on the quantities package. Would like to be able to take square root of uncertain quantities without changing code base that currently uses np.sqrt().
Edit:
I'd like to modify np.sqrt in the quantities package so that the following code works (all three should print identical results, note the 0 uncertainty when using np.sqrt()). I hope to not require end-users to modify their code, but in the quantities package properly wrap/decorate np.sqrt(). Currently many numpy functions are decorated (see https://github.com/python-quantities/python-quantities/blob/ca87253a5529c0a6bee37a9f7d576f1b693c0ddd/quantities/quantity.py), but seem to only work when x.func() is called, not numpy.func(x).
import numpy as np
import quantities as pq
x = pq.UncertainQuantity(2, pq.m, 2)
print x.sqrt()
>>> 1.41421356237 m**0.5 +/- 0.707106781187 m**0.5 (1 sigma)
print x**0.5
>>> 1.41421356237 m**0.5 +/- 0.707106781187 m**0.5 (1 sigma)
print np.sqrt(x)
>>> 1.41421356237 m**0.5 +/- 0.0 dimensionless (1 sigma)

Monkeypatching
If I understand your situation correctly, your use case is not really about decoration (modifying a function you write, in a standard manner)
but rather about monkey patching:
Modifying a function somebody else wrote without actually changing that function's definition's source code.
The idiom for what you then need is something like
import numpy as np # provide local access to the numpy module object
original_np_sqrt = np.sqrt
def my_improved_np_sqrt(x):
# do whatever you please, including:
# - contemplating the UncertainQuantity-ness of x and
# - calling original_np_sqrt as needed
np.sqrt = my_improved_np_sqrt
Of course, this can change only the future meaning of numpy.sqrt,
not the past one.
So if anybody has imported numpy before the above and has already used numpy.sqrt in a way you would have liked to influence, you lose.
(And the name to which they map numpy does not matter.)
But after the above code was executed, the meaning of numpy.sqrt in all
modules (whether they imported numpy before it or after it)
will be that of my_improved_np_sqrt, whether the creators of those modules
like it or not (and of course unless some more monkeypatching of numpy.sqrt
is going on elsewhere).
Note that
When you do weird things, Python can become a weird platform!
When you do weird things, Python can become a weird platform!
When you do weird things, Python can become a weird platform!
This is why monkey patching is not normally considered good design style.
So if you take that route, make sure you announce it very prominently
in all relevant documentation.
Oh, and if you do not want to modify other code than that which is
directly or indirectly executed from your own methods, you could
introduce a decorator that performs monkeypatching before the call
and un-monkeypatching (reassigning original_np_sqrt)
after the call and apply that decorator to
all your functions in question.
Make sure you handle exceptions in that decorator then, so that
the un-monkeypatching is really executed in all cases.

Maybe, as BrenBarn stated,
np.sqrt = decorator(np.sqrt)
because a decorator is just a callable that takes an object and returns a modified object.

Python and Numba for vectorized functions

Good day, I'm writing a Python module for some numeric work. Since there's a lot of stuff going on, I've been spending the last few days optimizing code to improve calculations times.
However, I have a question concerning Numba.
Basically, I have a class with some fields which are numpy arrays, which I initialize in the following way:
def init(self):
a = numpy.arange(0, self.max_i, 1)
self.vibr_energy = self.calculate_vibr_energy(a)
def calculate_vibr_energy(i):
return numpy.exp(-self.harmonic * i - self.anharmonic * (i ** 2))
So, the code is vectorized, and using Numba's JIT results in some improvement. However, sometimes I need to access the calculate_vibr_energy function from outside the class, and pass a single integer instead of an array in place of i.
As far as I understand, if I use Numba's JIT on the calculate_vibr_energy, it will have to always take an array as an argument.
So, which of the following options is better:
1) Create a new function calculate_vibr_energy_single(i), which will only take a single integer number, and use Numba on it too
2) Replace all usages of the function that are similar to this one:
myclass.calculate_vibr_energy(1)
with this:
tmp = np.array([1])
myclass.calculate_vibr_energy(tmp)[0]
Or are there other, more efficient (or at least, more Python-ic) ways of doing that?

I have only played a little with numba yet so I may be mistaken, but as far as I've understood it, using the "autojit" decorator should give functions that can take arguments of any type.
See e.g. http://numba.pydata.org/numba-doc/dev/pythonstuff.html

Python: is the iteration of the multidimensional array super slow?

I have to iterate all items in two-dimensional array of integers and change the value (according to some rule, not important).
I'm surprised how significant difference in performance is there between python runtime and C# or java runtime. Did I wrote totally wrong python code (v2.7.2)?
import numpy
a = numpy.ndarray((5000,5000), dtype = numpy.int32)
for x in numpy.nditer(a.T):
x = 123
>python -m timeit -n 2 -r 2 -s "import numpy; a = numpy.ndarray((5000,5000), dtype=numpy.int32)" "for x in numpy.nditer(a.T):" " x = 123"
2 loops, best of 2: 4.34 sec per loop
For example the C# code performs only 50ms, i.e. python is almost 100 times slower! (suppose the matrix variable is already initialized)
for (y = 0; y < 5000; y++)
for (x = 0; x < 5000; x++)
matrix[y][x] = 123;

Yep! Iterating through numpy arrays in python is slow. (Slower than iterating through a python list, as well.)
Typically, you avoid iterating through them directly.
If you can give us an example of the rule you're changing things based on, there's a good chance that it's easy to vectorize.
As a toy example:
import numpy as np
x = np.linspace(0, 8*np.pi, 100)
y = np.cos(x)
x[y > 0] = 100
However, in many cases you have to iterate, either due to the algorithm (e.g. finite difference methods) or to lessen the memory cost of temporary arrays.
In that case, have a look at Cython, Weave, or something similar.

The example you gave was presumably meant to set all items of a two-dimensional NumPy array to 123. This can be done efficiently like this:
a.fill(123)
or
a[:] = 123

Python is a much more dynamic language than C or C#. The main reason why the loop is so slow is that on every pass, the CPython interpreter is doing some extra work that wastes time: specifically, it is binding the name x with the next object from the iterator, then when it evaluates the assignment it has to look up the name x again.
As #Sven Marnach noted, you can call a method function numpy.fill() and it is fast. That function is compiled C or maybe Fortran, and it will simply loop over the addresses of the numpy.array data structure and fill in the values. Much less dynamic than Python, which is good for this simple case.
But now consider PyPy. Once you run your program under PyPy, a JIT analyzes what your code is actually doing. In this example, it notes that the name x isn't used for anything but the assignment, and it can optimize away binding the name. This example should be one that PyPy speeds up tremendously; likely PyPy will be ten times faster than plain Python (so only one-tenth as fast as C, rather than 1/100 as fast).
http://pypy.org
As I understand it, PyPy won't be working with Numpy for a while yet, so you can't just run your existing Numpy code under PyPy yet. But the day is coming.
I'm excited about PyPy. It offers the hope that we can write in a very high-level language (Python) and yet get nearly the performance of writing things in "portable assembly language" (C). For examples like this one, the Numpy might even beat the performance of naive C code, by using SIMD instructions from the CPU (SSE2, NEON, or whatever). For this example, with SIMD, you could set four integers to 123 with each loop, and that would be faster than a plain C loop. (Unless the C compiler used a SIMD optimization also! Which, come to think of it, is likely for this case. So we are back to "nearly the speed of C" rather than faster for this example. But we can imagine trickier cases that the C compiler isn't smart enough to optimize, where a future PyPy might.)
But never mind PyPy for now. If you will be working with Numpy, it is a good idea to learn all the functions like numpy.fill() that are there to speed up your code.

C++ emphasizes machine time over programmer time.
Python emphasizes programmer time over machine time.
Pypy is a python written in python, and they have the beginnings of numpy; you might try that. Pypy has a nice JIT that makes things quite fast.
You could also try cython, which allows you to translate a dialect of Python to C, and compile the C to a Python C extension module; this allows one to continue using CPython for most of your code, while still getting a bit of a speedup. However, in the one microbenchmark I've tried comparing Pypy and Cython, Pypy was quite a bit faster than Cython.
Cython uses a highly pythonish syntax, but it allows you to pretty freely intermix Python datatypes with C datatypes. If you redo your hotspots with C datatypes, it should be pretty fast. Continuing to use Python datatypes is sped up by Cython too, but not as much.

The nditer code does not assign a value to the elements of a. This doesn't affect the timings issue, but I mention it because it should not be taken as a good use of nditer.
a correct version is:
for i in np.nditer(a, op_flags=[["readwrite"]]):
i[...] = 123
The [...] is needed to retain the reference to loop value, which is an array of shape ().
There's no point in using A.T, since its the values of the base A that get changed.
I agree that the proper way of doing this assignment is a[:]=123.

If you need to do operations on a multidimensional array that depend on the value of the array but don't depend on the position inside the array then .itemset is 5 times faster than nditer for me.
So instead of doing something like
image = np.random.random_sample((200, 200,3));
with np.nditer(image, op_flags=['readwrite']) as it:
for x in it:
x[...] = x*4.5 if x<0.2 else x
You can do this
image2 = np.random.random_sample((200, 200,3));
for i in range(0,image2.size):
x = image2.item(i)
image2.itemset(i, x*4.5 if x<0.2 else x);

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.