Can a vectorized numpy function use a buffer as an output?

Can a vectorized numpy function use a buffer as an output? - python

Can I get a numpy vectorized function to use a buffer object as the result as opposed to creating a new array that is returned by that object?
I'd like to do something like this:
fun = numpy.vectorize(lambda x: x + 1)
a = numpy.zeros((1, 10)
buf = numpy.zeros((1, 10)
fun(a, buf_obj = buf)
as opposed to
fun = numpy.vectorize(lambda x: x + 1)
a = numpy.zeros((1, 10)
buf = fun(a)

Not for vectorize, but most numpy functions take an out argument that does exactly what you want.
What function are you trying to use numpy.vectorize with? vectorize is almost always the wrong solution when you're trying to "vectorize" a calculation.
In your example above, if you wanted to do the operation in-place, you could accomplish it with:
a = numpy.zeros((1, 10))
a += 1
Or, if you wanted to be a bit verbose, but do exactly what your example would do:
a = numpy.zeros((1, 10))
buf = numpy.empty_like(a)
numpy.add(a, 1, out=buf)
numpy.vectorize has to call a python function for every element in the array. Therefore, it has additional overhead when compared to numpy functions that operate on the entire array. Usually, when people refer to "vectorizing" an expression to get a speedup, they're referring to building the expression out of building-blocks of basic numpy functions, rather than using vectorize (which is certainly confusing...).
Edit: Based on your comment, vectorize really does fit your use case! (Writing a "raster calculator" is a pretty perfect use case for it, beyond security/sandboxing issues.)
On the other hand, numexpr is probably an even better fit if you don't mind an additional dependency.
It's faster and takes an out parameter.

Related

Python function that acts on provided array

Some NumPy functions (e.g. argmax or cumsum) can take an array as an optional out parameter and store the result in that array. Please excuse my less than perfect grasp of the terminology here (which is what prevents me from googling for an answer), but it seems that these functions somehow act on variables that are beyond their scope.
How would I transform this simple function so that it can take an out parameter as the functions mentioned?
import numpy as np
def add_two(a):
return a + 2
a = np.arange(5)
a = add_two(a)
From my understanding, a rewritten version of add_two() would allow for the last line above to be replaced with
add_two(a, out=a)

In my opinion, the best and most explicit is to do as you're currently doing. Python passes the values, not the references as parameters in a function, so you can only modify mutable objects.
One way would be to do:
import numpy as np
def add_two(a, out):
out[:] = a+2
a = np.arange(5)
add_two(a, out=a)
a
Output:
array([2, 3, 4, 5, 6])
NB. Unlike your current solution, this requires that the object passed as parameter out exists and is an array

The naive solution would be to fill in the buffer of the output array with the result of your computation:
def add_two(a, out=None):
result = a + 2
if out is None:
out = result
else:
out[:] = result
return out
The problem (if you could call it that), is that you are still generating the intermediate array, and effectively bypassing the benefits of pre-allocating the result in the first place. A more nuanced approach would be to use the out parameters of the functions in your numpy pipeline:
def add_two(a, out=None):
return np.add(a, 2, out=out)
Unfortunately, as with general vectorization, this can only be done on a case-by-case basis depending on what the desired set of operations is.
As an aside, this has nothing to do with scope. Python objects are specifically available to all namespaces (though their names might not be). If a mutable argument is modified in a function, the changes will always be visible outside the function. See for example "Least Astonishment" and the Mutable Default Argument.

Numpy fill_diagonal return None

I want to generate symmetric zero diagonal matrices. My symmetric part work, but when I use fill_diagonal from numpy as the result I got "None". My code is below. Thank you for reading
import numpy as np
matrix_size = int(input("Size of the matrix \n"))
random_matrix = np.random.random_integers(-4,4,size=(matrix_size,matrix_size))
symmetric_matrix = (random_matrix + random_matrix.T)/2
print(symmetric_matrix)
zero_diogonal_matrix = np.fill_diagonal(symmetric_matrix,0)
print(zero_diogonal_matrix)

np.fill_diagonal(), like many other methods across python/numpy, works in-place. For example: Why does “return list.sort()” return None, not the list?. That is that it directly alters the object in memory and does not create a new object. The return value from such functions is None. Therefore, change:
zero_diogonal_matrix = np.fill_diagonal(symmetric_matrix,0)
To just:
np.fill_diagonal(symmetric_matrix,0)
You will then see the change reflected in symmetric_matrix.

It's probably overkill, but in case you want to preserve the tenet of minimising surprise, you could wrap this (and other functions like it) in a function that takes care of preserving the original array:
def fill_diagonal(source_array, diagonal):
copy = source_array.copy()
np.fill_diagonal(copy, diagonal)
return copy
But the question then becomes "who exactly is going to be least surprised by doing it this way?"

numpy array multiplication slower than for loop with vector multiplication?

I have come across the following issue when multiplying numpy arrays. In the example below (which is slightly simplified from the real version I am dealing with), I start with a nearly empty array A and a full array C. I then use a recursive algorithm to fill in A.
Below, I perform this algorithm in two different ways. The first method involves the operations
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
while the second method involves the for loop
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
Note that the arrays A and B are identical, but they are filled in using the two different methods.
In the example below I time how long it takes to perform the computation using each method. I find that, for example, with n_pix=2 and max_counts = 400, the first method is much faster than the second (that is, time_np is much smaller than time_for). However, when I then switch to, for example, n_pix=1000 and max_counts = 400, instead I find method 2 is much faster (time_for is much smaller than time_np). I would have thought that method 1 would always be faster since method 2 explicitly runs over a loop while method 1 uses np.multiply.
So, I have two questions:
Why does the timing behave this way as a function of n_pix for a fixed max_counts?
What is optimal method for writing this code so that it behaves quickly for all n_pix?
That is, can anyone suggest a method 3? In my project, it is very important for this piece of code to perform quickly over a range of large and small n_pix.
import numpy as np
import time
def return_timing(n_pix,max_counts):
A=np.zeros((max_counts+1,n_pix))
A[0]=np.random.random(n_pix)*1.8
A[1]=np.random.random(n_pix)*2.3
B=np.zeros((max_counts+1,n_pix))
B[0]=A[0]
B[1]=A[1]
C=np.outer(np.random.random(max_counts+1),np.random.random(n_pix))*3.24
time_np=0
time_for=0
for c in range(2, max_counts + 1):
t0 = time.time()
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
time_np += time.time()-t0
t0 = time.time()
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
time_for += time.time()-t0
return time_np, time_for

First of all, you can easily replace:
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
with:
A[c] += (C[c:1:-1] * A[:c-1]).sum(0)
This is much faster because indexing with an array is much slower than slicing. But the temp_vec is still hidden in there, created before summing is done. This leads to the idea of using einsum, which is the fastest because it doesn't make the temp array.
A[c] = np.einsum('ij,ij->j', C[c:1:-1], A[:c-1])
Timing. For small arrays:
>>> return_timing(10,10)
numpy OP 0.000525951385498
loop OP 0.000250101089478
numpy slice 0.000246047973633
einsum 0.000170946121216
For large:
>>> return_timing(1000,100)
numpy OP 0.185983896255
loop OP 0.0458009243011
numpy slice 0.038364648819
einsum 0.0167834758759

It is probably because your numpy-only version requires creation/allocation of new ndarrays (temp_vec and n_array), while your other method does not.
Creation of new ndarrays is very slow and if you can modify your code in such a way that it no longer have to continuously create them, I would expect that you could get better performance out of that method.

Using generator instead of nested loops

I have the following nested loop. But it is inefficient time wise. So using a generator would be much better. Do you know how to do that?
x_sph[:] = [r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p]
It seems like some of you are of the opinion (looking at comments) that using a generator was not helpful in this case. I am under the impression that using generators will avoid assigning variables to memory, and thus save memory and time. Am I wrong?

Judging from your code snippet you want to do something numerical and you want to do it fast. A generator won't help much in this respect. But using the numpy module will. Do it like so:
import numpy
# Change your p into an array, you'll see why.
r = numpy.array(p) # If p is a list this will change it into 1 dimensional vector.
sin_theta = numpy.array(sin_theta) # Same with the rest.
cos_phi = numpy.array(cos_phi)
x_sph = r.dot(sin_theta).dot(cos_phi)
In fact I'd use numpy even earlier, by doing:
phi = numpy.array(phi) # I don't know how you calculate this but you can start here with a phi list.
theta = numpy.array(theta)
sin_theta =numpy.sin(theta)
cos_phi = numpy.cos(phi)
You could even skip the intermediate sin_theta and cos_phi assignments and just put all the stuff in one line. It'll be long and complicated so I'll omit it but I do numpy-maths like that sometimes.
And numpy is fast, it'll make a huge difference. At least a noticeable one.

[...] creates a list and (...) a generator :
generator = (r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p)
for value in generator:
# Do something

To turn a loop into a generator, you can make it a function and yield:
def x_sph(p, cos_phi, sin_theta):
for r in p:
for sin_t in sin_theta:
for cos_p in cos_phi:
yield r * sin_t * cos_p
However, note that the advantages of generators are generally only realised if you don't need to calculate all values and can break at some point, or if you don't want to store all the values (the latter is a space rather than time advantage). If you end up calling this:
lst = list(x_sph(p, cos_phi, sin_theta))
then you won't see any gain.

How to avoid using for-loops with numpy?

I have already written the following piece of code, which does exactly what I want, but it goes way too slow. I am certain that there is a way to make it faster, but I cant seem to find how it should be done. The first part of the code is just to show what is of which shape.
two images of measurements (VV1 and HH1)
precomputed values, VV simulated and HH simulated, which both depend on 3 parameters (precomputed for (101, 31, 11) values)
the index 2 is just to put the VV and HH images in the same ndarray, instead of making two 3darrays
VV1 = numpy.ndarray((54, 43)).flatten()
HH1 = numpy.ndarray((54, 43)).flatten()
precomp = numpy.ndarray((101, 31, 11, 2))
two of the three parameters we let vary
comp = numpy.zeros((len(parameter1), len(parameter2)))
for i,(vv,hh) in enumerate(zip(VV1,HH1)):
comp0 = numpy.zeros((len(parameter1),len(parameter2)))
for j in range(len(parameter1)):
for jj in range(len(parameter2)):
comp0[j,jj] = numpy.min((vv-precomp[j,jj,:,0])**2+(hh-precomp[j,jj,:,1])**2)
comp+=comp0
The obvious thing i know i should do is get rid of as many for-loops as I can, but I don't know how to make the numpy.min behave properly when working with more dimensions.
A second thing (less important if it can get vectorized, but still interesting) i noticed is that it takes mostly CPU time, and not RAM, but i searched a long time already, but i cant find a way to write something like "parfor" instead of "for" in matlab, (is it possible to make an #parallel decorator, if i just put the for-loop in a separate method?)
edit: in reply to Janne Karila: yeah that definately improves it a lot,
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
Is definitely a lot faster, but is there any possibility to remove the outer for-loop too? And is there a way to make a for-loop parallel, with an #parallel or something?

This can replace the inner loops, j and jj
comp0 = numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
This may be a replacement for the whole loop, though all this indexing is stretching my mind a bit. (this creates a large intermediate array though)
comp = numpy.sum(
numpy.min((VV1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,0])**2
+(HH1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,1])**2,
axis=2),
axis=0)

One way to parallelize the loop is to construct it in such a way as to use map. In that case, you can then use multiprocessing.Pool to use a parallel map.
I would change this:
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
To something like this:
def buildcomp(vvhh):
vv, hh = vvhh
return numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
if __name__=='__main__':
from multiprocessing import Pool
nthreads = 2
p = Pool(nthreads)
complist = p.map(buildcomp, np.column_stack((VV1,HH1)))
comp = np.dstack(complist).sum(-1)
Note that the dstack assumes that each comp.ndim is 2, because it will add a third axis, and sum along it. This will slow it down a bit because you have to build the list, stack it, then sum it, but these are all either parallel or numpy operations.
I also changed the zip to a numpy operation np.column_stack, since zip is much slower for long arrays, assuming they're already 1d arrays (which they are in your example).
I can't easily test this so if there's a problem, feel free to let me know.

In computer science, there is the concept of Big O notation, used for getting an approximation of how much work is required to do something. To make a program fast, do as little as possible.
This is why Janne's answer is so much faster, you do fewer calculations. Taking this principle farther, we can apply the concept of memoization, because you are CPU bound instead of RAM bound. You can use the memory library, if it needs to be more complex than the following example.
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
memo = AutoVivification()
def memoize(n, arr, end):
if not memo[n][arr][end]:
memo[n][arr][end] = (n-arr[...,end])**2
return memo[n][arr][end]
for (vv,hh) in zip(VV1,HH1):
first = memoize(vv, precomp, 0)
second = memoize(hh, precomp, 1)
comp+= numpy.min(first+second, axis=2)
Anything that has already been computed gets saved to memory in the dictionary, and we can look it up later instead of recomputing it. You can even break down the math being done into smaller steps that are each memoized if necessary.
The AutoVivification dictionary is just to make it easier to save the results inside of memoize, because I'm lazy. Again, you can memoize any of the math you do, so if numpy.min is slow, memoize it too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can a vectorized numpy function use a buffer as an output? - python

Related

Python function that acts on provided array

Numpy fill_diagonal return None

numpy array multiplication slower than for loop with vector multiplication?

Using generator instead of nested loops

How to avoid using for-loops with numpy?

Categories

Resources