Why is this numpy array operation so slow?

Why is this numpy array operation so slow? - python

I am a python beginner and I am trying to average two NumPy 2D arrays with shape of (1024,1024). Doing it like this is quite fast:
newImage = (image1 + image2) / 2
But now the images have a "mask" that invalidate certain elements if set to zero. That means if one of the elements is zero, the resulting element should also be zero. My trivial solution is:
newImage = numpy.zeros( (1024,1024) , dtype=numpy.int16 )
for y in xrange(newImage.shape[0]):
for x in xrange(newImage.shape[1]):
val1 = image1[y][x]
val2 = image2[y][x]
if val1!=0 and val2!=0:
newImage[y][x] = (val1 + val2) / 2
But this is really slow. I did not time it, but it seems to be slower by a factor of 100.
I also tried using a lambda operator and "map", but this does not return a NumPy array.

Try this:
newImage = numpy.where(np.logical_and(image1, image2), (image1 + image2) / 2, 0)
Where none of image1 and image2 equals zero, take their mean, otherwise zero.

Looping with native Python code is generally much slower than using
built-in tools that use fast C loops. I'm not familiar with NumPy; can
you use map() to do a transformation from your two input arrays to
the output? If so, that should be faster.

Explicit for loops are very inefficient in Python in general, not only for numpy operations. Fortunately, there are faster ways to solve our problem. If memory is not an issue, this solution is quite good:
import numpy as np
new_image = np.zeros((1024, 1024), dtype=np.int16)
valid = (image1!=0) & (image2!=0)
new_image[valid] = (image1+image2)[valid]
Another solution using masked arrays, which do not create copies of the arrays (they represent views of the original image1/2:
m1 = np.ma.masked_equal(image1, 0)
m2 = np.ma.masked_equal(image2, 0)
new_image = (m1+m2).filled(0)
Update: The first solution seems to be 3 times faster than the second for arrays with about 1000 non-zero entries.

numpy array access operation seems slow at best. I can't see any reason for it. You can clearly see it by constructing a simple example:
import numpy
# numpy version
def at(s,n):
t1=time.time()
a=numpy.zeros(s,dtype=numpy.int32)
for i in range(n):
a[i%s]=n
t2=time.time()
return t2-t1
# native version
def an(s,n):
t1=time.time()
a=[(i) for i in range(s)]
for i in range(n):
a[i%s]=n
t2=time.time()
return t2-t1
# test
[at(100000,1000000),an(100000,1000000)]
Result: [0.21972250938415527, 0.15950298309326172]

Related

Suggestion to vectorize a Python function

I wrote the following function, which takes as inputs three 1D array (namely int_array, x, and y) and a number lim. The output is a number as well.
def integrate_to_lim(int_array, x, y, lim):
if lim >= np.max(x):
res = 0.0
if lim <= np.min(x):
res = int_array[0]
else:
index = np.argmax(x > lim) # To find the first element of x larger than lim
partial = int_array[index]
slope = (y[index-1] - y[index]) / (x[index-1] - x[index])
rest = (x[index] - lim) * (y[index] + (lim - x[index]) * slope / 2.0)
res = partial + rest
return res
Basically, outside form the limit cases lim>=np.max(x) and lim<=np.min(x), the idea is that the function finds the index of the first value of the array x larger than lim and then uses it to make some simple calculations.
In my case, however lim can also be a fairly big 2D array (shape ~2000 times ~1000 elements)
I would like to rewrite it such that it makes the same calculations for the case that lim is a 2D array.
Obviously, the output should also be a 2D array of the same shape of lim.
I am having a real struggle figuring out how to vectorize it.
I would like to stick only to the numpy package.
PS I want to vectorize my function because efficiency is important and as I understand using for loops is not a good choice in this regard.
Edit: my attempt
I was not aware of the function np.take, which made the task way easier.
Here is my brutal attempt that seems to work (suggestions on how to clean up or to make the code faster are more than welcome).
def integrate_to_lim_vect(int_array, x, y, lim_mat):
lim_mat = np.asarray(lim_mat) # Make sure that it is an array
shape_3d = list(lim_mat.shape) + [1]
x_3d = np.ones(shape_3d) * x # 3 dimensional version of x
lim_3d = np.expand_dims(lim_mat, axis=2) * np.ones(x_3d.shape) # also 3d
# I use np.argmax on the 3d matrices (is there a simpler way?)
index_mat = np.argmax(x_3d > lim_3d, axis=2)
# Silly calculations
partial = np.take(int_array, index_mat)
y1_mat = np.take(y, index_mat)
y2_mat = np.take(y, index_mat - 1)
x1_mat = np.take(x, index_mat)
x2_mat = np.take(x, index_mat - 1)
slope = (y1_mat - y2_mat) / (x1_mat - x2_mat)
rest = (x1_mat - lim_mat) * (y1_mat + (lim_mat - x1_mat) * slope / 2.0)
res = partial + rest
# Make the cases with np.select
condlist = [lim_mat >= np.max(x), lim_mat <= np.min(x)]
choicelist = [0.0, int_array[0]] # Shoud these options be a 2d matrix?
output = np.select(condlist, choicelist, default=res)
return output
I am aware that if the limit is larger than the maximum value in the array np.argmax returns the index zero (leading to wrong results). This is why I used np.select to check and correct for these cases.
Is it necessary to define the three dimensional matrices x_3d and lim_3d, or there is a simpler way to find the 2D matrix of the indices index_mat?
Suggestions, especially to improve the way I expanded the dimension of the arrays, are welcome.

I think you can solve this using two tricks. First, a 2d array can be easily flattened to a 1d array, and then your answers can be converted back into a 2d array with reshape.
Next, your use of argmax suggests that your array is sorted. Then you can find your full set of indices using digitize. Thus instead of a single index, you will get a complete array of indices. All the calculations you are doing are intrinsically supported as array operations in numpy, so that should not cause any problems.
You will have to specifically look at the limiting cases. If those are rare enough, then it might be okay to let the answers be derived by the default formula (they will be garbage values), and then replace them with the actual values you desire.

python numpy arrays. How to slice multiple arrays in an efficient way?

i got a problem to solve and i cannot come up with a good solution.
To ease it down I got an array of 10x10 and i want to slice out "little arrays" of 3x3. Right now i do this the following way:
array = np.arange(100).reshape((10,10))
patch = np.array(array[:3, :3]
for n in range(3, 10, 3):
for m in range(3, 10, 3):
patch = numpy.append(patch, array[n:n+3, m:m+3]
i basically create the numpy array patch with the first slice and append all other slices afterwards. The problem with this is that its horribly slow and does not do good use of the slicing opportunities of numpy. I need to do this for a high number of much bigger arrays.
Can anyone give me any advice on how to make this more efficient?
1000 thanks!

Your problem is entirely down to using numpy.append. append creates a new array each time you use it. As your patch array gets bigger this will take progressively longer.
Instead, use a presized array (you already know the final size of the patch array), and avoid making intermediary copies of any data.
# setup
x, y = 999, 999
array = np.arange(x * y)
array.shape = x, y
little_array_size = 3
# creates an array of "little arrays"
patch = np.empty(array.size, dtype=int)
patch.shape = -1, little_array_size, little_array_size
i = 0
for n in range(0, array.shape[0], little_array_size):
for m in range(0, array.shape[1], little_array_size):
# uses view, so data is copied straight from array to patch
patch[i,:] = array[n:n+little_array_size, m:m+little_array_size]
i += 1
patch.shape = -1 # flattens array
The above takes about a third of second on my computer (two orders of magnitude faster than using numpy.append (20+ seconds)).

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)

One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

fast dot product on all pair of rows

I have a 2d numpy array X = (xrows, xcols) and I want to apply dot product on each row combination of the array to obtain another array which is of the shape P = (xrow, xrow).
The code looks like the following:
P = np.zeros((xrow, xrow))
for i in range(xrow):
for j in range(xrow):
P[i, j] = numpy.dot(X[i], X[j])
which works well if the array X is small but takes a lot of time for huge X. Is there any way to make it faster or do it more pythonically so that it is fast?

That is obtained by doing result = X.dot(X.T)
When the array becomes large, it can be done be blocks, but depending on your numpy backend this should already parallelize threadwise as much as possible. It seems that this is what you are looking for.
If for some reason you don't want to rely on that, and finally do resort to multiprocessing, you can try something along the lines of
import numpy as np
X = np.random.randn(1000, 100000)
block_size = 10000
from sklearn.externals.joblib import Parallel, delayed
products = Parallel(n_jobs=10)(delayed(np.dot)(X[:, pos:pos + block_size], X.T[pos:pos + block_size]) for pos in range(0, X.shape[1], block_size))
product = np.sum(products, axis=0)
I don't think this is useful for relatively small arrays. And threading can sometimes take care of this better as well.

This is 10% faster on my machine as it avoids loops:
numpy.matrix(X) * numpy.matrix(X.T)
but still there is 50% redundancy.

Speed up python code for computing matrix cofactors

As part of a complex task, I need to compute matrix cofactors. I did this in a straightforward way using this nice code for computing matrix minors. Here is my code:
def matrix_cofactor(matrix):
C = np.zeros(matrix.shape)
nrows, ncols = C.shape
for row in xrange(nrows):
for col in xrange(ncols):
minor = matrix[np.array(range(row)+range(row+1,nrows))[:,np.newaxis],
np.array(range(col)+range(col+1,ncols))]
C[row, col] = (-1)**(row+col) * np.linalg.det(minor)
return C
It turns out that this matrix cofactor code is the bottleneck, and I would like to optimize the code snippet above. Any ideas as to how to do this?

If your matrix is invertible, the cofactor is related to the inverse:
def matrix_cofactor(matrix):
return np.linalg.inv(matrix).T * np.linalg.det(matrix)
This gives large speedups (~ 1000x for 50x50 matrices). The main reason is fundamental: this is an O(n^3) algorithm, whereas the minor-det-based one is O(n^5).
This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition).
If you stick with the det-based approach, what you can do is the following:
The majority of the time seems to be spent inside det. (Check out line_profiler to find this out yourself.) You can try to speed that part up by linking Numpy with the Intel MKL, but other than that, there is not much that can be done.
You can speed up the other part of the code like this:
minor = np.zeros([nrows-1, ncols-1])
for row in xrange(nrows):
for col in xrange(ncols):
minor[:row,:col] = matrix[:row,:col]
minor[row:,:col] = matrix[row+1:,:col]
minor[:row,col:] = matrix[:row,col+1:]
minor[row:,col:] = matrix[row+1:,col+1:]
...
This gains some 10-50% total runtime depending on the size of your matrices. The original code has Python range and list manipulations, which are slower than direct slice indexing. You could try also to be more clever and copy only parts of the minor that actually change --- however, already after the above change, close to 100% of the time is spent inside numpy.linalg.det so that furher optimization of the othe parts does not make so much sense.

The calculation of np.array(range(row)+range(row+1,nrows))[:,np.newaxis] does not depended on col so you could could move that outside the inner loop and cache the value. Depending on the number of columns you have this might give a small optimization.

Instead of using the inverse and determinant, I'd suggest using the SVD
def cofactors(A):
U,sigma,Vt = np.linalg.svd(A)
N = len(sigma)
g = np.tile(sigma,N)
g[::(N+1)] = 1
G = np.diag(-(-1)**N*np.product(np.reshape(g,(N,N)),1))
return U # G # Vt

from sympy import *
A = Matrix([[1,2,0],[0,3,0],[0,7,1]])
A.adjugate().T
And the output (which is cofactor matrix) is:
Matrix([
[ 3, 0, 0],
[-2, 1, -7],
[ 0, 0, 3]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is this numpy array operation so slow? - python

Try this: newImage = numpy.where(np.logical_and(image1, image2), (image1 + image2) / 2, 0) Where none of image1 and image2 equals zero, take their mean, otherwise zero.

Looping with native Python code is generally much slower than using built-in tools that use fast C loops. I'm not familiar with NumPy; can you use map() to do a transformation from your two input arrays to the output? If so, that should be faster.

Related

Suggestion to vectorize a Python function

python numpy arrays. How to slice multiple arrays in an efficient way?

np.bincount for 1 line, vectorized multidimensional averaging

fast dot product on all pair of rows

Speed up python code for computing matrix cofactors

Categories

Resources