Iterating operation with two arrays using numpy - python

I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.
So I want to:
perform an operation with one row of the first array with every individual row of the second array, iterating to obtain 75 values
find the minimum value, and store that in a new array
repeat this with the second row of the first array, once again iterating the operation for all the rows of the second array, and again storing the minimum difference to the new array
How would I go about doing this with numpy?
Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.
Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.
arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))

Here are two methods one using einsum, the other KDTree:
einsum does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B) is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2). The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.
KDTree is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.
import numpy as np
from scipy.spatial import cKDTree as KDTree
def f_einsum(A, B):
B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
idx = B2AB.argmin(axis=1)
D = A-B[idx]
return np.sqrt(np.einsum('ij,ij->i', D, D)), idx
def f_KDTree(A, B):
T = KDTree(B)
return T.query(A, 1)
m, n = 75, 4
A, B = np.random.randn(2, m, n)
de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)
from timeit import timeit
for m, n in [(75, 4), (500, 4)]:
A, B = np.random.randn(2, m, n)
print(m, n)
print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))
Sample run:
75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444
We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.

Related

Applying mathematical operation between rows of two numpy arrays

Let's assume we have two numpy arrays A (n1xm) and B (n2xm) and I want to apply a certain mathematical operation between the rows of both tables.
For example, let's say that we want to calculate the Euclidean distance between each row of A and each row of B and store it at a new numpy table C (n1xn2).
The simple for-loop approach would be something like the following:
C = np.zeros((A.shape[0],B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.linalg.norm(A[i]-B[j])
However, the above implementation is not the most efficient. How could I write this differently by using vectorization to speed up the implementation ?
You can broadcast over a new axis:
# n1 x m x n2
diff = A[:, :, None] - B[:, :, None].T
# n1 x n2 after summing across m
dists = np.sqrt((diff * diff).sum(1))

Replace rows in a numpy 2d array with rows from another 2d array

I have two 2d arrays, let's call them img (m * n) and means (k * n), and a list, let's call this clusters (m, ). I'd like to replace rows in the img array with rows from the means array where I select the row from the means array based on the value in the clusters list. For example:
Suppose, img
img = np.array([[0.40784314, 0.48627451, 0.52549022],
[0.05490196, 0.1254902, 0.2]]) # This will be a (m * n) array
And, means
means = np.array([[0.80551694, 0.69010299, 0.17438512],
[0.33569541, 0.45309059, 0.52275014]]) # (k * n) array
And, clusters list
clusters = [1, 0] # list of length m
The Desired output is
[[0.33569541 0.45309059 0.52275014]
[0.80551694 0.69010299 0.17438512]] # This should be a (m * n) array, same as img above
Notice that the first row has been replaced with the second row from the means array because cluster[0] == 1 and second row has been replaced with the first row from the means array because cluster[1] == 0 and so on so forth.
I am able to do this using the following line of code, but I was wondering if there is any faster way of doing this, if any.
np.array([means[i] for i in clusters])
What you're looking for is called advanced indexing:
>>> means[clusters]
array([[0.33569541, 0.45309059, 0.52275014],
[0.80551694, 0.69010299, 0.17438512]])

Numpy: Vectorize np.argwhere

I have the following data structures in numpy:
import numpy as np
a = np.random.rand(267, 173) # dense img matrix
b = np.random.rand(199) # array of probability samples
My goal is to take each entry i in b, find the x,y coordinates/index positions of all values in a that are <= i, then randomly select one of the values in that subset:
from random import randint
for i in b:
l = np.argwhere(a <= i) # list of img coordinates where pixel <= i
sample = l[randint(0, len(l)-1)] # random selection from `l`
This "works", but I'd like to vectorize the sampling operation (i.e. replace the for loop with apply_along_axis or similar). Does anyone know how this can be done? Any suggestions would be greatly appreciated!
You can't exactly vectorize np.argmax because you have a random subset size every time. What you can do though, is speed up the computation pretty dramatically with sorting. Sorting the image once will create a single allocation, while masking the image at every step will create a temporary array for the mask and for the extracted elements. With a sorted image, you can just apply np.searchsorted to get the sizes:
a_sorted = np.sort(a.ravel())
indices = np.searchsorted(a_sorted, b, side='right')
You still need a loop to do the sampling, but you can do something like
samples = np.array([a_sorted[np.random.randint(i)] for i in indices])
Getting x-y coordinates instead of sample values is a bit more complicated with this system. You can use np.unravel_index to get the indices, but first you must convert form the reference frame of a_sorted to a.ravel(). If you sort using np.argsort instead of np.sort, you can get the indices in the original array. Fortunately, np.searchsorted supports this exact scenario with the sorter parameter:
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, side='right', sorter=a_ind)
r, c = np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
r and c are the same size as b, and correspond to the row and column indices in a of each selection based on b. The index conversion depends on the strides in your array, so we'll assume that you're using C order, as 90% of arrays will do by default.
Complexity
Let's say b has size M and a has size N.
Your current algorithm does a linear search through each element of a for each element of b. At each iteration, it allocates a mask for the matching elements (N/2 on average), and then a buffer of the same size to hold the masked choices. This means that the time complexity is on the order of O(M * N) and the space complexity is the same.
My algorithm sorts a first, which is O(N log N). Then it searches for M insertion points, which is O(M log N). Finally, it selects M samples. The space it allocates is one sorted copy of the image and two arrays of size M. It is therefore of O((M + N) log N) time complexity and O(M + N) in space.
Here is an alternative approach argsorting b instead and then binning a accordingly using np.digitize and this post:
import numpy as np
from scipy import sparse
from timeit import timeit
import math
def h_digitize(a,bs,right=False):
mx,mn = a.max(),a.min()
asz = mx-mn
bsz = bs[-1]-bs[0]
nbins=int(bs.size*math.sqrt(bs.size)*asz/bsz)
bbs = np.concatenate([[0],((nbins-1)*(bs-mn)/asz).astype(int).clip(0,nbins),[nbins]])
bins = np.repeat(np.arange(bs.size+1), np.diff(bbs))
bbs = bbs[:bbs.searchsorted(nbins)]
bins[bbs] = -1
aidx = bins[((nbins-1)*(a-mn)/asz).astype(int)]
ambig = aidx == -1
aa = a[ambig]
if aa.size:
aidx[ambig] = np.digitize(aa,bs,right)
return aidx
def f_pp():
bo = b.argsort()
bs = b[bo]
aidx = h_digitize(a,bs,right=True).ravel()
aux = sparse.csr_matrix((aidx,aidx,np.arange(aidx.size+1)),
(aidx.size,b.size+1)).tocsc()
ridx = np.empty(b.size,int)
ridx[bo] = aux.indices[np.fromiter(map(np.random.randint,aux.indptr[1:-1].tolist()),int,b.size)]
return np.unravel_index(ridx,a.shape)
def f_mp():
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind, side='right')
return np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
a = np.random.rand(267, 173) # dense img matrix
b = np.random.rand(199) # array of probability samples
# round to test wether equality is handled correctly
a = np.round(a,3)
b = np.round(b,3)
print('pp',timeit(f_pp, number=1000),'ms')
print('mp',timeit(f_mp, number=1000),'ms')
# sanity checks
S = np.max([a[f_pp()] for _ in range(1000)],axis=0)
T = np.max([a[f_mp()] for _ in range(1000)],axis=0)
print(f"inequality satisfied: pp {(S<=b).all()} mp {(T<=b).all()}")
print(f"largest smalles distance to boundary: pp {(b-S).max()} mp {(b-T).max()}")
print(f"equality done right: pp {not (b-S).all()} mp {not (b-T).all()}")
Using a tweaked digitize I'm a bit faster but this may vary with problem size. Also, #MadPhysicist's solution is much less convoluted. With standard digitize we are about equal.
pp 2.620121960993856 ms
mp 3.301037881989032 ms
inequality satisfied: pp True mp True
largest smalles distance to boundary: pp 0.0040000000000000036 mp 0.006000000000000005
equality done right: pp True mp True
A slight improvement on #MadPhysicist 's algorithm to make it more vectorized:
%%timeit
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind)
r, c = np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
100 loops, best of 3: 6.32 ms per loop
%%timeit
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind)
r, c = np.unravel_index(a_ind[(np.random.rand(indices.size) * indices).astype(int)], a.shape)
100 loops, best of 3: 4.16 ms per loop
#PaulPanzer 's solution still rules the field, though I'm not sure what it's caching:
%timeit f_pp()
The slowest run took 14.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.88 ms per loop

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A
Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

How to vectorize 3D Numpy arrays

I have a 3D numpy array like a = np.zeros((100,100, 20)). I want to perform an operation over every x,y position that involves all the elements over the z axis and the result is stored in an array like b = np.zeros((100,100)) on the same corresponding x,y position.
Now i'm doing it using a for loop:
d_n = np.array([...]) # a parameter with the same shape as b
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
### calculate some_value using C
minv = sys.maxint
depth = -1
C = a[x,y,:]
for d in range(len(C)):
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
if e < minv:
minv = e
depth = d
some_value = depth
if depth == -1:
some_value = len(C) - 1
###
b[x,y] = some_value
The problem now is that this operation is much slower than others done the pythonic way, e.g. c = b * b (I actually profiled this function and it's around 2 orders of magnitude slower than others using numpy built in functions and vectorized functions, over a similar number of elements)
How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?
What is usually done in 3D images is to swap the Z axis to the first index:
>>> a = a.transpose((2,0,1))
>>> a.shape
(20, 100, 100)
And now you can easily iterate over the Z axis:
>>> for slice in a:
do something
The slice here will be each of your 100x100 fractions of your 3D matrix. Additionally, by transpossing allows you to access each of the 2D slices directly by indexing the first axis. For example a[10] will give you the 11th 2D 100x100 slice.
Bonus: If you store the data contiguosly, without transposing (or converting to a contiguous array using a = np.ascontiguousarray(a.transpose((2,0,1))) the access to you 2D slices will be faster since they are mapped contiguosly in memory.
Obviously you want to get rid of the explicit for loop, but I think whether this is possible depends on what calculation you are doing with C. As a simple example,
a = np.zeros((100,100, 20))
a[:,:] = np.linspace(1,20,20) # example data: 1,2,3,.., 20 as "z" for every "x","y"
b = np.sum(a[:,:]**2, axis=2)
will fill the 100 by 100 array b with the sum of the squared "z" values of a, that is 1+4+9+...+400 = 2870.
If your inner calculation is sufficiently complex, and not amenable to vectorization, then your iteration structure is good, and does not contribute significantly to the calculation time
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
...
for d in range(len(C)):
... # complex, not vectorizable calc
...
b[x,y] = some_value
There doesn't appear to be a special structure in the 1st 2 dimensions, so you could just as well think of it as 2D mapping on to 1D, e.g. mapping a (N,20) array onto a (N,) array. That doesn't speed up anything, but may help highlight the essential structure of the problem.
One step is to focus on speeding up that C to some_value calculation. There are functions like cumsum and cumprod that help you do sequential calculations on a vector. cython is also a good tool.
A different approach is to see if you can perform that internal calculation over the N values all at once. In other words, if you must iterate, it is better to do so over the smallest dimension.
In a sense this a non-answer. But without full knowledge of how you get some_value from C and d_n I don't think we can do more.
It looks like e can be calculated for all points at once:
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
E = 2.5 * (d_n[...,None] - np.arange(a.shape[-1]))**2 + a * 0.05 # (100,100,20)
E.min(axis=-1) # smallest value along the last dimension
E.argmin(axis=-1) # index of where that min occurs
On first glance it looks like this E.argmin is the b value that you want (tweaked for some boundary conditions if needed).
I don't have realistic a and d_n arrays, but with simple test ones, this E.argmin(-1) matches your b, with a 66x speedup.
How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?
Many functions in Numpy are "reduction" functions*, for example sum, any, std, etc. If you supply an axis argument other than None to such a function it will reduce the dimension of the array over that axis. For your code you can use the argmin function, if you first calculate e in a vectorized way:
d = np.arange(a.shape[2])
e = 2.5 * (d_n[...,None] - d)**2 + a*0.05
b = np.argmin(e, axis=2)
The indexing with [...,None] is used to engage broadcasting. The values in e are floating point values, so it's a bit strange to compare to sys.maxint but there you go:
I, J = np.indices(b.shape)
b[e[I,J,b] >= sys.maxint] = a.shape[2] - 1
* Strickly speaking a reduction function is of the form reduce(operator, sequence) so technically not std and argmin

Categories

Resources