Applying mathematical operation between rows of two numpy arrays - python

Let's assume we have two numpy arrays A (n1xm) and B (n2xm) and I want to apply a certain mathematical operation between the rows of both tables.
For example, let's say that we want to calculate the Euclidean distance between each row of A and each row of B and store it at a new numpy table C (n1xn2).
The simple for-loop approach would be something like the following:
C = np.zeros((A.shape[0],B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.linalg.norm(A[i]-B[j])
However, the above implementation is not the most efficient. How could I write this differently by using vectorization to speed up the implementation ?

You can broadcast over a new axis:
# n1 x m x n2
diff = A[:, :, None] - B[:, :, None].T
# n1 x n2 after summing across m
dists = np.sqrt((diff * diff).sum(1))

Related

3D tensor of diagonal matrices

I have a matrix A with m rows and n columns. I want a 3D tensor of dimension m*n*n such that the tensor consists out of m diagonal matrices formed by each of the columns of A. In other words every column of A should be converted into a diagonalized matrix and all those matrices should form a 3D tensor together.
This is quite easy to do with a for loop. But I want to do it without to improve speed.
I came up with a bad and inefficient way which works, but I hope someone can help me with finding a better way, which allows for large A matrices.
# I use python
# import numpy as np
n = A.shape[0] # A is an n*k matrix
k = A.shape[1]
holding_matrix = np.repeat(np.identity(k), repeats=n, axis=1) # k rows with n*k columns
identity_stack = np.tile(np.identity(n),k) #k nxn identity matrices stacked together
B = np.array((A#holding_matrix)*identity_stack)
B = np.array(np.hsplit(B,k)) # desired result of k n*n diagonal matrices in a tensor
n = A.shape[0] # A.shape == (n, k)
k = A.shape[1]
B = np.zeros_like(A, shape=(k, n*n)) # to preserve dtype and order of A
B[:, ::(n+1)] = A.T
B = B.reshape(k, n, n)

Want to define an ndarray in numpy elementwise

I have 2 2d numpy arrays, A with shape (i,j) and B (i, k) where j >> k. I want to define a new 3d array C such that each element in C is the broadcasted element wise product of each column in A with the whole matrix B. In other words as a normal python loop I would do it like this
for x in range(j):
C[x] = A[:,x]*B
However j is very large in this case and it would benefit me a lot if I am able to use Numpy's functionality to maybe define an ndarray C elementwise like in my loop above.
Thank you for your help
You can use broadcasting like this:
a.T[:, :, None] * b
Example:
import numpy as np
np.random.seed(444)
i, j, k = 2, 10, 3
a = np.random.randn(i, j)
b = np.random.randn(i, k)
c = a.T[:, :, None] * b
print(c.shape)
# (10, 2, 3)
Transposing stems from the fact that you want to internally operate for each column in a, and [:, :, None] expands the dimensionality to enable broadcasting, as explained in NumPy's broadcasting rules.

Iterating operation with two arrays using numpy

I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.
So I want to:
perform an operation with one row of the first array with every individual row of the second array, iterating to obtain 75 values
find the minimum value, and store that in a new array
repeat this with the second row of the first array, once again iterating the operation for all the rows of the second array, and again storing the minimum difference to the new array
How would I go about doing this with numpy?
Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.
Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.
arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))
Here are two methods one using einsum, the other KDTree:
einsum does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B) is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2). The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.
KDTree is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.
import numpy as np
from scipy.spatial import cKDTree as KDTree
def f_einsum(A, B):
B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
idx = B2AB.argmin(axis=1)
D = A-B[idx]
return np.sqrt(np.einsum('ij,ij->i', D, D)), idx
def f_KDTree(A, B):
T = KDTree(B)
return T.query(A, 1)
m, n = 75, 4
A, B = np.random.randn(2, m, n)
de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)
from timeit import timeit
for m, n in [(75, 4), (500, 4)]:
A, B = np.random.randn(2, m, n)
print(m, n)
print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))
Sample run:
75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444
We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.

Normalize 2d arrays

Consider a square matrix containing positive numbers, given as a 2d numpy array A of shape ((m,m)). I would like to build a new array B that has the same shape with entries
B[i,j] = A[i,j] / (np.sqrt(A[i,i]) * np.sqrt(A[j,j]))
An obvious solution is to loop over all (i,j) but I'm wondering if there is a faster way.
Two approaches leveraging broadcasting could be suggested.
Approach #1 :
d = np.sqrt(np.diag(A))
B = A/d[:,None]
B /= d
Approach #2 :
B = A/(d[:,None]*d) # d same as used in Approach #1
Approach #1 has lesser memory overhead and as such I think would be faster.
You can normalize each row of your array by the main diagonal leveraging broadcasting using
b = np.sqrt(np.diag(a))
a / b[:, None]
Also, you can normalize each column using
a / b[None, :]
To do both, as your question seems to ask, using
a / (b[:, None] * b[None, :])
If you want to prevent the creation of intermediate arrays and do the operation in place, you can use
a /= b[:, None]
a /= b[None, :]

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A
Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

Categories

Resources