Numpy - Speed up iteration comparison? - python

The following use case:
I have a Numpy matrix/array with a few thousand 2d points. Call it A.
Eg:
[1 2]
[300 400]
..
[123 242]
I also have another Numpy matrix with a few 2d points as above. Call it B.
Basically, I want to iterate through A, then iterate through B and compute the distance between A[i] and B[j]. Then assign that back to another array. I could do it like this:
for i, (x0, x1) in enumerate(zip(A[:,0],A[:,1])):
weight_distance = 0
for j, (p0, p1) in enumerate(zip(A[:,0],A[:,1])):
weight_distance = weight_distance + distance((p0,p1),(x0,x1))
weight_array[i] = weight_distance
But this is too slow. What might be a Numpy way to approach this?

What you're probably looking for is the code in scipy.spatial.distance, particularly the cdist function. This can efficiently compute the pairwise distances between arrays of points for a wide variety of metrics.
import numpy as np
from scipy.spatial.distance import cdist
A = np.random.random((1000, 2))
B = np.random.random((100, 2))
D = cdist(A, B, metric='euclidean')
print(D.shape) # (1000, 100)
weights = D.sum(1)
print(weights.shape) # (1000,)
Here euclidean is the standard root-sum-square distance that you're probably used to, and D[i, j] holds the distance between A[i] and B[j], and so summing along axis 1 gives the desired weights.
There are ways to do this via broadcasting directly in numpy, but that approach would use several large temporary arrays, and will in general be slower than the scipy cdist approach.
Edit:
I thought I may as well add a note on the NumPy-only approach. It looks like this:
D2 = np.sqrt(((A[:, None, :] - B[None, :, :]) ** 2).sum(-1))
weights2 = D2.sum(1)
np.allclose(weights, weights2) # True
Let's break it down:
A[:, None, :] adds a new dimension to A, so its shape is now [1000, 1, 2]. Similar for B[None, :, :], which becomes [1, 100, 2]
A[:, None, :] - B[None, :, :] is a broadcasting operation which results in an array of differences, with shape [1000, 100, 2]
We square every element of this result.
the sum(-1) method on this result sums across the last dimension, resulting in an array of shape [1000, 100]
we take the square root of the result, which gives the distance matrix
we sum along axis 1 to get the weights
Notice that this broadcasting approach creates not one, but two temporary arrays of size 1000 * 100 * 2 along the way, which is why it is less efficient than a purpose-built compiled function like cdist.

Related

Applying mathematical operation between rows of two numpy arrays

Let's assume we have two numpy arrays A (n1xm) and B (n2xm) and I want to apply a certain mathematical operation between the rows of both tables.
For example, let's say that we want to calculate the Euclidean distance between each row of A and each row of B and store it at a new numpy table C (n1xn2).
The simple for-loop approach would be something like the following:
C = np.zeros((A.shape[0],B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.linalg.norm(A[i]-B[j])
However, the above implementation is not the most efficient. How could I write this differently by using vectorization to speed up the implementation ?
You can broadcast over a new axis:
# n1 x m x n2
diff = A[:, :, None] - B[:, :, None].T
# n1 x n2 after summing across m
dists = np.sqrt((diff * diff).sum(1))

Correlating an array row-wise with a vector

I have an array X with dimension mxn, for every row m I want to get a correlation with a vector y with dimension n.
In Matlab this would be possible with the corr function corr(X,y). For Python however this does not seem possible with the np.corrcoef function:
import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape
Which results in shape (1001, 1001). But this will fail when the dimension of X is large. In my case, there is an error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64
Since the X.shape[0] dimension is 902630.
My question is, how can I only get the row wise correlations with the vector resulting in shape (1000,) of all correlations?
Of course this could be done via a list comprehension:
np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])
Currently I am therefore using numba with a for loop running through the >900000 elemens. But I think there could be a much more efficient matrix operation function for this problem.
EDIT:
Pandas provides with the corrwith function also a method for this problem:
X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)
The implementation allows for different correlation type calculations, but does not seem to be implemmented as a matrix operation and is therefore really slow. Probably there is a more efficient implementation.
This should work to compute the correlation coefficient for each row with a specified y in a vectorized manner.
X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951

Repmat operation in python

I want to calculate the mean of a 3D array along two axes and subtract this mean from the array.
In Matlab I use the repmat function to achieve this as follows
% A is an array of size 100x50x100
mean_A = mean(mean(A,3),1); % mean_A is 1D of length 50
Am = repmat(mean_A,[100,1,100]) % Am is 3D 100x50x100
flc_A = A - Am % flc_A is 3D 100x50x100
Now, I am trying to do the same with python.
mean_A = numpy.mean(numpy.mean(A,axis=2),axis=0);
gives me the 1D array. However, I cannot find a way to copy this to form a 3D array using numpy.tile().
Am I missing something or is there another way to do this in python?
You could set keepdims to True in both cases so the resulting shape is broadcastable and use np.broadcast_to to broadcast to the shape of A:
np.broadcast_to(np.mean(np.mean(A,2,keepdims=True),axis=0,keepdims=True), A.shape)
Note that you can also specify a tuple of axes along which to take the successive means:
np.broadcast_to(np.mean(A,axis=tuple([2,0]), keepdims=True), A.shape)
numpy.tile is not the same with Matlab repmat. You could refer to this question. However, there is an easy way to repeat the work you have done in Matlab. And you don't really have to understand how numpy.tile works in Python.
import numpy as np
A = np.random.rand(100, 50, 100)
# keep the dims of the array when calculating mean values
B = np.mean(A, axis=2, keepdims=True)
C = np.mean(B, axis=0, keepdims=True) # now the shape of C is (1, 50, 1)
# then simply duplicate C in the first and the third dimensions
D = np.repeat(C, 100, axis=0)
D = np.repeat(D, 100, axis=2)
D is the 3D array you want.

Want to define an ndarray in numpy elementwise

I have 2 2d numpy arrays, A with shape (i,j) and B (i, k) where j >> k. I want to define a new 3d array C such that each element in C is the broadcasted element wise product of each column in A with the whole matrix B. In other words as a normal python loop I would do it like this
for x in range(j):
C[x] = A[:,x]*B
However j is very large in this case and it would benefit me a lot if I am able to use Numpy's functionality to maybe define an ndarray C elementwise like in my loop above.
Thank you for your help
You can use broadcasting like this:
a.T[:, :, None] * b
Example:
import numpy as np
np.random.seed(444)
i, j, k = 2, 10, 3
a = np.random.randn(i, j)
b = np.random.randn(i, k)
c = a.T[:, :, None] * b
print(c.shape)
# (10, 2, 3)
Transposing stems from the fact that you want to internally operate for each column in a, and [:, :, None] expands the dimensionality to enable broadcasting, as explained in NumPy's broadcasting rules.

Normalize 2d arrays

Consider a square matrix containing positive numbers, given as a 2d numpy array A of shape ((m,m)). I would like to build a new array B that has the same shape with entries
B[i,j] = A[i,j] / (np.sqrt(A[i,i]) * np.sqrt(A[j,j]))
An obvious solution is to loop over all (i,j) but I'm wondering if there is a faster way.
Two approaches leveraging broadcasting could be suggested.
Approach #1 :
d = np.sqrt(np.diag(A))
B = A/d[:,None]
B /= d
Approach #2 :
B = A/(d[:,None]*d) # d same as used in Approach #1
Approach #1 has lesser memory overhead and as such I think would be faster.
You can normalize each row of your array by the main diagonal leveraging broadcasting using
b = np.sqrt(np.diag(a))
a / b[:, None]
Also, you can normalize each column using
a / b[None, :]
To do both, as your question seems to ask, using
a / (b[:, None] * b[None, :])
If you want to prevent the creation of intermediate arrays and do the operation in place, you can use
a /= b[:, None]
a /= b[None, :]

Categories

Resources