Vectorizing Numpy for loops - python

I'm currently trying to vectorize a few operations in NumPy. s is a very large number (10000) and X represents a numpy array with around 1200000
for element1 in range(1,s+1):
d = np.zeros(s)
for element2 in range(1,s+1):
d[element2-1] = norm(np.subtract(X[0:n,element1],X[0:n,element2]))
I'm trying to rewrite this without using for loops but I can't think of a way. One method of trying involves using zip and np.tile, but that yields wrong results.

Those are basically euclidean distances on a slice off the input array -
from scipy.spatial.distance import cdist, pdist, squareform
X_slice = X[0:n,1:s+1]
d_all = squareform(pdist(X_slice.T))
Thus, inside the first loop, it would be just a slice from the output d_all that could be re-used as d, like so -
for element1 in range(1,s+1):
d = d_all[element1-1, :]

Related

Are there any limitations of np.dot() function in numpy library?

I have two vectors or arrays with one million elements each(all are positive). I want to find their dot product. When I use python lists to find them I get some big 20 - 30 digit answer. When using numpy arrays with np.dot() function I am getting a negative answer. The code is shown below. Kindly explain your solution.
Code:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))
# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)
# Dot product using lists
result = 0
for x1,x2 in zip(arr1,arr2):
result+=x1*x2
print(result)
# Dot product using numpy built in function np.dot()
print(np.dot(arr1_np,arr2_np))
enter image description here

Euclidean distance between the two points using vectorized approach

I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)
euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))

How to get chunks of submatrices faster?

I have a really big matrix (nxn)for which I would to build the intersecting tiles (submatrices) with the dimensions mxm. There will be an offset of step bvetween each contiguous submatrices. Here is an example for n=8, m=4, step=2:
import numpy as np
matrix=np.random.randn(8,8)
n=matrix.shape[0]
m=4
step=2
This will store all the corner indices (x,y) from which we will take a 4x4 natrix: (x:x+4,x:x+4)
a={(i,j) for i in range(0,n-m+1,step) for j in range(0,n-m+1,step)}
The submatrices will be extracted like that
sub_matrices = np.zeros([m,m,len(a)])
for i,ind in enumerate(a):
x,y=ind
sub_matrices[:,:,i]=matrix[x:x+m, y:y+m]
Is there a faster way to do this submatrices initialization?
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
from skimage.util.shape import view_as_windows
# Get indices as array
ar = np.array(list(a))
# Get all sliding windows
w = view_as_windows(matrix,(m,m))
# Get selective ones by indexing with ar
selected_windows = np.moveaxis(w[ar[:,0],ar[:,1]],0,2)
Alternatively, we can extract the row and col indices with a list comprehension and then index with those, like so -
R = [i[0] for i in a]
C = [i[1] for i in a]
selected_windows = np.moveaxis(w[R,C],0,2)
Optimizing from the start, we can skip the creation of stepping array, a and simply use the step arg with view_as_windows, like so -
view_as_windows(matrix,(m,m),step=2)
This would give us a 4D array and indexing into the first two axes of it would have all the mxm shaped windows. These windows are simply views into input and hence no extra memory overhead plus virtually free runtime!
import numpy as np
a = np.random.randn(n, n)
b = a[0:m*step:step, 0:m*step:step]
If you have a one-dimension array, you can get it's submatrix by the following code:
c = a[start:end:step]
If the dimension is two or more, add comma between every dimension.
d = a[start1:end1:step1, start2:end3:step2]

Broadcasting with reduction or extension in Numpy

In the following code we calculate magnitudes of vectors between all pairs of given points. To speed up this operation in NumPy we can use broadcasting
import numpy as np
points = np.random.rand(10,3)
pair_vectors = points[:,np.newaxis,:] - points[np.newaxis,:,:]
pair_dists = np.linalg.norm(pair_vectors,axis=2).shape
or outer product iteration
it = np.nditer([points,points,None], flags=['external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it:
c[...] = b - a
pair_vectors = it.operands[2]
pair_dists = np.linalg.norm(pair_vectors,axis=2)
My question is how could one use broadcasting or outer product iteration to create an array with the form 10x10x6 where the last axis contains the coordinates of both points in a pair (extension). And in a related way, is it possible to calculate pair distances using broadcasting or outer product iteration directly, i.e. produce a matrix of form 10x10 without first calculating the difference vectors (reduction).
To clarify, the following code creates the desired matrices using slow looping.
pair_coords = np.zeros(10,10,6)
pair_dists = np.zeros(10,10)
for i in range(10):
for j in range(10):
pair_coords[i,j,0:3] = points[i,:]
pair_coords[i,j,3:6] = points[j,:]
pair_dists[i,j] = np.linalg.norm(points[i,:]-points[j,:])
This is a failed attempt to calculate distanced (or apply any other function that takes 6 coordinates of both points in a pair and produce a scalar) using outer product iteration.
res = np.zeros((10,10))
it = np.nditer([points,points,res], flags=['reduce_ok','external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it: c[...] = np.linalg.norm(b-a)
pair_dists = it.operands[2]
Here's an approach to produce those arrays in vectorized ways -
from itertools import product
from scipy.spatial.distance import pdist, squareform
N = points.shape[0]
# Get indices for selecting rows off points array and stacking them
idx = np.array(list(product(range(N),repeat=2)))
p_coords = np.column_stack((points[idx[:,0]],points[idx[:,1]])).reshape(N,N,6)
# Get the distances for upper triangular elements.
# Then create a symmetric one for the final dists array.
p_dists = squareform(pdist(points))
Few other vectorized approaches are discussed in this post, so have a look there too!

Vectorizing a numpy array call of varying indices

I have a 2D numpy array and a list of lists of indices for which I wish to compute the sum of the corresponding 1D vectors from the numpy array. This can be easily done through a for loop or via list comprehension, but I wonder if it's possible to vectorize it. With similar code I gain about 40x speedups from the vectorization.
Here's sample code:
import numpy as np
indices = [[1,2],[1,3],[2,0,3],[1]]
array_2d = np.array([[0.5, 1.5],[1.5,2.5],[2.5,3.5],[3.5,4.5]])
soln = [np.sum(array_2d[x], axis=-1) for x in indices]
(edit): Note that the indices are not (x,y) coordinates for array_2d, instead indices[0] = [1,2] represents the first and second vectors (rows) in array_2d. The number of elements of each list in indices can be variable.
This is what I would hope to be able to do:
vectorized_soln = np.sum(array_2d[indices[:]], axis=-1)
Does anybody know if there are any ways of achieving this?
First to all, I think you have a typo in the third element of indices...
The easy way to do that is building a sub_array with two arrays of indices:
i = np.array([1,1,2])
j = np.array([2,3,?])
sub_arr2d = array_2d[i,j]
and finally, you can take the sum of sub_arr2d...

Categories

Resources