Broadcasting with reduction or extension in Numpy - python

In the following code we calculate magnitudes of vectors between all pairs of given points. To speed up this operation in NumPy we can use broadcasting
import numpy as np
points = np.random.rand(10,3)
pair_vectors = points[:,np.newaxis,:] - points[np.newaxis,:,:]
pair_dists = np.linalg.norm(pair_vectors,axis=2).shape
or outer product iteration
it = np.nditer([points,points,None], flags=['external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it:
c[...] = b - a
pair_vectors = it.operands[2]
pair_dists = np.linalg.norm(pair_vectors,axis=2)
My question is how could one use broadcasting or outer product iteration to create an array with the form 10x10x6 where the last axis contains the coordinates of both points in a pair (extension). And in a related way, is it possible to calculate pair distances using broadcasting or outer product iteration directly, i.e. produce a matrix of form 10x10 without first calculating the difference vectors (reduction).
To clarify, the following code creates the desired matrices using slow looping.
pair_coords = np.zeros(10,10,6)
pair_dists = np.zeros(10,10)
for i in range(10):
for j in range(10):
pair_coords[i,j,0:3] = points[i,:]
pair_coords[i,j,3:6] = points[j,:]
pair_dists[i,j] = np.linalg.norm(points[i,:]-points[j,:])
This is a failed attempt to calculate distanced (or apply any other function that takes 6 coordinates of both points in a pair and produce a scalar) using outer product iteration.
res = np.zeros((10,10))
it = np.nditer([points,points,res], flags=['reduce_ok','external_loop'], op_axes=[[0,-1,1],[-1,0,1],None])
for a,b,c in it: c[...] = np.linalg.norm(b-a)
pair_dists = it.operands[2]

Here's an approach to produce those arrays in vectorized ways -
from itertools import product
from scipy.spatial.distance import pdist, squareform
N = points.shape[0]
# Get indices for selecting rows off points array and stacking them
idx = np.array(list(product(range(N),repeat=2)))
p_coords = np.column_stack((points[idx[:,0]],points[idx[:,1]])).reshape(N,N,6)
# Get the distances for upper triangular elements.
# Then create a symmetric one for the final dists array.
p_dists = squareform(pdist(points))
Few other vectorized approaches are discussed in this post, so have a look there too!

Related

Are there any limitations of np.dot() function in numpy library?

I have two vectors or arrays with one million elements each(all are positive). I want to find their dot product. When I use python lists to find them I get some big 20 - 30 digit answer. When using numpy arrays with np.dot() function I am getting a negative answer. The code is shown below. Kindly explain your solution.
Code:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))
# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)
# Dot product using lists
result = 0
for x1,x2 in zip(arr1,arr2):
result+=x1*x2
print(result)
# Dot product using numpy built in function np.dot()
print(np.dot(arr1_np,arr2_np))
enter image description here

Euclidean distance between the two points using vectorized approach

I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)
euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))

How to get chunks of submatrices faster?

I have a really big matrix (nxn)for which I would to build the intersecting tiles (submatrices) with the dimensions mxm. There will be an offset of step bvetween each contiguous submatrices. Here is an example for n=8, m=4, step=2:
import numpy as np
matrix=np.random.randn(8,8)
n=matrix.shape[0]
m=4
step=2
This will store all the corner indices (x,y) from which we will take a 4x4 natrix: (x:x+4,x:x+4)
a={(i,j) for i in range(0,n-m+1,step) for j in range(0,n-m+1,step)}
The submatrices will be extracted like that
sub_matrices = np.zeros([m,m,len(a)])
for i,ind in enumerate(a):
x,y=ind
sub_matrices[:,:,i]=matrix[x:x+m, y:y+m]
Is there a faster way to do this submatrices initialization?
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
from skimage.util.shape import view_as_windows
# Get indices as array
ar = np.array(list(a))
# Get all sliding windows
w = view_as_windows(matrix,(m,m))
# Get selective ones by indexing with ar
selected_windows = np.moveaxis(w[ar[:,0],ar[:,1]],0,2)
Alternatively, we can extract the row and col indices with a list comprehension and then index with those, like so -
R = [i[0] for i in a]
C = [i[1] for i in a]
selected_windows = np.moveaxis(w[R,C],0,2)
Optimizing from the start, we can skip the creation of stepping array, a and simply use the step arg with view_as_windows, like so -
view_as_windows(matrix,(m,m),step=2)
This would give us a 4D array and indexing into the first two axes of it would have all the mxm shaped windows. These windows are simply views into input and hence no extra memory overhead plus virtually free runtime!
import numpy as np
a = np.random.randn(n, n)
b = a[0:m*step:step, 0:m*step:step]
If you have a one-dimension array, you can get it's submatrix by the following code:
c = a[start:end:step]
If the dimension is two or more, add comma between every dimension.
d = a[start1:end1:step1, start2:end3:step2]

Why does my matrix vector multiplication in NumPy yield a two dimensional array instead of a one dimensional vector?

I have a matrix called inverseJ, which is a 2x2 matrix ([[0.07908312, 0.03071918], [-0.12699082, -0.0296126]]), and a one dimensional vector deltaT of length two ([-31.44630082, -16.9922145]). In NumPy, multiplying these should yield a one dimensional vector again, as in this example. However, when I multiply these using inverseJ.dot(deltaT), I get a two dimensional array ([[-3.00885838, 4.49657509]]) with the only element being the vector I am actually looking for. Does anyone know why I am not simply getting a vector? Any help is greatly appreciated!
Whole script for reference
from __future__ import division
import sys
import io
import os
from math import *
import numpy as np
if __name__ == "__main__":
# Fingertip position
x = float(sys.argv[1])
y = float(sys.argv[2])
# Initial guesses
q = np.array([0., 0.])
q[0] = float(sys.argv[3])
q[1] = float(sys.argv[4])
error = 0.01
while(error > 0.001):
# Configuration matrix
T = np.array([17.3*cos(q[0] + (5/3)*q[1])+25.7*cos(q[0] + q[1])+41.4*cos(q[0]),
17.3*sin(q[0] + (5/3)*q[1])+25.7*sin(q[0] + q[1])+41.4*sin(q[0])])
# Deviation
deltaT = np.subtract(np.array([x,y]), T)
error = deltaT[0]**2 + deltaT[1]**2
# Jacobian
J = np.matrix([ [-25.7*sin(q[0]+q[1])-17.3*sin(q[0]+(5/3)*q[1])-41.4*sin(q[0]), -25.7*sin(q[0]+q[1])-28.8333*sin(q[0]+(5/3)*q[1])],
[25.7*cos(q[0]+q[1])+17.3*cos(q[0]+(5/3)*q[1])+41.4*cos(q[0]), 25.7*cos(q[0]+q[1])+28.8333*cos(q[0]+(5/3)*q[1])]])
#Inverse of the Jacobian
det = J.item((0,0))*J.item((1,1)) - J.item((0,1))*J.item((1,0))
inverseJ = 1/det * np.matrix([ [J.item((1,1)), -J.item((0,1))],
[-J.item((1,0)), J.item((0,0))]])
### THE PROBLEMATIC MATRIX VECTOR MULTIPLICATION IN QUESTION
q = q + inverseJ.dot(deltaT)
When a matrix is involved in an operation, the output is another matrix. matrix object are matrices in the strict linear algebra sense. They are always 2D, even if they have only one element.
On the contrary, the example you mention uses arrays, not matrices. Arrays are more "loosely behaved". One of the differences is that "useless" dimensions are removed, yielding a 1D vector in this example.
This simply seems to be the way numpy.dot() functions. It does a simple array multiplication which, since one of the parameters is two dimensional, returns a two dimensional array. dot() is not a smart method, it just does what it's told without sanity checks from what I can gather in the documentation here. Note that this is not an error in your code, but you will have to extract the inner list yourself.

Cosine distance of vector to matrix

In python, is there a vectorized efficient way to calculate the cosine distance of a sparse array u to a sparse matrix v, resulting in an array of elements [1, 2, ..., n] corresponding to cosine(u,v[0]), cosine(u,v[1]), ..., cosine(u, v[n])?
Not natively. You can however use the library scipy that can compute the cosine distance between two vectors for you: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html. You can build a version that takes a matrix using this as a stepping stone.
Add the vector onto the end of the matrix, calculate a pairwise distance matrix using sklearn.metrics.pairwise_distances() and then extract the relevant column/row.
So for vector v (with shape (D,)) and matrix m (with shape (N,D)) do:
import sklearn
from sklearn.metrics import pairwise_distances
new_m = np.concatenate([m,v[None,:]], axis=0)
distance_matrix = sklearn.metrics.pairwise_distances(new_m, axis=0), metric="cosine")
distances = distance_matrix[-1,:-1]
Not ideal, but better than iterating!
This method can be extended if you are querying more than one vector. To do this, a list of vectors can be concatenated instead.
I think there is a way using the definition and the numpy library:
Definition:
import numpy as np
#just creating random data
u = np.random.random(100)
v = np.random.random((100,100))
#dot product: for every row in v, multiply u and sum the elements
u_dot_v = np.sum(u*v,axis = 1)
#find the norm of u and each row of v
mod_u = np.sqrt(np.sum(u*u))
mod_v = np.sqrt(np.sum(v*v,axis = 1))
#just apply the definition
final = 1 - u_dot_v/(mod_u*mod_v)
#verify with the cosine function from scipy
from scipy.spatial.distance import cosine
final2 = np.array([cosine(u,i) for i in v])
The definition of cosine distance i found here :https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
In scipy.spatial.distance.cosine()
http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.cosine.html
Below worked for me, have to provide correct signature
from scipy.spatial.distance import cosine
def cosine_distances(embedding_matrix, extracted_embedding):
return cosine(embedding_matrix, extracted_embedding)
cosine_distances = np.vectorize(cosine_distances, signature='(m),(d)->()')
cosine_distances(corpus_embeddings, extracted_embedding)
In my case
corpus_embeddings is a (10000,128) matrix
extracted_embedding is a 128-dimensional vector

Categories

Resources