Tensorflow efficient pairwise inner product

Tensorflow efficient pairwise inner product - python

In Tensorflow (python), given a matrix X of shape (n x d), where each row is a data point, I would like to compute the pairwise inner products of these n data points, i.e., the upper triangle of XX'.
Of course I could compute the whole XX' and fetch its upper triangle, but this means I would compute the off-diagonal elements twice. How to compute these efficiently in Tensorflow (python) by computing the inner product only once per pair?

With numpy, you can do this:
import numpy as np
A = np.random.randn(5, 3)
inds = np.triu_indices(5) # upper triangle indices
# expensive way to do it
ipu1 = np.dot(A, A.T)[inds]
# possibly less expensive way to do it.
ipu2 = np.einsum('ij,ij->i', A[inds[0]], A[inds[1]])
print(np.allclose(ipu1, ipu2))
This outputs True. Tensorflow does not have the triu_indices function build in, but it is not hard to write one if needed by looking at the numpy code. It does have einsum.

Related

Fastest way to calculate cosine similartity between two 2D arrays

I have one array A containing 64000 embeddings and an other array B containing 12000 embeddings (each of the embedding is 1024 floats long).
Now I want to calculate the cosine similarity for all the pairs between array A and array B (cartesian product).
To perform that (using pandas), I merge array A with array B using .merge(how="cross").
It gives me 768 000 000 pairs.
Now I am looking for the fastest way of calculating the cosine sim. for now I used something like this using Numpy:
def compute_cosine_sim(a, b):
return dot(a, b)/(norm(a)*norm(b))
np.vectorize(compute_cosine_sim)(df.embedding_A.to_numpy(), df.embedding_B.to_numpy())
To keep the RAM at reasonable level, I use pandas Dataframe chunking.
The problem is my method is not. fast enough, and I was wondering if there wasn't something to change here, especially regarding the effectiveness of the numpy function I use.
To give some details, I reach 130000 iter/sec with this function, is it normal ?
Also, could this kind of operation be run on GPU easily ?
Thanks for the help

You could matrix multiply embedding A with the transpose of embedding B to get the dot products of all pairs, and then divide the output along the columns by the norm of the vectors in A, and the rows by the norm of the vectors in B:
import numpy as np
a = np.random.randn(10, 4)
b = np.random.randn(8, 4)
out = a # b.T / np.linalg.norm(a, axis=1, keepdims=True) / np.linalg.norm(b.T, axis=0, keepdims=True)
# out has shape (10, 8)

How to efficiently iterate through rows in a large matrix with too many columns?

I'm working on document clustering where I first build a distance matrix from the tf-idf results. I use the below code to get my tf-idf matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(models)
This results in a matrix of (9069, 22210). Now I want to build a distance matrix from this (9069*9069). I'm using the following code for that:
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.spatial import distance
arrX = X.toarray()
rowSize = X.shape[0]
distMatrix = np.zeros(shape=(rowSize, rowSize))
#build distance matrix
for i, x in enumerate(arrX):
for j, y in enumerate(arrX):
distMatrix[i][j] = distance.braycurtis(x, y)
np.savetxt("dist.csv", distMatrix, delimiter=",")
The problem with this code is that it's extremely slow for this matrix size. Is there a faster way of doing this?

The biggest issue is that the algorithms runs in O(n^3) since each iteration to distance.braycurtis requires the computation of two arrays of size 9069. Since the computation is done 9069*9069 times. This means thousands billion scalar operations are required so to complete the computation. This is huge. The thing is the complexity of the algorithm probably cannot be improved. There are several ways to speed this up:
The first thing to do is not to recompute the distance twice. Indeed, this distance seems to be a commutative operator so distMatrix[i][j] == distMatrix[j][i]. You can compute the upper triangular part and then copy it to the lower triangular part.
Another optimization is simply not to use distance.braycurtis because it is slow: it takes about 10 us/call on my machine. This is mainly because it creates several temporary arrays, is mostly memory-bound because of Numpy operations, and also because np.sum is not very fast (mainly because it uses of a pretty precise algorithm that is hard to optimize). Moreover, it is sequential while nearly all mainstream processor have multiple cores nowadays. We can use Numba so to massively speed up this operation:
import numba as nb
#nb.njit(['float32(float32[::1], float32[::1])', 'float64(float64[::1], float64[::1])'], fastmath=True)
def fastBrayCurtis(arr1, arr2):
assert arr1.size == arr2.size
assert arr1.size > 0
zero = arr1[0] * 0 # Trick to set `zero` to the right type regarding the one of `arr1`
df, sm = zero, zero
for k in range(arr1.size):
df += np.abs(arr1[k] - arr2[k])
sm += np.abs(arr1[k] + arr2[k])
return df / sm
# The signature of the function is provided so to compile the function eagerly
# with both 32-bit and 64-bit floating-point 2D contiguous arrays.
#nb.njit(['float32[:,::1](float32[:,::1])', 'float64[:,::1](float64[:,::1])'], fastmath=True, parallel=True)
def brayCurtisDistMatrix(arr):
n = arr.shape[0]
distance = np.empty((n, n), dtype=arr.dtype)
# Compute the distance matrix in parallel while balancing the work between threads
for i in nb.prange((n+1)//2):
# Top of the upper triangular part (many items)
for j in range(i, n):
distance[j, i] = distance[i, j] = fastBrayCurtis(arr[i], arr[j])
# Bottom of the upper triangular part (few items)
for j in range(n-1-i, n):
distance[j, n-1-i] = distance[n-1-i, j] = fastBrayCurtis(arr[n-1-i], arr[j])
return distance
This code is about 440 times faster than the initial one on my 6-core i5-9600KF processor. Actually, a quick theoretical analysis combined with profiling results shows that the algorithm is close to be optimal (>75% of the computing power of my processor is used)! If this is not enough, you should consider using the simple-precision implementation. If this is still not enough, you should then also consider writing an optimized GPU code for that (or simply reconsider the need to compute such a huge distance matrix).

You see, the individual elements of the NumPy multidimensional matrix you give in as input are saved in memory in 2 ways. They are:
ROW MAJOR
COLUMN MAJOR
Each has its advantages and disadvantages.
You can even control the way it is stored.
I hope you find this helpful

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Let w, x, y, z be torch tensors of shape (m, n) and we wish to compute the following unbiased estimator row-wise efficiently (without for loops), where I want to compute for every row 1, ..., m:
In case of only the unbiased estimator of the square of means, i.e., for :
this is possible, e.g., using torch.einsum:
batch_outer = torch.einsum('bi, bj -> bij', x, y)
zero_diag = 1-torch.eye(batch_outer.shape[1])
return (batch_outer * zero_diag).sum(dim=2).sum(dim=1) / (n * (n-1))
However, for the case to the power of four this is not so easy doable, mostly because these are not squared tensors and in particular, because the zeroing out of the diagonals becomes very tedious.
My questions:
1.) How can this be implemented efficiently ommitting any for loops?
2.) Which time and memory complexity would that solution have in big O notation?
3.) Can this solution also be used to do it with four 3D tensors of shape (m, k, n), where again we only want to do the computations along the axes of length n (dim=2)?
4.) If I want to do it in log-space for numerical stability, i.e., to use logsumexp for summations and sums for multiplications (because log(xy)= log(x)+log(y)), any solution with einsum wouldnt work anymore. How could that computation then be done in log space?

1 This implementation seems to work if I didn't make mess with the diagonal dimensions.
import numpy as np
import torch as th
x = np.array([1,4,5,3])
y = np.array([5,2,4,5])[np.newaxis]
z = np.array([5,7,4,5])[np.newaxis][np.newaxis]
w = np.array([3,9,5,1])[np.newaxis][np.newaxis][np.newaxis]
xth = th.Tensor(x)
yth = th.Tensor(y)
zth = th.Tensor(z)
wth = th.Tensor(w)
tensor = xth*th.transpose(yth, 0, 1)*th.transpose(zth,0,2)*th.transpose(wth,0,3)
diag = th.diagonal(tensor, dim1 = -2, dim2 = -1)
result = th.sum(tensor) - th.sum(diag)
result /= np.math.factorial(len(x))
print(result)
The order is between O(n^2.37..) - O(n^3), depending on the pytorch implementation of the matrix multiplication.
I don't see why not, just choose properly the dimensions to transpose and take the diagonal.
I don't see why would this solution won't work in a log-space.
pd: my knowledge in pytorch is quite limited, but I'm sure you can define x,y,z,w in a more elegant way.

Vectorized Portfolio Risk

Im have N pairs of portfolio weights stored in a numpy array and would like to calculate portfolio risk which is w * E * w_T where w_T is weight transpose. The way I came up with is to loop through each weight pair and apply the matrix multiplication. Is there a vectorized approach to this such that given a weight pair (or if possible N weights that all sum to 1) I apply a single covariance matrix to each row to get the risk (ie without loop)?
import numpy as np
w = np.array([[0.2,0.8],[0.5,0.5]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
w1 = w[0].reshape([1,2]) # each row in w
#portfolio risk
np.dot(np.dot(w1,covar),w1.T)

#Adam's answer is valid, but for big arrays, can result with very big temporary arrays (NxN), and unnecessary computations (computing the off-diagonal elements).
Here's a similar, yet much more efficient solution:
(I added another weight-pair, to distinguish between the different dimensions of the problem)
w = np.array([[0.2,0.8],[0.5,0.5], [0.33, 0.67]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
(np.dot(w, covar) * w).sum(axis=-1)
=> array([ 2.77600000e-05, 2.80000000e-05, 2.68916000e-05])
By using plain-multiplication in the second step, I'm avoiding the unnecessary compuations of the off-diagonals.
EDIT: explaining the temporary arrays
# first multiplication (in both solutions)
np.dot(w, covar).shape
(3, 2)
# second, my solution
(np.dot(w, covar) * w).shape
(3, 2)
# second, Adam's solution
np.dot(np.dot(w,covar),w.T).shape
(3, 3)
Now, if you have N sets of weights you want to compute risk for (in this example N=3), and M instruments in your portfolio (here M=2), and N>>M, you get an array which is much bigger with Adam's solution (NxN). Not only that it will consume more memory, the computation populating the off-diagonal elements are expensive (matrix multiplication), and unnecessary.

It seems like your code is already set up for a vectorized approach, but you are only dealing with one row at a time. Grabbing the diagonals from the result when using your full weight matrix should give you what you want.
# portfolio risk
np.diagonal(np.dot(np.dot(w,covar),w.T))

Reverse sort and argsort in python

I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?

I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).
The double sort can be skipped by doing
v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]
Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is
v = (tfidf * tfidf[idx, :]).transpose()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow efficient pairwise inner product - python

Related

Fastest way to calculate cosine similartity between two 2D arrays

How to efficiently iterate through rows in a large matrix with too many columns?

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Vectorized Portfolio Risk

Reverse sort and argsort in python

Categories

Resources