Numpy, avoid loop in 3d array difference nested summation - python

I have a simple problem for Numpy: I have 3d coordinates and I want to compute the overlap between two distinct configurations with the following function
def Overlap(rt, r0,a):
s=0
for i in range(len(rt)):
s+=(( pl.norm(r0[i]-rt ,axis=1) <=a).astype('int')).sum()
return s`
Where rt and r0 represent two m by 3 tables, the configurations.
Practically, it computes the distance between a vector in the first configuration and any other vector in the second, checks for a threshold value a, and returns the total sum after a loop over all the positions. Is there a smart way to avoid the explicit for loop? I have the feeling that the complexity cannot really be changed, but there is maybe a way to avoid the slowness of the native for construct.

How about the following:
from scipy.spatial.distance import cdist
import numpy as np
overlap = np.sum(cdist(rt, r0) <= a)
When m is 1000 on my machine, this is about 9x faster. It's much faster for small arrays

Related

How to efficiently iterate through rows in a large matrix with too many columns?

I'm working on document clustering where I first build a distance matrix from the tf-idf results. I use the below code to get my tf-idf matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(models)
This results in a matrix of (9069, 22210). Now I want to build a distance matrix from this (9069*9069). I'm using the following code for that:
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.spatial import distance
arrX = X.toarray()
rowSize = X.shape[0]
distMatrix = np.zeros(shape=(rowSize, rowSize))
#build distance matrix
for i, x in enumerate(arrX):
for j, y in enumerate(arrX):
distMatrix[i][j] = distance.braycurtis(x, y)
np.savetxt("dist.csv", distMatrix, delimiter=",")
The problem with this code is that it's extremely slow for this matrix size. Is there a faster way of doing this?
The biggest issue is that the algorithms runs in O(n^3) since each iteration to distance.braycurtis requires the computation of two arrays of size 9069. Since the computation is done 9069*9069 times. This means thousands billion scalar operations are required so to complete the computation. This is huge. The thing is the complexity of the algorithm probably cannot be improved. There are several ways to speed this up:
The first thing to do is not to recompute the distance twice. Indeed, this distance seems to be a commutative operator so distMatrix[i][j] == distMatrix[j][i]. You can compute the upper triangular part and then copy it to the lower triangular part.
Another optimization is simply not to use distance.braycurtis because it is slow: it takes about 10 us/call on my machine. This is mainly because it creates several temporary arrays, is mostly memory-bound because of Numpy operations, and also because np.sum is not very fast (mainly because it uses of a pretty precise algorithm that is hard to optimize). Moreover, it is sequential while nearly all mainstream processor have multiple cores nowadays. We can use Numba so to massively speed up this operation:
import numba as nb
#nb.njit(['float32(float32[::1], float32[::1])', 'float64(float64[::1], float64[::1])'], fastmath=True)
def fastBrayCurtis(arr1, arr2):
assert arr1.size == arr2.size
assert arr1.size > 0
zero = arr1[0] * 0 # Trick to set `zero` to the right type regarding the one of `arr1`
df, sm = zero, zero
for k in range(arr1.size):
df += np.abs(arr1[k] - arr2[k])
sm += np.abs(arr1[k] + arr2[k])
return df / sm
# The signature of the function is provided so to compile the function eagerly
# with both 32-bit and 64-bit floating-point 2D contiguous arrays.
#nb.njit(['float32[:,::1](float32[:,::1])', 'float64[:,::1](float64[:,::1])'], fastmath=True, parallel=True)
def brayCurtisDistMatrix(arr):
n = arr.shape[0]
distance = np.empty((n, n), dtype=arr.dtype)
# Compute the distance matrix in parallel while balancing the work between threads
for i in nb.prange((n+1)//2):
# Top of the upper triangular part (many items)
for j in range(i, n):
distance[j, i] = distance[i, j] = fastBrayCurtis(arr[i], arr[j])
# Bottom of the upper triangular part (few items)
for j in range(n-1-i, n):
distance[j, n-1-i] = distance[n-1-i, j] = fastBrayCurtis(arr[n-1-i], arr[j])
return distance
This code is about 440 times faster than the initial one on my 6-core i5-9600KF processor. Actually, a quick theoretical analysis combined with profiling results shows that the algorithm is close to be optimal (>75% of the computing power of my processor is used)! If this is not enough, you should consider using the simple-precision implementation. If this is still not enough, you should then also consider writing an optimized GPU code for that (or simply reconsider the need to compute such a huge distance matrix).
You see, the individual elements of the NumPy multidimensional matrix you give in as input are saved in memory in 2 ways. They are:
ROW MAJOR
COLUMN MAJOR
Each has its advantages and disadvantages.
You can even control the way it is stored.
I hope you find this helpful

Efficiently compute cosine similarity

I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.
My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:
df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)
which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.
Consider using vectorized computations rather than looping over DataFrame rows (which is very slow and should be avoided).
I'm not sure how the arrays are represented in the dataframe, so make sure you're starting out with two arrays of the same shape.
from numpy import einsum
from numpy.linalg import norm
arr_a = df["tf_idf"].values
arr_b = df["new_string"].values
cos_sim = einsum('ij,ij->i', arr_a, arr_b) / (norm(arr_a, axis=1)*norm(arr_b, axis=1))
df["cosine_distance"] = 1 - cos_sim
This code directly calculates the cosine distance using vector operations (einsum reference) and will run orders of magnitude faster than the DataFrame.apply() method.

Fastest way generate and sum arrays

I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)

Python numpy : "Array is too big"

import numpy
from scipy.spatial.distance import pdist
X = numpy.zeros(50000,25)
C = pdist(X, 'euclidian')
I want to find:
And then numpy gives error : Array is too big.
I think problem is about array size of C. Pdist cannot creates (50000,50000) array. I dont know why numpy restricts? I can run same code in matlab. How can i run this code using array?
And also ,i found possible duplication but their array-matrix size too big.
Is it possible to create a 1million x 1 million matrix using numpy?
Very large matrices using Python and NumPy
first thing there are a couple of typos in your code. It's:
X = numpy.zeros((50000,25)) # it's a tuple going in
C = pdist(X, 'euclidean') # euclidean with an e
of course it does not matter for the question.
The Euclidean pdist is just a call for numpy.linalg.norm (http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html). It's a very general function. If it does not work in your case due to memory constraints you can always create something yourself. Two 50000 length vectors do not take that much memory and this can make one pairwise comparison:
np.sqrt(np.sum(np.square(X[0])) + np.sum(np.square(X[1])))
And then you only need to loop through the whole thing.
Hope it helps,
P

Numpy, all pairwise correlations of a 3d array

I have an array of shape (l,m,n). I'm trying to calculate a distance matrix of shape (l,m,n) where entry (i,j,k) is the coefficient between vectors (i,j,:) and (i,:,k). I haven't found anything in numpy or scipy that fits the bill.
I tried using a for loop and iterating along axis 0, then feeding that to scipy.spatial.distance.pdist, but that takes a long time as pdist itself uses a nested for loop. In essence, what I would like to do would be to perform pdist down axis 0, but ideally make it so pdist doesn't use for loops either....
Any thoughts?
I would personally write a little Cython function to do this ( http://cython.org). Write and test an iterative pure Python version (with for loops), move it to a .pyx Cython file, add type declarations and follow the NumPy integration guide:
http://docs.cython.org/src/tutorial/numpy.html
Might seem like work but if you're doing computing in Python, some basic Cython skills are well worth cultivating as it makes writing C extensions much easier.
Any thoughts?
First thought is that you cannot compute such distances as long as m != n
Second thought is that internal loops of pdist should not bother you if those are written in C, so the probable reason is not in implementation, but in the amount of computations needed
Final thought is that your problem may be solved by numpy.einsum and linear algebra:
Code (which I assume to be optimal):
products = numpy.einsum('ijl, ilk -> ijk')
distances = numpy.einsum('ijj -> ij', products)
distances = distances[:, :, None] + distances[:, None, :] - 2 * product

Categories

Resources