I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. The test dataset has 500 points, each point is a N dimensional vector (N=1024). The training dataset has around 10000 points and each point is also a 1024- dim vector. The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). Since the test array and training array have different sizes, I tried using broadcasting:
import numpy as np
dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))
where test is an array of shape (500,1024) and train is an array of shape (10000,1024). I am getting a MemoryError. However, the same code works for smaller arrays. For example:
test= np.array([[1,2],[3,4]])
train=np.array([[1,0],[0,1],[1,1]])
Is there a more memory efficient way to do the above computation without loops? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). So I tried the following:
x2 = np.dot(test, test.T)
y2 = np.dot(train,train.T)
xy = 2* np.dot(test,train.T)
dist = np.sqrt(x2 - xy + y2)
Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j.
thanks
Here is broadcasting with shapes of the intermediates made explicit:
m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)
The documentation on broadcasting has some pretty good examples.
I think what you are asking for already exists in scipy in the form of the cdist function.
from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')
Simplified and working version from this answer:
x, y = test, train
x2 = np.sum(x**2, axis=1, keepdims=True)
y2 = np.sum(y**2, axis=1)
xy = np.dot(x, y.T)
dist = np.sqrt(x2 - 2*xy + y2)
So the approach you have in mind is correct, but you need to be careful how you apply it.
To make your life easier, consider using the tested and proven functions from scipy or scikit-learn.
Related
Let w, x, y, z be torch tensors of shape (m, n) and we wish to compute the following unbiased estimator row-wise efficiently (without for loops), where I want to compute for every row 1, ..., m:
In case of only the unbiased estimator of the square of means, i.e., for :
this is possible, e.g., using torch.einsum:
batch_outer = torch.einsum('bi, bj -> bij', x, y)
zero_diag = 1-torch.eye(batch_outer.shape[1])
return (batch_outer * zero_diag).sum(dim=2).sum(dim=1) / (n * (n-1))
However, for the case to the power of four this is not so easy doable, mostly because these are not squared tensors and in particular, because the zeroing out of the diagonals becomes very tedious.
My questions:
1.) How can this be implemented efficiently ommitting any for loops?
2.) Which time and memory complexity would that solution have in big O notation?
3.) Can this solution also be used to do it with four 3D tensors of shape (m, k, n), where again we only want to do the computations along the axes of length n (dim=2)?
4.) If I want to do it in log-space for numerical stability, i.e., to use logsumexp for summations and sums for multiplications (because log(xy)= log(x)+log(y)), any solution with einsum wouldnt work anymore. How could that computation then be done in log space?
1 This implementation seems to work if I didn't make mess with the diagonal dimensions.
import numpy as np
import torch as th
x = np.array([1,4,5,3])
y = np.array([5,2,4,5])[np.newaxis]
z = np.array([5,7,4,5])[np.newaxis][np.newaxis]
w = np.array([3,9,5,1])[np.newaxis][np.newaxis][np.newaxis]
xth = th.Tensor(x)
yth = th.Tensor(y)
zth = th.Tensor(z)
wth = th.Tensor(w)
tensor = xth*th.transpose(yth, 0, 1)*th.transpose(zth,0,2)*th.transpose(wth,0,3)
diag = th.diagonal(tensor, dim1 = -2, dim2 = -1)
result = th.sum(tensor) - th.sum(diag)
result /= np.math.factorial(len(x))
print(result)
The order is between O(n^2.37..) - O(n^3), depending on the pytorch implementation of the matrix multiplication.
I don't see why not, just choose properly the dimensions to transpose and take the diagonal.
I don't see why would this solution won't work in a log-space.
pd: my knowledge in pytorch is quite limited, but I'm sure you can define x,y,z,w in a more elegant way.
I have a 2D means matrix in size n*m, where n is number of samples and m is the dimension of the data.
I have as well n matrices of m*m, namely sigma is my variance matrix in shape n*m*m.
I wish to sample n samples from a the distributions above, such that x_i~N(mean[i], sigma[i]).
Any way to do that in numpy or any other standard lib w/o running with a for loop?
The only option I thought was using np.random.multivariate_normal() by flatting the means matrix to one vector, and flatten the 3D sigma to a 2D blocks-diagonal matrix. And of course reshape afterwards. But that means we are going the sample with sigma in shape (n*m)*(n*m) which can easily be ridiculously huge, and only computing and allocating that matrix (if possible) can take longer than running in a for loop.
In my specific task, right now Sigma is the same matrix for all the samples, means I can express Sigma in m*m, and it is the same one for all n points. But I am interested in a general solution.
Appreciate your help.
Difficult to tell without testable code, but this should be close:
A = numpy.linalg.cholesky(sigma) # => shape (n, m, m), same as sigma
Z = np.random.normal(size = (n, m)) # shape (n, m)
X = np.einsum('ijk, ik -> ij', A, Z) + mean # shape (n, m)
What's going on:
We're manually sampling multivariate normal distributions according to the standard Cholesky decomposition method outlined here. A is built such that A#A.T = sigma. Then X (the multivariate normal) can be formed by the dot product of A and a univariate normal N(0, 1) vector Z, plus the mean.
You keep the extraneous dimension throughout the calculation in the first (index = 0, 'i' in the einsum) axis, while contracting the last ('k') axis, forming the dot product.
I am working with Trimesh, and trying to compute some statistics on the meshes. One of the possible statistics (and the one I am using to illustrate the question) is a histogram of the areas of 3 random vertices of the mesh. Currently I am doing the following, but I would like to know if there's any way to avoid using a loop.
def CalcArea(self, p):
return 0.5 * np.linalg.norm(np.cross(p[1]-p[0], p[2]-p[0]))
v_c = self.mesh.vertices.copy()
np.random.shuffle(v_c)
areas = [self.CalcArea(v_c[i:i+3]) for i in range(len(v_c[:-2]))]
The numpy documentation is your friend :-).
np.cross and np.linalg.norm work on arrays of vectors as well. And they support the powerful keyworded argument axis.
I'm assuming your v_c has the shape (N, 3), where N is your number of vertices. Let's assume it's a multiple of three for simplicity, then:
N = 30
v_c = np.random.random((N, 3))
v1 = v_c[N//3:2*N//3, :] - v_c[:N//3, :]
v2 = v_c[2*N//3:, :] - v_c[:N//3, :]
area = 0.5*np.linalg.norm(np.cross(v1, v2), axis=1)
Note that this involves the creation of temporary arrays so maybe keep an eye out for very large N.
I would like to write a function for the formula 5.5.1 on this page:
Electric potential of point charges
I found scipy.spatial.distance
There is the euclidean distance with the weight function, which I could use but there is also a squareform and a sqeuclidean.
I am new to scipy, so I either have to test this out or leverage on someone with experience.
Are these three versions basically equivalent for my use case or should I chose one over the other. On the long run I want to calculate a large amount of points (in the range of millions) so performance and memory usage will matter here.
Numpy would be sufficient for this problem. This formula could be easily vectorized,
so you don't need to use Python loops to compute it, e.g.
import numpy as np
def get_electric_potential(x, qi, ri):
"""
x = (x1, x2, x3) -- numpy array of shape 3x1
qi = (q1, ..., qN) -- numpy array of shape (N,)
ri = (...).shape = 3 x N
"""
eps0 = 0.00001 # Put your value here!!!
C = 1 / 4 / np.pi / eps0
C = 1 # Comment this line
return C * ((qi / np.diag((x - ri).T # (x - ri))) * (x - ri)).sum(axis=1)
So, x is a vector, where you want to compute value of the potential;
ri is a matrix of shape 3xN, each column of it represent location of electric charge. qi -- corresponding vector of electric charges.
Example:
x = np.array([1,2,3])[:, np.newaxis]
ri = np.arange(30).reshape(3, 10)
qi = np.arange(10)
get_electric_potential(x, qi, ri)
array([-0.28622007, -0.83010791, -1.37399575])
Note: to use this formula you need to define constant C, define eps0.
I am interested in computing the power spectrum of a system of particles (~100,000) in 3D space with Python. What I have found so far is a group of functions in Numpy (fft,fftn,..) which compute the discrete Fourier transform, of which the square of the absolute value is the power spectrum. My question is a matter of how my data are being represented - and truthfully may be fairly simple to answer.
The data structure I have is an array which has a shape of (n,2), n being the number of particles I have, and each column representing either the x, y, and z coordinate of the n particles. The function I believe I should be using it the fftn() function, which takes the discrete Fourier transform of an n-dimensional array - but it says nothing about the format. How should the data be represented as a data structure to be fed into fftn?
Here is what I've tried so far to test the function:
import numpy as np
import random
import matplotlib.pyplot as plt
DATA = np.zeros((100,3))
for i in range(len(DATA)):
DATA[i,0] = random.uniform(-1,1)
DATA[i,1] = random.uniform(-1,1)
DATA[i,2] = random.uniform(-1,1)
FFT = np.fft.fftn(DATA)
PS = abs(FFT)**2
plt.plot(PS)
plt.show()
The array entitled DATA is a mock array, ultimately the thing which will be 100,000 by 3 in shape. The output of the code gives me something like:
As you can see, I think this is giving me three 1D power spectra (1 for each column of my data), but really I'd like a power spectrum as a function of radius.
Does anybody have any advice or alternative methods/packages they know of to compute the power spectrum (I'd even settle for the two point autocorrelation function).
It doesn't quite work the way you are setting it out...
You need a function, lets call it f(x, y, z), that describes the density of mass in space. In your case, you can consider the galaxies as point masses, so you will have a delta function centered at the location of each galaxy. It is for this function that you can calculate the three-dimensional autocorrelation, from which you could calculate the power spectrum.
If you want to use numpy to do that for you, you are first going to have to discretize your function. A possible mock example would be:
import numpy as np
import matplotlib.pyplot as plt
space = np.zeros((100, 100, 100), dtype=np.uint8)
x, y, z = np.random.randint(100, size=(3, 1000))
space[x, y, z] += 1
space_ps = np.abs(np.fft.fftn(space))
space_ps *= space_ps
space_ac = np.fft.ifftn(space_ps).real.round()
space_ac /= space_ac[0, 0, 0]
And now space_ac holds the three-dimensional autocorrelation function for the data set. This is not quite what you are after, and to get you one-dimensional correlation function you would have to average the values on spherical shells around the origin:
dist = np.minimum(np.arange(100), np.arange(100, 0, -1))
dist *= dist
dist_3d = np.sqrt(dist[:, None, None] + dist[:, None] + dist)
distances, _ = np.unique(dist_3d, return_inverse=True)
values = np.bincount(_, weights=space_ac.ravel()) / np.bincount(_)
plt.plot(distances[1:], values[1:])
There is another issue with doing things yourself this way: when you compute the power spectrum as above, mathematically is as if your three dimensional array wrapped around the borders, i.e. point [999, y, z] is a neighbour to [0, y, z]. So your autocorrelation could show two very distant galaxies as close neighbours. The simplest way to deal with this is by making your array twice as large along every dimension, padding with extra zeros, and then discarding the extra data.
Alternatively you could use scipy.ndimage.filters.correlate with mode='constant' to do all the dirty work for you.