I'm currently learning about Mahalanobis Distance and I find it quite difficult. To get the idea better I generated 2 sets of random values (x and y) and a random point, where for all 3 mean=0 and standard deviation=1. How can I calculate the Mahalanobis Distance between them? Please find my Python code below
Many thanks for your help!
import numpy as np
from numpy import cov
from scipy.spatial import distance
generate 20 random values where mean = 0 and standard deviation = 1, assign one set to x and one to y
x = [random.normalvariate(0,1) for i in range(20)]
y = [random.normalvariate(0,1) for i in range(20)]
r_point = [random.normalvariate(0,1)] #that's my random point
sigma = cov(x, y)
print(sigma)
print("random point =", r_point)
#use the covariance to calculate the mahalanobis distance from a random point```
Here is an example that shows how to compute the Mahalanobis distance of a point r_point to some data. The Mahalanobis distance takes into account the variance and correlation of the data you are measuring the distance to (using the inverse of its covariance matrix). Here, the Mahalanobis distance and the Euclidean distance should be very close because of the distribution of the data (0-mean and standard-deviation of 1). For other data, they will be different.
import numpy as np
N = 5000
mean = 0.0
stdDev = 1.0
data = np.random.normal(mean, stdDev, (2, N)) # 2D random points
r_point = np.random.randn(2)
cov = np.cov(data)
mahalanobis_dist = np.sqrt(r_point.T # np.linalg.inv(cov) # r_point)
print("Mahalanobis distance = ", mahalanobis_dist)
euclidean_dist = np.sqrt(r_point.T # r_point)
print("Euclidean distance = ", euclidean_dist)
Related
The task is to find such a dot with a (x,0) coordinates, that the distance from it to the most distant point from the original set (distance is Euclidean) is minimal.
My idea is to find the minimum of the function that finds an euclidean distance like this:
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
def function_3(points_x, points_y):
dots = np.array([points_x,points_y])
ans = minimize(cdist(dots,points1),x0=0)
return(ans)
But it seems like that I'm doing something wrong... Can somebody give an advice?
Solution
Here's a complete working example for fitting points of the form (x, 0):
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
# set up a test set of 100 points to fit against
n = 100
xyTestset = np.random.rand(n,2)*10
def fun(x, xycomp):
# x is a vector, assumed to be of size 1
# cdist expects a 2D array, so we reshape xy into a 1x2 array
xy = np.array((x[0], 0)).reshape(1, -1)
return cdist(xy, xycomp).max()
fit = minimize(fun, x0=0, args=xyTestset)
print(fit.x)
which outputs:
[5.06807808]
This means, roughly speaking, that the minimization is finding the centroid of the set of random test points, as expected. If you wanted to do a 2D fitting to points of the form (x, y) instead, you can do:
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
# set up a test set of 100 points to fit against
n = 100
xyTestset = np.random.rand(n,2)*10
def fun(x, xycomp):
# x is a vector, assumed to be of size 2
return cdist(x.reshape(1, -1), xycomp).max()
fit = minimize(fun, x0=(0, 0), args=xyTestset)
print(fit.x)
which outputs:
[5.21292828 5.01491085]
which, again, is roughly the centroid of the 100 random points in xyTestset, as you'd expect.
Complete explanation
The problem that you're running into is that scipy.optimize.minimize has very specific expectations about the form of its first argument fun. fun is supposed to be a function that takes x as its first argument, where x is a 1D vector of the values to be minimized over. fun can also take additional arguments. These have to be passed into minimize via the args parameter, and their values are constant (ie they won't change over the course of the minimization).
Also, you should be aware that your case of fitting (x, 0) can be simplified. It's effectively a 1D problem, so you all you need to do is calculate the x distances between the points. You can completely ignore the y distances and still get the same results.
Additionally, you don't need a minimization to solve the problem you stated. The point that minimizes the distance to the farthest point (which is the same as saying "minimizing the distance to all points") is the centroid. The coordinates of the centroid are the means of each coordinate in your set of points, so if your points are stored in an Nx2 array xydata you can calculate the centroid by just doing:
xydata.mean(axis=1)
I have a set of values D:
[[ 6.83822474 3.54843586]
[ 12.45778114 4.42755159]
[ 10.27710359 9.47337879]
...,
[ 46.55259568 64.73755611]
[ 51.50842754 44.60132979]
Given a Multivariate Gaussian distribution with mean M and covariance V:
What is the equivalent multivariate case of a univariate point being within 2 standard deviations of the mean? i.e. assuming I have a univariate distribution with mean A and std B, I can say a point x_i is within 2 standard deviations of the mean if x_i - A < B. What would be the equivalent of this in the multivariate case?
How would I compute all the points in D that are within 2 std's (or the equivalent in the multivariate case) from the mean M?
It sounds like the generalization that you want is the Mahalanobis distance. A Mahalanobis distance of 1 from the mean is the generalization of one standard deviation from the mean of a univariate Gaussian.
You can compute the Mahalanobis distance using functions in the module scipy.spatial.distance. (There is almost certainly code for this distance in some form in scikit-learn, and possibly statsmodels, but I haven't checked.)
For computing a single distance, there is scipy.spatial.distance.mahalanobis, and for computing distances among or between collections of points, you can use pdist and cdist, respectively (also from scipy.spatial.distance).
Here's a script that uses cdist. In the plot, the points circled in red are within a Mahalanobis distance of 2 from the mean.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
# Mean
M = [10, 7]
# Covariance matrix
V = np.array([[ 9, -2],
[-2, 2]])
VI = np.linalg.inv(V)
# Generate a sample from the multivariate normal distribution
# with mean M and covariance matrix V.
rng = np.random.default_rng()
x = rng.multivariate_normal(M, V, size=250)
# Compute the Mahalanobis distance of each point in the sample.
mdist = cdist(x, [M], metric='mahalanobis', VI=VI)[:,0]
# Find where the Mahalanobis distance is less than 2.
d2_mask = mdist < 2
x2 = x[d2_mask]
plt.plot(x2[:,0], x2[:,1], 'o',
markeredgecolor='r', markerfacecolor='w', markersize=6, alpha=0.6)
plt.plot(x[:,0], x[:,1], 'k.', markersize=5, alpha=0.5)
plt.grid(alpha=0.3)
plt.axis('equal')
plt.show()
The correct way to define distance for the multivariate case is the Mahalanobis distance, i.e.
An example of doing this would be:
import numpy as np
vals = np.array([[ 6.83822474, 3.54843586],
[ 12.45778114, 4.42755159],
[ 10.27710359, 9.47337879],
[ 46.55259568, 64.73755611],
[ 51.50842754, 44.60132979]])
# Compute covariance matrix and its inverse
cov = np.cov(vals.T)
cov_inverse = np.linalg.inv(cov)
# Mean center the values
mean = np.mean(vals, axis=0)
centered_vals = vals - mean
# Compute Mahalanobis distance
dist = np.sqrt(np.sum(centered_vals * cov_inverse.dot(centered_vals.T).T, axis=1))
# Find points that are "far away" from the mean
indices = dist > 2
More specifically, given a natural number d, how can I generate random vectors in R^d such that each vector x has Euclidean norm <= 1?
Generating random vectors via numpy.random.rand(1,d) is no problem, but the likelihood of such a random vector having norm <= 1 is predictably bad for even not-small d. For example, even for d = 10 about 0.2% percent of such random vectors have appropriately small norm. So that seems like a silly solution.
EDIT: Re: Walter's comment, yes, I'm looking for a uniform distribution over vectors in the unit ball in R^d.
Based on the Wolfram Mathworld article on hypersphere point picking and Nate Eldredge's answer to a similar question on math.stackexchange.com, you can generate such a vector by generating a vector of d independent Gaussian random variables and a random number U uniformly distributed over the closed interval [0, 1], then normalizing the vector to norm U^(1/d).
Based on the answer by user2357112, you need something like this:
import numpy as np
...
inv_d = 1.0 / d
for ...:
gauss = np.random.normal(size=d)
length = np.linalg.norm(gauss)
if length == 0.0:
x = gauss
else:
r = np.random.rand() ** inv_d
x = np.multiply(gauss, r / length)
# conceptually: / length followed by * r
# do something with x
(this is my second Python program, so don't shoot at me...)
The tricks are that
the combination of d independent gaussian variables with same σ is a gaussian distribution in d dimensions, which, remarkably, has spherical symmetry,
the gaussian distribution in d dimensions can be projected onto the unit sphere by dividing by the norm, and
the uniform distribution in a d-dimensional unit sphere has cumulative radial distribution rd (which is what you need to invert)
this is the Python / Numpy code I am using. Since it does not use loops, is much faster:
n_vectors=1000
d=2
rnd_vec=np.random.uniform(-1, 1, size=(n_vectors, d)) # the initial random vectors
unif=np.random.uniform(size=n_vectors) # a second array random numbers
scale_f=np.expand_dims(np.linalg.norm(rnd_vec, axis=1)/unif, axis=1) # the scaling factors
rnd_vec=rnd_vec/scale_f # the random vectors in R^d
The second array of random numbers (unif) is needed as second scaling factor because otherwise all the vectors will have euclidean norm equal to one.
import numpy as np
from scipy.spatial import distance
d1 = np.random.randint(0, 255, size=(50))*0.9
d2 = np.random.randint(0, 255, size=(50))*0.7
vi = np.linalg.inv(np.cov(d1,d2, rowvar=0))
res = distance.mahalanobis(d1,d2,vi)
print res
ValueError: shapes (50,) and (2,2) not aligned: 50 (dim 0) != 2 (dim 0)
The Mahalanobis distance computes the distance between two D-dimensional vectors in reference to a D x D covariance matrix, which in some senses "defines the space" in which the distance is calculated. The matrix encodes how various combinations of coordinates should be weighted in computing the distance.
It seems that you've computed the 2x2 sample covariance for your points, which is not the right type of covariance matrix to use in a mahalanobis distance.
If you don't already have a well-justified 50x50 covariance matrix which defines your mahalanobis metric, the mahalanobis distance is probably not the right choice for your application. Without more detail it's hard to give a better recommendation.
As mentioned in jakevdp's answer, your inverse covariance matrix must be of DxD dimensions, where D is the number of elements in your vectors. So, your code should be:
import numpy as np
from scipy.spatial import distance
d1 = np.random.randint(0, 255, size=(50))*0.9
d2 = np.random.randint(0, 255, size=(50))*0.7
m =zip(d1, d2)
v = np.cov(m)
try:
vi = np.linalg.inv(v)
except:
vi = np.linalg.pinv(v) #just in case the produced matrix cannot be inverted
res = distance.mahalanobis(d1,d2,vi)
print res
I have two lists ( of different lengths) of numbers.
Using Python, I want to calculate histograms with say 10 bins.
Then I want to smooth these two histograms with Standard kernel (gaussian kernel with mean = 0 ,sigma=1)
Then I want to calculate the KL distance between these 2 smoothed histograms.
I found some code about histogram calculation but no sure about how to apply standard kernel for smoothening and then how to calculate the KL distance.
Please help.
For calculating histograms you can use numpy.histogram() and for gaussian smoothing scipy.ndimage.filters.gaussian_filter(). Kullback-Leibler divergence code can be found here.
Method to calculate do the required calculation would look something like this:
import numpy as np
from scipy.ndimage.filters import gaussian_filter
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0, p * np.log(p / q), 0))
def smoothed_hist_kl_distance(a, b, nbins=10, sigma=1):
ahist, bhist = (np.histogram(a, bins=nbins)[0],
np.histogram(b, bins=nbins)[0])
asmooth, bsmooth = (gaussian_filter(ahist, sigma),
gaussian_filter(bhist, sigma))
return kl(asmooth, bsmooth)