Unable to calculate mahalanobis distance - python

import numpy as np
from scipy.spatial import distance
d1 = np.random.randint(0, 255, size=(50))*0.9
d2 = np.random.randint(0, 255, size=(50))*0.7
vi = np.linalg.inv(np.cov(d1,d2, rowvar=0))
res = distance.mahalanobis(d1,d2,vi)
print res
ValueError: shapes (50,) and (2,2) not aligned: 50 (dim 0) != 2 (dim 0)

The Mahalanobis distance computes the distance between two D-dimensional vectors in reference to a D x D covariance matrix, which in some senses "defines the space" in which the distance is calculated. The matrix encodes how various combinations of coordinates should be weighted in computing the distance.
It seems that you've computed the 2x2 sample covariance for your points, which is not the right type of covariance matrix to use in a mahalanobis distance.
If you don't already have a well-justified 50x50 covariance matrix which defines your mahalanobis metric, the mahalanobis distance is probably not the right choice for your application. Without more detail it's hard to give a better recommendation.

As mentioned in jakevdp's answer, your inverse covariance matrix must be of DxD dimensions, where D is the number of elements in your vectors. So, your code should be:
import numpy as np
from scipy.spatial import distance
d1 = np.random.randint(0, 255, size=(50))*0.9
d2 = np.random.randint(0, 255, size=(50))*0.7
m =zip(d1, d2)
v = np.cov(m)
try:
vi = np.linalg.inv(v)
except:
vi = np.linalg.pinv(v) #just in case the produced matrix cannot be inverted
res = distance.mahalanobis(d1,d2,vi)
print res

Related

Given A = S*D*S.T (D is a diagonal matrix , S/S.T an arbitrary nxn matrix) Shouldn't the eigenvalues of A correspond to the diagonal entries of D?

I was asked to write a function that generates a random symmetric positive definite 2D matrix.
Here is my attempt:
import numpy as np
from numpy import linalg as la
def random_spd(n):
"""Generates random 2D SPD matrix (symmetric positive definite)"""
while True:
S = np.random.rand(n,n)
if la.matrix_rank(S)==n: #Make sure that S has full rank.
break
D = np.diag(np.random.randint(0,10,size=n))
print(f"S:\n{S}\n\nD:\n{D}\n") #Only for debugging
return S#D#S.T
A = random_spd(2)
print(f"A:\n{A}\n")
ei_vals, ei_vecs = la.eig(A)
print(f"Eigenvalues:\n{ei_vals}\n\nEigenvectors:\n{ei_vecs}")
Output:
D:
[[6 0]
[0 5]]
A:
[[1.97478191 1.71620628]
[1.71620628 2.37372465]]
Eigenvalues:
[0.4464938 3.90201276]
Eigenvectors:
[[-0.74681018 -0.66503726]
[ 0.66503726 -0.74681018]]
As far as I know, the function works.
Now, if I try to calculate the eigenvalues of a randomly generated matrix, shouldn't they be the same as
the diagonal entries of the matrix D?
Can someone help me understand my misconception or mistake?
Thank you very much!
Best regards, Max :)
What you are applying is a congruency transform, it preserves definiteness.
A positive definite matrix P is one that for any (non-null) vector x of shape (N, 1), x.T # P # x.
Now if you replace x = S # y, in the above condition you get y.T # S.T # P # S # y, comparing the two you conclude that S.T # P # S is positive definite as well (semidefinite positive if S is not full-rank).
Similarly the eigenvalues are defined by the equation
A # v = lambda * v
If you replace v = S u the equation you get is
A # S # u = lambda * S # u
To place this equation in the same form as the eigenvalues equatios, left-multiply the equation for inv(S)
(inv(S) # A # S) # u = lambda * u
We say that the matrix obtained this way inv(S) # A # S is similar to A, and we call this a similarity transformation.
There are simpler ways to do create a positive definite matrix. One simple way
S = np.random.rand(n,n)
A = S.T # S + eps * np.eye(n)
S.T # S can be seen as a congruency transform of the identity matrix, thus positive semidefinite, adding eps * eye(n) will ensure that all eigen values are greater than eps. No matrix inversions, no eigen decomposition.

How to calculate Mahalanobis Distance between randomly generated values?

I'm currently learning about Mahalanobis Distance and I find it quite difficult. To get the idea better I generated 2 sets of random values (x and y) and a random point, where for all 3 mean=0 and standard deviation=1. How can I calculate the Mahalanobis Distance between them? Please find my Python code below
Many thanks for your help!
import numpy as np
from numpy import cov
from scipy.spatial import distance
generate 20 random values where mean = 0 and standard deviation = 1, assign one set to x and one to y
x = [random.normalvariate(0,1) for i in range(20)]
y = [random.normalvariate(0,1) for i in range(20)]
r_point = [random.normalvariate(0,1)] #that's my random point
sigma = cov(x, y)
print(sigma)
print("random point =", r_point)
#use the covariance to calculate the mahalanobis distance from a random point```
Here is an example that shows how to compute the Mahalanobis distance of a point r_point to some data. The Mahalanobis distance takes into account the variance and correlation of the data you are measuring the distance to (using the inverse of its covariance matrix). Here, the Mahalanobis distance and the Euclidean distance should be very close because of the distribution of the data (0-mean and standard-deviation of 1). For other data, they will be different.
import numpy as np
N = 5000
mean = 0.0
stdDev = 1.0
data = np.random.normal(mean, stdDev, (2, N)) # 2D random points
r_point = np.random.randn(2)
cov = np.cov(data)
mahalanobis_dist = np.sqrt(r_point.T # np.linalg.inv(cov) # r_point)
print("Mahalanobis distance = ", mahalanobis_dist)
euclidean_dist = np.sqrt(r_point.T # r_point)
print("Euclidean distance = ", euclidean_dist)

Calculating Mean Squared Error through Matrix Arithmetic on Numpy Matrices of Binary Images

I have 2 binary images, one is a ground truth, and one is an image segmentation that I produced.
I am trying to calculate the mean squared distance ...
Let G = {g1, g2, . . . , gN} be the points in the ground truth boundary.
Let B = {b1, b2, . . . , bM} be the points in the segmented boundary.
Define d(p, p0) be a measure of distance between points p and p0 (e.g. Euclidean, city block, etc.)
between the two images using the following algorithm.
def MSD(A,G):
'''
Takes a thresholded binary image, and a ground truth img(binary), and computes the mean squared absolute difference
:param A: The thresholded binary image
:param G: The ground truth img
:return:
'''
sim = np.bitwise_xor(A,G)
sum = 0
for i in range(0,sim.shape[0]):
for j in range(0,sim.shape[1]):
if (sim[i,j] == True):
min = 9999999
for k in range(0,sim.shape[0]):
for l in range(0,sim.shape[1]):
if (sim[k, l] == True):
e = abs(i-k) + abs(j-l)
if e < min:
min = e
mink = k
minl = l
sum += min
return sum/(sim.shape[0]*sim.shape[1])
This algorithm is too slow though and never completes.
This example and this example (Answer 3) might show method of how to get the mean squared error using Matrix arithmetic, but I do not understand how these examples make any sense or why they work.
So if I understand your formula and code correctly, you have one (binary) image B and a (ground truth) image G. "Points" are defined by the pixel positions where either image has a True (or at least nonzero) value. From your bitwise_xor I deduce that both images have the same shape (M,N).
So the quantity d^2(b,g) is at worst an (M*N, M*N)-sized array, relating each pixel of B to each pixel of G. It's even better: we only need a shape (m,n) if there are m nonzeros in B and n nonzeros in G. Unless your images are huge we can get away with keeping track of this large quantity. This will cost memory but we will win a lot of CPU time by vectorization. So then we only have to find the minimum of this distance with respect to every n possible value, for each m. Then just sum up each minimum. Note that the solution below uses extreme vectorization, and it can easily eat up your memory if the images are large.
Assuming Manhattan distance (with the square in d^2 which seems to be missing from your code):
import numpy as np
# generate dummy data
M,N = 100,100
B = np.random.rand(M,N) > 0.5
G = np.random.rand(M,N) > 0.5
def MSD(B, G):
# get indices of nonzero pixels
nnz_B = B.nonzero() # (x_inds, y_inds) tuple, x_inds and y_inds are shape (m,)
nnz_G = G.nonzero() # (x_inds', y_inds') each with shape (n,)
# np.array(nnz_B) has shape (2,m)
# compute squared Manhattan distance
dist2 = abs(np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:]).sum(axis=0)**2 # shape (m,n)
# alternatively: Euclidean for comparison:
#dist2 = ((np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:])**2).sum(axis=0)
mindist2 = dist2.min(axis=-1) # shape (m,) of minimum square distances
return mindist2.mean() # sum divided by m, i.e. the MSD itself
print(MSD(B, G))
If the above uses too much memory we can introduce a loop over the elements of nnz_B, and only vectorize in the elements of nnz_G. This will take more CPU power and less memory. This trade-off is typical for vectorization.
An efficient method for calculating this distance is using the Distance Transform. SciPy has an implementation in the ndimage package: scipy.ndimage.morphology.distance_transform_edt.
The idea is to compute a distance transform for the background of the ground-truth image G. This leads to a new image D that is 0 for each pixel that is nonzero in G, and for each zero pixel in G there will be the distance to the nearest nonzero pixel.
Next, for each nonzero pixel in B (or A in the code that you posted), you look at the corresponding pixel in D. This is the distance to G for that pixel. So, simply average all the values in D for which B is nonzero to obtain your result.
import numpy as np
import scipy.ndimage as nd
import matplotlib.pyplot as pp
# Create some test data
img = pp.imread('erika.tif') # a random image
G = img > 120 # the ground truth
img = img + np.random.normal(0, 20, img.shape)
B = img > 120 # the other image
D = nd.morphology.distance_transform_edt(~G)
msd = np.mean(D[B]**2)

Confusion about homography matrix

I'm trying to get a homography matrix that describes the transformation from one image to another. I tried doing this by using an eigendecomposition and taking the smallest eigenvector. Apparently, I have to reshape it into a 3x3 matrix, but the numpy linalg function returns an eigenvalue of shape (9,9). (trying to compute it from 4 points)
A is a 8x9 matrix. pts1 and pts2 are arrays of 4 points in the source image and the target image respectively.
Code starts from symmetry matrix for homography calculation (size 8x9)
A_t = A.transpose()
sym_mat = np.dot(A_t,A)
eig_val,eig_vec = np.linalg.eig(sym_mat)
#sort according to value
idx = np.argsort(eig_val)
eig_val = eig_val[idx]
eig_vec = eig_vec[:,idx]
# Return the eigenvector corresponding to the smallest eigenvalue, reshaped
# as a 3x3 matrix.
H = np.reshape(smallest,(3,3))
`

Get indices of results from scipy.pdist(myArray,metric="jaccard") to map back to original array?

I am trying to calculate jaccard similarity
y= 1 - scipy.spatial.distance.pdist(X,metric="jaccard")
X is a m x n matrix and I get a 1-D array of size m choose 2 as a result of this function. How would I map the similarity values back to obtain a symmetric array or (a non-symmetric array either way is fine) so I can tell which two vectors from X (each row in X is a boolean vector) generated a particular jaccard similarity value in y?
You can use scipy.spatial.distance.squareform to convert between a full m x n distance matrix and the upper triangle:
import numpy as np
from scipy.spatial import distance
m = 100
n = 200
X = np.random.randn(m, n)
d = distance.pdist(X, metric='jaccard')
print(d.shape)
# (4950,)
D = distance.squareform(d)
print D.shape
# (100, 100)
There is a module called scipy.spatial.distance.squareform(y) wherein it converts the condensed form 1-D matrix obtained from scipy.spatial.distance.pdist(X,metric='jaccard') into a symmetric matrix so it would be relatively straightforward to obtain indices from there.
So we could do the following :
y=1-scipy.spatial.distance.pdist(x,metric='jaccard')
z=scipy.spatial.distance.squareform(y)
X is a m x n input matrix.
here y will be a 1 x (m choose 2) dimensional matrix (condensed distance matrix)
z will be a m x m square symmetric matrix

Categories

Resources