Use Distance Matrix in scipy.cluster.hierarchy.linkage()? - python

I have a distance matrix n*n M where M_ij is the distance between object_i and object_j. So as expected, it takes the following form:
/ 0 M_01 M_02 ... M_0n\
| M_10 0 M_12 ... M_1n |
| M_20 M_21 0 ... M2_n |
| ... |
\ M_n0 M_n2 M_n2 ... 0 /
Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean').
Its documentation says:
y must be a {n \choose 2} sized vector where n is the number of
original observations paired in the distance matrix.
y : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper triangular of the
distance matrix. This is the form that pdist returns. Alternatively, a
collection of m observation vectors in n dimensions may be passed as
an m by n array.
I am confused by this description of y. Can I directly feed my M in as the input y?
Update
#hongbo-zhu-cn has raised this issue up in GitHub. This is exactly what I am concerning about. However, as a newbie to GitHub, I don't know how it works and therefore have no idea how this issue is dealt with.

It seems that indeed we cannot directly pass the redundant square matrix in, although the documentation claims we can do so.
To benefit anyone who faces the same problem in the future, I write my solution as an additional answer here. So the copy-and-paste guys can just proceed with the clustering.
Use the following snippet to condense the matrix and happily proceed.
import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j
Please correct me if I am wrong.

For now you should pass in the 'condensed distance matrix', i.e. just the upper triangle of the distance matrix in vector form:
y = M[np.triu_indices(n,1)]
From the discussion of #hongbo-zhu-cn's pull request it looks as though the solution will be to add an extra keyword argument to the linkage function that will allow the user to explicitly specify that they are passing in an n x n distance matrix rather than an m x n observation matrix.

Related

Getting a series of Normal Distributions from a Least Squares regression

I am not particularly good at math. I would like to get some breadcrumbs about how to solve the following formula using python code.
Assume an [m,n] matrix M and a [1,n] vector y.
Solve for the least squares using scipy.linalg.lstsq(M, y).
The output will be an [m,1] vector of coefficients a in the equation Ma=y.
As per this question, any vector of solutions like a in a regression is basically a series of single points each taken from a normal distribution that describes the error of every point on the regression. In effect, every single digit in the solution vector a is the mean of a normal distribution of errors centred on zero.
I would like to find those normal distributions rather than the scalar value for every single point in the solution. Apologies for the poor description of the mathy bits, I was never trained in math in Uni.
Here is a hint. Let me know if you want more.
scipy.linalg.lstsq(M, y) returns four things:
x : (N,) or (N, K) ndarray
Least-squares solution.
residues : (K,) ndarray or float
Square of the 2-norm for each column in b - a x, if M > N and ndim(A) == n
(returns a scalar if b is 1-D). Otherwise a (0,)-shaped array is returned.
rank : int
Effective rank of a.
s : (min(M, N),) ndarray or None
Singular values of a. The condition number of a is s[0] / s[-1].
residues is going to be of interest to you!
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html

In which scenario would one use another matrix than the identity matrix for finding eigenvalues?

The scipy.linalg.eigh function can take two matrices as arguments: first the matrix a, of which we will find eigenvalues and eigenvectors, but also the matrix b, which is optional and chosen as the identity matrix in case it is left blank.
In what scenario would someone like to use this b matrix?
Some more context: I am trying to use xdawn covariances from the pyRiemann package. This uses the scipy.linalg.eigh function with a covariance matrix a and a baseline covariance matrix b. You can find the implementation here. This yields an error, as the b matrix in my case is not positive definitive and thus not useable in the scipy.linalg.eigh function. Removing this matrix and just using the identity matrix however solves this problem and yields relatively nice results... The problem is that I do not really understand what I changed, and maybe I am doing something I should not be doing.
This is the code from the pyRiemann package I am using (modified to avoid using functions defined in other parts of the package):
# X are samples (EEG data), y are labels
# shape of X is (1000, 64, 2459)
# shape of y is (1000,)
from scipy.linalg import eigh
Ne, Ns, Nt = X.shape
tmp = X.transpose((1, 2, 0))
b = np.matrix(sklearn.covariance.empirical_covariance(tmp.reshape(Ne, Ns * Nt).T))
for c in self.classes_:
# Prototyped response for each class
P = np.mean(X[y == c, :, :], axis=0)
# Covariance matrix of the prototyper response & signal
a = np.matrix(sklearn.covariance.empirical_covariance(P.T))
# Spatial filters
evals, evecs = eigh(a, b)
# and I am now using the following, disregarding the b matrix:
# evals, evecs = eigh(a)
If A and B were both symmetric matrices that doesn't necessarily have to imply that inv(A)*B must be a symmetric matrix. And so, if i had to solve a generalised eigenvalue problem of Ax=lambda Bx then i would use eig(A,B) rather than eig(inv(A)*B), so that the symmetry isn't lost.
One practical application is in finding the natural frequencies of a dynamic mechanical system from differential equations of the form M (d²x/dt²) = Kx where M is a positive definite matrix known as the mass matrix and K is the stiffness matrix, and x is displacement vector and d²x/dt² is acceleration vector which is the second derivative of the displacement vector. To find the natural frequencies, x can be substituted with x0 sin(ωt) where ω is the natural frequency. The equation reduces to Kx = ω²Mx. Now, one can use eig(inv(K)*M) but that might break the symmetry of the resultant matrix, and so I would use eig(K,M) instead.
A - lambda B x it means that x is not in the same basis as the covariance matrix.
If the matrix is not definite positive it means that there are vectors that can be flipped by your B.
I hope it was helpful.

Soft cosine distance between two vectors (Python)

I am wondering if there is a good way to calculate the soft cosine distance between two vectors of numbers. So far, I have seen solutions for sentences, which however did not help me, unfortunately.
Say I have two vectors like this:
a = [0,.25,.25,0,.5]
b = [.5,.0,.0,0.25,.25]
Now, I know that the features in the vectors exhibit some degree of similarity among them. This is described via:
s = [[0,.67,.25,0.78,.53]
[.53,0,.33,0.25,.25]
[.45,.33,0,0.25,.25]
[.85,.04,.11,0,0.25]
[.95,.33,.44,0.25,0]]
So a and b are 1x5 vectors, and s is a 5x5 matrix, describing how similar the features in a and b are.
Now, I would like to calculate the soft cosine distance between a and b, but accounting for between-feature similarity. I found this formula, which should calculate what I need:
soft cosine formula
I already tried implementing it using numpy:
import numpy as np
soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,b))) * np.sqrt(np.dot(a,np.dot(s,b)))))
It is supposed to produce a number between 0 and 1, with a higher number indicating a higher distance between a and b. However, I am running this on a larger dataframe with multiple vectors a and b, and for some it produces negative values. Clearly, I am doing something wrong.
Any help is greatly appreciated, and I am happy to clarify what need clarification!
Best,
Johannes
From what I see it may just be a formula error. Could you please try with mine ?
soft_cosine = a # (s#b) / np.sqrt( (a # (s#a) ) * (b # (s#b) ) )
I use the # operator (which is a shorthand for np.matmul on ndarrays), as I find it cleaner to write : it's just matrix multiplication, no matter if 1D or 2D. It is a simple way to compute a dot product between two 1D arrays, with less code than the usual np.dot function.
soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,b))) * np.sqrt(np.dot(a,np.dot(s,b)))))
I think you need to change: the denominator has both "a" and both "b".
soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,a))) * np.sqrt(np.dot(a,np.dot(s,b))))).

Lanczos algorithm for finding top eigenvalues of a matrix sum

I am trying to find the top k leading eigenvalues of a numpy matrix (using python dot product notation)
L#L + a*Y#Y.T, where L and Y are a symmetric nxn and an nxd matrix, respectively.
According to the below text from this paper, I should be able to calculate these leading eigenvalues with L#(L#v) + a*X#(X.T#v), where I guess v is an arbitrary vector. The Lanczos paper they cite is here.
I'm not quite sure where to start. I know that scipy has scipy.sparse.linalg.eigsh here, and from the notes it looks like it uses the Lanczos algorithm - but I am at a loss as to whether it's possible to use sparse.linalg.eigsh for my specific use case. I googled around and didn't find a Python implementation for this very quickly -- does anybody know if I can use sparse.linalg.eigsh to calculate this somehow? I definitely don't want to write this algorithm myself.
I also wasn't sure whether to post this in math.stackexchange or here, since it's a question about the Python implementation of a very mathy thing.
You could check scipy.sparse.linalg.eigsh.
import numpy as np;
from scipy.sparse.linalg import eigsh;
from numpy.linalg import eigh
a = 1.4
n = 20;
d = 7;
# random symmetric n x n matrix
L = np.random.randn(n, n)
L = L + L.T
# random n x d matrix
Y = np.random.randn(n, d)
A = L # L.T + a * Y # Y.T # your equation
A must be positive-definite to use eigsh, this is guaranteed to be true if a>0.
You could check the four eigenvalues as follows
eigsh(La, 4)[0]
For reference you can compare based on numpy.linalg.eigh that compute all the eigenvalues. Sort them, and take the last four elements of the sorted array, the results should be close.
np.sort(eigh(La)[0])[-4:]

Calculating the Angle Between Vectors by using a vector as a reference point:

I have been trying to find a fast algorithm of calculating all the angle between n vectors that are of length x. For example if x=3 and n=4, my data would look something like this:
A: [1,2,3]
B: [2,3,4]
C: [...]
D: [...]
I was wondering is it acceptable to find the the angle between all of be vectors (A,B,C,D) with respect to some fix vector (i.e. X:[100,100,100,100]) and then the subtract the angles of (A,B,C,D) found with respect to that fixed value, to find the angle between all of them. I want to do this because I would only have to compute the angle once and then I can subtract angles all of my vectors to find the different between them. In short, I want to know is it safe to make this assumption?
angle_between(A,B) == angle_between(A,X) - angle_between(B,X)
and the angle_between function is the Cosine similarity.
That approach will only work for 2-D vectors. For higher dimensions any two vectors will define a hyperplane, and only if the third (reference) vector also lies within this hyperplane will your approach work. Unfortunately instead of only calculating n angles and subtracting, in order to determine the angles between each pair of vectors you would have to calculate all n choose 2 of them.

Categories

Resources