Recovering transformation matrix from metric_learning LMNN algorithm

Recovering transformation matrix from metric_learning LMNN algorithm - python

I am using the LMNN module from scikit-learn metric_learning (http://contrib.scikit-learn.org/metric-learn/index.html), and I am attempting to recover the linear transformation matrix (L.T) from the learned Mahalanobis (M) matrix.
The reason I am trying to recover this linear transformation is that I am fitting my dataset using cloud compute, but am testing it on a local machine. This means I can not save or recover the LMNN model after fitting on cloud compute, but I can save the learned M matrix and use a decomposition to find the learned linear transformation. I can then apply that learned linear transformation to my test sets on a local machine.
The problem is that I can't seem to reconcile the results from the LMNN module's built in transformation with the learned linear transformation from the decomposed M matrix. Here's an example:
import numpy as np
from metric_learn import LMNN
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
lmnn = LMNN(k=5, learn_rate=1e-6)
X_transformed = lmnn.fit_transform(X, Y)
M_matrix = lmnn.get_mahalanobis_matrix()
array([[ 2.47937397, 0.36313715, -0.41243858, -0.78715282],
[ 0.36313715, 1.69818843, -0.90042673, -0.0740197 ],
[-0.41243858, -0.90042673, 2.37024271, 2.18292864],
[-0.78715282, -0.0740197 , 2.18292864, 2.9531315 ]])
# cholesky decomp of M_matrix
eigvalues, eigcolvectors = np.linalg.eig(M_matrix)
eigvalues_diag = np.diag(eigvalues)
eigvalues_diag_sqrt = np.sqrt(eigvalues_diag)
L = eigcolvectors.dot(eigvalues_diag_sqrt.dot(np.linalg.inv(eigcolvectors)))
L_transpose = np.transpose(L)
L_transpose.dot(L) # check to confirm that matches M_matrix
array([[ 2.47937397, 0.36313715, -0.41243858, -0.78715282],
[ 0.36313715, 1.69818843, -0.90042673, -0.0740197 ],
[-0.41243858, -0.90042673, 2.37024271, 2.18292864],
[-0.78715282, -0.0740197 , 2.18292864, 2.9531315 ]])
# test fit_transform() vs. transform() using LMNN functions
lmnn.transform(X[0:4, :])
array([[8.2487 , 4.41337015, 0.14988465, 0.52629361],
[7.87314906, 3.77220291, 0.36015873, 0.525688 ],
[7.59410008, 4.03369392, 0.17339877, 0.51350962],
[7.41676205, 3.82012155, 0.47312948, 0.68515535]])
X_transformed[0:4, :]
array([[8.2487 , 4.41337015, 0.14988465, 0.52629361],
[7.87314906, 3.77220291, 0.36015873, 0.525688 ],
[7.59410008, 4.03369392, 0.17339877, 0.51350962],
[7.41676205, 3.82012155, 0.47312948, 0.68515535]])
# test manual transform of X[0:4, :]
X[0:4, :].dot(L_transpose)
array([[8.22608756, 4.45271327, 0.24690081, 0.51206068],
[7.85071271, 3.81054846, 0.45442718, 0.51144826],
[7.57310259, 4.06981377, 0.26240745, 0.50067674],
[7.39356544, 3.85511015, 0.55776916, 0.67615584]])
As seen above, the first four rows of the original dataset X[0:4, :] when transformed by the LMNN module (using either fit_transform(X, Y) or transform(X[0:4, :]) give different results from the manual transformation.
I believe my decomposition of the M matrix is correct as I can replicate the M matrix using L.T.dot(L).
The learned linear transformation is L.T as per the github code: https://github.com/scikit-learn-contrib/metric-learn/blob/master/metric_learn/base_metric.py
class MetricTransformer(six.with_metaclass(ABCMeta)):
#abstractmethod
def transform(self, X):
"""Applies the metric transformation.
Parameters
----------
X : (n x d) matrix
Data to transform.
Returns
-------
transformed : (n x d) matrix
Input data transformed to the metric space by :math:`XL^{\\top}`
class MahalanobisMixin(six.with_metaclass(ABCMeta, BaseMetricLearner,
MetricTransformer)):
r"""Mahalanobis metric learning algorithms.
Algorithm that learns a Mahalanobis (pseudo) distance :math:`d_M(x, x')`,
defined between two column vectors :math:`x` and :math:`x'` by: :math:`d_M(x,
x') = \sqrt{(x-x')^T M (x-x')}`, where :math:`M` is a learned symmetric
positive semi-definite (PSD) matrix. The metric between points can then be
expressed as the euclidean distance between points embedded in a new space
through a linear transformation. Indeed, the above matrix can be decomposed
into the product of two transpose matrices (through SVD or Cholesky
decomposition): :math:`d_M(x, x')^2 = (x-x')^T M (x-x') = (x-x')^T L^T L
(x-x') = (L x - L x')^T (L x- L x')`
What am I missing here?
Thanks!

metric-learn contributor here, #BeginnersMindTruly you're right, for LMNN we indeed learn the L matrix directly during training, from which we compute M at the end, so computing back the transformation L from M may lead to numerical differences.
As for your particular use case of accessing directly the learned matrix L, you should be able to do that using the components_ attribute of your metric learner, at the end of training.

Per contributors from the team who built the module (http://contrib.scikit-learn.org/metric-learn/index.html), this discrepancy is due to floating point precision errors. The LMNN module first computes the linear transformation L.T then computes the M matrix by computing L.T.dot(L). So any attempt to recover the original transformation loses precision both during the computation of M and then the refactoring M.

Related

In which scenario would one use another matrix than the identity matrix for finding eigenvalues?

The scipy.linalg.eigh function can take two matrices as arguments: first the matrix a, of which we will find eigenvalues and eigenvectors, but also the matrix b, which is optional and chosen as the identity matrix in case it is left blank.
In what scenario would someone like to use this b matrix?
Some more context: I am trying to use xdawn covariances from the pyRiemann package. This uses the scipy.linalg.eigh function with a covariance matrix a and a baseline covariance matrix b. You can find the implementation here. This yields an error, as the b matrix in my case is not positive definitive and thus not useable in the scipy.linalg.eigh function. Removing this matrix and just using the identity matrix however solves this problem and yields relatively nice results... The problem is that I do not really understand what I changed, and maybe I am doing something I should not be doing.
This is the code from the pyRiemann package I am using (modified to avoid using functions defined in other parts of the package):
# X are samples (EEG data), y are labels
# shape of X is (1000, 64, 2459)
# shape of y is (1000,)
from scipy.linalg import eigh
Ne, Ns, Nt = X.shape
tmp = X.transpose((1, 2, 0))
b = np.matrix(sklearn.covariance.empirical_covariance(tmp.reshape(Ne, Ns * Nt).T))
for c in self.classes_:
# Prototyped response for each class
P = np.mean(X[y == c, :, :], axis=0)
# Covariance matrix of the prototyper response & signal
a = np.matrix(sklearn.covariance.empirical_covariance(P.T))
# Spatial filters
evals, evecs = eigh(a, b)
# and I am now using the following, disregarding the b matrix:
# evals, evecs = eigh(a)

If A and B were both symmetric matrices that doesn't necessarily have to imply that inv(A)*B must be a symmetric matrix. And so, if i had to solve a generalised eigenvalue problem of Ax=lambda Bx then i would use eig(A,B) rather than eig(inv(A)*B), so that the symmetry isn't lost.
One practical application is in finding the natural frequencies of a dynamic mechanical system from differential equations of the form M (d²x/dt²) = Kx where M is a positive definite matrix known as the mass matrix and K is the stiffness matrix, and x is displacement vector and d²x/dt² is acceleration vector which is the second derivative of the displacement vector. To find the natural frequencies, x can be substituted with x0 sin(ωt) where ω is the natural frequency. The equation reduces to Kx = ω²Mx. Now, one can use eig(inv(K)*M) but that might break the symmetry of the resultant matrix, and so I would use eig(K,M) instead.

A - lambda B x it means that x is not in the same basis as the covariance matrix.
If the matrix is not definite positive it means that there are vectors that can be flipped by your B.
I hope it was helpful.

PCA from scratch and Sklearn PCA giving different output

I am trying to implement PCA from scratch. Following is the code:
sc = StandardScaler() #standardization
X_new = sc.fit_transform(X)
Z = np.divide(np.dot(X_new.T,X_new),X_new.shape[0]) #covariance matrix
eig_values, eig_vectors = np.linalg.eig(Z) #eigen vectors calculation
eigval_sorted = np.sort(eig_values)[::-1]
ev_index =np.argsort(eigval_sorted)[::-1]
pc = eig_vectors[:,ev_index] #eigen vectors sorts on the basis of eigen values
W = pc[:,0:2] #extracting 2 components
print(W)
and getting the following components:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
When I use the sklearn's PCA I get the following two components:
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105],
[ 0.37231836, 0.92555649, 0.02109478, 0.06541577]])
Projection onto new feature space gives following different figures:
Where am I doing it wrong and what can be done to resolve the problem?

The result of a PCA are technically not n vectors, but a subspace of dimension n. This subspace is represented by n vectors that span that subspace.
In your case, while the vectors are different, the spanned subspace is the same, so the result of the PCA is the same.
If you want to align your solution perfectly with the sklearn solution, you need to normalise your solution in the same way. Apparently sklearn prefers positive values over negative values? You'd need to dig into their documentation.
edit:
Yes, of course, what I wrote is wrong. The algorithm itself returns ordered orthonormal basis vectors. So vectors that are of length one and orthogonal to each other and they are ordered in their 'importance' to the dataset. So way more information than just the subspace.
However, if v, w, u are a solution of the PCA, so should +/- v, w, u be.
edit: It seems that np.linalg.eig has no mechanism to guarantee it will also return the same set of eigenvectors representing the eigenspace, see also here:
NumPy linalg.eig
So, a new version of numpy, or just how the stars are aligned today, can change your result. Although, for a PCA it should only vary in +/-

How to you find the basis expansion matrix in python

I'm doing nonparametric regression, and need a function to expand the design matrix X into the basis matrix. Is there a package that can do this?
For example, if X is 200*10 (200 obs and 10 features), using a B-spline basis expansion with 5 bases will yield a 200*50 basis matrix.
I tried scipy.interpolate.BSpline, but it seems it does not return the basis matrix.

The pasty library in python is useful.
from pasty import dmatrix
transformed_x = dmatrix(
"bs(x, df=df, degree=degree, include_intercept=False)",
{"train": x}, return_type='matrix')
This will return the basis expansion for a vector x. If you have a data matrix, do this for each column.

How to use Robust PCA output as principal-component (eigen)vectors from traditional PCA

I am using PCA to reduce the dimensionality of a N-dimensional dataset, but I want to build in robustness to large outliers, so I've been looking into Robust PCA codes.
For traditional PCA, I'm using python's sklearn.decomposition.PCA which nicely returns the principal components as vectors, onto which I can then project my data (to be clear, I've also coded my own versions using SVD so I know how the method works). I found a few pre-coded RPCA python codes out there (like https://github.com/dganguli/robust-pca and https://github.com/jkarnows/rpcaADMM).
The 1st code is based on the Candes et al. (2009) method, and returns low rank L and sparse S matrices for a dataset D. The 2nd code uses the ADMM method of matrix decomposition (Parikh, N., & Boyd, S. 2013) and returns X_1, X_2, X_3 matrices. I must admit, I'm having a very hard time figuring out how to connect these to the principal axes that are returned by a standard PCM algorithm. Can anyone provide any guidance?
Specifically, in one dataset X, I have a cloud of N 3-D points. I run it through PCA:
pca=sklean.decompose.PCA(n_components=3)
pca.fit(X)
comps=pca.components_
and these 3 components are 3-D vectors define the new basis onto which I project all my points. With Robust PCA, I get matrices L+S=X. Does one then run pca.fit(L)? I would have thought that RPCA would have given me back the eigenvectors but have internal steps to throw out outliers as part of building the covariance matrix or performing SVD.
Maybe what I think of as "Robust PCA" isn't how other people are using/coding it?

The robust-pca code factors the data matrix D into two matrices, L and S which are "low-rank" and "sparse" matrices (see the paper for details). L is what's mostly constant between the various observations, while S is what varies. Figures 2 and 3 in the paper give a really nice example from a couple of security cameras, picking out the static background (L) and variability such as passing people (S).
If you just want the eigenvectors, treat the S as junk (the "large outliers" you're wanting to clip out) and do an eigenanalysis on the L matrix.
Here's an example using the robust-pca code:
L, S = RPCA(data).fit()
rcomp, revals, revecs = pca(L)
print("Normalised robust eigenvalues: %s" % (revals/np.sum(revals),))
Here, the pca function is:
def pca(data, numComponents=None):
"""Principal Components Analysis
From: http://stackoverflow.com/a/13224592/834250
Parameters
----------
data : `numpy.ndarray`
numpy array of data to analyse
numComponents : `int`
number of principal components to use
Returns
-------
comps : `numpy.ndarray`
Principal components
evals : `numpy.ndarray`
Eigenvalues
evecs : `numpy.ndarray`
Eigenvectors
"""
m, n = data.shape
data -= data.mean(axis=0)
R = np.cov(data, rowvar=False)
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = np.linalg.eigh(R)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
if numComponents is not None:
evecs = evecs[:, :numComponents]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return np.dot(evecs.T, data.T).T, evals, evecs

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_

You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])

Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.

I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.

a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37

It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.