My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?
I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.
I'm trying to fit a gaussian to this set of data:
It is a 2D matrix with values (probability distribution). If I plot it in 3D it looks like:
As far as I understood from this other question (https://mathematica.stackexchange.com/questions/27642/fitting-a-two-dimensional-gaussian-to-a-set-of-2d-pixels) I need to compute the mean and the covariance matrix of my data and the Gaussian that I need will be exactly the one defined by that mean and covariance matrix.
However, I can not properly understand the code of that other question (as it is from Mathematica) and I am pretty stuck with statistics.
How would I compute in Python (Numpy, PyTorch...) the mean and the covariance matrix of the Gaussian?
I'm trying to avoid all these optimization frameworks (LSQ, KDE) as I think that the solution is much simpler and the computational cost is something that I have to take into account...
Thanks!
Let's call your data matrix D with shape d x n where d is the data dimension and n is the number of samples. I will assume that in your example, d=5 and n=6, although you will need to determine for yourself which is the data dimension and which is the sample dimension. In that case, we can find the mean and covariance using the following code:
import numpy as np
n = 6
d = 5
D = np.random.random([d, n])
mean = D.mean(axis=1)
covariance = np.cov(D)
I am using the LMNN module from scikit-learn metric_learning (http://contrib.scikit-learn.org/metric-learn/index.html), and I am attempting to recover the linear transformation matrix (L.T) from the learned Mahalanobis (M) matrix.
The reason I am trying to recover this linear transformation is that I am fitting my dataset using cloud compute, but am testing it on a local machine. This means I can not save or recover the LMNN model after fitting on cloud compute, but I can save the learned M matrix and use a decomposition to find the learned linear transformation. I can then apply that learned linear transformation to my test sets on a local machine.
The problem is that I can't seem to reconcile the results from the LMNN module's built in transformation with the learned linear transformation from the decomposed M matrix. Here's an example:
import numpy as np
from metric_learn import LMNN
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
lmnn = LMNN(k=5, learn_rate=1e-6)
X_transformed = lmnn.fit_transform(X, Y)
M_matrix = lmnn.get_mahalanobis_matrix()
array([[ 2.47937397, 0.36313715, -0.41243858, -0.78715282],
[ 0.36313715, 1.69818843, -0.90042673, -0.0740197 ],
[-0.41243858, -0.90042673, 2.37024271, 2.18292864],
[-0.78715282, -0.0740197 , 2.18292864, 2.9531315 ]])
# cholesky decomp of M_matrix
eigvalues, eigcolvectors = np.linalg.eig(M_matrix)
eigvalues_diag = np.diag(eigvalues)
eigvalues_diag_sqrt = np.sqrt(eigvalues_diag)
L = eigcolvectors.dot(eigvalues_diag_sqrt.dot(np.linalg.inv(eigcolvectors)))
L_transpose = np.transpose(L)
L_transpose.dot(L) # check to confirm that matches M_matrix
array([[ 2.47937397, 0.36313715, -0.41243858, -0.78715282],
[ 0.36313715, 1.69818843, -0.90042673, -0.0740197 ],
[-0.41243858, -0.90042673, 2.37024271, 2.18292864],
[-0.78715282, -0.0740197 , 2.18292864, 2.9531315 ]])
# test fit_transform() vs. transform() using LMNN functions
lmnn.transform(X[0:4, :])
array([[8.2487 , 4.41337015, 0.14988465, 0.52629361],
[7.87314906, 3.77220291, 0.36015873, 0.525688 ],
[7.59410008, 4.03369392, 0.17339877, 0.51350962],
[7.41676205, 3.82012155, 0.47312948, 0.68515535]])
X_transformed[0:4, :]
array([[8.2487 , 4.41337015, 0.14988465, 0.52629361],
[7.87314906, 3.77220291, 0.36015873, 0.525688 ],
[7.59410008, 4.03369392, 0.17339877, 0.51350962],
[7.41676205, 3.82012155, 0.47312948, 0.68515535]])
# test manual transform of X[0:4, :]
X[0:4, :].dot(L_transpose)
array([[8.22608756, 4.45271327, 0.24690081, 0.51206068],
[7.85071271, 3.81054846, 0.45442718, 0.51144826],
[7.57310259, 4.06981377, 0.26240745, 0.50067674],
[7.39356544, 3.85511015, 0.55776916, 0.67615584]])
As seen above, the first four rows of the original dataset X[0:4, :] when transformed by the LMNN module (using either fit_transform(X, Y) or transform(X[0:4, :]) give different results from the manual transformation.
I believe my decomposition of the M matrix is correct as I can replicate the M matrix using L.T.dot(L).
The learned linear transformation is L.T as per the github code: https://github.com/scikit-learn-contrib/metric-learn/blob/master/metric_learn/base_metric.py
class MetricTransformer(six.with_metaclass(ABCMeta)):
#abstractmethod
def transform(self, X):
"""Applies the metric transformation.
Parameters
----------
X : (n x d) matrix
Data to transform.
Returns
-------
transformed : (n x d) matrix
Input data transformed to the metric space by :math:`XL^{\\top}`
class MahalanobisMixin(six.with_metaclass(ABCMeta, BaseMetricLearner,
MetricTransformer)):
r"""Mahalanobis metric learning algorithms.
Algorithm that learns a Mahalanobis (pseudo) distance :math:`d_M(x, x')`,
defined between two column vectors :math:`x` and :math:`x'` by: :math:`d_M(x,
x') = \sqrt{(x-x')^T M (x-x')}`, where :math:`M` is a learned symmetric
positive semi-definite (PSD) matrix. The metric between points can then be
expressed as the euclidean distance between points embedded in a new space
through a linear transformation. Indeed, the above matrix can be decomposed
into the product of two transpose matrices (through SVD or Cholesky
decomposition): :math:`d_M(x, x')^2 = (x-x')^T M (x-x') = (x-x')^T L^T L
(x-x') = (L x - L x')^T (L x- L x')`
What am I missing here?
Thanks!
metric-learn contributor here, #BeginnersMindTruly you're right, for LMNN we indeed learn the L matrix directly during training, from which we compute M at the end, so computing back the transformation L from M may lead to numerical differences.
As for your particular use case of accessing directly the learned matrix L, you should be able to do that using the components_ attribute of your metric learner, at the end of training.
Per contributors from the team who built the module (http://contrib.scikit-learn.org/metric-learn/index.html), this discrepancy is due to floating point precision errors. The LMNN module first computes the linear transformation L.T then computes the M matrix by computing L.T.dot(L). So any attempt to recover the original transformation loses precision both during the computation of M and then the refactoring M.
I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.
If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.
Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.
I am using PCA to reduce the dimensionality of a N-dimensional dataset, but I want to build in robustness to large outliers, so I've been looking into Robust PCA codes.
For traditional PCA, I'm using python's sklearn.decomposition.PCA which nicely returns the principal components as vectors, onto which I can then project my data (to be clear, I've also coded my own versions using SVD so I know how the method works). I found a few pre-coded RPCA python codes out there (like https://github.com/dganguli/robust-pca and https://github.com/jkarnows/rpcaADMM).
The 1st code is based on the Candes et al. (2009) method, and returns low rank L and sparse S matrices for a dataset D. The 2nd code uses the ADMM method of matrix decomposition (Parikh, N., & Boyd, S. 2013) and returns X_1, X_2, X_3 matrices. I must admit, I'm having a very hard time figuring out how to connect these to the principal axes that are returned by a standard PCM algorithm. Can anyone provide any guidance?
Specifically, in one dataset X, I have a cloud of N 3-D points. I run it through PCA:
pca=sklean.decompose.PCA(n_components=3)
pca.fit(X)
comps=pca.components_
and these 3 components are 3-D vectors define the new basis onto which I project all my points. With Robust PCA, I get matrices L+S=X. Does one then run pca.fit(L)? I would have thought that RPCA would have given me back the eigenvectors but have internal steps to throw out outliers as part of building the covariance matrix or performing SVD.
Maybe what I think of as "Robust PCA" isn't how other people are using/coding it?
The robust-pca code factors the data matrix D into two matrices, L and S which are "low-rank" and "sparse" matrices (see the paper for details). L is what's mostly constant between the various observations, while S is what varies. Figures 2 and 3 in the paper give a really nice example from a couple of security cameras, picking out the static background (L) and variability such as passing people (S).
If you just want the eigenvectors, treat the S as junk (the "large outliers" you're wanting to clip out) and do an eigenanalysis on the L matrix.
Here's an example using the robust-pca code:
L, S = RPCA(data).fit()
rcomp, revals, revecs = pca(L)
print("Normalised robust eigenvalues: %s" % (revals/np.sum(revals),))
Here, the pca function is:
def pca(data, numComponents=None):
"""Principal Components Analysis
From: http://stackoverflow.com/a/13224592/834250
Parameters
----------
data : `numpy.ndarray`
numpy array of data to analyse
numComponents : `int`
number of principal components to use
Returns
-------
comps : `numpy.ndarray`
Principal components
evals : `numpy.ndarray`
Eigenvalues
evecs : `numpy.ndarray`
Eigenvectors
"""
m, n = data.shape
data -= data.mean(axis=0)
R = np.cov(data, rowvar=False)
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = np.linalg.eigh(R)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
if numComponents is not None:
evecs = evecs[:, :numComponents]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return np.dot(evecs.T, data.T).T, evals, evecs