I am calculating the eigenvalues of a covariance matrix, which is real and symmetric positive semi-definite. Therefore, the eigenvalues and eigenvectors should all be real, however numpy.linalg.eig() is returning complex values with (almost) zero imaginary components.
The covariance matrix is too large to post here, but the eigenvalues come out as
[1.38174e01+00j, 9.00153e00+00j, ....]
with the largest imaginary component in the vector being negligible at -9.7557e-16j.
I think there is some machine precision issue here, as clearly the imaginary components are negligible (and given that my covariance matrix is real pos semi-def).
Is there a way to suppress returning the imaginary component using numpy eig (or scipy)? I'm trying to avoid an if statement that checks if the eigenvalue object is complex and then sets it to the real components only, if possible.
I think the best solution for this specific case is to use #PaulPanzer's suggestion, that is np.linalg.eigh. This works directly for Hermitian matrices, and thus will have only real Eigen values, exactly this specific use case.
In general, to retrieve the real part of numbers in an array is as easy as:
>>> np.real(np.array([1+1j,2+1j]))
array([ 1., 2.])
numpy.real returns the real part of your numbers.
Related
I am trying to estimate the the density of a data set at certain points, using scipy.
from scipy.stats import gaussian_kde
import numpy as np
I have a dataset A of 3D points (this is just a minimal example. My actual data has many more dimensions and many more samples)
A = np.array([[0.078377 , 0.76737392, 0.45038174],
[0.65990129, 0.13154658, 0.30770917],
[0.46068406, 0.22751313, 0.28122463]])
and the points at which I want to estimate the density
B = np.array([[0.40209377, 0.21063273, 0.75885516],
[0.91709997, 0.79303252, 0.65156937]])
But I can't seem to be able to use the gaussian_kde function, as
result = gaussian_kde(A.T)(B.T)
returns
LinAlgError: Matrix is not positive definite
How do I fix this error? How do I get the density of my sample?
TL; DR:
You have highly correlated features in your data which leads to a numerical error. There are several possible ways to address this, each with pros and cons. A drop-in replacement class for gaussian_kde is proposed below.
Diagnostic
Your dataset (i.e. the matrix that you feed when creating the gaussian_kde object, not when using it), likely contains highly dependent features. This fact (possibly combined with having low numerical resolution like float32 and "too many" datapoints) causes the covariance matrix to have near-zero eigenvalues, which due to numerical error can get below zero. This in turn will break the code that uses the Cholesky decomposition on the dataset covariance matrix (see explanation for specific details).
Assuming your dataset has shape (dims, N) you can test if this is your problem via np.linalg.eigh(np.cov(dataset))[0] <= 0. If any of the outputs comes out True, let me be the first welcoming you to the club.
Treatments
The idea is to get all eigenvalues above zero.
Increasing numerical resolution to the highest float that is practical can be an easy fix and worth trying, but may not be enough.
Given the fact that this is caused by correlated features, removing datapoints doesn't help much a priori. There is a slim hope that having less numbers to crush will propagate less error, and keep the eigenvalues above zero. It's easy to implement but it discards data points.
A more involved fix is to identify the highly correlated features and merge them or ignore the "superfluous" ones. This is tricky especially if the correlations among dimensions vary from instance to instance.
Probably the cleanest way is to leave the data untouched, and lift the problematic eigenvalues to positive values. This is usually done in 2 ways:
SVD addresses the problem directly at its core: Decompose the covariance matrix and replace the negative eigenvalues with a small positive epsilon. This will put your matrix back to PD domain introducing minimal error.
If the SVD is too expensive to compute, an alternative numerical hack is to add epsilon * np.eye(D) to the covariance matrix. This has the effect of adding epsilon to each one of the eigenvalues. Again, if epsilon is small enough, this doesn't introduce that much of an error.
If you go for the last approach you'll need to tell gaussian_kde to modify its covariance matrix. This is a relatively clean way I found to do that: simply add this class to your codebase and replace gaussian_kde with GaussianKde (tested on my end, seems to work fine).
class GaussianKde(gaussian_kde):
"""
Drop-in replacement for gaussian_kde that adds the class attribute EPSILON
to the covmat eigenvalues, to prevent exceptions due to numerical error.
"""
EPSILON = 1e-10 # adjust this at will
def _compute_covariance(self):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor().
"""
self.factor = self.covariance_factor()
# Cache covariance and inverse covariance of the data
if not hasattr(self, '_data_inv_cov'):
self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1,
bias=False,
aweights=self.weights))
# we're going the easy way here
self._data_covariance += self.EPSILON * np.eye(
len(self._data_covariance))
self._data_inv_cov = np.linalg.inv(self._data_covariance)
self.covariance = self._data_covariance * self.factor**2
self.inv_cov = self._data_inv_cov / self.factor**2
L = np.linalg.cholesky(self.covariance * 2 * np.pi)
self._norm_factor = 2*np.log(np.diag(L)).sum() # needed for scipy 1.5.2
self.log_det = 2*np.log(np.diag(L)).sum() # changed var name on 1.6.2
Explanation
In case your error is similar, but not quite that, or anyone feels curious, here is the process I followed, hopefully it helps.
The exception stack specified that the error was triggered during a Cholesky decomposition. In my case, this was this line inside the _compute_covariance method.
Indeed, after the exception, checking the Eigenvalues for the covariance and inv_cov attributes via np.eigh showed that covariance had a close-to-zero negative eigenvalue, and hence its inverse had a huge one. Since Cholesky expects PD matrices, this triggered an error.
At this point we can heavily suspect that the tiny, negative eigenvalue is a numerical error, since covariance matrices are PSD. Indeed, the error source comes when the covariance matrix is originally computed from the dataset that has been passed to the constructor, here. In my case, the covariance matrix yielded at least 2 almost identical columns, which led to the residual negative eigenvalue due to numerical error.
When will your dataset lead to a quasi-singular covariance matrix? That question is perfectly addressed in this other SE post. The bottom line is: If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix) (Kudos and +1 to ttnphns for the amazing work).
I want to get the eigenvectors of a matrix, but I do not want them to be normalized.
By default numpy.linalg.eig returns normalized eigenvectors. But, rather than multiplying this result by the norm (which will introduce an unnecessary additional numerical errors), I want it to just return the eigenvectors not normalized, but as they are originally.
Is there a smooth way to do this in either numpy or scipy? I am more familiar with numpy, so a solution there is best, but scipy would also be acceptable.
I would like to get all of the eigenvalues and eigenvectors for a particular real symmetric matrix. This is obviously possible with numpy.linalg.eigh, however, this matrix has a particular sparse structure which allows linear scaling dot-product with a vector. For this reason, I would like to use scipy.sparse.linalg.eigsh, which allows for a LinearOperator in place of the input array, and use of the implicitly restarted Lanczos method.
My problem is that scipy.sparse.linalg.eigsh does not allow calculation of all eigenvalues and eigenvectors (i.e. k=n), and the rank of my input matrix is typically equal to n. Is there any way to get around this, or does any other function allow similar functionality?
Assume that I have a square matrix M. Assume that I would like to invert the matrix M.
I am trying to use the the fractions mpq class within gmpy2 as members of my matrix M. If you are not familiar with these fractions, they are functionally similar to python's built-in package fractions. The only problem is, there are no packages that will invert my matrix unless I take them out of fraction form. I require the numbers and the answers in fraction form. So I will have to write my own function to invert M.
There are known algorithms that I could program, such as gaussian elimination. However, performance is an issue, so my question is as follows:
Is there a computationally fast algorithm that I could use to calculate the inverse of a matrix M?
Is there anything else you know about these matrices? For example, for symmetric positive definite matrices, Cholesky decomposition allows you to invert faster than the standard Gauss-Jordan method you mentioned.
For general matrix inversions, the Strassen algorithm will give you a faster result than Gauss-Jordan but slower than Cholesky.
It seems like you want exact results, but if you're fine with approximate inversions, then there are algorithms which approximate the inverse much faster than the previously mentioned algorithms.
However, you might want to ask yourself if you need the entire matrix inverse for your specific application. Depending on what you are doing it might be faster to use another matrix property. In my experience computing the matrix inverse is an unnecessary step.
I hope that helps!
I am working with a large (complex) Hermitian matrix and I am trying to diagonalize it efficiently using Python/Scipy.
Using the eigh function from scipy.linalgit takes about 3s to generate and diagonalize a roughly 800x800 matrix and compute all the eigenvalues and eigenvectors.
The eigenvalues in my problem are symmetrically distributed around 0 and range from roughly -4 to 4. I only need the eigenvectors corresponding to the negative eigenvalues, though, which turns the range I am looking to calculate into [-4,0).
My matrix is sparse, so it's natural to use the scipy.sparsepackage and its functions to calculate the eigenvectors via eigsh, since it uses much less memory to store the matrix.
Also I can tell the program to only calculate the negative eigenvalues via which='SA'. The problem with this method is, that it takes now roughly 40s to compute half the eigenvalues/eigenvectors. I know, that the ARPACK algorithm is very inefficient when computing small eigenvalues, but I can't think of any other way to compute all the eigenvectors that I need.
Is there any way, to speed up the calculation? Maybe with using the shift-invert mode? I will have to do many, many diagonalizations and eventually increase the size of the matrix as well, so I am a bit lost at the moment.
I would really appreciate any help!
This question is probably better to ask on http://scicomp.stackexchange.com as it's more of a general math question, rather than specific to Scipy or related to programming.
If you need all eigenvectors, it does not make very much sense to use ARPACK. Since you need N/2 eigenvectors, your memory requirement is at least N*N/2 floats; and probably in practice more. Using eigh requires N*N+3*N floats. eigh is then within a factor of 2 from the minimum requirement, so the easiest solution is to stick with it.
If you can process the eigenvectors "on-line" so that you can throw the previous one away before processing the next, there are other approaches; look at the answers to similar questions on scicomp.