I am trying to estimate the the density of a data set at certain points, using scipy.
from scipy.stats import gaussian_kde
import numpy as np
I have a dataset A of 3D points (this is just a minimal example. My actual data has many more dimensions and many more samples)
A = np.array([[0.078377 , 0.76737392, 0.45038174],
[0.65990129, 0.13154658, 0.30770917],
[0.46068406, 0.22751313, 0.28122463]])
and the points at which I want to estimate the density
B = np.array([[0.40209377, 0.21063273, 0.75885516],
[0.91709997, 0.79303252, 0.65156937]])
But I can't seem to be able to use the gaussian_kde function, as
result = gaussian_kde(A.T)(B.T)
returns
LinAlgError: Matrix is not positive definite
How do I fix this error? How do I get the density of my sample?
TL; DR:
You have highly correlated features in your data which leads to a numerical error. There are several possible ways to address this, each with pros and cons. A drop-in replacement class for gaussian_kde is proposed below.
Diagnostic
Your dataset (i.e. the matrix that you feed when creating the gaussian_kde object, not when using it), likely contains highly dependent features. This fact (possibly combined with having low numerical resolution like float32 and "too many" datapoints) causes the covariance matrix to have near-zero eigenvalues, which due to numerical error can get below zero. This in turn will break the code that uses the Cholesky decomposition on the dataset covariance matrix (see explanation for specific details).
Assuming your dataset has shape (dims, N) you can test if this is your problem via np.linalg.eigh(np.cov(dataset))[0] <= 0. If any of the outputs comes out True, let me be the first welcoming you to the club.
Treatments
The idea is to get all eigenvalues above zero.
Increasing numerical resolution to the highest float that is practical can be an easy fix and worth trying, but may not be enough.
Given the fact that this is caused by correlated features, removing datapoints doesn't help much a priori. There is a slim hope that having less numbers to crush will propagate less error, and keep the eigenvalues above zero. It's easy to implement but it discards data points.
A more involved fix is to identify the highly correlated features and merge them or ignore the "superfluous" ones. This is tricky especially if the correlations among dimensions vary from instance to instance.
Probably the cleanest way is to leave the data untouched, and lift the problematic eigenvalues to positive values. This is usually done in 2 ways:
SVD addresses the problem directly at its core: Decompose the covariance matrix and replace the negative eigenvalues with a small positive epsilon. This will put your matrix back to PD domain introducing minimal error.
If the SVD is too expensive to compute, an alternative numerical hack is to add epsilon * np.eye(D) to the covariance matrix. This has the effect of adding epsilon to each one of the eigenvalues. Again, if epsilon is small enough, this doesn't introduce that much of an error.
If you go for the last approach you'll need to tell gaussian_kde to modify its covariance matrix. This is a relatively clean way I found to do that: simply add this class to your codebase and replace gaussian_kde with GaussianKde (tested on my end, seems to work fine).
class GaussianKde(gaussian_kde):
"""
Drop-in replacement for gaussian_kde that adds the class attribute EPSILON
to the covmat eigenvalues, to prevent exceptions due to numerical error.
"""
EPSILON = 1e-10 # adjust this at will
def _compute_covariance(self):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor().
"""
self.factor = self.covariance_factor()
# Cache covariance and inverse covariance of the data
if not hasattr(self, '_data_inv_cov'):
self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1,
bias=False,
aweights=self.weights))
# we're going the easy way here
self._data_covariance += self.EPSILON * np.eye(
len(self._data_covariance))
self._data_inv_cov = np.linalg.inv(self._data_covariance)
self.covariance = self._data_covariance * self.factor**2
self.inv_cov = self._data_inv_cov / self.factor**2
L = np.linalg.cholesky(self.covariance * 2 * np.pi)
self._norm_factor = 2*np.log(np.diag(L)).sum() # needed for scipy 1.5.2
self.log_det = 2*np.log(np.diag(L)).sum() # changed var name on 1.6.2
Explanation
In case your error is similar, but not quite that, or anyone feels curious, here is the process I followed, hopefully it helps.
The exception stack specified that the error was triggered during a Cholesky decomposition. In my case, this was this line inside the _compute_covariance method.
Indeed, after the exception, checking the Eigenvalues for the covariance and inv_cov attributes via np.eigh showed that covariance had a close-to-zero negative eigenvalue, and hence its inverse had a huge one. Since Cholesky expects PD matrices, this triggered an error.
At this point we can heavily suspect that the tiny, negative eigenvalue is a numerical error, since covariance matrices are PSD. Indeed, the error source comes when the covariance matrix is originally computed from the dataset that has been passed to the constructor, here. In my case, the covariance matrix yielded at least 2 almost identical columns, which led to the residual negative eigenvalue due to numerical error.
When will your dataset lead to a quasi-singular covariance matrix? That question is perfectly addressed in this other SE post. The bottom line is: If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix) (Kudos and +1 to ttnphns for the amazing work).
Related
I'm attempting to write a simple implementation of the Newton-Raphson method, in Python. I've already done so using the SymPy library, however what I'm working on now will (ultimately) end up running in an environment where only Numpy is available.
For those unfamiliar with the algorithm, it works (in my case) as follows:
I have some a system of symbolic equations, which I "stack" to form a matrix F. The unknowns are X,Y,Z,T (which I wish to determine). Some additional values are initially unknown, until passed to my solver, which substitutes these known values for variables in the symbolic expressions.
Now, the Jacobian matrix (J) of F is computed. This, too, is a matrix of symbolic expressions.
Now, I iterate in some range (max_iter). With each iteration, I form a matrix A by substituing for the unknowns X,Y,Z,T in F current estimates (starting with some initial values). Similarly, I form a matrix b by substituting for X,Y,Z,T current estimates.
I then obtain new estimates by solving the matrix equation Ax = b for x. This vector x holds dT, dX, dY, dZ. I then add these to current estimates for T,X,Y,Z, and iterate again.
Thus far, I've found my largest issue to be computing the Jacobian matrix. I need only to do this once, however it will be different depending upon the coefficients fed to the solver (not unknowns, but only known once fed to the solver, so I can't simply hard-code the Jacobian).
While I'm not terribly familiar with Numpy, I know that it offers numpy.gradient. I'm not sure, however, that this is the same as SymPy's .jacobian.
How can the Jacobian matrix be found, either in "pure" Python, or with Numpy?
EDIT:
Should it be useful to you, more information on the problem can be found [here]. 1. It can be formulated a few different ways, however (as of now) I'm writing it as 4 equations of the form:
\sqrt{(X-x_i)^2+(Y-y_i)^2+(Z-z_i)^2 }= c * (t_i-T)
Where X,Y,Z and T are unknown.
This describes the solution to a localization problem, where we know (a) the location of n >= 4 observers in a 3-dimensional space, (b) the time at which each observer "saw" some signal, and (c) the velocity of the signal. The goal is to determine the coordinates of the signal source X,Y,Z (and, as a side effect, the time of emission, T).
Notice that I've tried (many) other approaches to solving this problem, and all leads point toward a combination of Newton-Raphson with regression.
NB: this problem actually happens inside tensorflow, and results in samples not exactly from the true pdf. The principle is, however, the same in numpy, and my aim is to understand the following warning.
Namely, I am trying to sample from a multivariate normal in python. That is
np.random.multivariate_normal(mean = some_mean_vector, cov = some_cov_matrix)
Of course, any valid covariance matrix must be positive semi-definite. However, some covariance matrices used for sampling (that pass every test for positive semi-definiteness), give the following warning
/usr/local/lib/python3.6/site-packages/ipykernel/__main__.py:1: RuntimeWarning: covariance is not positive-semidefinite.
One such matrix is
A = array([[ 1.00000359e-01, -3.66802835e+00],[ -3.66802859e+00, 1.34643845e+02]], dtype=float32)
for which I can find both the cholesky decomposition and eigenvalues without warning (the smallest eigenvalue is 7.42144039e-05).
Can anyone help and tell me why this might be happening?
(In tensorflow, I just feed the cholesky decomposition of the above matrix, and receive inexact samples, which messes up everything I'm trying to do).
The discussion of this pull request has some information about what triggers this warning. According to the source:
# Also check that cov is positive-semidefinite. If so, the u.T and v
# matrices should be equal up to roundoff error if cov is
# symmetrical and the singular value of the corresponding row is
# not zero. We continue to use the SVD rather than Cholesky in
# order to preserve current outputs. Note that symmetry has not
# been checked.
It seems that in your case the tests based on SVD and Cholesky give different results.
I am trying to implement least squares:
I have: $y=\theta\omega$
The least square solution is \omega=(\theta^{T}\theta)^{-1}\theta^{T}y
I tryied:
import numpy as np
def least_squares1(y, tx):
"""calculate the least squares solution."""
w = np.dot(np.linalg.inv(np.dot(tx.T,tx)), np.dot(tx.T,y))
return w
The problem is that this method becomes quickly unstable
(for small problems its okay)
I realized that, when I compared the result to this least square calculation:
import numpy as np
def least_squares2(y, tx):
"""calculate the least squares solution."""
a = tx.T.dot(tx)
b = tx.T.dot(y)
return np.linalg.solve(a, b)
Compare both methods:
I tried to fit data with a polynomial of degree 12 [1, x,x^2,x^3,x^4...,x^12]
First method:
Second method:
Do you know why the first method diverges for large polynomials ?
P.S. I only added "import numpy as np" for your convinience, if you want to test the functions.
There are three points here:
One is that it is generally better (faster, more accurate) to solve linear equations rather than to compute inverses.
The second is that it's always a good idea to use what you know about a system of equations (e.g. that the coefficient matrix is positive definite) when computing a solution, in this case you should use numpy.linalg.lstsq
The third is more specifically about polynomials. When using monomials as a basis, you can end up with a very poorly conditioned coefficient matrix, and this will mean that numerical errors tend to be large. This is because, for example, the vectors x->pow(x,11) and x->pow(x,12) are very nearly parallel. You would get a more accurate fit, and be able to use higher degrees, if you were to use a basis of orthogonal polynomials, for example https://en.wikipedia.org/wiki/Chebyshev_polynomials or https://en.wikipedia.org/wiki/Legendre_polynomials
I am going to improve on what was said before. I answered this yesterday.
The problem with higher order polynomials is something called Runge's phenomena. The reason why the person resorted orthogonal polynomials which are known as Hermite polynomials is that they attempt to get rid of the Gibbs phenomenon which is an adverse oscillatory effect when Fourier series methods are applied to non-periodic signals.
You can sometimes improve under the conditioning be resorting to regularizing methods if the matrix is low rank as I did in the other post. Other parts may be due to smoothness properties of the vector.
I got this array of data and I need to calculate the area under the curve, so I use the Numpy library and the Scipy library which contain the functions trapz in Numpy and integrate.simps in Scipy for a Numerical Integration which gave me a really nice result in both cases.
The problem now is, that I need the error for each one or at least the error for the Trapezoidal Rule. The thing is, that the formula for that ask me a function, which obviously I don't have. I have been researching for a way to obtain the error but always return to the same point...
Here are the pages of scipy.integrate http://docs.scipy.org/doc/scipy/reference/integrate.html and trapz in Numpy http://docs.scipy.org/doc/numpy/reference/generated/numpy.trapz.html I try and see a lot of code about the Numerical Integration and prefer to use the existing ones...
Any ideas please?
While cel is right that you cannot determine an integration error if you don't know the function, there is something you can do.
You can use curve fitting to fit a function through the available data points. You can then use that function for error estimation.
If you expect the data to fit a certain kind of function like a sine, log or exponential it is good to use that as a basis for curve fitting.
For instance, if you are measuring the drag on a moving car, it is known that this mostly proportional to the velocity squared because of air resistance.
However, if you do not have any knowledge about the applicable function then assuming you have N data points, there is a polynomial of the N-1 degree that fits exactly though all those data points. Determining such a polynomial from the data is solving a system of lineair equations. See e.g. polynomial interpolation. You could use this polynomial as an estimate for the unknown real function. Note however that outside the range of data points this polynomial might be wildly inaccurate.
I am using frequently scipy.optimize.leastsq() for my Ph.D thesis however I have no idea how can I get the estimate of a jacobian from the data that leastsq() returns. I need to know the estimate of a jacobian that is used in minimization to compare with the finite difference approximation at minimum.
Does anyone has a formula how to get it?
This can be a bit tricky when you check how for e.g. a covariance matrix is calculated inside leastsq()
I think the answer is that nobody ever managed to recover the Jacobian from optimize.leastsq.
latest threat on this http://mail.scipy.org/pipermail/scipy-user/2011-August/030320.html
From some examples of the covariance matrix returned by leastsq, I think there are many cases where the precision with default settings is not very high.
To check how good it is, you can compare the returned covariance matrix with the outer product of your Jacobian.