I am currently going through different Machine Learning methods on a low-level basis. Since the polynomial kernel
K(x,z)=(1+x^T*z)^d
is one of the more commonly used kernels, I was under the assumption that this kernel must yield a positive (semi-)definite matrix for a fixed set of data {x1,...,xn}.
However, in my implementation, this seems to not be the case, as the following example shows:
import numpy as np
np.random.seed(93)
x=np.random.uniform(0,1,5)
#Assuming d=1
kernel=(1+x[:,None]#x.T[None,:])
np.linalg.eigvals(kernel)
The output would be
[ 6.9463439e+00 1.6070016e-01 9.5388039e-08 -1.5821310e-08 -3.7724433e-08]
I'm also getting negative eigenvalues for d>2.
Am I totally misunderstanding something here? Or is the polynomia kernel simply not PSD?
EDIT: In a previous version, i used x=np.float32(np.random.uniform(0,1,5)) to reduce computing time, which lead to a greater amount of negative eigenvalues (I believe due to numerical instabilities, as #user2357112 mentioned). I guess this is a good example that precision does matter? Since negative eigenvalues still occur, even for float64 precision, the follow-up question would then be how to avoid such numerical instabilities?
Related
I am trying to estimate the the density of a data set at certain points, using scipy.
from scipy.stats import gaussian_kde
import numpy as np
I have a dataset A of 3D points (this is just a minimal example. My actual data has many more dimensions and many more samples)
A = np.array([[0.078377 , 0.76737392, 0.45038174],
[0.65990129, 0.13154658, 0.30770917],
[0.46068406, 0.22751313, 0.28122463]])
and the points at which I want to estimate the density
B = np.array([[0.40209377, 0.21063273, 0.75885516],
[0.91709997, 0.79303252, 0.65156937]])
But I can't seem to be able to use the gaussian_kde function, as
result = gaussian_kde(A.T)(B.T)
returns
LinAlgError: Matrix is not positive definite
How do I fix this error? How do I get the density of my sample?
TL; DR:
You have highly correlated features in your data which leads to a numerical error. There are several possible ways to address this, each with pros and cons. A drop-in replacement class for gaussian_kde is proposed below.
Diagnostic
Your dataset (i.e. the matrix that you feed when creating the gaussian_kde object, not when using it), likely contains highly dependent features. This fact (possibly combined with having low numerical resolution like float32 and "too many" datapoints) causes the covariance matrix to have near-zero eigenvalues, which due to numerical error can get below zero. This in turn will break the code that uses the Cholesky decomposition on the dataset covariance matrix (see explanation for specific details).
Assuming your dataset has shape (dims, N) you can test if this is your problem via np.linalg.eigh(np.cov(dataset))[0] <= 0. If any of the outputs comes out True, let me be the first welcoming you to the club.
Treatments
The idea is to get all eigenvalues above zero.
Increasing numerical resolution to the highest float that is practical can be an easy fix and worth trying, but may not be enough.
Given the fact that this is caused by correlated features, removing datapoints doesn't help much a priori. There is a slim hope that having less numbers to crush will propagate less error, and keep the eigenvalues above zero. It's easy to implement but it discards data points.
A more involved fix is to identify the highly correlated features and merge them or ignore the "superfluous" ones. This is tricky especially if the correlations among dimensions vary from instance to instance.
Probably the cleanest way is to leave the data untouched, and lift the problematic eigenvalues to positive values. This is usually done in 2 ways:
SVD addresses the problem directly at its core: Decompose the covariance matrix and replace the negative eigenvalues with a small positive epsilon. This will put your matrix back to PD domain introducing minimal error.
If the SVD is too expensive to compute, an alternative numerical hack is to add epsilon * np.eye(D) to the covariance matrix. This has the effect of adding epsilon to each one of the eigenvalues. Again, if epsilon is small enough, this doesn't introduce that much of an error.
If you go for the last approach you'll need to tell gaussian_kde to modify its covariance matrix. This is a relatively clean way I found to do that: simply add this class to your codebase and replace gaussian_kde with GaussianKde (tested on my end, seems to work fine).
class GaussianKde(gaussian_kde):
"""
Drop-in replacement for gaussian_kde that adds the class attribute EPSILON
to the covmat eigenvalues, to prevent exceptions due to numerical error.
"""
EPSILON = 1e-10 # adjust this at will
def _compute_covariance(self):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor().
"""
self.factor = self.covariance_factor()
# Cache covariance and inverse covariance of the data
if not hasattr(self, '_data_inv_cov'):
self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1,
bias=False,
aweights=self.weights))
# we're going the easy way here
self._data_covariance += self.EPSILON * np.eye(
len(self._data_covariance))
self._data_inv_cov = np.linalg.inv(self._data_covariance)
self.covariance = self._data_covariance * self.factor**2
self.inv_cov = self._data_inv_cov / self.factor**2
L = np.linalg.cholesky(self.covariance * 2 * np.pi)
self._norm_factor = 2*np.log(np.diag(L)).sum() # needed for scipy 1.5.2
self.log_det = 2*np.log(np.diag(L)).sum() # changed var name on 1.6.2
Explanation
In case your error is similar, but not quite that, or anyone feels curious, here is the process I followed, hopefully it helps.
The exception stack specified that the error was triggered during a Cholesky decomposition. In my case, this was this line inside the _compute_covariance method.
Indeed, after the exception, checking the Eigenvalues for the covariance and inv_cov attributes via np.eigh showed that covariance had a close-to-zero negative eigenvalue, and hence its inverse had a huge one. Since Cholesky expects PD matrices, this triggered an error.
At this point we can heavily suspect that the tiny, negative eigenvalue is a numerical error, since covariance matrices are PSD. Indeed, the error source comes when the covariance matrix is originally computed from the dataset that has been passed to the constructor, here. In my case, the covariance matrix yielded at least 2 almost identical columns, which led to the residual negative eigenvalue due to numerical error.
When will your dataset lead to a quasi-singular covariance matrix? That question is perfectly addressed in this other SE post. The bottom line is: If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix) (Kudos and +1 to ttnphns for the amazing work).
I am having trouble understanding Numpy's behavior regarding the Nyquist frequency. Consider the following example:
import numpy as np
x=np.linspace(0, 2*np.pi, 21)[:-1]
k=np.fft.rfftfreq(len(x), d=x[1]-x[0])
FFT=np.fft.rfft(x)
x1=np.fft.irfft(1j*k*FFT)
FFT[-1]+=1e5
x2=np.fft.irfft(1j*k*FFT)
print(np.allclose(x1,x2))
Prints True. So apparently it doesn't matter what I do with the Nyquist frequency in FFT, the result is always the same and the change is ignored. Curiously, this does not happen when trying to just recover the function (no derivation):
x1=np.fft.irfft(FFT)
FFT[-1]+=1e5
x2=np.fft.irfft(FFT)
print(np.allclose(x1,x2))
prints False.
I may be misunderstanding what the Nyquist frequency is here (Wikipedia and other sources weren't very helpful) but aren't both results supposed to be affected by a change in the Nyquist frequency? The closest explanation I can find is that the Nyquist frequency is supposed to be a real number, but still doesn't seem to explain both behaviors.
The reason I'm asking this is because I'm trying to reproduce results that I know are correct from a Fortran code that does do some stuff with the Nyquist frequency wen differentiating. My results are always about 1% off and I'm guessing this is the culprit.
The r in np.fft.rfft() indicates that you are using the DFT on real input. But if that is not True, you will get unexpected behaviors like this one. Just use fft functions for complex values. As a side note, always try to inspect your data.
EDIT (additional explanation):
In particular, when you calculate the "DFT for real inputs" you are enforcing certain properties to your data, i.e. the (D)FT of real valued function, implies that the (D)FT transform is Hermitian-symmetric, and hence the negative (D)FT coefficients are redundant, so rfft and later irfft are optimized for the computation under this assumption.
See their documentations np.fft.rfft() and np.fft.irfft() for more information.
Briefly, because of this expected parity, half of your coefficients (the negative ones) will not be computed by np.fft.rfft() and because of parity of the (D)FT transform, the first component is purely real (by definition) and the last component is also purely real (for convenience).
Because of the 1j multiplication, whatever was purely real is now purely imaginary (and viceversa) in the subsequent irfft calls.
Since the irfft() will ignore the imaginary part of the first and last components, your statement will not affect its result.
I am attempting to calculate integrals between two limits using python/scipy.
I am using online calculators to double check my results (http://www.wolframalpha.com/widgets/view.jsp?id=8c7e046ce6f4d030f0b386ea5c17b16a, http://www.integral-calculator.com/), and my results disagree when I have certain limits set.
The code used is:
import scipy as sp
import numpy as np
def integrand(x):
return np.exp(-0.5*x**2)
def int_test(a,b):
# a and b are the lower and upper bounds of the integration
return sp.integrate.quad(integrand,a,b)
When setting the limits (a,b) to (-np.inf,1) I get answers that agree (2.10894...)
however if I set (-np.inf,300) I get an answer of zero.
On further investigation using:
for i in range(50):
print(i,int_test(-np.inf,i))
I can see that the result goes wrong at i=36.
I was wondering if there was a way to avoid this?
Thanks,
Matt
I am guessing this has to do with the infinite bounds. scipy.integrate.quad is a wrapper around quadpack routines.
https://people.sc.fsu.edu/~jburkardt/f_src/quadpack/quadpack.html
In the end, these routines chose suitable intervals and try to get the value of the integral through function evaluations and then numerical integrations. This works fine for finite integrals, assuming you know roughly how fine you can make the steps of the function evaluation.
For infinite integrals it depends how well the algorithms choose respective subintervals and how accurately they are computed.
My advice: do NOT use numerical integration software AT ALL if you are interested in accurate values for infinite integrals.
If your problem can be solved analytically, try that or confine yourself to certain bounds.
I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.
The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.
The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.
I was trying to port one code from python to matlab, but I encounter one inconsistence between numpy fft2 and matlab fft2:
peak =
4.377491037053e-223 3.029446976068e-216 ...
1.271610790463e-209 3.237410810582e-203 ...
(Large data can't be list directly, it can be accessed here:https://drive.google.com/file/d/0Bz1-hopez9CGTFdzU0t3RDAyaHc/edit?usp=sharing)
Matlab:
fft2(peak) --(sample result)
12.5663706143590 -12.4458341615690
-12.4458341615690 12.3264538927637
Python:
np.fft.fft2(peak) --(sample result)
12.56637061 +0.00000000e+00j -12.44583416 +3.42948517e-15j
-12.44583416 +3.35525358e-15j 12.32645389 -6.78073635e-15j
Please help me to explain why, and give suggestion on how to fix it.
The Fourier transform of a real, even function is real and even (ref). Therefore, it appears that your FFT should be real? Numpy is probably just struggling with the numerics while MATLAB may outright check for symmetry and force the solution to be real.
MATLAB uses FFTW3 while my research indicates Numpy uses a library called FFTPack. FFTW is one of the standards for FFT performance and uses a number of tricks to work quickly and perform calculations to the best precision possible. You can incredibly tiny numbers and this offers a number of numerical challenges that any library will be hard pressed to resolve.
You might consider executing the Python code against an FFTW3 wrapper like pyFFTW3 and see if you get similar results.
It appears that your input data is gaussian real and even, in which case we do expect the FFT2 of the signal to be real and even. If all your inputs are this way you could just take the real part. Or round to a certain precision. I would trust MATLAB's FFTW code over the Python code.
Or you could just ignore it. The differences are quite small and a value of 3e-15i is effectively zero for most applications. If you have automated the comparison, consider calling them equivalent if the mean square error of all the entries is less than some threshold (say 1e-8 or 1e-15 or 1e-20).