I have to calculate the covariance between 2 parameters from a fit function. I found this package in Python called iminuit that did a good fit and also calculate the covariance matrix of the parameters. I tested the package on a simple function. This is the code:
from iminuit import Minuit, describe, Struct
def func(x,y):
f=x**2+y**2
return f
m = Minuit(func,pedantic=False,print_level=0)
m.migrad()
print("Covariance:")
print(m.matrix())
and this is the output:
Covariance:
((1.0, 0.0),
(0.0, 1.0))
However if i replace x^2+y^2 with (x-y)^2 I obtain
Covariance:
((250.24975024975475, 249.75024975025426),
(249.75024975025426, 250.24975024975475))
I am confused why do I get covariance bigger than 1 (I am not good at statistics but from what I understood it has to be between -1 and 1), so someone who knows iminuit can help me? And also, in the first case, what does the matrix means? Why there is 0 correlation between x and y and what 1 on the diagonal means?
You are confusing covariance with correlation. Correlation is the normalised version of the covariance, which is indeed always between -1 and 1.
To obtain the corellation from the covariance matrix, calculate:
correlation = cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])
Related
When I apply statsmodels.multivariate.pca.PCA to some data, I am finding that the sum of the produced eigenvalues does not equal to the total variance of the data. I am using the following code
import numpy as np
import statsmodels.api as sm
corr_matrix = np.array([
[1, 0.8, 0.4],
[0.8, 1, 0.6],
[0.4, 0.6, 1]])
Z = np.random.multivariate_normal([0,0,0], corr, 1000)
pc = sm.PCA(Z, standardize=False, demean=False, normalize=False)
pc.eigenvals.sum()
and the result (in a given random sample) is 2994.51488403581 while I was expecting this to add up to 3.
What am I missing?
Add 1
It seems that when the PCA is performed on the data X (i.e. using the matrix X^TX), the relationship between sum of variances and eigenvalues no longer holds, and it is only when the PCA is performed on the covariance matrix (i.e. on X^TX/n) when the sum of eigenvalues is eual to the sum of variances, i.e. trace(X^TX/n) = sum(eigenvalues). I wish this was more clearly stated on all the post one finds on PCA.
The eigenvalues are not the variance of the data. eigenvalues are the variances of the data in specific direction, defined by eigenvectors. The Variance of the data is the sum of the distance of all points to the mean value of the data. PC's are the characteristic of data and shows how the data is expanded in the space in specific directions. You should not confuse the variance of the data with eigenvalue (which shows the variance in the direction of the eigenvector).
quick answer by reverse engineering (I don't remember the details)
pc = PCA(Z, standardize=False, demean=True, normalize=False)
pc.eigenvals.sum() / 1000
2.7550787264061087
Z.var(0).sum()
2.7550787264061087
In the computation of the variance, the data is demeaned. If we don't demean, then we only get a uncentered quadratic product.
pc = PCA(Z, standardize=False, demean=False, normalize=False)
pc.eigenvals.sum(), pc.eigenvals.sum() / Z.shape[0]
(2756.1915877060546, 2.7561915877060548)
(Z**2).mean(0).sum()
2.7561915877060548
I don't understand curve_fit isn't able to estimate the covariance of the parameter, thus raising the OptimizeWarning below. The following MCVE explains my problem:
MCVE python snippet
from scipy.optimize import curve_fit
func = lambda x, a: a * x
popt, pcov = curve_fit(f = func, xdata = [1], ydata = [1])
print(popt, pcov)
Output
\python-3.4.4\lib\site-packages\scipy\optimize\minpack.py:715:
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
[ 1.] [[ inf]]
For a = 1 the function fits xdata and ydata exactly. Why isn't the error/variance 0, or something close to 0, but inf instead?
There is this quote from the curve_fit SciPy Reference Guide:
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
So, what's the underlying problem? Why doesn't the Jacobian matrix at the solution have a full rank?
The formula for the covariance of the parameters (Wikipedia) has the number of degrees of freedom in the denominator. The degrees of freedoms are computed as (number of data points) - (number of parameters), which is 1 - 1 = 0 in your example. And this is where SciPy checks the number of degrees of freedom before dividing by it.
With xdata = [1, 2], ydata = [1, 2] you would get zero covariance (note that the model still fits exactly: exact fit is not the problem).
This is the same sort of issue as sample variance being undefined if the sample size N is 1 (the formula for sample variance has (N-1) in the denominator). If we only took size=1 sample out of the population, we don't estimate the variance by zero, we know nothing about the variance.
I use scipy.odr in order to make a fit with uncertainties on both x and y following this question Correct fitting with scipy curve_fit including errors in x?
After the fit I would like to compute the uncertainties on the parameters. Thus I look at the square root of the diagonal elements of the covariance matrix. I get :
>>> print(np.sqrt(np.diag(output.cov_beta)))
[ 0.17516591 0.33020487 0.27856021]
But in the Output there is also output.sd_beta which is, according to the doc on odr
Standard errors of the estimated parameters, of shape (p,).
But, it does not give me the same results :
>>> print(output.sd_beta)
[ 0.19705029 0.37145907 0.31336217]
EDIT
This is an example on a notebook : https://nbviewer.jupyter.org/github/gvallverdu/cookbook/blob/master/fit_odr.ipynb
With least square
stop reason: ['Sum of squares convergence']
params: [ -1.94792946 11.03369235 -5.43265555]
info: 1
sd_beta: [ 0.26176284 0.49877962 0.35510071]
sqrt(diag(cov): [ 0.25066236 0.47762805 0.34004208]
With ODR
stop reason: ['Sum of squares convergence']
params: [-1.93538595 6.141885 -3.80784384]
info: 1
sd_beta: [ 0.6941821 0.88909997 0.17292514]
sqrt(diag(cov): [ 0.01093697 0.01400794 0.00272447]
The reason for the discrepancy is that sd_beta is scaled by the residual variance, whereas cov_beta isn't.
scipy.odr is an interface for the ODRPACK FORTRAN library, which is thinly wrapped in __odrpack.c. sd_beta and cov_beta are recovered by indexing into the work vector that's used internally by the FORTRAN routine. The indices of their first elements in work are variables named sd and vcv (see here).
From the ODRPACK documentation (p.85):
WORK(SDI) is the first element of a p × 1 array SD containing
the standard deviations ̂σβK of the function parameters β, i.e., the
square roots of the diagonal entries of the covariance matrix, where
WORK(SDI-1+K) = SD(K) = ̂V 1/2 β (K, K) = ̂σβK
for K = 1,... ,p.
WORK(VCVI) is the first element of a p × p array VCV containing
the values of the covariance matrix of the parameters β prior to
scaling by the residual variance, where
WORK(VCVI-1+I+(J-1)*(NP)) = VCV(I,J) = ̂σ⁻²V β(I, J)
for I = 1,... ,p and J = 1,... ,p.
In other words, np.sqrt(np.diag(output.cov_beta * output.res_var)) will give you the same result as output.sd_beta.
I've opened a bug report here.
I wonder if anyone can help me on my interpretion of the algorithm that allow to compute the bivariate skewness which is a univariate measure of skewness for bivariate data.
The bivariate skewness is defined as described in this paper:
http://www.jstor.org/discover/10.2307/2346576?sid=21105063910471&uid=3737592&uid=4&uid=2
I considered p =2(bivariate) and took the same other assumptions as described in the paper.
Here is the function I wrote to compute b1,p (algorithm of Skewness in the paper) in python:
def multiSkew(x1, x2): #bivariate function, e.g use two columns of dataframe (same len)
covariance_x1_x2 = np.cov(x1,x2) # compute the covariance matrix
inv_covariance_x1_x2 = np.linalg.inv(covariance_x1_x2) #inverse of covariance mat
x1_x2_mean = np.mean(x1),np.mean(x2) #mean value of each variable
mk = []
for x_i in x1:
for y_i in x2:
x_diff = x_i - x1_x2_mean[0] #from the equation(see link) (xi-xbar)
y_diff = y_i - x1_x2_mean[1] # (xj-xbar)
yj = np.dot(np.dot(np.transpose(x_diff),inv_covariance_x1_x2),y_diff)
mk.append(yj**3)
skew=(1.0/(len(x1)**2))*sum(mk)
return skew
I have this when I tested it
My Skewness is too big. It normally should turn around zero if I am right
In: multiSkew(x1,x2)
Out[15]: 2809276168.079186
Can anyone, more advanced in programming ,help me please?I should have made an error somewhere in the summing part.
I don't think that there is a python module that can help me to compute the skewness of a multivariate set of data.In scipy, the skew module is just for univariate set of data by the way.
I am trying to fit a polynomial to a set of data. Sometimes it may happen that the covariance matrix returned by numpy.ployfit only consists of inf, although the fit seems to be useful. There are no numpy.inf or 'numpy.nan' in the data!
Example:
import numpy as np
# sample data, does not contain really x**2-like behaviour,
# but that should be visible in the fit results
x = [-449., -454., -459., -464., -469.]
y = [ 0.9677024, 0.97341953, 0.97724978, 0.98215678, 0.9876293]
fit, cov = np.polyfit(x, y, 2, cov=True)
print 'fit: ', fit
print 'cov: ', cov
Result:
fit: [ 1.67867158e-06 5.69199547e-04 8.85146009e-01]
cov: [[ inf inf inf]
[ inf inf inf]
[ inf inf inf]]
np.cov(x,y) gives
[[ 6.25000000e+01 -6.07388099e-02]
[ -6.07388099e-02 5.92268942e-05]]
So np.cov is not the same as the covariance returned from np.polyfit. Has anybody an idea what's going on?
EDIT:
I now got the point that numpy.cov is not what I want. I need the variances of the polynom coefficients, but I dont get them if (len(x) - order - 2.0) == 0. Is there another way to get the variances of the fit polynom coefficients?
As rustil's answer says, this is caused by the bias correction applied to the denominator of the covariance equation, which results in a zero divide for this input. The reasoning behind this correction is similar to that behind Bessel's Correction. This is really a sign that there are too few datapoints to estimate covariance in a well-defined way.
How to skirt this problem? Well, this version of polyfit accepts weights. You could add another datapoint but weight it at epsilon. This is equivalent to reducing the 2.0 in this formula to a 1.0.
x = [-449., -454., -459., -464., -469.]
y = [ 0.9677024, 0.97341953, 0.97724978, 0.98215678, 0.9876293]
x_extra = x + x[-1:]
y_extra = y + y[-1:]
weights = [1.0, 1.0, 1.0, 1.0, 1.0, sys.float_info.epsilon]
fit, cov = np.polyfit(x, y, 2, cov=True)
fit_extra, cov_extra = np.polyfit(x_extra, y_extra, 2, w=weights, cov=True)
print fit == fit_extra
print cov_extra
The output. Note that the fit values are identical:
>>> print fit == fit_extra
[ True True True]
>>> print cov_extra
[[ 8.84481850e-11 8.11954338e-08 1.86299297e-05]
[ 8.11954338e-08 7.45405039e-05 1.71036963e-02]
[ 1.86299297e-05 1.71036963e-02 3.92469307e+00]]
I am very uncertain that this will be especially meaningful, but it's a way to work around the problem. It's a bit of a kludge though. For something more robust, you could modify polyfit to accept its own ddof parameter, perhaps in lieu of the boolean that cov currently accepts. (I just opened an issue to suggest as much.)
A quick final note about the calculation of cov: If you look at the wikipedia page on least squares regression, you'll see that the simplified formula for the covariance of the coefficients is inv(dot(dot(X, W), X)), which has a corresponding line in the numpy code -- at least roughly speaking. In this case, X is the Vandermonde matrix, and the weights have already been multiplied in. The numpy code also does some scaling (which I understand; it's part of a strategy to minimize numerical error) and multiplies the result by the norm of the residuals (which I don't understand; I can only guess that it's part of another version of the covariance formula).
the difference should be in the degree of freedom. In the polyfit method it already takes into account that your degree is 2, thus causing:
RuntimeWarning: divide by zero encountered in true_divide
fac = resids / (len(x) - order - 2.0)
you can pass your np.cov a ddof= keyword (ddof = delta degrees of freedom) and you'll run into the same problem