numpy.polyfit gives useful fit, but infinite covariance matrix - python

I am trying to fit a polynomial to a set of data. Sometimes it may happen that the covariance matrix returned by numpy.ployfit only consists of inf, although the fit seems to be useful. There are no numpy.inf or 'numpy.nan' in the data!
Example:
import numpy as np
# sample data, does not contain really x**2-like behaviour,
# but that should be visible in the fit results
x = [-449., -454., -459., -464., -469.]
y = [ 0.9677024, 0.97341953, 0.97724978, 0.98215678, 0.9876293]
fit, cov = np.polyfit(x, y, 2, cov=True)
print 'fit: ', fit
print 'cov: ', cov
Result:
fit: [ 1.67867158e-06 5.69199547e-04 8.85146009e-01]
cov: [[ inf inf inf]
[ inf inf inf]
[ inf inf inf]]
np.cov(x,y) gives
[[ 6.25000000e+01 -6.07388099e-02]
[ -6.07388099e-02 5.92268942e-05]]
So np.cov is not the same as the covariance returned from np.polyfit. Has anybody an idea what's going on?
EDIT:
I now got the point that numpy.cov is not what I want. I need the variances of the polynom coefficients, but I dont get them if (len(x) - order - 2.0) == 0. Is there another way to get the variances of the fit polynom coefficients?

As rustil's answer says, this is caused by the bias correction applied to the denominator of the covariance equation, which results in a zero divide for this input. The reasoning behind this correction is similar to that behind Bessel's Correction. This is really a sign that there are too few datapoints to estimate covariance in a well-defined way.
How to skirt this problem? Well, this version of polyfit accepts weights. You could add another datapoint but weight it at epsilon. This is equivalent to reducing the 2.0 in this formula to a 1.0.
x = [-449., -454., -459., -464., -469.]
y = [ 0.9677024, 0.97341953, 0.97724978, 0.98215678, 0.9876293]
x_extra = x + x[-1:]
y_extra = y + y[-1:]
weights = [1.0, 1.0, 1.0, 1.0, 1.0, sys.float_info.epsilon]
fit, cov = np.polyfit(x, y, 2, cov=True)
fit_extra, cov_extra = np.polyfit(x_extra, y_extra, 2, w=weights, cov=True)
print fit == fit_extra
print cov_extra
The output. Note that the fit values are identical:
>>> print fit == fit_extra
[ True True True]
>>> print cov_extra
[[ 8.84481850e-11 8.11954338e-08 1.86299297e-05]
[ 8.11954338e-08 7.45405039e-05 1.71036963e-02]
[ 1.86299297e-05 1.71036963e-02 3.92469307e+00]]
I am very uncertain that this will be especially meaningful, but it's a way to work around the problem. It's a bit of a kludge though. For something more robust, you could modify polyfit to accept its own ddof parameter, perhaps in lieu of the boolean that cov currently accepts. (I just opened an issue to suggest as much.)
A quick final note about the calculation of cov: If you look at the wikipedia page on least squares regression, you'll see that the simplified formula for the covariance of the coefficients is inv(dot(dot(X, W), X)), which has a corresponding line in the numpy code -- at least roughly speaking. In this case, X is the Vandermonde matrix, and the weights have already been multiplied in. The numpy code also does some scaling (which I understand; it's part of a strategy to minimize numerical error) and multiplies the result by the norm of the residuals (which I don't understand; I can only guess that it's part of another version of the covariance formula).

the difference should be in the degree of freedom. In the polyfit method it already takes into account that your degree is 2, thus causing:
RuntimeWarning: divide by zero encountered in true_divide
fac = resids / (len(x) - order - 2.0)
you can pass your np.cov a ddof= keyword (ddof = delta degrees of freedom) and you'll run into the same problem

Related

scipy curve_fit incorrect for large X values

To determine trends over time, I use scipy curve_fit with X values from time.time(), for example 1663847528.7147126 (1.6 billion).
Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0 values doesn't help. I found the magnitude of X to be a crucial element for this error and I wonder why?
Here is a simple snippet that shows working and non-working X offset:
import scipy.optimize
def fit_func(x, a, b):
return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
# (array([-1.e+08, 1.e+00]), array([[ 0., -0.],
# [-0., 0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
# OptimizeWarning: Covariance of the parameters could not be estimated
# warnings.warn('Covariance of the parameters could not be estimated',
# (array([-4.53788811e+08, 4.53788812e-01]), array([[inf, inf],
# [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
# (array([-7.60846335e+10, 7.60846334e+01]), array([[-1.97051972e+19, 1.97051970e+10],
# [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09, 1.e+00]), array([[inf, inf],
# [inf, inf]]))
As a side question, perhaps there's a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.
Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.
Root cause
You are facing two problems:
Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
Float Arithmetic Error. When switching from 1e8 to 1e9 you just hit the magnitude when such a kind of error become predominant.
The second one is very important to realize. Let's say you are limited to 8 significant digits representation, then 1 000 000 000 and 1 000 000 001 are the same numbers as they are both limited to this writing 1.0000000e9 and we cannot accurately represents 1.0000000_e9 which requires one more digit (_). This is why your second example fails.
Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.
You have three solutions:
Normalize;
Normalize and change the methodology/algorithm;
Increase the machine precision.
I'll choose the first one as it is more generic, the second one has been proposed by #blunova and totally makes sense, the latter is probably an inherent limitation.
Normalization
To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:
import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
return a + b * x
xm = np.mean(x) # 1000000002.0
xs = np.std(x) # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2. , 1.41421356]),
# array([[0., 0.],
# [0., 0.]]))
# Back transformation:
a = result[0][1]/xs # 1.0
b = result[0][0] - xm*result[0][1]/xs # -1000000000.0
Or the same result using sklearn interface:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", LinearRegression())
])
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_ # array([1.e+09])
pipe.named_steps["scaler"].scale_ # array([1.41421356])
pipe.named_steps["regressor"].coef_ # array([1.41421356])
pipe.named_steps["regressor"].intercept_ # 2.0
Back transformation
Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.
Simply write down and solve the transformation:
y = x'*a' + b'
x' = (x - m)/s
y = x*a + b
Which gives you the following solution:
a = a'/s
b = b' - m/s*a'
Precision addendum
Numpy default float precision is float64 as you expected and has about 15 significant digits:
x.dtype # dtype('float64')
np.finfo(np.float64).precision # 15
But scipy.curve_fit relies on scipy.least_square which makes use of a squared metric to drive the optimization.
Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to 1e9 you reach the threshold where Float Arithmetic Error becomes predominant.
So this threshold of 1e9 you have hit is not related to the distinction between numbers on your variable x (float64 has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
You can also check that in its signature, tolerances are about 8 decades wide:
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
loss='linear', f_scale=1.0, diff_step=None, tr_solver=None,
tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
args=(), kwargs={})
Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.
Methods comparison
What is interesting with scipy.stats.linregress method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the TINY variable) to solve the LS problem even in problematic conditions.
This of course contrasts with the scipy.optimize.curve_fit method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).
If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.
If you just need to compute a linear fit, I believe curve_fit is not necessary and I would just use the linregress function instead from SciPy as well:
>>> from scipy import stats
>>> y = list(range(5))
>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
In general, if you need a polynomial fit I would use NumPy polyfit.

Minimizing scipy.stats.multivariate_normal.logpdf with respect to covariance

I have a python script where I compute the value of a normal log-likelihood function for a sample of bivariate data using scipy's multivariate_normal.log_pdf. I am assuming the values of the sample means and variances, leaving only the sample correlation between the variables as the unknown,
from scipy.stats import multivariate_normal
from scipy.optimize import minimize
VAR_X = 0.4
VAR_Y = 0.32
MEAN_X = 1
MEAN_Y = 1.2
def log_likelihood_function(x, data):
log_likelihood = 0
sigma = [ [VAR_X, x[0]], [x[0], VAR_Y] ]
mu = [ MEAN_X, MEAN_Y ]
for point in data:
log_likelihood += multivariate_normal.logpdf(x=point, mean=mu, cov=sigma)
return log_likelihood
if __name__ == "__main__":
some_data = [ [1.1, 2.0], [1.2, 1.9], [0.8, 0.2], [0.7, 1.3] ]
guess = [ 0 ]
# maximize log-likelihood by minimizing the negative
likelihood = lambda x: (-1)*log_likelihood_function(x, some_data)
result = minimize(fun = likelihood, x0 = guess, options = {'disp': True}, method="SLSQP")
print(result)
No matter what I set as my guess, this script reliably throws a ValueError,
ValueError: the input matrix must be positive semidefinite
Now, the problem, by my estimation, seems to be scipy.optimize.minimize is guessing values that create a covariance matrix that is not positive definite. So I need a way to make sure the minimization algorithm throws away values that are outside the domain of the problem. I thought to add a constraint to the minimize call,
## make the determinant always positive
def positive_definite_constraint(x):
return VAR_X*VAR_Y - x*x
Which is basically the Slyvester Criteron for the covariance matrix and would ensure the matrix is positive definite (since we know the variance is always positiv, that condition doesn't need checked) But it seems like scipy.optimize.minimize evaluates the objective function before it determines if the constraints are satisfied (which seems like a design flaw; wouldn't it be faster to search for a solution in a restricted domain, instead of searching all possible solutions and then determining if the constraints are satisfied? I might be mistaken about the order of evaluation, though.)
I am not sure how to proceed. I realize I am stretching the purpose of scipy.optimize here a bit by parameterizing the covariance matrix and then minimizing with respect to that parameterization, and I know there are better ways to calculate the correlation for a normal sample, but I am interested in this problem because of its generalization to distributions that are not normal.
Any suggestions? Is there a better way to solve this problem?
You are on the right track. Note that your definiteness constraint reduces to a simple bound on the optimization variable, i.e. -∞ <= x[0] <= VAR_X*VAR_Y. Variable bounds are better handled internally than the more general constraints, so I'd recommend something like this:
bounds = [(None, VAR_X*VAR_Y)]
res = minimize(fun = likelihood, x0 = guess, bounds=bounds, options = {'disp': True}, method="SLSQP")
This gives me:
fun: 6.610504611834715
jac: array([-0.0063166])
message: 'Optimization terminated successfully'
nfev: 9
nit: 4
njev: 4
status: 0
success: True
x: array([0.12090069])

Why isn't `curve_fit` able to estimate the covariance of the parameter if the parameter fits exactly?

I don't understand curve_fit isn't able to estimate the covariance of the parameter, thus raising the OptimizeWarning below. The following MCVE explains my problem:
MCVE python snippet
from scipy.optimize import curve_fit
func = lambda x, a: a * x
popt, pcov = curve_fit(f = func, xdata = [1], ydata = [1])
print(popt, pcov)
Output
\python-3.4.4\lib\site-packages\scipy\optimize\minpack.py:715:
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
[ 1.] [[ inf]]
For a = 1 the function fits xdata and ydata exactly. Why isn't the error/variance 0, or something close to 0, but inf instead?
There is this quote from the curve_fit SciPy Reference Guide:
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
So, what's the underlying problem? Why doesn't the Jacobian matrix at the solution have a full rank?
The formula for the covariance of the parameters (Wikipedia) has the number of degrees of freedom in the denominator. The degrees of freedoms are computed as (number of data points) - (number of parameters), which is 1 - 1 = 0 in your example. And this is where SciPy checks the number of degrees of freedom before dividing by it.
With xdata = [1, 2], ydata = [1, 2] you would get zero covariance (note that the model still fits exactly: exact fit is not the problem).
This is the same sort of issue as sample variance being undefined if the sample size N is 1 (the formula for sample variance has (N-1) in the denominator). If we only took size=1 sample out of the population, we don't estimate the variance by zero, we know nothing about the variance.

Covariance in Python with iminuit

I have to calculate the covariance between 2 parameters from a fit function. I found this package in Python called iminuit that did a good fit and also calculate the covariance matrix of the parameters. I tested the package on a simple function. This is the code:
from iminuit import Minuit, describe, Struct
def func(x,y):
f=x**2+y**2
return f
m = Minuit(func,pedantic=False,print_level=0)
m.migrad()
print("Covariance:")
print(m.matrix())
and this is the output:
Covariance:
((1.0, 0.0),
(0.0, 1.0))
However if i replace x^2+y^2 with (x-y)^2 I obtain
Covariance:
((250.24975024975475, 249.75024975025426),
(249.75024975025426, 250.24975024975475))
I am confused why do I get covariance bigger than 1 (I am not good at statistics but from what I understood it has to be between -1 and 1), so someone who knows iminuit can help me? And also, in the first case, what does the matrix means? Why there is 0 correlation between x and y and what 1 on the diagonal means?
You are confusing covariance with correlation. Correlation is the normalised version of the covariance, which is indeed always between -1 and 1.
To obtain the corellation from the covariance matrix, calculate:
correlation = cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])

Need Python polynomial fit function that returns covariance

I want to do least-squares polynomial fits on data sets (X,Y,Yerr) and obtain the covariance matrices of the fit parameters. Also, since I have many data sets, CPU-time is an issue, so I'm seeking an analytical (=fast) solution. I found the following (non-ideal) options:
numpy.polyfit does the fit, but doesn't take into account the errors Yerr, nor does it return the covariance;
numpy.polynomial.polynomial.polyfit does accept Yerr as an input (in the form of weights), but doesn't return covariance either;
scipy.optimize.curve_fit and scipy.optimize.leastsq can be tailored to fit polynomials and return the covariance matrix, but - being iterative methods - these are much slower than the polyfit routines (which yield an analytical solution);
Does Python provide an analytical polynomial fit routine that returns the covariance of the fit parameters (or do I have to write one myself :-) ?
Update:
It appears that in Numpy 1.7.0, numpy.polyfit now not only does accept weights, but also returns the covariance matrix of the coefficients ... So, issue resolved! :-)
You want a fast weighted least squares model that returns the covariance matrix without additional overhead? In general, the right covariance matrix will depend on the data generating process (DGP) because different DGP (say Heteroscedasticity of errors) imply different distributions of parameter estimates (think White vs. OLS standard errors). But if you can assume WLS is the right way to do it, and I believe you would use the asymptotic variance estimate for beta for WLS, (1/n X'V^-1X)^-1, where V is the weighting matrix created from Yerrs. That's a pretty simple formula if numpy.polynomial.polynomial.polyfit is working for you.
I looked for an online reference but couldn't find one. But see Fumio Hayashi's Ecomometrics, 2000, Princeton University press, p. 133 - 137 for a derivation and discussion.
Update 12/4/12:
There is another stack overflow question that comes close:
numpy.polyfit has no keyword 'cov' that has a nice explanation (with code) of how to use scikits.statsmodels to do what you want. I believe you'll want to replace the line:
result = sm.OLS(Y,reg_x_data).fit()
to
result = sm.WLS(Y,reg_x_data, weights).fit()
Where you define weights as a function of Yerr as before with numpy.polynomial.polynomial.polyfit. More details on using statsmodels with WLS over at
the statsmodels website.
Here it is using scipy.linalg.lstsq
import numpy as np,numpy.random, scipy.linalg
#generate the test data
N = 100
xs = np.random.uniform(size=N)
errs = np.random.uniform(0, 0.1, size=N) # errors
ys = 1 + 2 * xs + 3 * xs ** 2 + errs * np.random.normal(size=N)
# do the fit
polydeg = 2
A = np.vstack([1 / errs] + [xs ** _ / errs for _ in range(1, polydeg + 1)]).T
result = scipy.linalg.lstsq(A, (ys / errs))[0]
covar = np.matrix(np.dot(A.T, A)).I
print result, '\n', covar
>> [ 0.99991811 2.00009834 3.00195187]
[[ 4.82718910e-07 -2.82097554e-06 3.80331414e-06]
[ -2.82097554e-06 1.77361434e-05 -2.60150367e-05]
[ 3.80331414e-06 -2.60150367e-05 4.22541049e-05]]

Categories

Resources