scipy curve_fit incorrect for large X values - python

To determine trends over time, I use scipy curve_fit with X values from time.time(), for example 1663847528.7147126 (1.6 billion).
Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0 values doesn't help. I found the magnitude of X to be a crucial element for this error and I wonder why?
Here is a simple snippet that shows working and non-working X offset:
import scipy.optimize
def fit_func(x, a, b):
return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
# (array([-1.e+08, 1.e+00]), array([[ 0., -0.],
# [-0., 0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
# OptimizeWarning: Covariance of the parameters could not be estimated
# warnings.warn('Covariance of the parameters could not be estimated',
# (array([-4.53788811e+08, 4.53788812e-01]), array([[inf, inf],
# [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
# (array([-7.60846335e+10, 7.60846334e+01]), array([[-1.97051972e+19, 1.97051970e+10],
# [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09, 1.e+00]), array([[inf, inf],
# [inf, inf]]))
As a side question, perhaps there's a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.
Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.

Root cause
You are facing two problems:
Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
Float Arithmetic Error. When switching from 1e8 to 1e9 you just hit the magnitude when such a kind of error become predominant.
The second one is very important to realize. Let's say you are limited to 8 significant digits representation, then 1 000 000 000 and 1 000 000 001 are the same numbers as they are both limited to this writing 1.0000000e9 and we cannot accurately represents 1.0000000_e9 which requires one more digit (_). This is why your second example fails.
Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.
You have three solutions:
Normalize;
Normalize and change the methodology/algorithm;
Increase the machine precision.
I'll choose the first one as it is more generic, the second one has been proposed by #blunova and totally makes sense, the latter is probably an inherent limitation.
Normalization
To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:
import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
return a + b * x
xm = np.mean(x) # 1000000002.0
xs = np.std(x) # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2. , 1.41421356]),
# array([[0., 0.],
# [0., 0.]]))
# Back transformation:
a = result[0][1]/xs # 1.0
b = result[0][0] - xm*result[0][1]/xs # -1000000000.0
Or the same result using sklearn interface:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", LinearRegression())
])
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_ # array([1.e+09])
pipe.named_steps["scaler"].scale_ # array([1.41421356])
pipe.named_steps["regressor"].coef_ # array([1.41421356])
pipe.named_steps["regressor"].intercept_ # 2.0
Back transformation
Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.
Simply write down and solve the transformation:
y = x'*a' + b'
x' = (x - m)/s
y = x*a + b
Which gives you the following solution:
a = a'/s
b = b' - m/s*a'
Precision addendum
Numpy default float precision is float64 as you expected and has about 15 significant digits:
x.dtype # dtype('float64')
np.finfo(np.float64).precision # 15
But scipy.curve_fit relies on scipy.least_square which makes use of a squared metric to drive the optimization.
Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to 1e9 you reach the threshold where Float Arithmetic Error becomes predominant.
So this threshold of 1e9 you have hit is not related to the distinction between numbers on your variable x (float64 has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
You can also check that in its signature, tolerances are about 8 decades wide:
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
loss='linear', f_scale=1.0, diff_step=None, tr_solver=None,
tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
args=(), kwargs={})
Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.
Methods comparison
What is interesting with scipy.stats.linregress method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the TINY variable) to solve the LS problem even in problematic conditions.
This of course contrasts with the scipy.optimize.curve_fit method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).
If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.

If you just need to compute a linear fit, I believe curve_fit is not necessary and I would just use the linregress function instead from SciPy as well:
>>> from scipy import stats
>>> y = list(range(5))
>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
In general, if you need a polynomial fit I would use NumPy polyfit.

Related

Constraining OLS (or WLS) coeffecients using statsmodels

I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])

Closed Form Ridge Regression

I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!
First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.
Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.
Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.

Sigmoid function returns 1 for large positive inputs

I wrote the following function in Python to calculate sigmoid function of a scalar, vector or matrix.
def sigmoid(z):
sig = 1.0/(1.0 + np.exp(-z))
return sig
For relatively large positive values of z, e^-z returns a very small value close to zero (0) and hence the value of sig is rounded to 1. My final objective is to determine cost function of logistic regression algorithm. Since sigmoid returns exactly 1, log(1-1) return 'nan'. How can I solve the problem in such a way that my sigmoid() function will return the proper value and not round the e^-z to 0?
When I normalized the input features using mean and standard deviation, it worked fine. But is there a way to make it work with larger values of z?
Also, I tried the same on Matlab and it worked fine without normalization.
>>> Z = np.array([[60, 100],[20, 80]])
>>> Z
array([[ 60, 100],
[ 20, 80]])
>>> np.exp(-Z)
array([[8.75651076e-27, 3.72007598e-44],
[2.06115362e-09, 1.80485139e-35]])
>>> 1.0/(1.0 + np.exp(-Z))
array([[1., 1.],
[1., 1.]])
I overcame this issue by wrapping the sigmoid function with np.minimum & np.maximum:
def sigmoid(x):
sig = 1 / (1 + np.exp(-x)) # Define sigmoid function
sig = np.minimum(sig, 0.9999) # Set upper bound
sig = np.maximum(sig, 0.0001) # Set lower bound
return sig
As a result my losses started looking like this
But both training and test losses converged well & I received ~90% of accuracy on a tiny dataset using logistic regression.
As already mentioned by jdehesa, your issue is with precision limits. You can read more here:
https://docs.python.org/2/tutorial/floatingpoint.html
You could try using the Decimal class which avoids rounding:
from decimal import Decimal
import numpy as np
import math
def sigmoid(z):
sig = Decimal(1.0)/(Decimal(1.0) + Decimal(np.exp(-z)))
return sig
math.log(Decimal(1)-sigmoid(60))
>>> -59.97257293350302
but this will only work up to a point (I see it already fails for 80). I did some reading and if you really need more precision, you can change the precision of the Decimal object by increasing the value of the following from the default of 28:
from decimal import *
getcontext().prec = 28
More detail here:
https://docs.python.org/2/library/decimal.html
For most cases though, the suggestion of manually handling small results with a logic check will probably be best.

Fitting a vector function with curve_fit in Scipy

I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.
If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!
I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.

Numpy and R give non-zero intercept in linear regression when x = y

I was testing some code which, among other things, runs a linear regression of the form y = m * x + b on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:
>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([ 1.00000000e+00, -8.51331872e-16]), array([ 7.50403936e-31]), 2, array([ 5.78859314, 1.22155205]))
>>> # ^slope ^intercept ^residuals ^rank ^singular values
Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0, because this exact equation is not found, numpy reports a tiny but non-zero residual value.
As a sanity check, I tried this out in R (my "native" language), and observed similar results:
> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3.972e-16 1.000e+00
My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.
I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.
It looks like numerical error, the y-intercept is extremely small.
Python, and numpy included, uses double precision floating point numbers by default. These numbers are formatted to having a 52 bit coefficient (see this for floating point explanation, and this for scientific notation explanation of "base")
In your case, you found a y-intercept of ~4e-16. As it turns out, a 52 bit coefficient has roughly 2e-16 accuracy. Basically, in the regression, you subtracted a number on the order of 1 from something closely resembling itself, and hit the numerical precision of double floating point.

Categories

Resources