I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.
Related
I have an experimental dataset 1 which plots intensity as a function of energy. These are arrays of 1800 datapoints.
I have been trying to fit a model to this data, given by the equation below:
Imodel = I0 * ((math.cos(phi) + (beta * f1))**2 + (math.sin(phi) + (beta*f2))**2 + Ioff
I have 2 other datasets of f1 vs. energy and f2 vs. energy 2. These are arrays of 700 datapoints, albeit over the same energy range as the first dataset.
I want to use this model function together with the f1 and f2 data to find optimal values of the other 4 parameters (I0, phi, beta, Ioff) where this model function fits the experimental dataset exactly.
I have been looking into curve_fit and least_squares from the scipy.optimize package, as well as linear regression packages such as lmfit and scikit, but to no avail.
can anyone help? Thanks
Presently I have no representative data from Ayrtonb1 in order to test the method proposed below. The method seems convenient from theoretical basis but one cannot be sure that it will be satisfying with the OP data.
Nevertheless a preliminary test was carried out with a "toy" data (shown below).
I suppose that the screencopy below is sufficient to understand the method and to reproduce the calculus with real data.
The result of this preliminary test is rather good :
LRMSE<2 for a range up to 600. (Least Root Mean Square Error).
LRMSRE<2% (Least Root Mean Square Relative Error).
Your data and formula look suspiciously like resonant (or anomalous) X-ray diffraction data, with measurements of scattered intensity going across the Zn K-edge. Although you do not say this, the discussion here will assume that. You say you have 1800 measurements, presumably as a function of X-ray energy in eV. The resonant scattering factors (f1, f2) you show seem to be idealized and possibly "typical", and perhaps not specifically for the Zn K-edge -- at the very least the energy scale shown is not the same as your data.
You will want to treat the data and model the intensity as a function of X-ray energy. And you will want realistic values for f1 and f2 for the element of interest, and at the actual energy points for your data. I recommend using xraydb (full disclosure: I am the lead author) [pip install xraydb], so that you can do
import numpy as np
import xraydb
#edata, idata = function_to_extract_your_data()
# or perhaps testing with
edata = np.linspace(9500, 10500, 501)
f1 = xraydb.f1_chantler('Zn', edata)
f2 = xraydb.f2_chantler('Zn', edata)
As written, your intensity function does not directly depend on energy, though it might at a later date, say to make that offset be linear in energy, not just a constant. You might write a function like:
def intensity(en, phi, beta, scale=1, slope=0, offset=0, f1=-1, f2=1):
costerm = np.cos(phi) + beta*f1
sinterm = np.sin(phi) + beta*f2
return scale * (costerm**2 + sinterm**2) + slope*en + offset
with that you can start just trying out some values to get a feel for the function and how it compares to your data:
import matplotlib.pyplot as plt
beta = 0.025 # Wild Guess!
for phi in np.pi*np.arange(20)/10:
plt.plot(edata, intensity(edata, phi, beta, f1=f1, f2=f2), label='%.1f'%phi)
plt.legend()
plt.show()
It kind of looks like your value for phi would be around 5.5 to 6 (or -0.8 to -0.3). You could also try different values of beta and plot that with your actual data.
With that model for intensity and a feel for what the range of parameters is, you could then try to fit your data. To do that, I would recommend using lmfit (full disclosure: I am the lead author) [pip install lmfit], where you can create a model from your intensity model function - this will use the names of the function arguments to name the fitting parameters.
from lmfit import Model
imodel = Model(intensity, independent_vars=['en', 'f1', 'f2'])
params = imodel.make_params(scale=1, offset=0, slope=0, beta=0.1, phi=5.5)
That independent_vars will tell Model to not make fitting Parameters for the function arguments f1 and f2 and to expect them to be passed into any evaluation or fit. The other function arguments (scale, phi, etc) will be expected to become fitting variables. You do have to create a "Parameters" object and must give initial values for all parameters. You can put bounds on these or fix them so they are not altered in the fit, as with
params['scale'].min = 0 # force scale to be positive
params['slope'].vary = False # slope will be fixed at 0.
You can then evaluate the model with
init_value = imodel.eval(params, en=edata, f1=f1, f2=f2)
And then eventually do a fit with
result = imodel.fit(idata, params, en=edata, f1=f1, f2=f2)
print(result.fit_report())
plt.plot(edata, idata, label='data')
plt.plot(edata, init_value, label='initial fit')
plt.plot(edata, result.best_fit, label='best fit')
plt.legend()
plt.show()
Finally, for analysis of X-ray resonant scattering be sure to consider including absorption corrections in that intensity calculation. As you go across the Zn K edge, the absorption depth of the sample may change dramatically if the Zn concentration is high.
I attempted to fit a best fit line with my data points using scipy.optimize.curvefit function:
x = np.array([0,2246,2600,3465,4392])
y = [-0.763,0.052,0.081,0.266,0.179]
yerror = [0.201,0.113,0.139,0.162,0.204]
plt.errorbar(wavelength,A,yerr=B, xerr=None, fmt='o')
def func(x, a, b, c):#the best fit function
return a + (b * x)**c
popt, pcov = scipy.optimize.curve_fit(func, x, y)
x_fit = np.linspace(0, np.max(x), 1000) # create curve line of best fit
plt.plot(x_fit, func(x_fit, *popt), "b")
My popt value is: array([-7.63283206e-01, 2.23580046e-04, 2.63164486e-01])
where the first value -7.63283206e-01 is the intercept I wish it to show in the graph.
The data points and best fit are plotted here using code above and gives a logarithmic curve, but I want the line of best fit to pass through the y axis like this instead to illustrate a straighter curve.
Thanks in advance!
"Best fit" means nothing until the criteria of fitting be specified (Least mean square error or least mean square relative error or least mean absolute error or etc.). The "best fit" is different for each one.
Sine there is no criteria of fitting specified, why not choosing the simplest method, without iterative process and without need of guessed initial values of parameters.
The method below, from https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales gives :
If you specify a particular criteria of fitting then an iterative method with guessed initial values are required. You can use the above values a,b,c as initial values for a robust and fast convergence.
I'm trying to fit a non linear model using Generalized Additive model. How do I determine the number of splines to use. Is there a specific way to choose the number of splines? I have used a 3rd order (cubic) spline fitting. Below is the code.
from pygam import LinearGAM
from pygam.utils import generate_X_grid
# Curve fitting using GAM model - Penalised spline curve.
def modeltrain(time,value):
return LinearGAM(n_splines=58,spline_order=3).gridsearch(time, value)
model=modeltrain(t1,x1)
# samples random x-values for prediction
XX = generate_X_grid(model)
#plots for vizualisation
plt.plot(XX, model.predict(XX), 'r--')
plt.plot(XX, model.prediction_intervals(XX,width=0.25), color='b', ls='-- ')
plt.scatter(t1, x1)
plt.show()
This is the expected result
Original data scatter plot
If the number of splines is not chosen correctly, then I get a incorrect fit.
Please, I would like a suggestion of methods to choose the number of splines accurately.
Typically for splines you choose a fairly high number of splines (~25) and you let the lambda smoothing parameter do the work of reducing the flexibility of the model.
For your use-case I would choose the default n_splines=25 and then do a gridsearch over the lambda parameter lam to find the best amount of smoothing:
def modeltrain(time,value):
return LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, lam=np.logspace(-3, 3, 11))
This will try 11 models from lam = 1e-3 to 1e3.
I think your choice of n_splines=58 is too high because it looks like it produces one spline per data-point.
If you really want to do a search over n_splines then you could do:
LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, n_splines=np.arange(50))
Note: the function generate_X_grid does NOT do random sampling for prediction, it actually just makes a dense linear-spacing of your X-values (time). The reason for this is to visualize how the learned model will interpolate.
I know that there are some similar questions, but since none of them brought me any further, I decided to ask one of my own.
I am sorry, if the answer to my problem is already somewhere out there, but I really couldn't find it.
I tried fitting f(x) = a*x**b to rather linear data using curve_fit. It compiles properly, but the result is way off as shown below:
The thing is, that I don't really know what I am doing, but on the other hand fitting always is more of an art than science and there was at least one general bug with scipy.optimize.
My data looks like this:
x-values:
[16.8, 2.97, 0.157, 0.0394, 14.000000000000002, 8.03, 0.378, 0.192, 0.0428, 0.029799999999999997, 0.000781, 0.0007890000000000001]
y-values:
[14561.766666666666, 7154.7950000000001, 661.53750000000002, 104.51446666666668, 40307.949999999997, 15993.933333333332, 1798.1166666666666, 1015.0476666666667, 194.93800000000002, 136.82833333333332, 9.9531566666666684, 12.073133333333333]
That's my code (using a really nice example in the last answer to that question):
def func(x,p0,p1): # HERE WE DEFINE A FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p0*(x**p1)
# Here you give the initial parameters for p0 which Python then iterates over to find the best fit
popt, pcov = curve_fit(func,xvalues,yvalues, p0=(1.0,1.0))#p0=(3107,0.944)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p0 = popt[0]
p1 = popt[1]
residuals = yvalues - func(xvalues,p0,p1)
fres = sum(residuals**2)
print 'chi-square'
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(5e-4,20) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p0,p1)
The starting values are from a fit with gnuplot, that is plausible but I need to cross-check.
This is printed output (first fitted p0, p1, then chi-square):
[ 4.67885857e+03 6.24149549e-01]
chi-square
424707043.407
I guess this is a difficult question, therefore much thanks in advance!
When fitting curve_fit optimizes the sum of (data - model)^2 / (error)^2
If you don't pass in errors (as you are doing here) curve_fit assumes that all of the points have an error of 1.
In this case, as your data spans many orders of magnitude, the points with the largest y values dominate the objective function, and causes curve_fit to attempt to fit them at the expense of the others.
The best way of fixing this would be including the errors in your yvalues in the fit (it looks like you do as you have error bars in the plot you have made!). You can do this by passing them in as the sigma parameter of curve_fit.
I would rethink the experimental part. Two datapoints are questionable:
The image you showed us looks pretty good because you took the log:
You could do a linear fit on log(x) and log(y). In this way you might limit the impact of the largest residuals. Another approach would be robust regression (RANSAC from sklearn or least_squares from scipy).
Nevertheless you should either gather more datapoints or repeat the measurements.
I have to draw plot using least squares method in Python 3. I have list of x and y values:
y = [186,273,308,484]
x = [2.25,2.34,2.47,2.56]
There are many more values for x and for y, there is only a shortcut. And now, I know, that f(x)=y should be a linear function. I can get cofactor „a” and „b” of this function, by calculating:
delta_x = x[len(x)]-x[0] and delta_y = y[len(y)]-y[0]
Etc, using tangent function. I know, how to do it.
But there are also uncertainties of y, about 2 percent of y. So I have y_errors table, which contains all uncertainties of y.
But what now, how I can draw least squares?
Of course I have been used Google, I saw docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#least-square-fitting-leastsq, but there are some problems.
I tried to edit example from scipy.org to my own purpose. So I edited x, y, y_meas variables, putting here my own lists. But now, I dont know, what is p0 variable in this example. And what should I must edit to make my example working.
Of course I can edit also residuals function. It must get only one variable - y_true. In addition to this I dont understand arquments of leastsq function.
Sorry for my english and for asking such newbie question. But I dont understand this method. Thank You in advance.
I believe you are trying to fit a set of {x, y} (and possibly sigma_y: the uncertainties in y) values to a linear expression. This is known as linear regression, and For linear regression (or indeed, for regression of any polynomial) you can use numpy's polyfit. The uncertainties can be used for the weights::
weight = 1/sigma_y
where sigma_y is the standard deviation in y.
The least-squares routines in scipy.optimize allow you to fit a non-linear function to data, but you have to write the function that computes the "residual" (data - model) in terms of variables that are to be adjusted in order to minimize the calculated residual.