Fitting data from scatterplot - python

I have a Dataframe with two columns which I scatter plotted and got something like the following picture:
I would like to know if there is a way to find a distribution curve who best fits it, since the tutorials I've found focus in the distribution of one variable only (e.g. this case. I'm looking for something like this:
Does anyone have any directions or sample code for this case?

You can try fitting different degrees of polynomial using numpy.polyfit. It takes x, y and degree of fitting polynomial as inputs.
You can write a loop which iterates from 1 to say 5 for the degrees. Plot the f(x) using the coefficients which are returned by the function.
for d in degrees:
Fit using np.polyfit(x, y, d)
Get coefficients and optionally plot f(x) for degree d
Then find sum of squares (yi - f(xi))^2
Note that the sum of squares is just an indication of the error - in general it would go down as the degree increases but the plotting will kind of show you if you are overfitting to the data.
This is just one of the ways to go about solving the problem.

Related

Problems with curve_fit from scipy.optimze

I know that there are some similar questions, but since none of them brought me any further, I decided to ask one of my own.
I am sorry, if the answer to my problem is already somewhere out there, but I really couldn't find it.
I tried fitting f(x) = a*x**b to rather linear data using curve_fit. It compiles properly, but the result is way off as shown below:
The thing is, that I don't really know what I am doing, but on the other hand fitting always is more of an art than science and there was at least one general bug with scipy.optimize.
My data looks like this:
x-values:
[16.8, 2.97, 0.157, 0.0394, 14.000000000000002, 8.03, 0.378, 0.192, 0.0428, 0.029799999999999997, 0.000781, 0.0007890000000000001]
y-values:
[14561.766666666666, 7154.7950000000001, 661.53750000000002, 104.51446666666668, 40307.949999999997, 15993.933333333332, 1798.1166666666666, 1015.0476666666667, 194.93800000000002, 136.82833333333332, 9.9531566666666684, 12.073133333333333]
That's my code (using a really nice example in the last answer to that question):
def func(x,p0,p1): # HERE WE DEFINE A FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p0*(x**p1)
# Here you give the initial parameters for p0 which Python then iterates over to find the best fit
popt, pcov = curve_fit(func,xvalues,yvalues, p0=(1.0,1.0))#p0=(3107,0.944)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p0 = popt[0]
p1 = popt[1]
residuals = yvalues - func(xvalues,p0,p1)
fres = sum(residuals**2)
print 'chi-square'
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(5e-4,20) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p0,p1)
The starting values are from a fit with gnuplot, that is plausible but I need to cross-check.
This is printed output (first fitted p0, p1, then chi-square):
[ 4.67885857e+03 6.24149549e-01]
chi-square
424707043.407
I guess this is a difficult question, therefore much thanks in advance!
When fitting curve_fit optimizes the sum of (data - model)^2 / (error)^2
If you don't pass in errors (as you are doing here) curve_fit assumes that all of the points have an error of 1.
In this case, as your data spans many orders of magnitude, the points with the largest y values dominate the objective function, and causes curve_fit to attempt to fit them at the expense of the others.
The best way of fixing this would be including the errors in your yvalues in the fit (it looks like you do as you have error bars in the plot you have made!). You can do this by passing them in as the sigma parameter of curve_fit.
I would rethink the experimental part. Two datapoints are questionable:
The image you showed us looks pretty good because you took the log:
You could do a linear fit on log(x) and log(y). In this way you might limit the impact of the largest residuals. Another approach would be robust regression (RANSAC from sklearn or least_squares from scipy).
Nevertheless you should either gather more datapoints or repeat the measurements.

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

PyMC Robust Linear Regression with Measured Uncertainties

I use least squares regression of data with measured errors in both x and y and use the reduced chi-square (mean square weighted deviation: mswd) as a measure of the fit. However, some of the assumptions for using reduced chi-squared likely are not met and I'd like to move towards an mcmc/bayesian approach using PyMC. I've searched the web but can't exactly seem to find what I'm looking for, most examples assume the data uncertainty is Gaussian, but here I have measured uncertainties in x and y.
It seems like I should be able to do this in PyMC2 or PyMC3 with glm.
Here's a typical dataset plotted up:
And the data to go with it:
# Data in Columns, Observations in Rows
# Measured values x versus y,
# Measured standard deviations sx and sy.
x sx y sy
0.3779397 0.001889699 0.5130084 2.748546e-05
0.3659092 0.001829546 0.5129624 2.721838e-05
0.3430834 0.001715417 0.5129023 2.720073e-05
0.4121606 0.002060803 0.5130235 2.755231e-05
0.3075815 0.001537908 0.5128739 2.776967e-05
0.3794471 0.001897236 0.5129950 2.842079e-05
0.1447394 0.000723697 0.5126784 2.816200e-05
I'm looking for any examples and references where people have done this. Thanks in advance.

Python LeastSquares plot

I have to draw plot using least squares method in Python 3. I have list of x and y values:
y = [186,273,308,484]
x = [2.25,2.34,2.47,2.56]
There are many more values for x and for y, there is only a shortcut. And now, I know, that f(x)=y should be a linear function. I can get cofactor „a” and „b” of this function, by calculating:
delta_x = x[len(x)]-x[0] and delta_y = y[len(y)]-y[0]
Etc, using tangent function. I know, how to do it.
But there are also uncertainties of y, about 2 percent of y. So I have y_errors table, which contains all uncertainties of y.
But what now, how I can draw least squares?
Of course I have been used Google, I saw docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#least-square-fitting-leastsq, but there are some problems.
I tried to edit example from scipy.org to my own purpose. So I edited x, y, y_meas variables, putting here my own lists. But now, I dont know, what is p0 variable in this example. And what should I must edit to make my example working.
Of course I can edit also residuals function. It must get only one variable - y_true. In addition to this I dont understand arquments of leastsq function.
Sorry for my english and for asking such newbie question. But I dont understand this method. Thank You in advance.
I believe you are trying to fit a set of {x, y} (and possibly sigma_y: the uncertainties in y) values to a linear expression. This is known as linear regression, and For linear regression (or indeed, for regression of any polynomial) you can use numpy's polyfit. The uncertainties can be used for the weights::
weight = 1/sigma_y
where sigma_y is the standard deviation in y.
The least-squares routines in scipy.optimize allow you to fit a non-linear function to data, but you have to write the function that computes the "residual" (data - model) in terms of variables that are to be adjusted in order to minimize the calculated residual.

How can I produce some data with a specific shape?

I have a couple of images of graphs for which I'd like to synthesise the original (x,y) data. I don't have the original data, just the graphs.
Ideally I'd like to be able to approximate the shape of the curves using mathematical functions, so that I can vary the functions to produce slightly different output, and so that I can have simple reproducibility.
The first image shows a set of curves: temperature anomalies from the mean over some recent period, stretching back to 20,000 years before present. The second image shows a step function with a change at 10,000 years before present (log scale). (You'll also notice they have opposite x-axis directions).
For each of these, the eventual data I want to produce is a text file, with a temperature anomaly value for every 10 or 100 years.
Any solution is welcome.
I am not sure I fully understand your question. But what you could try to do is to digitize the data (windig works for windows and engauge for linux) and then do some data interpolation between the points.
The trivial interpolation which works almost all the time is just a straight line between two consecutive points. A more sophisticated approach is a cubic spline (for instance B-splines http://en.wikipedia.org/wiki/B-spline) that keep the second derivative continuous.
I've decided to answer with some details about a way to generate a curve using algebra.
For a periodic type curve, you'd use either sine or cosine, and fiddle with amplitude and frequency to match your particular situation. For example, y = A sin(2x), where the amplitude is A, and the period is related to the inner function of x (that bit inside the brackets). Try this in gnuplot:
A=2
f(x) = A*sin(2*x)
set xrange[-pi:pi]
plot f(x), sin(x), cos(x)
To change the amplitude as x changes just add a power term or exponential term of the x values to the amplitude:
f(x) = A*exp(0.5*x)*sin(2*x)
set xrange[-2*pi:2*pi]
plot f(x), sin(x)
# add an initial value (offset)
f(x) = 5+A*exp(0.5*x)*sin(2*x)
plot f(x), sin(x)
And so on.

Categories

Resources