Python, how to fit dataset to 1 or 2 equations

Python, how to fit dataset to 1 or 2 equations - python

I have the following dataset from a mechanical indentation test:
https://www.dropbox.com/s/jovjl55sjjyph3r/Test%20dataset.csv?dl=0
The graph shows the displacement of a spherical probe vs force recorded. I need to fit these data with specific equations (DMT model if you are familiar with it).
I have produced the code below, but I am unable to get a good fitting result. The code gives no error or warnings, so I don't know if the problem is on the plotting or on the actually fitting.
Did I write the fitting code correctly? Did I pass the variables Fad and R correctly into the function? Is the code that plots the fitting curve correct?
Also, in the code you can notice 2 different fitting functions. Function1 is based on 2 equations:
a = ((R/K)*(x+Fad))^(1/3)
y = ((a^2)/R)
Function2 is the same as function1 but the 2 equations are combined in a single equation. The funny thing is that they give 2 different plots!
Importantly, I'd like to use the 2 equations method because there are other more complex models that I should use to fit the same dataset. In these models the equations cannot be combined that easily like in this case.
Any help from the Community to solve this problem would be very much appreciated.
import pandas
from matplotlib import pyplot as plt
from scipy import optimize
import numpy as np
df = pandas.read_table("Test dataset.csv", sep = ',', header=0)
df = df.astype(float) #Change data from object to float
print(df.shape)
print(df)
df_Fad = -(df.iloc[0, 0])
print("Adhesion force = {} N".format(df_Fad))
R = 280*1e-6
print("Probe radius = {} m".format(df_Fad))
df_x = df["Corr Force A [N]"].to_list()
df_y = df["Corr Displacement [m]"].to_list()
#Define fitting function1
def DMT(x, R, K, Fad):
a = ((R/K)*(x+Fad))**(1/3)
return ((a**2)/R)
custom_DMT = lambda x, K: DMT(x, R, K, df_Fad) #Fix Fad value
pars_DMT, cov_DMT = optimize.curve_fit(f=custom_DMT, xdata=df_x, ydata=df_y)
print("K = ", round(pars_DMT[0],2))
print ("E = ", round(pars_DMT[0]*(4/3),2))
ax0 = df.plot(kind='scatter', x="Corr Force A [N]", y="Corr Displacement [m]", color='lightblue')
plt.plot(df_x, DMT(np.array(df_y), pars_DMT[0], R, df_Fad), "--", color='black')
ax0.set_title("DMT fitting")
ax0.set_xlabel("Force / N")
ax0.set_ylabel("Displacement / m")
ax0.legend(['Dataset'])
plt.tight_layout()
#Define fitting function2 => function2 = funtion1 in one line
def DMT2(x, Fad, R, K):
return ((x+Fad)**(2/3))/((R**(1/3))*(K**(2/3)))
custom_DMT2 = lambda x, K: DMT2(x, df_Fad, R, K) #Fix Fad value
pars_DMT2, cov_DMT2 = optimize.curve_fit(f=custom_DMT2, xdata=df_x, ydata=df_y)
print("K = ", round(pars_DMT2[0],2))
print ("E = ", round(pars_DMT2[0]*(4/3),2))
ax1 = df.plot(kind='scatter', x="Corr Force A [N]", y="Corr Displacement [m]", color='lightblue')
plt.plot( df_x, DMT2(np.array(df_y), pars_DMT2[0], df_Fad, R), "--", color='black')
ax1.set_title("DMT fitting")
ax1.set_xlabel("Force / N")
ax1.set_ylabel("Displacement / m")
ax1.legend(['Dataset'])
plt.tight_layout()
plt.show()

After further attempts and research I have solved the problem even though a doubt on a line of the code above still remains. I decided to post my solution here hoping that this could be helpful to others.
I could see that the code:
def DMT(x, R, K, Fad):
a = ((R/K)*(x+Fad))**(1/3)
return ((a**2)/R)
works well in fitting the experimental data, meaning that more equations can be easily used and combined in Python, which is great. It is important though that the variables (x, R, K, Fad) are introduced in the same order as they appear in the equations. Failing to do this gives random results.
The problem stays in the code line:
plt.plot(df_x, DMT(np.array(df_y), pars_DMT[0], R, df_Fad), "--", color='black')
Initially I thought that the order of the parameters (x, R, K, Fad) was wrong. I tried:
plt.plot(df_x, DMT(np.array(df_y), R, pars_DMT[0], df_Fad), "--", color='black')
but this didn't solve the problem. Anyone who can tell me what's wrong with this line?
Anyway, my way around this problem was to directly calculate the y data of the fitting line from the calculated parameter K (R and df_Fad are fixed) using the following code:
df["Fitting Displacement [m]"] = ((df["Corr Force A [N]"]+df_Fad)**(2/3))/((R**(1/3))*(pars_DMT[0]**(2/3)))
This was anyway a necessary step to do in order to save the fitting results.

Related

Fitting two distinct equations to a function (curve_fit)

I have a problem: I have two distinct equations, one is a linear equation, the other one is an exponential equation. However not both equations should be valid at the same time, meaning that there are two distinct regimes.
Equation 1 (x < a): E*x
Equation 2 (x >=a): a+b*x+c*(1-np.exp(-d*np.array(x)))
Meaning the first part of the data should just be fit with a linear equation and the rest should be fit with the before mentioned equation 2.
The data I'm trying to fit looks like this (I have also added some sample data, if people wanna have a go):
I have tried several thing already, from just defining one fit function with a heaviside function:
def fit_fun(x,a,b,c,d,E):
funktion1=E*np.array(x)
funktion2=a+b*x+c*(1-np.exp(-d*np.array(x)))
return np.heaviside(x+a,0)*funktion2+(1-np.heaviside(x+a,0))*funktion1
defining a piecewise function:
def fit_fun(x,a,b,c,d,E):
return np.piecewise(x, [x <= a, x > a], [lambda x: E*np.array(x), lambda x: a+b*x+c*(1-np.exp(-d*np.array(x)))])
to lastly (which unforunattly yields me some form function error?):
def plast_fun(x,a,b,c,d,E):
out = E*x
out [np.where(x >= a)] = a+b*x+c*(1-np.exp(-d+x))
return out
Don't get me wrong I do get "some" fits, but they do seem to either take one or the other equation and not really use both. I also tried using several bounds and inital guesses, but it never changes.
Any input would be greatly appreciated!
Data:
0.000000 -1.570670
0.000434 83.292677
0.000867 108.909402
0.001301 124.121676
0.001734 138.187659
0.002168 151.278839
0.002601 163.160478
0.003035 174.255626
0.003468 185.035092
0.003902 195.629820
0.004336 205.887161
0.004769 215.611995
0.005203 224.752083
0.005636 233.436680
0.006070 241.897851
0.006503 250.352697
0.006937 258.915168
0.007370 267.569337
0.007804 276.199005
0.008237 284.646778
0.008671 292.772349
0.009105 300.489611
0.009538 307.776858
0.009972 314.666291
0.010405 321.224211
0.010839 327.531594
0.011272 333.669261
0.011706 339.706420
0.012139 345.689265
0.012573 351.628362
0.013007 357.488150
0.013440 363.185771
0.013874 368.606298
0.014307 373.635696
0.014741 378.203192
0.015174 382.315634
0.015608 386.064126
0.016041 389.592120
0.016475 393.033854
0.016908 396.454226
0.017342 399.831519
0.017776 403.107084
0.018209 406.277016
0.018643 409.441119
0.019076 412.710982
0.019510 415.987331
0.019943 418.873140
0.020377 421.178098
0.020810 423.756827
So far I have found these two questions, but I could't figure it out:
Fit of two different functions with boarder as fit parameter
Fit a curve for data made up of two distinct regimes

I suspect you are making a mistake in the second equation, where you do a+b*x+c*(1-np.exp(-d+x)). where a is the value of x where you change from one curve to the other. I think you should use the value of y instead which is a*E. Also it is very important to define initial parameters to the fit. I've ran the following code with your data in .txt file and the fit seems pretty good as you can see bellow:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import optimize, stats
def fit_fun(x,a,b,c,d,E):
return np.piecewise(x, [x <= a, x > a], [lambda x: E*x, lambda x: a*E+b*x+c*(1-np.exp(-d*x))])
df = pd.read_csv('teste.txt', delimiter='\s+', header=None)
df.columns = ['x','y']
xdata = df['x']
ydata = df['y']
p0 = [0.001,1,1,1,100000]
popt, pcov = optimize.curve_fit(fit_fun, xdata.values, ydata.values, p0=p0, maxfev=10000, absolute_sigma=True, method='trf')
print(popt)
plt.plot(xdata, ydata,'*')
plt.plot(xdata, fit_fun(xdata.values, *popt), 'r')
plt.show()

matplotlib exact solution to a differential equation

I'm trying to plot the exact solution to a differential equation (a radioactive leak model) in python2.7 with matplotlib. When plotting the graph with Euler methods OR SciPy I get the expected results, but with the exact solution the output is a straight-line graph (should be logarithmic curve).
Here is my code:
import math
import numpy as np
import matplotlib.pyplot as plt
#define parameters
r = 1
beta = 0.0864
x0 = 0
maxt = 100.0
tstep = 1.0
#Make arrays for time and radioactivity
t = np.zeros(1)
#Implementing model with Exact solution where Xact = Exact Solution
Xact = np.zeros(1)
e = math.exp(-(beta/t))
while (t[-1]<maxt):
t = np.append(t,t[-1]+tstep)
Xact = np.append(Xact,Xact[-1] + ((r/beta)+(x0-r/beta)*e))
#plot results
plt.plot(t, Xact,color="green")
I realise that my problem may be due to an incorrect equation, but I'd be very grateful if someone could point out an error in my code. Cheers.

You probably want e to depend on t, as in
def e(t): return np.exp(-t/beta)
and then use
Xact.append( (r/beta)+(x0-r/beta)*e(t[-1]) )
But you can have that all shorter as
t = np.arange(0, maxt+tstep/2, tstep)
plt.plot(t, (r/beta)+(x0-r/beta)*np.exp(-t/beta), color="green" )

How to put constraints on fitting parameter?

I want to fit a curve to my data:
x=[24,25,28,37,58,104,200,235,235]
y=[340,350,370,400,430,460,490,520,550]
xerr=[1.1,1,0.8,1.4,1.4,2.6,3.8,2,2]
def fit_fc(x, a, b, c):
return a*x**b+c
popt, pcov=curve_fit(fit_fc,x,y,maxfev=5000)
plt.plot(x,fit_fc(x,popt[0],popt[1],popt[2]))
plt.errorbar(x,y,xerr=xerr,fmt='-o')
but i want to put some constraints on the a,b and c. For example I want them to be in some range, lets say between 0 and 20. How can i achieve that? I'm new in Python, so any help would be appreciated.

You could use lmfit to constrain you parameters. For the following plot, I constrained your parameters a and b to the range [0,20] (which you mentioned in your post) and c to the range [0, 400]. The parameters you get are:
a: 19.9999991
b: 0.46769173
c: 274.074071
and the corresponding plot looks as follows:
As you can see, the model reproduces the data reasonable well and the parameters are in the given ranges.
Here is the code that reproduces the results with additional comments:
from lmfit import minimize, Parameters, Parameter, report_fit
import numpy as np
x=[24,25,28,37,58,104,200,235,235]
y=[340,350,370,400,430,460,490,520,550]
def fit_fc(params, x, data):
a = params['a'].value
b = params['b'].value
c = params['c'].value
model = np.power(x,b)*a + c
return model - data #that's what you want to minimize
# create a set of Parameters
#'value' is the initial condition
#'min' and 'max' define your boundaries
params = Parameters()
params.add('a', value= 2, min=0, max=20)
params.add('b', value= 0.5, min=0, max=20)
params.add('c', value= 300.0, min=0, max=400)
# do fit, here with leastsq model
result = minimize(fit_fc, params, args=(x, y))
# calculate final result
final = y + result.residual
# write error report
report_fit(params)
#plot results
try:
import pylab
pylab.plot(x, y, 'k+')
pylab.plot(x, final, 'r')
pylab.show()
except:
pass
If you constrain all of your parameters to the range [0,20], the plot looks rather bad:

It depends on what you want to have happen if the variables are out of range. You can use a simple if statement (in this case the program exit()s):
x = 21
if (x not in range(0, 20)):
print("var x is out of range")
exit()
Another way is to assert that the variable must be in the range. In this case, it's wrapped in a try/except block that handles the problem gracefully, and also exit()s like above:
try:
assert(x in range(0, 20))
except AssertionError:
print("variable x is out of range")
exit()

Scipy uses unconstrained least squares in order to fit curve parameters, so it won't be that straightforward: https://github.com/scipy/scipy/blob/v0.16.0/scipy/optimize/minpack.py#L454
What you'd probably like to do is called constrained (non-linear?, giving what you're trying to fit) least squares problem. For instance, take a look at those discussions:
Constrained least-squares estimation in Python ( leastsq_bounds: https://gist.github.com/denis-bz/65da931bdbf92c49e4d0 )
scipy.optimize.leastsq with bound constraints

scipy.interpolate.UnivariateSpline not smoothing regardless of parameters

I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter.
Here is my code:
# Imports
import scipy
import pylab
# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")
# Plot estimates using splines with a range of degrees
for k in range(1, 4):
mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
xi = range(0, 15100, 20)
yi = mySpline(xi)
pylab.plot(xi, yi, label="Predicted k=%d" % k)
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()
Here is the result:
I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over.
Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?

Short answer: you need to choose the value for s more carefully.
The documentation for UnivariateSpline states that:
Positive smoothing factor used to choose the number of knots. Number of
knots will be increased until the smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s
From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

#Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact).
# Imports
import numpy
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()

While I'm not aware of any library which will do it for you off-hand, I'd try a bit more DIY approach: I'd start from making a spline with knots in between the raw data points, in both x and y. In your particular example, having a single knot in between the 4th and 5th points should do the trick, since it'd remove the huge derivative at around x=8000.

I had trouble getting BigChef's answer running, here is a variation that works on python 3.6:
# Imports
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
import numpy
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(key=lambda li: li[0])
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.show()

Python interp1d vs. UnivariateSpline

I'm trying to port some MatLab code over to Scipy, and I've tried two different functions from scipy.interpolate, interp1d and UnivariateSpline. The interp1d results match the interp1d MatLab function, but the UnivariateSpline numbers come out different - and in some cases very different.
f = interp1d(row1,row2,kind='cubic',bounds_error=False,fill_value=numpy.max(row2))
return f(interp)
f = UnivariateSpline(row1,row2,k=3,s=0)
return f(interp)
Could anyone offer any insight? My x vals aren't equally spaced, although I'm not sure why that would matter.

I just ran into the same issue.
Short answer
Use InterpolatedUnivariateSpline instead:
f = InterpolatedUnivariateSpline(row1, row2)
return f(interp)
Long answer
UnivariateSpline is a 'one-dimensional smoothing spline fit to a given set of data points' whereas InterpolatedUnivariateSpline is a 'one-dimensional interpolating spline for a given set of data points'. The former smoothes the data whereas the latter is a more conventional interpolation method and reproduces the results expected from interp1d. The figure below illustrates the difference.
The code to reproduce the figure is shown below.
import scipy.interpolate as ip
#Define independent variable
sparse = linspace(0, 2 * pi, num = 20)
dense = linspace(0, 2 * pi, num = 200)
#Define function and calculate dependent variable
f = lambda x: sin(x) + 2
fsparse = f(sparse)
fdense = f(dense)
ax = subplot(2, 1, 1)
#Plot the sparse samples and the true function
plot(sparse, fsparse, label = 'Sparse samples', linestyle = 'None', marker = 'o')
plot(dense, fdense, label = 'True function')
#Plot the different interpolation results
interpolate = ip.InterpolatedUnivariateSpline(sparse, fsparse)
plot(dense, interpolate(dense), label = 'InterpolatedUnivariateSpline', linewidth = 2)
smoothing = ip.UnivariateSpline(sparse, fsparse)
plot(dense, smoothing(dense), label = 'UnivariateSpline', color = 'k', linewidth = 2)
ip1d = ip.interp1d(sparse, fsparse, kind = 'cubic')
plot(dense, ip1d(dense), label = 'interp1d')
ylim(.9, 3.3)
legend(loc = 'upper right', frameon = False)
ylabel('f(x)')
#Plot the fractional error
subplot(2, 1, 2, sharex = ax)
plot(dense, smoothing(dense) / fdense - 1, label = 'UnivariateSpline')
plot(dense, interpolate(dense) / fdense - 1, label = 'InterpolatedUnivariateSpline')
plot(dense, ip1d(dense) / fdense - 1, label = 'interp1d')
ylabel('Fractional error')
xlabel('x')
ylim(-.1,.15)
legend(loc = 'upper left', frameon = False)
tight_layout()

The reason why the results are different (but both likely correct) is that the interpolation routines used by UnivariateSpline and interp1d are different.
interp1d constructs a smooth B-spline using the x-points you gave to it as knots
UnivariateSpline is based on FITPACK, which also constructs a smooth B-spline. However, FITPACK tries to choose new knots for the spline, to fit the data better (probably to minimize chi^2 plus some penalty for curvature, or something similar). You can find out what knot points it used via g.get_knots().
So the reason why you get different results is that the interpolation algorithm is different. If you want B-splines with knots at data points, use interp1d or splmake. If you want what FITPACK does, use UnivariateSpline. In the limit of dense data, both methods give same results, but when data is sparse, you may get different results.
(How do I know all this: I read the code :-)

Works for me,
from scipy import allclose, linspace
from scipy.interpolate import interp1d, UnivariateSpline
from numpy.random import normal
from pylab import plot, show
n = 2**5
x = linspace(0,3,n)
y = (2*x**2 + 3*x + 1) + normal(0.0,2.0,n)
i = interp1d(x,y,kind=3)
u = UnivariateSpline(x,y,k=3,s=0)
m = 2**4
t = linspace(1,2,m)
plot(x,y,'r,')
plot(t,i(t),'b')
plot(t,u(t),'g')
print allclose(i(t),u(t)) # evaluates to True
show()
This gives me,

UnivariateSpline: A more recent
wrapper of the FITPACK routines.
this might explain the slightly different values? (I also experienced that UnivariateSpline is much faster than interp1d.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, how to fit dataset to 1 or 2 equations - python

Related

Fitting two distinct equations to a function (curve_fit)

matplotlib exact solution to a differential equation

How to put constraints on fitting parameter?

scipy.interpolate.UnivariateSpline not smoothing regardless of parameters

Python interp1d vs. UnivariateSpline

Categories

Resources