I need to determine the values of ceofficients in my equation. For that I decided to use the least square method. The equation is presented below:
The equation presents a connection between stress and time to failure of a tested product at different temperature levels. The data that I've used is made up, but presents the structure of the actual data, that I will use later on.
For better understanding I also included a graphical correlation:
I am fairly new to python so I didn't know that there so many ways/functions of this method availible, so I decided to try out a few:
Input data
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from lmfit import minimize, Parameters, fit_report
# data
temp =np.array([650, 700, 750, 720, 680]) # temperature
xdata = np.array([500, 525, 540, 534, 490]) # time
ydata = np.array([330, 332, 315, 325, 335]) # stress
T = temp[0]
plt.plot(xdata,ydata,'*')
plt.xlabel('xdata')
plt.ylabel('ydata')
1. Using the curve_fit function
def func(logS, a_0, a_1, a_2, T_a, logt_a):
return logt_a + (T - T_a) * (a_0 + a_1 * logS + a_2 * logS**2)
popt, pcov = curve_fit(func, xdata, ydata, p0=(1, 1, 1, 1, 1))
popt
zapis = 'a_0: {0:1.5e}\na_1: {1:1.5e}\na_2: {2:1.5e}\nT_a: {3:1.5e}\nlogt_a: {4:1.5e}'.format(popt[0], popt[1], popt[2], popt[3], popt[4])
print(zapis)
a_0 = popt[0]
a_1 = popt[1]
a_2 = popt[2]
T_a = popt[3]
logt_a = popt[4]
residuals = ydata - func(logS, a_0, a_1, a_2, T_a, logt_a)
fres = sum(residuals**2)
print(fres)
curvex=np.linspace(np.min(xdata)-np.min(xdata)/10, np.max(xdata)+50, np.max(xdata)/10)
curvey=func(curvex, a_0, a_1, a_2, T_a, logt_a)
plt.plot(xdata,ydata,'*')
plt.plot(curvex,curvey, 'r')
plt.xlabel('xdata')
plt.ylabel('ydata')
2. Using the leastsq function
from scipy.optimize import leastsq
def function(parameters, logS):
a_0, a_1, a_2, T_a, logt_a = parameters
model = logt_a + (T - T_a) * (a_0 + a_1 * logS + a_2 * logS**2)
return model
def objective(pars, t_r, logS):
err = t_r - function(pars, logS)
return err
x0 = [ 1.0, 1.0, 1.0, 1.0, 1.0 ] #initial guess of parameters
plsq = leastsq(objective, x0, args=(ydata, xdata))
print('Fitted parameters = {0}'.format(plsq[0]))
plt.plot(xdata, ydata, 'ro')
#plot the fitted curve on top
x = np.linspace(min(xdata), max(xdata), 50)
y = function(plsq[0], x)
plt.plot(x, y, 'k-')
plt.xlabel('x')
plt.ylabel('y')
In both cases I got this results:
a_0: -5.95683e+02
a_1: 2.65405e-02
a_2: -2.63017e-05
T_a: 1.21502e+02
logt_a: 3.11614e+05
Question 1: What is the best way of determing the initial values of the searched coefficients?
Question 2: Which of the methods in python, that is based on the least square method is the best for equations like in my case?
Question 3: Is there a way to make the process of determing the coefficients as parameters more automated? Because I will have to try out also higher order polynomials which will lead to more coefficients (a_3, a_4, a_5,...). The idea would be to write the order of the polynomial and everything else would then be calculated and formed by itself.
Related
I'm doing a curve fit in python using scipy.curve_fit, and the fit itself looks great, however the parameters that are generated don't make sense.
The equation is (ax)^b + cx, but with the params python finds a = -c and b = 1, so the whole equation just equals 0 for every value of x.
here is the plot
(https://i.stack.imgur.com/fBfg7.png)](https://i.stack.imgur.com/fBfg7.png)
here is the experimental raw data I used: https://pastebin.com/CR2BCJji
xdata = cfu_u
ydata = OD_u
min_cfu = 0.1
max_cfu = 9.1
x_vec = pow(10,np.arange(min_cfu,max_cfu,0.1))
def func(x,a, b, c):
return (a*x)**b + c*x
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(x_vec, func(x_vec, *popt), label = 'curve fit',color='slateblue',linewidth = 2.2)
plt.plot(cfu_u,OD_u,'-',label = 'experimental data',marker='.',markersize=8,color='deepskyblue',linewidth = 1.4)
plt.legend(loc='upper left',fontsize=12)
plt.ylabel("Y",fontsize=12)
plt.xlabel("X",fontsize=12)
plt.xscale("log")
plt.gcf().set_size_inches(7, 5)
plt.show()
print(popt)
[ 1.44930871e+03 1.00000000e+00 -1.44930871e+03]
I used the curve_fit function from scipy to fit an exponential curve to some data. The fit looks very good, so that part was a success.
However, the parameters output by the curve_fit function do not make sense, and solving f(x) with them results in f(x)=0 for every value of x, which is clearly not what is happening in the curve.
Modify your model to show what's actually happening:
def func(x: np.ndarray, a: float, b: float, c: float) -> np.ndarray:
return (a*x)**(1 - b) + (c - a)*x
producing optimized parameters
[3.49003332e-04 6.60420171e-06 3.13366557e-08]
This is likely to be numerically unstable. Try optimizing in the log domain instead.
When I run your example (after adding imports, etc.), I get NaNs for popt, and I eventually realized you were allowing general, real b with negative x. If I fit to the positive x only, I get a popt of [1.89176133e+01 5.66689997e+00 1.29380532e+08]. The fit isn't too bad (see below), but perhaps you need to restrict b to be an integer to fit the whole set. I'm not sure how to do that in Scipy (I assume you need mixed integer-real optimization, and I haven't investigated if Scipy supports that.)
Code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
cfu_u, OD_u = np.loadtxt('data.txt', skiprows=1).T
# fit to positive x only
posmask = cfu_u > 0
xdata = cfu_u[posmask]
ydata = OD_u[posmask]
def func(x, a, b, c):
return (a*x)**b + c*x
popt, pcov = curve_fit(func, xdata, ydata, p0=[1000,2,1])
x_vec = np.geomspace(xdata.min(), xdata.max())
plt.plot(x_vec, func(x_vec, *popt), label = 'curve fit',color='slateblue',linewidth = 2.2)
plt.plot(cfu_u,OD_u,'-',label = 'experimental data', marker='x',markersize=8,color='deepskyblue',linewidth = 1.4)
plt.legend(loc='upper left',fontsize=12)
plt.ylabel("Y",fontsize=12)
plt.xlabel("X",fontsize=12)
plt.yscale("log")
plt.xscale("symlog")
plt.show()
print(popt)
#[ 1.44930871e+03 1.00000000e+00 -1.44930871e+03]
I'd like to make a Gaussian Fit for some data that has a rough gaussian fit. I'd like the information of data peak (A), center position (mu), and standard deviation (sigma), along with 95% confidence intervals for these values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.stats import norm
# gaussian function
def gaussian_func(x, A, mu, sigma):
return A * np.exp( - (x - mu)**2 / (2 * sigma**2))
# generate toy data
x = np.arange(50)
y = [ 97.04421053, 96.53052632, 96.85684211, 96.33894737, 96.85052632,
96.30526316, 96.87789474, 96.75157895, 97.05052632, 96.73473684,
96.46736842, 96.23368421, 96.22526316, 96.11789474, 96.41263158,
96.32631579, 96.33684211, 96.44421053, 96.48421053, 96.49894737,
97.30105263, 98.58315789, 100.07368421, 101.43578947, 101.92210526,
102.26736842, 101.80421053, 101.91157895, 102.07368421, 102.02105263,
101.35578947, 99.83578947, 98.28, 96.98315789, 96.61473684,
96.82947368, 97.09263158, 96.82105263, 96.24210526, 95.95578947,
95.84210526, 95.67157895, 95.83157895, 95.37894737, 95.25473684,
95.32842105, 95.45684211, 95.31578947, 95.42526316, 95.30526316]
plt.scatter(x,y)
# initial_guess_of_parameters
# この値はソルバーとかで求めましょう.
parameter_initial = np.array([652, 2.9, 1.3])
# estimate optimal parameter & parameter covariance
popt, pcov = curve_fit(gaussian_func, x, y, p0=parameter_initial)
# plot result
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_func(xd, popt[0], popt[1], popt[2])
plt.plot(xd, estimated_curve, label="Estimated curve", color="r")
plt.legend()
plt.savefig("gaussian_fitting.png")
plt.show()
# estimate standard Error
StdE = np.sqrt(np.diag(pcov))
# estimate 95% confidence interval
alpha=0.025
lwCI = popt + norm.ppf(q=alpha)*StdE
upCI = popt + norm.ppf(q=1-alpha)*StdE
# print result
mat = np.vstack((popt,StdE, lwCI, upCI)).T
df=pd.DataFrame(mat,index=("A", "mu", "sigma"),
columns=("Estimate", "Std. Error", "lwCI", "upCI"))
print(df)
Data Plot with Fitted Curve
The data peak and center position seems correct, but the standard deviation is off. Any input is greatly appreciated.
Your scatter indeed looks similar to a gaussian distribution, but it is not centered around zero. Given the specifics of the Gaussian function it will therefor be hard to nicely fit a Gaussian distribution to the data the way you gave us. I would therefor propose by starting with demeaning the x series:
x = np.arange(0, 50) - 24.5
Next I would add one additional parameter to your gaussian function, the offset. Since the regular Gaussian function will always have its tails close to zero it is impossible to otherwise nicely fit your scatterplot:
def gaussian_function(x, A, mu, sigma, offset):
return A * np.exp(-np.power((x - mu)/sigma, 2.)/2.) + offset
Next you should define an error_loss_function to minimise:
def error_loss_function(params):
gaussian = gaussian_function(x, params[0], params[1], params[2], params[3])
errors = gaussian - y
return sum(np.power(errors, 2)) # You can also pick a different error loss function!
All that remains is fitting our curve now:
fit = scipy.optimize.minimize(fun=error_loss_function, x0=[2, 0, 0.2, 97])
params = fit.x # A: 6.57592661, mu: 1.95248855, sigma: 3.93230503, offset: 96.12570778
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_function(xd, params[0], params[1], params[2], params[3])
plt.plot(xd, estimated_curve, label="Estimated curve", color="b")
plt.legend()
plt.show(block=False)
Hopefully this helps. Looks like a fun project, let me know if my answer is not clear.
I would like to use the curve_fit function from the scipy.optimize module to determine amplitudes, frequencies, phases of sum of sine functions (and one y0). It's easy to do when I know a number of sines to use. For example when I know two frequencies from the DFT (Discrete Fourier Transform): 1.152 and 0.432 I can define a function:
def func(x, amp1, amp2, freq1 , freq2, phase1, phase2, y0):
return amp1*np.sin(freq1*x + phase1) + amp2*np.sin(freq2*x + phase2) + y0
Then, using the curve_fit and constraining intervals of frequencies I can find a good fitting:
param, _ = curve_fit(func, t, data, bounds=([-np.inf, -np.inf, 1.14, 0.43, -np.inf, -np.inf, -np.inf], [np.inf, np.inf, 1.16, 0.44, np.inf, np.inf, np.inf]))
It looks great:
But in this case I've prepared the data and I've known a number of frequencies. Do you know how to define the func only once and handle all cases (for example five sine functions)? I've tried to put the parameters into lists, e.g. amp = [amp1, amp2, ... ] and I've iterated over their length. But there is a problem to define bounds for parameter lists. bounds is very important to ensure reality model.
The solution does not have to based on curve_fit.
Assuming you know the frequencies beforehand the problem is simple. You can set the lower bound to 0 and set the upper bound to 2 * pi * freq for frequency. For amps, set any number (or np.inf if you want no boundary).
You can formulate the function in the form lambda x, amp1, phase1, amp2, phase2... : y, curve_fit can accept a function of undefined number of arguments as long as you supply a proper initial guess.
A sample code for five frequencies:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,10,60)
w = [1,2,3,4,5]
a = [1,4,2,3,0.1]
x0 = [0,1,0,1,0.5]
y = np.sum(a_i * np.sin(w_i * x - x0_i) for w_i, a_i, x0_i in zip(w,a, x0)) #base_data
yr = y + np.random.normal(0,0.5, size=x.size) #noisy data
def func(x, *args):
""" function of the form lambda x, amp1, phase1, amp2, phase2...."""
return np.sum(a_i * np.sin(w_i * (x-x0)) for w_i, a_i, x0
in zip(w,args[::2], args[1::2]))
ubounds = np.zeros(len(w) * 2)
ubounds[::2] = 10 #setting amp max value to 10 (arbitrary)
ubounds[1::2] = np.asarray(w) * 2 * np.pi
p0 = [0] * 10 # note p0 size
popt, pcov = curve_fit(func, x, yr, p0, bounds=(0, ubounds))
amps, phases = popt[::2], popt[1::2]
plt.plot(x,func(x, *popt))
plt.plot(x,yr, 'go')
Say I want to fit a sine function using scipy.optimize.curve_fit. I don't know any parameters of the function. To get the frequency, I do Fourier transform and guess all the other parameters - amplitude, phase, and offset. When running my program, I do get a fit but it does not make sense. What is the problem? Any help will be appreciated.
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
ampl = 1
freq = 24.5
phase = np.pi/2
offset = 0.05
t = np.arange(0,10,0.001)
func = np.sin(2*np.pi*t*freq + phase) + offset
fastfft = np.fft.fft(func)
freq_array = np.fft.fftfreq(len(t),t[0]-t[1])
max_value_index = np.argmax(abs(fastfft))
frequency = abs(freq_array[max_value_index])
def fit(a, f, p, o, t):
return a * np.sin(2*np.pi*t*f + p) + o
guess = (0.9, frequency, np.pi/4, 0.1)
params, fit = sp.optimize.curve_fit(fit, t, func, p0=guess)
a, f, p, o = params
fitfunc = lambda t: a * np.sin(2*np.pi*t*f + p) + o
plt.plot(t, func, 'r-', t, fitfunc(t), 'b-')
The main problem in your program was a misunderstanding, how scipy.optimize.curve_fit is designed and its assumption of the fit function:
ydata = f(xdata, *params) + eps
This means that the fit function has to have the array for the x values as the first parameter followed by the function parameters in no particular order and must return an array for the y values. Here is an example, how to do this:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
#t has to be the first parameter of the fit function
def fit(t, a, f, p, o):
return a * np.sin(2*np.pi*t*f + p) + o
ampl = 1
freq = 2
phase = np.pi/2
offset = 0.5
t = np.arange(0,10,0.01)
#is the same as fit(t, ampl, freq, phase, offset)
func = np.sin(2*np.pi*t*freq + phase) + offset
fastfft = np.fft.fft(func)
freq_array = np.fft.fftfreq(len(t),t[0]-t[1])
max_value_index = np.argmax(abs(fastfft))
frequency = abs(freq_array[max_value_index])
guess = (0.9, frequency, np.pi/4, 0.1)
#renamed the covariance matrix
params, pcov = scipy.optimize.curve_fit(fit, t, func, p0=guess)
a, f, p, o = params
#calculate the fit plot using the fit function
plt.plot(t, func, 'r-', t, fit(t, *params), 'b-')
plt.show()
As you can see, I have also changed the way the fit function for the plot is calculated. You don't need another function - just utilise the fit function with the parameter list, the fit procedure gives you back.
The other problem was that you called the covariance array fit - overwriting the previously defined function fit. I fixed that as well.
P.S.: Of course now you only see one curve, because the perfect fit covers your data points.
My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time.append(float(segs[0]))
counts.append(segs[1])
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
plt.show()
This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
######################################
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
######################################
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
plt.legend()
plt.show()
You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
clf.fit(yourdata)
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
plotgauss1(histdist[1])
plotgauss2(histdist[1])
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
%pylab
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
clf.fit(data)
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws
coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.