I need to fit an experimental histogram by a simulated one (to determine several parameters of the simulated one with which it fits best). I've tried curve_fit from scipy.optimize, but it does not work in this case: an error "... is not a python function" is returned. Is it possible to do this automatically in scipy or some other python module? If not, could you, please, give some links to probable algorithms to adjust them myself?
From what you have said I think the following should help, it seems your trying to use curve_fit in the wrong way:
You need to define the distribution you are trying to fit. For example if I have some data that looks normally distributed and I want to know if how well, and what parameters give the best fit, I would do the following:
import numpy as np
import pylab as plt
from scipy.optimize import curve_fit
# Create fake data and run it though `histogram` to get the experimental distribution
experimental = np.random.normal(10.0, 0.4, size=10000)
n, bins = plt.histogram(experimental, bins=100, normed=True)
# This just gives the mid points of the bins, many different (and better) ways exist to
# do this I'm sure
bins_mid_points = (0.5*(bins + np.roll(bins, 1)))[1:]
# Define the normal distribution as a function
def normal(x, sigma, mu):
return np.exp(-(x - mu)**2 / (2 * sigma**2)) / (sigma * np.sqrt(2*np.pi))
# Fit the experimental data,
popt, pcov = curve_fit(normal, xdata=bins_mid_points, ydata=n)
# Plot both
plt.bar(bins[:-1], n, width=np.diff(bins))
plt.plot(bins_mid_points, normal(bins_mid_points, *popt), color="r", lw=3)
The red line shows our simulated fit, if required you could also plot this as a histogram. The output of popt gives an array of [sigma, mu] which best fit the data while pcov could be used to determine how good the fit is.
Note that I normalised the data in histogram this is because the function I defined is the normal distribution.
You need to think carefully about what distribution you expect and what statistic your looking to get from it.
Related
I have an experimental dataset 1 which plots intensity as a function of energy. These are arrays of 1800 datapoints.
I have been trying to fit a model to this data, given by the equation below:
Imodel = I0 * ((math.cos(phi) + (beta * f1))**2 + (math.sin(phi) + (beta*f2))**2 + Ioff
I have 2 other datasets of f1 vs. energy and f2 vs. energy 2. These are arrays of 700 datapoints, albeit over the same energy range as the first dataset.
I want to use this model function together with the f1 and f2 data to find optimal values of the other 4 parameters (I0, phi, beta, Ioff) where this model function fits the experimental dataset exactly.
I have been looking into curve_fit and least_squares from the scipy.optimize package, as well as linear regression packages such as lmfit and scikit, but to no avail.
can anyone help? Thanks
Presently I have no representative data from Ayrtonb1 in order to test the method proposed below. The method seems convenient from theoretical basis but one cannot be sure that it will be satisfying with the OP data.
Nevertheless a preliminary test was carried out with a "toy" data (shown below).
I suppose that the screencopy below is sufficient to understand the method and to reproduce the calculus with real data.
The result of this preliminary test is rather good :
LRMSE<2 for a range up to 600. (Least Root Mean Square Error).
LRMSRE<2% (Least Root Mean Square Relative Error).
Your data and formula look suspiciously like resonant (or anomalous) X-ray diffraction data, with measurements of scattered intensity going across the Zn K-edge. Although you do not say this, the discussion here will assume that. You say you have 1800 measurements, presumably as a function of X-ray energy in eV. The resonant scattering factors (f1, f2) you show seem to be idealized and possibly "typical", and perhaps not specifically for the Zn K-edge -- at the very least the energy scale shown is not the same as your data.
You will want to treat the data and model the intensity as a function of X-ray energy. And you will want realistic values for f1 and f2 for the element of interest, and at the actual energy points for your data. I recommend using xraydb (full disclosure: I am the lead author) [pip install xraydb], so that you can do
import numpy as np
import xraydb
#edata, idata = function_to_extract_your_data()
# or perhaps testing with
edata = np.linspace(9500, 10500, 501)
f1 = xraydb.f1_chantler('Zn', edata)
f2 = xraydb.f2_chantler('Zn', edata)
As written, your intensity function does not directly depend on energy, though it might at a later date, say to make that offset be linear in energy, not just a constant. You might write a function like:
def intensity(en, phi, beta, scale=1, slope=0, offset=0, f1=-1, f2=1):
costerm = np.cos(phi) + beta*f1
sinterm = np.sin(phi) + beta*f2
return scale * (costerm**2 + sinterm**2) + slope*en + offset
with that you can start just trying out some values to get a feel for the function and how it compares to your data:
import matplotlib.pyplot as plt
beta = 0.025 # Wild Guess!
for phi in np.pi*np.arange(20)/10:
plt.plot(edata, intensity(edata, phi, beta, f1=f1, f2=f2), label='%.1f'%phi)
plt.legend()
plt.show()
It kind of looks like your value for phi would be around 5.5 to 6 (or -0.8 to -0.3). You could also try different values of beta and plot that with your actual data.
With that model for intensity and a feel for what the range of parameters is, you could then try to fit your data. To do that, I would recommend using lmfit (full disclosure: I am the lead author) [pip install lmfit], where you can create a model from your intensity model function - this will use the names of the function arguments to name the fitting parameters.
from lmfit import Model
imodel = Model(intensity, independent_vars=['en', 'f1', 'f2'])
params = imodel.make_params(scale=1, offset=0, slope=0, beta=0.1, phi=5.5)
That independent_vars will tell Model to not make fitting Parameters for the function arguments f1 and f2 and to expect them to be passed into any evaluation or fit. The other function arguments (scale, phi, etc) will be expected to become fitting variables. You do have to create a "Parameters" object and must give initial values for all parameters. You can put bounds on these or fix them so they are not altered in the fit, as with
params['scale'].min = 0 # force scale to be positive
params['slope'].vary = False # slope will be fixed at 0.
You can then evaluate the model with
init_value = imodel.eval(params, en=edata, f1=f1, f2=f2)
And then eventually do a fit with
result = imodel.fit(idata, params, en=edata, f1=f1, f2=f2)
print(result.fit_report())
plt.plot(edata, idata, label='data')
plt.plot(edata, init_value, label='initial fit')
plt.plot(edata, result.best_fit, label='best fit')
plt.legend()
plt.show()
Finally, for analysis of X-ray resonant scattering be sure to consider including absorption corrections in that intensity calculation. As you go across the Zn K edge, the absorption depth of the sample may change dramatically if the Zn concentration is high.
I am having a hard time trying to understand why my Gaussian fit to a set of data (ydata) does not work well if I shift the interval of x-values corresponding to that data (xdata1 to xdata2). The Gaussian is written as:
where A is just an amplitude factor. Changing some of the values of the data, it is easy to make it work for both cases, but one can also easily find cases in which it does not work well for xdata1 and also in which covariance of the parameters is not estimated.
I am using scipy.optimize.curve_fit in Spyder with Python 3.7.1 on Windows 7.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
xdata1 = np.linspace(-9,4,20, endpoint=True) # works fine
xdata2 = xdata1+2
ydata = np.array([8,9,15,12,14,20,24,40,54,94,160,290,400,420,300,130,40,10,8,4])
def gaussian(x, amp, mean, sigma):
return amp*np.exp(-(((x-mean)**2)/(2*sigma**2)))/(sigma*np.sqrt(2*np.pi))
popt1, pcov1 = curve_fit(gaussian, xdata1, ydata)
popt2, pcov2 = curve_fit(gaussian, xdata2, ydata)
fig, ([ax1, ax2]) = plt.subplots(nrows=1, ncols=2,figsize=(9, 4))
ax1.plot(xdata1, ydata, 'b+:', label='xdata1')
ax1.plot(xdata1, gaussian(xdata1, *popt1), 'r-', label='fit')
ax1.legend()
ax2.plot(xdata2, ydata, 'b+:', label='xdata2')
ax2.plot(xdata2, gaussian(xdata2, *popt2), 'r-', label='fit')
ax2.legend()
The problem is your second attempt at fitting a gaussian is getting stuck in a local minimum while searching parameter space: curve_fit is a wrapper for least_squares which uses gradient descent to minimize the cost function and this is liable to get stuck in local minima.
You should try providing reasonable starting parameters (by using the p0 argument of curve_fit) to avoid this:
#... your code
y_max = np.max(y_data)
max_pos = ydata[ydata==y_max][0]
initial_guess = [y_max, max_pos, 1] # amplitude, mean, std
popt2, pcov2 = curve_fit(gaussian, xdata2, ydata, p0=initial_guess)
Which as you can see provides a reasonable fit:
You should write a function which can provide reasonable estimates of the starting parameters. Here I just found the maximum y value and used this to determine the initial parameters. I've found this works well for the fitting normal distributions but you could consider other methods.
Edit:
You can also solve the problem by scaling the amplitude: the amplitude is so large the parameter space is distorted and the gradient descent simply follows the direction of greatest change in the amplitude and effectively ignores the sigma. Consider the following plot in parameter space (Colour is the sum of the squared residuals of the fit for given parameters and the white cross shows the optimal solution):
Make sure to make note of the different scales for the x and y axis.
One needs to make a large number of 'unit' sized steps in y (amplitude) to get to the minimum from the point x,y = (0,0), where as you only need less than one 'unit' sized step to get to the minimum in x (sigma). The algorithm simply takes steps in amplitude as this is the steepest gradient. When it gets to the amplitude which minimises the cost function it simply stops the algorithm as it appears to have converged and makes little or no changes in the sigma parameter.
One way to fix this is to scale your ydata to un-distort the parameter space: divide your ydata by 100 and you will see your fit works without providing any starting parameters!
I am relatively new to coding and have recently been asked by my research professor to create a log regression script to fit functions to experimental data. The input data is always small values (<10 for x,y) and contains no more than 5 data points per experiment.
I have managed to write a script using scipy's curve_fit function but the results I receive don't look correct? Would anyone be able to point out my errors in this simple script?
I have performed my own research on the use of the curve_fit function but it seems everyones case is very specific and any solution that I have found doesn't work with mine for some reason...
This is my code with a small bit of sample data;
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import log as log
def func(x, a, b, c):
return a * log(b * x) + c
x = np.array([0, 1.1029, 1.6148])
y = np.array([-8.5067, -6.8924, -6.713])
popt, pcov = curve_fit (func, x, y)
plt.figure()
plt.plot(x, y, 'k.', label = 'Raw Data')
plt.plot(x, func(x, *popt), 'k-', label = 'Fitted Curve')
plt.xlabel('ln(x)')
plt.ylabel('y')
plt.legend()
plt.show()
This is the graph output that I get;
Graphical Output
Also note after line 12 is executed (the line with the curve_fit function), the following error appears;
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
What does this error mean?
Also for a bonus question, does anyone know what the best way to potentially modify this into a multiple parameter regression would be? I understand if you add more variables you may be able to adjust the function in line 6,7 accordingly?
Thank you to whoever offers their expertise, it's much appreciated!
I have a data set which in theory is described by a polynomial of the second degree. I would like to fit this data and I have used numpy.polyfit to do this. However, the down side is that the error on the returned coefficients is not available. Therefore I decided to also fit the data using scipy.odr. The weird thing was that the coefficients for the polynomial deviated from each other.
I do not understand this and therefore decided to test both fitting routines on a set of data that I produce my self:
import numpy
import scipy.odr
import matplotlib.pyplot as plt
x = numpy.arange(-20, 20, 0.1)
y = 1.8 * x**2 -2.1 * x + 0.6 + numpy.random.normal(scale = 100, size = len(x))
#Define function for scipy.odr
def fit_func(p, t):
return p[0] * t**2 + p[1] * t + p[2]
#Fit the data using numpy.polyfit
fit_np = numpy.polyfit(x, y, 2)
#Fit the data using scipy.odr
Model = scipy.odr.Model(fit_func)
Data = scipy.odr.RealData(x, y)
Odr = scipy.odr.ODR(Data, Model, [1.5, -2, 1], maxit = 10000)
output = Odr.run()
#output.pprint()
beta = output.beta
betastd = output.sd_beta
print "poly", fit_np
print "ODR", beta
plt.plot(x, y, "bo")
plt.plot(x, numpy.polyval(fit_np, x), "r--", lw = 2)
plt.plot(x, fit_func(beta, x), "g--", lw = 2)
plt.tight_layout()
plt.show()
An example of an outcome is as follows:
poly [ 1.77992643 -2.42753714 3.86331152]
ODR [ 3.8161735 -23.08952492 -146.76214989]
In the included image, the solution of numpy.polyfit (red dashed line) corresponds pretty well. The solution of scipy.odr (green dashed line) is basically completely off. I do have to note that the difference between numpy.polyfit and scipy.odr was less in the actual data set I wanted to fit. However, I do not understand where the difference between the two comes from, why in my own testing example the difference is extremely big, and which fitting routine is better?
I hope you can provide answers that might help me give a better understanding between the two fitting routines and in the process provide answers to the questions I have.
In the way you are using ODR it does a full orthogonal distance regression. To have it do a normal nonlinear least squares fit add
Odr.set_job(fit_type=2)
before starting the optimization and you will get what you expected.
The reason that the full ODR fails so badly is due to not specifying weights/standard deviations. Obviously it does hard to interpret that point cloud and assumes equal wheights for x and y. If you provide estimated standard deviations, odr will yield a good (though different of course) result, too.
Data = scipy.odr.RealData(x, y, sx=0.1, sy=10)
The actual problem is that the odr output has the beta coefficients in the opposite order than numpy.polyfit has. So the green curve is not calculated correctly. To plot it, use instead
plt.plot(x, fit_func(beta[::-1], x), "g--", lw = 2)
I did some work and finally i got a data that its shape looked like sinc function and i tried to search how to fitting graph to sinc function using by numpy and i found this:
Fitting a variable Sinc function in python
It's good that i found it but i think why it look quite complicated?
Can you give me more friendly way to fitting graph that give me a curve like sinc function?
Well to perform fitting the answer provided in the link you have given is good enough. But since you say you find it difficult I have an example code with data in the form a sine curve and a user defined function that fits the data.
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import math
xdata = np.array([2.65, 2.80, 2.96, 3.80, 3.90, 4.60, 4.80, 4.90, 5.65, 5.92])
ydata = np.sin(xdata)
def func(x,p1,p2,p3): # HERE WE DEFINE A SIN FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p1*np.sin(x*p2+p3)
# Here you give the initial parameters for p0 which Python then iterates over
# to find the best fit
popt, pcov = curve_fit(func,xdata,ydata,p0=(1.0,1.0,1.0)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p1 = popt[0]
p2 = popt[1]
p3 = popt[2]
residuals = ydata - func(xdata,p1,p2,p3)
fres = sum(residuals**2)
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(1,7,100) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p1,p2,p3)
plt.plot(xdata,ydata,'*')
plt.plot(xaxis,curve_y,'-')
plt.show()
You can also visit this website!! and learn step by step about how curve_fit works.