Segmented regression in python using differential evolution

Segmented regression in python using differential evolution - python

I have the long-term aim of creating a module that for a specific data set, fits segmented regressions up to an arbitrary number of breakpoints, as well as a standard polynomial and linear curve fit, and then evaluates which of the fits are the most appropriate for the data (likely using AIC or BIC).
I have a function that uses differential evolution to use segmented regression on an x and y dataset assuming 1 breakpoint:
def segReg_one(xData,yData):
def func(xVals,model_break,slopeA,slopeB,offsetA,offsetB): #Initialization of the piecewise function
returnArray=[]
for x in xVals:
if x > model_break:
returnArray.append(slopeA * x + offsetA)
else:
returnArray.append(slopeB * x + offsetB)
return returnArray
def sumSquaredError(parametersTuple): #Definition of an error function to minimize
modely=func(xData,*parametersTuple)
warnings.filterwarnings("ignore") # Ignore warnings by genetic algorithm
return np.sum((yData-modely)**2.0)
def generate_genetic_Parameters():
initial_parameters=[]
x_max=np.max(xData)
x_min=np.min(xData)
y_max=np.max(yData)
y_min=np.min(yData)
slope=10*(y_max-y_min)/(x_max-x_min)
initial_parameters.append([x_max,x_min]) #Bounds for model break point
initial_parameters.append([-slope,slope]) #Bounds for slopeA
initial_parameters.append([-slope,slope]) #Bounds for slopeB
initial_parameters.append([y_max,y_min]) #Bounds for offset A
initial_parameters.append([y_max,y_min]) #Bounds for offset B
result=differential_evolution(sumSquaredError,initial_parameters,seed=3)
return result.x
geneticParameters = generate_genetic_Parameters() #Generates genetic parameters
fittedParameters, pcov= curve_fit(func, xData, yData, geneticParameters) #Fits the data
print('Parameters:', fittedParameters)
print('Model break at: ', fittedParameters[0])
print('Slope of line where x < model break: ', fittedParameters[1])
print('Slope of line where x > model break: ', fittedParameters[2])
print('Offset of line where x < model break: ', fittedParameters[3])
print('Offset of line where x > model break: ', fittedParameters[4])
model=func(xData,*fittedParameters)
absError = model - yData
SE = np.square(absError)
MSE = np.mean(SE)
RMSE = np.sqrt(MSE)
Rsquared = 1.0 - (np.var(absError) / np.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
axes.plot(xData, yData, 'D')
xModel = np.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all')
graphWidth = 800
graphHeight = 600
return ModelAndScatterPlot(800,600)
Which runs fine. However, I tried to expand the model to allow for more than 1 breakpoint:
def segReg_two(xData,yData):
def func(xData,break1,break2,slope1,slope_mid,slope2,offset1,offset_mid,offset2):
returnArray=[]
for x in xData:
if x < break1:
returnArray.append(slope1 * x + offset1)
if (x < break2 and x > break1):
returnArray.append(slope_mid * x + offset_mid)
else:
returnArray.append(slope2 * x + offset2)
def sumSquaredError(parametersTuple): #Definition of an error function to minimize
modely=func(xData,*parametersTuple)
warnings.filterwarnings("ignore") # Ignore warnings by genetic algorithm
return np.sum((yData-modely)**2.0)
def generate_genetic_Parameters():
initial_parameters=[]
x_max=np.max(xData)
x_min=np.min(xData)
y_max=np.max(yData)
y_min=np.min(yData)
slope=10*(y_max-y_min)/(x_max-x_min)
initial_parameters.append([x_max,x_min]) #Bounds for model break point
initial_parameters.append([x_max,x_min])
initial_parameters.append([-slope,slope])
initial_parameters.append([-slope,slope])
initial_parameters.append([-slope,slope])
initial_parameters.append([y_max,y_min])
initial_parameters.append([y_max,y_min])
initial_parameters.append([y_max,y_min])
result=differential_evolution(sumSquaredError,initial_parameters,seed=3)
return result.x
geneticParameters = generate_genetic_Parameters() #Generates genetic parameters
fittedParameters, pcov= curve_fit(func, xData, yData, geneticParameters) #Fits the data
print('Parameters:', fittedParameters)
print('Model break at: ', fittedParameters[0])
print('Slope of line where x < model break: ', fittedParameters[1])
print('Slope of line where x > model break: ', fittedParameters[2])
print('Offset of line where x < model break: ', fittedParameters[3])
print('Offset of line where x > model break: ', fittedParameters[4])
model=func(xData,*fittedParameters)
absError = model - yData
SE = np.square(absError)
MSE = np.mean(SE)
RMSE = np.sqrt(MSE)
Rsquared = 1.0 - (np.var(absError) / np.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
axes.plot(xData, yData, 'D')
xModel = np.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all')
graphWidth = 800
graphHeight = 600
return ModelAndScatterPlot(800,600)
And this code runs into problems when I run segReg_two(x,y), stopping at the differential_evolution bit:
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
During handling of the above exception, another exception occurred:
RuntimeError: The map-like callable must be of the form f(func, iterable), returning a sequence of numbers the same length as 'iterable'
I didn't have this problem with segReg_one, so I don't see why I'm having it happen here. I am assuming (and I may be incorrect with this assumption) that the argument iterable must have compatible dimensions with my error function. However, I'm not exactly sure on how those two arguments exactly relate other than the fact that I'm finding the breakpoints, slopes and offsets that minimize the breakpoints given the bounds I have.
Also, my plan of attack seems extremely long-winded and brutish. Is there a better way to tackle this?
I think perhaps it is considering my piecewise function as None-type. Printing the function with some random values returned simply "None". However, my piecewise function prints the same thing and it still worked out fine.

If you are not tied to using differential evolution then the piecewise-regression package fits segmented models to data, using an iterative algorithm. And it has a model comparison tool based on the Bayesian Information Criterion.
Here is some data generated from 2 breakpoints
x = [0.0, 0.3, 0.5, 0.8, 1.0, 1.3, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5, 3.8, 4.0, 4.3, 4.5, 4.8, 5.1, 5.3, 5.6, 5.8, 6.1, 6.3, 6.6, 6.8, 7.1, 7.3, 7.6, 7.8, 8.1, 8.3, 8.6, 8.8, 9.1, 9.3, 9.6, 9.8, 10.1, 10.4, 10.6, 10.9, 11.1, 11.4, 11.6, 11.9, 12.1, 12.4, 12.6, 12.9, 13.1, 13.4, 13.6, 13.9, 14.1, 14.4, 14.6, 14.9, 15.2, 15.4, 15.7, 15.9, 16.2, 16.4, 16.7, 16.9, 17.2, 17.4, 17.7, 17.9, 18.2, 18.4, 18.7, 18.9, 19.2, 19.4, 19.7, 19.9, 20.2, 20.5, 20.7, 21.0, 21.2, 21.5, 21.7, 22.0, 22.2, 22.5, 22.7, 23.0, 23.2, 23.5, 23.7, 24.0, 24.2, 24.5, 24.7, 25.0]
y = [16.2, -5.5, -4.0, -8.8, 11.2, -19.9, 21.2, -3.2, 8.2, 3.2, 20.9, -13.7, 4.4, 4.4, 20.2, -1.5, 8.4, 2.0, 11.8, 17.8, 1.6, 24.7, 22.9, 19.5, 24.7, 11.9, 20.6, 15.5, 25.2, 36.2, 27.0, 33.0, 33.1, 34.5, 39.3, 48.9, 40.9, 57.5, 74.7, 68.6, 62.3, 58.4, 62.8, 90.2, 76.8, 73.0, 84.3, 106.4, 89.7, 97.7, 97.5, 94.0, 89.2, 100.1, 104.5, 115.5, 121.1, 125.0, 121.6, 130.6, 115.8, 136.3, 129.4, 121.8, 130.2, 125.1, 137.6, 142.0, 149.2, 113.9, 113.9, 123.8, 131.0, 138.6, 133.5, 110.7, 128.3, 140.2, 134.7, 140.5, 131.2, 131.9, 136.3, 139.0, 137.4, 137.1, 129.7, 140.7, 138.7, 149.2, 150.4, 140.8, 135.7, 133.6, 144.7, 141.8, 138.0, 142.4, 136.3, 150.0]
using piecewise-regression's model comparison tool:
import piecewise_regression
piecewise_regression.ModelSelection(xx, yy)
That suggests a model with 2 breakpoints, based on the BIC.
We can also plot a fit with 2 breakpoints:
pw_fit = piecewise_regression.Fit(xx, yy, n_breakpoints=2)
pw_fit.plot()

Related

Problem Fitting a Residence Time Distribution Data

I am trying to fit Resident Time Distribution (RTD) Data. RTD is typically skewed distribution. I have built a simple code that takes this non equally space-time data set from the RTD.
Data Sett
timeArray = [0.0, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 14.0]
concArray = [0.0, 0.6, 1.4, 5.0, 8.0, 10.0, 8.0, 6.0, 4.0, 3.0, 2.2, 1.5, 0.6, 0.0]
To fit the data I have been using python curve_fit function
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
and different sets of models (ex. CSTR, Sine, Gauss) to fit the data. However, no success so far.
The RTD data that I have correspond to a CSTR and there is an equation that model very accurate this type of behavior.
#Generalize nCSTR model
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
As a separate note: from the Generalized nCSTR model I am using gamma instead of (n-1)! factorial terms because of the complexities of the code trying to deal with decimal values in factorials terms.
This CSTR model should be the one fitting the data without problem but for some reason is not able to do so. The outcome after executing my code:
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o')
def nCSTR(x, tau, n):
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
return y
guess = [1, 12]
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
tau = parameters[0]
n = parameters[1]
y = np.arange(0.0, len(time), 1.0)
for i in range(len(timeArray)):
y[i] = (( (np.power(time[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*time[i]/tau)
plt.plot(time,y)
is this plot Fitting Output
I know I am missing something and any help will be well appreciated. The model has been well known for decades so it should not be related to the equation. I did some dummy data to confirm that the equation is written correctly and the output was the same type of profile that I am looking for. In that end, the equestion is fine.
import numpy as np
import math
t = np.arange(0.0, 10.5, 0.5)
tau = 2
n = 5
y = np.arange(0.0, len(t), 1.0)
for i in range(len(t)):
y[i] = (( (np.power(t[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*t[i]/tau)
print(y)
plt.plot(t,y)
CSTR profile with Dummy Data (image)
If anyone is interested in the theory behind it I recommend any reading related to Tank In Series (specifically CSTR) Fogler has a great book about this topic.

I think that the main problem is that your model does not allow for an overall scale factor or that your data may not be normalized as you expect.
If you'll permit me to convert your curve-fitting program to use lmfit (I am a lead author), you might do:
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
from lmfit import Model
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o', label='data')
def nCSTR(x, scale, tau, n):
"""scaled CSTR model"""
z = n*x/tau
return scale * np.exp(-z) * z**(n-1) * (n/(tau*gamma(n)))
# create a Model for your model function
cmodel = Model(nCSTR)
# now create a set of Parameters for your model (note that parameters
# are named using your function arguments), and give initial values
params = cmodel.make_params(tau=3, scale=10, n=10)
# since you have `xxx**(n-1)`, setting a lower bound of 1 on `n`
# is wise, otherwise you would have to handle complex values
params['n'].min = 1
# now fit the model to your `conc` data with those parameters
# (and also passing in independent variables using `x`: the argument
# name from the signature of the model function)
result = cmodel.fit(conc, params, x=time)
# print out a report of the results
print(result.fit_report())
# you do not need to construct the best fit yourself, it is in `result`:
plt.plot(time, result.best_fit, label='fit')
plt.legend()
plt.show()
This will print out a report that includes statistics and uncertainties:
[[Model]]
Model(nCSTR)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 29
# data points = 29
# variables = 3
chi-square = 2.84348862
reduced chi-square = 0.10936495
Akaike info crit = -61.3456602
Bayesian info crit = -57.2437727
R-squared = 0.98989860
[[Variables]]
scale: 49.7615649 +/- 0.81616118 (1.64%) (init = 10)
tau: 5.06327482 +/- 0.05267918 (1.04%) (init = 3)
n: 4.33771512 +/- 0.14012112 (3.23%) (init = 10)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, n) = -0.521
C(scale, tau) = 0.477
C(tau, n) = -0.406
and generate a plot of

Non linear complex function fitting - python

I'm trying to fit the curve of a graph based on a model. The problem is that the function has to fit the real solution and the imaginary solution.
I have tried with curve_fit from scipy but the results are not a proper fit to the curve.
This is the code
(the data to fit is invented, but it should work as an example):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import math
def long_function(fre, e_inf, e_s, alfa, beta, tau):
return ((e_s-e_inf)*((1+1j*2*np.pi*fre*tau)**(1-alfa))**(-beta))+e_inf
def funcBoth(x, e_inf, e_s, alfa, beta, tau):
N = len(x)
x_real = x[:N//2]
x_imag = x[N//2:]
y_real = np.real(long_function(x_real, e_inf, e_s, alfa, beta, tau))
y_imag = np.imag(long_function(x_imag, e_inf, e_s, alfa, beta, tau))
return np.hstack([y_real, y_imag])
def plot_graph(poptBoth, fre,yReal,yImag):
# Compute the best-fit solution
yFit = long_function(fre, *poptBoth)
print("alfa: {0:.2f}".format(poptBoth[2]))
print("beta: {0:.2f}".format(poptBoth[3]))
print("epsilon_infinita: {0:.2f}".format(poptBoth[0]))
print("epsilon_s: {0:.2f}".format(poptBoth[1]))
print("tau: ",poptBoth[4])
# Plot the results
plt.figure(figsize=(9, 4))
plt.subplot(121)
plt.plot(fre, np.real(yFit), label="Best fit")
plt.plot(fre, np.real(yReal), "k.", label="Noisy y")
plt.ylabel("Real part of y")
plt.xlabel("x")
plt.legend()
plt.subplot(122)
plt.plot(fre, np.imag(yFit), label="Best fit")
plt.plot(fre, np.real(yImag), "k.", label="Noisy y")
plt.ylabel("Real part of y")
plt.xlabel("x")
plt.tight_layout()
plt.legend(loc='best')
plt.show()
def curve_fitter(fre, yReal, yImag):
yBoth = np.hstack([yReal, yImag])
poptBoth, pcovBoth = curve_fit(funcBoth, np.hstack([fre, fre]), yBoth, maxfev=500000) # method='lm' , p0=guess
plot_graph(poptBoth,fre,yReal,yImag)
yReal = [70.0, 68.0, 60.0, 50.0, 42.0, 38.0, 36.0, 35.4, 34.0, 33.0, 32.0, 30.0, 29.1, 28.8, 28.6, 28.4, 28.3, 28.2, 28.2, 28.1, 28.0]
yImag = [17.0, 21.0, 22.5, 23.0, 22.5, 21.0, 19.0, 18.0, 17.3, 16.9, 16.4, 16.3, 16.2, 16.0, 15.7, 15.2, 14.8, 14.7, 14.7, 14.6, 14.5]
fre = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
yReal = np.array(yReal)
yImag = np.array(yImag)
fre = np.array(fre)
curve_fitter(fre, yReal, yImag)
And the result that I get is the following:
As you can see it is not fitting correctly.
I am also trying with the function minimize() but I am not getting results.

Python - Fitting a GEV distribution from these values

I'm very new with Python and I've looked around on the internet, but couldn't find anything logic that could help me with my problem.
I have values of precipitation in a graph, and now I need to fit a GEV distribution from these values in a graph. Each value equals the maximum value of a year, starting from 1974 to 2017 (so there are 43 values total).
These are the values:
max_precip = [9.4, 38.0, 12.5, 35.3, 17.6, 12.9, 12.4, 19.6, 15.0, 13.2, 12.3, 16.9, 16.9, 29.4, 13.6, 11.1, 8.0, 16.6, 12.0, 13.1, 9.1, 9.7, 21.0, 11.2, 14.4, 18.8, 14.0, 19.9, 12.4, 10.8, 21.6, 15.4, 17.4, 14.8, 22.7, 11.5, 10.5, 11.8, 12.4, 16.6, 11.7, 12.9, 17.8]
I found that I need to use gev.fit, so I thought using the following:
t = np.linspace(1,43,43)
fit = gev.fit(max_precip,loc=3)
pdf = gev.pdf(t, *fit)
plt.plot(t,pdf)
plt.plot(t, max_precip, "o")
But this only prints the points of max_precip in a graph and not the GEV distribution.
Can someone help me? Sorry if this question is already asked, I couldnt find anything like it.
I used these imports:
import csv
import matplotlib.pyplot as plt
import numpy as np
from dateutil.rrule import rrule, YEARLY
import datetime
from matplotlib.dates import DateFormatter
from scipy.stats import genextreme as gev
from scipy.stats import genpareto as gpd
from scipy.optimize import minimize

I've tried to fit your data
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import genextreme as gev
def main(rvs):
shape, loc, scale = gev.fit(rvs)
return shape, loc, scale
if __name__ == '__main__':
rvs = [9.4, 38.0, 12.5, 35.3, 17.6, 12.9, 12.4, 19.6, 15.0, 13.2, 12.3, 16.9, 16.9, 29.4, 13.6, 11.1, 8.0, 16.6, 12.0, 13.1, 9.1, 9.7, 21.0, 11.2, 14.4, 18.8, 14.0, 19.9, 12.4, 10.8, 21.6, 15.4, 17.4, 14.8, 22.7, 11.5, 10.5, 11.8, 12.4, 16.6, 11.7, 12.9, 17.8]
shape, loc, scale = main(rvs)
print(shape)
print(loc)
print(scale)
l = loc + scale / shape
xx = np.linspace(l+0.00001, l+0.00001+35, num=71)
yy = gev.pdf(xx, shape, loc, scale)
hist, bins = np.histogram(rvs, bins=12, range=(-0.5, 23.5), density=True)
plt.bar(bins[:-1], hist, width = 2, align='edge')
plt.plot(xx, yy, 'ro')
plt.show()
but what I've got back are
-0.21989526255575445
12.749780017954315
3.449061347316184
for shape, loc and scale. If you look at GEV distribution as defined in scipy, when shape is negative, the valid interval is [loc + scale/shape...+infinity]. I've computed latter value and it is equal to
-2.935417290135696
should work...
Python3, Anaconda, scipy 1.1, Windows 10 64bit
UPDATE
Ok, I've updated the code and added plotting, looks somewhat reasonable. Is it what are you looking for? Basically, trick is to histogram it and plot density bins overlapping with PDF

Out of curiosity, I tried the GeneralizedExtremeValue Factory (GEV) available in OpenTURNS
import openturns as ot
sample = ot.Sample([[p] for p in max_precip])
gev = ot.GeneralizedExtremeValueFactory().buildAsGeneralizedExtremeValue(sample)
print (gev)
>>> GeneralizedExtremeValue(mu=12.7497, sigma=3.44903, xi=0.219894)
I can confirm it gives the same result.

Python: fit data to given cosine function

I am trying to simply find best fit for malus's law.
I_measured=I_0*(cos(theta)) ^2
When I scatter the plot, it obviously works but with the def form() function I get the error given below.
I googled the problem and it seems that this is not the correct way to curvefit a cosine function.
given data is ..
x_data=x1 in the code below
[ 0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0,
60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 105.0, 110.0, 115.0,
120.0, 125.0, 130.0, 135.0, 140.0, 145.0, 150.0, 155.0, 160.0, 165.0,
170.0, 175.0, 180.0, 185.0, 190.0, 195.0, 200.0, 205.0, 210.0, 215.0,
220.0, 225.0, 230.0, 235.0, 240.0, 245.0, 250.0, 255.0, 260.0, 265.0,
270.0, 275.0, 280.0, 285.0, 290.0, 295.0, 300.0, 305.0, 310.0, 315.0,
320.0, 325.0, 330.0, 335.0, 340.0, 345.0, 350.0, 355.0, 360.0]
y_data = x2 in the code below
[ 1.69000000e-05 2.80000000e-05 4.14000000e-05 5.89000000e-05
7.97000000e-05 9.79000000e-05 1.23000000e-04 1.47500000e-04
1.69800000e-04 1.94000000e-04 2.17400000e-04 2.40200000e-04
2.55400000e-04 2.70500000e-04 2.81900000e-04 2.87600000e-04
2.91500000e-04 2.90300000e-04 2.83500000e-04 2.76200000e-04
2.62100000e-04 2.41800000e-04 2.24200000e-04 1.99500000e-04
1.74100000e-04 1.49300000e-04 1.35600000e-04 1.11500000e-04
9.00000000e-05 6.87000000e-05 4.98000000e-05 3.19000000e-05
2.07000000e-05 1.31000000e-05 9.90000000e-06 1.03000000e-05
1.49000000e-05 2.34000000e-05 3.65000000e-05 5.58000000e-05
7.56000000e-05 9.65000000e-05 1.19400000e-04 1.46900000e-04
1.73000000e-04 1.99200000e-04 2.24600000e-04 2.38700000e-04
2.60700000e-04 2.74800000e-04 2.84000000e-04 2.91200000e-04
2.93400000e-04 2.90300000e-04 2.86400000e-04 2.77900000e-04
2.63600000e-04 2.45900000e-04 2.25500000e-04 2.03900000e-04
1.79100000e-04 1.51800000e-04 1.32400000e-04 1.07000000e-04
8.39000000e-05 6.20000000e-05 4.41000000e-05 3.01000000e-05
1.93000000e-05 1.24000000e-05 1.00000000e-05 1.13000000e-05
1.77000000e-05]
the code
I_0=291,5*10**-6/(pi*0.35**2) # print(I_0) gives (291, 1.2992240252399621e-05)??
def form(theta, I_0):
return (I_0*(np.abs(np.cos(theta)))**2) # theta is x_data
param=I_0
parame,covariance= optimize.curve_fit(form,x1,x2,I_0)
test=parame*I_0
#print(parame)
#plt.scatter(x1,x2,label='data')
plt.ylim(10**-5,3*10**-4)
plt.plot(x1,form(x1,*parame),'b--',label='fitcurve')
The error I get is:
TypeError: form() takes 2 positional arguments but 3 were given`
i started again with another code shown below.
x1=np.radians(np.array(x1))
x2=np.array(x2)*10**-6
print(x1,x2)
def form(theta, I_0, theta0, offset):
return I_0 * np.cos(np.radians(theta - theta0)) ** 2 + offset
param, covariance = optimize.curve_fit(form, x1, x2)
plt.scatter(x1, x2, label='data')
plt.ylim(0, 3e-4)
plt.xlim(0, 360)
plt.plot(x1, form(x1, *param), 'b-')
plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
plt.axes().xaxis.set_major_locator(ticker.MultipleLocator(45))
plt.show()
in the new code. i multiplide the input array with a number.. basically it s still y_data in the first code. when i plot this, i see that function does not fit at all with an added code x1 = np.radians(np.array(x1))

Comma
I guess your I_0=291,5*10**-6/(pi*0.35**2) is supposed to be the initial guess for the fit. I don't know why this is expressed in such a complicated way. Using , as decimal separator is the wrong syntax in Python, use . instead. Also, instead of something like 123.4 * 10 ** -5 you can write 123.4e-5 (scientific notation).
Anyway, it turns out you don't even need to specify the initial guess if you do the fit correctly.
Model function
In your model function, I_measured = I_0 * cos(theta)**2, theta is in radians (0 to 2π), but your x values are in degrees (0 to 360).
Your model function doesn't account for any offset in the x or y values. You should include such parameters in the function.
An improved model function would look like this:
def form(theta, I_0, theta0, offset):
return I_0 * np.cos(np.radians(theta - theta0)) ** 2 + offset
(Credits to Martin Evans for pointing out the np.radians function.)
Result
Now the curve_fit function is able to derive values for I_0, theta0, and offset that best fit the model function to your measured data:
>>> param, covariance = optimize.curve_fit(form, x, y)
>>> print 'I_0: {0:e} / theta_0: {1} degrees / offset: {2:e}'.format(*param)
I_0: -2.827996e-04 / theta_0: -9.17118424279 degrees / offset: 2.926534e-04
The plot looks decent, too:
import matplotlib.ticker as ticker
plt.scatter(x, y, label='data')
plt.ylim(0, 3e-4)
plt.xlim(0, 360)
plt.plot(x, form(x, *param), 'b-')
plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
plt.axes().xaxis.set_major_locator(ticker.MultipleLocator(45))
plt.show()
(Your x values are from 0 to 360, I don't know why you've set the plot limits to 370. Also, I spaced the ticks in 45 degrees interval.)
Update: The fit results in a negative amplitude I_0 and an offset of about 3e-4, close to the maximum y values. You can guide the fit to a positive amplitude and offset close to zero ("flip it around") by providing a 90 degree initial phase offset:
>>> param, covariance = optimize.curve_fit(form, x, y, [3e-4, 90, 0])
>>> print 'I_0: {0:e} / theta_0: {1} degrees / offset: {2:e}'.format(*param)
I_0: 2.827996e-04 / theta_0: 80.8288157578 degrees / offset: 9.853833e-06
Here's the complete code.

The comma in your formula is creating a two object tuple, it does not specify "thousands", as such, you should remove this giving you:
I_O = 0.00757447606715
The aim here is to provide a function that can be adapted to fit your data. Your original function only provided one parameter, which was not enough to enable curve_fit() to get a good fit.
In order to get a better fit, you need to create more variables for your func() to enable the curve fitter more flexibility. In this case for the cos wave, it provides I_O for the amplitude, theta0 for the phase and yoffset.
So the code would be:
import matplotlib.pyplot as plt
from math import pi
from scipy import optimize
import numpy as np
x1 = [ 0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0,
60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 105.0, 110.0, 115.0,
120.0, 125.0, 130.0, 135.0, 140.0, 145.0, 150.0, 155.0, 160.0, 165.0,
170.0, 175.0, 180.0, 185.0, 190.0, 195.0, 200.0, 205.0, 210.0, 215.0,
220.0, 225.0, 230.0, 235.0, 240.0, 245.0, 250.0, 255.0, 260.0, 265.0,
270.0, 275.0, 280.0, 285.0, 290.0, 295.0, 300.0, 305.0, 310.0, 315.0,
320.0, 325.0, 330.0, 335.0, 340.0, 345.0, 350.0, 355.0, 360.0]
x2 = [ 1.69000000e-05, 2.80000000e-05, 4.14000000e-05, 5.89000000e-05,
7.97000000e-05, 9.79000000e-05, 1.23000000e-04, 1.47500000e-04,
1.69800000e-04, 1.94000000e-04, 2.17400000e-04, 2.40200000e-04,
2.55400000e-04, 2.70500000e-04, 2.81900000e-04, 2.87600000e-04,
2.91500000e-04, 2.90300000e-04, 2.83500000e-04, 2.76200000e-04,
2.62100000e-04, 2.41800000e-04, 2.24200000e-04, 1.99500000e-04,
1.74100000e-04, 1.49300000e-04, 1.35600000e-04, 1.11500000e-04,
9.00000000e-05, 6.87000000e-05, 4.98000000e-05, 3.19000000e-05,
2.07000000e-05, 1.31000000e-05, 9.90000000e-06, 1.03000000e-05,
1.49000000e-05, 2.34000000e-05, 3.65000000e-05, 5.58000000e-05,
7.56000000e-05, 9.65000000e-05, 1.19400000e-04, 1.46900000e-04,
1.73000000e-04, 1.99200000e-04, 2.24600000e-04, 2.38700000e-04,
2.60700000e-04, 2.74800000e-04, 2.84000000e-04, 2.91200000e-04,
2.93400000e-04, 2.90300000e-04, 2.86400000e-04, 2.77900000e-04,
2.63600000e-04, 2.45900000e-04, 2.25500000e-04, 2.03900000e-04,
1.79100000e-04, 1.51800000e-04, 1.32400000e-04, 1.07000000e-04,
8.39000000e-05, 6.20000000e-05, 4.41000000e-05, 3.01000000e-05,
1.93000000e-05, 1.24000000e-05, 1.00000000e-05, 1.13000000e-05,
1.77000000e-05]
x1 = np.radians(np.array(x1))
x2 = np.array(x2)
def form(theta, I_0, theta0, offset):
return I_0 * np.cos(theta - theta0) ** 2 + offset
param, covariance = optimize.curve_fit(form, x1, x2)
plt.scatter(x1, x2, label='data')
plt.ylim(x2.min(), x2.max())
plt.plot(x1, form(x1, *param), 'b-')
plt.show()
Giving you an output of:
The maths libraries work in radians, so your data would need to be converted to radians at some point (where 2pi == 360 degrees). You can either convert your data to radians, or carry out the conversion within your function.
Thanks also to mkrieger1 for the extra parameters.

Fitting a Lognormal Distribution in Python using CURVE_FIT

I have a hypothetical y function of x and trying to find/fit a lognormal distribution curve that would shape over the data best. I am using curve_fit function and was able to fit normal distribution, but the curve does not look optimized.
Below are the give y and x data points where y = f(x).
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
y-axis are probabilities of an event occurring in x-axis time bins:
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
I was able to get a better fit on my data using excel and lognormal approach. When I attempt to use lognormal in python, the fit does not work and I am doing something wrong.
Below is the code I have for fitting a normal distribution, which seems to be the only one that I can fit in python (hard to believe):
#fitting distributino on top of savitzky-golay
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, halflogistic, foldcauchy
from scipy.optimize import curve_fit
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# results from savgol
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
# def gamma_f(x, a, loc, scale):
# return gamma.pdf(x, a, loc, scale)
def norm_f(x, loc, scale):
# print 'loc: ', loc, 'scale: ', scale, "\n"
return norm.pdf(x, loc, scale)
fitting = norm_f
# param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf])
result = curve_fit(fitting, x_axis, y_axis)
result_mod = result
# mod scale
# results_adj = [result_mod[0][0]*.75, result_mod[0][1]*.85]
plt.plot(x_axis, y_axis, 'ro')
plt.bar(x_axis, y_axis, 1, alpha=0.75)
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.1])
# convert back into probability
y_norm_fit = [fitting(_, *result[0]) for _ in x_axis]
y_fit = [_*sum_ys for _ in y_norm_fit]
print list(y_fit)
plt.show()
I am trying to get answers two questions:
Is this the best fit I will get from normal distribution curve? How can I imporve my the fit?
Normal distribution result:
How can I fit a lognormal distribution to this data or is there a better distribution that I can use?
I was playing around with lognormal distribution curve adjust mu and sigma, it looks like that there is possible a better fit. I don't understand what I am doing wrong to get similar results in python.

Actually, Gamma distribution might be good fit as #Glen_b proposed. I'm using second definition with \alpha and \beta.
NB: trick I use for a quick fit is to compute mean and variance and for typical two-parametric distribution it is enough to recover parameters and get quick idea if it is good fit or not.
Code
import math
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * x_axis[k]
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - m)
v += y_axis[k] * t * t
print(m, v)
b = m/v
a = m * b
print(a, b)
z = []
for k in range(0, len(x_axis)):
q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a)
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()

Discrete distribution might look better - your x are all integers after all. You have distribution with variance about 3 times higher than mean, asymmetric - so most likely something like Negative Binomial might work quite well. Here is quick fit
r is a bit above 6, so you might want to move to distribution with real r - Polya distribution.
Code
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
s = 1.0 # shift by 1 to have them all at 0
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * (x_axis[k] - s)
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - s - m)
v += y_axis[k] * t * t
print(m, v)
p = 1.0 - m/v
r = int(m*(1.0 - p) / p)
print(p, r)
z = []
for k in range(0, len(x_axis)):
q = comb(k + r - 1, k) * (1.0 - p)**r * p**k
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()

Note that if a lognormal curve is correct and you take logs of both variables, you should have a quadratic relationship; even if that's not a suitable scale for a final model (because of variance effects -- if your variance is near constant on the original scale it will overweight the small values) it should at least give a good starting point for a nonlinear fit.
Indeed aside from the first two points this looks fairly good:
-- a quadratic fit to the solid points would describe that data quite well and should give suitable starting values if you then want to do a nonlinear fit.
(If error in x is at all possible, the lack of fit at the lowest x may be as much issues with error in x as error in y)
Incidentally, that plot seems to hint that a gamma curve may fit a little better overall than a lognormal one (in particular if you don't want to reduce the impact of those first two points relative to points 4-6). A good initial fit for that can be had by regressing log(y) on x and log(x):
The scaled gamma density is g = c.x^(a-1) exp(-bx) ... taking logs, you get log(g) = log(c) + (a-1) log(x) - b x = b0 + b1 log(x) + b2 x ... so supplying log(x) and x to a linear regression routine will fit that. The same caveats about variance effects apply (so it might be best as a starting point for a nonlinear least squares fit if your relative error in y isn't nearly constant).

In Python, I explained a trick here of how to fit a LogNormal very simply using OpenTURNS library:
import openturns as ot
n_times = [int(y_axis[i] * N) for i in range(len(y_axis))]
S = np.repeat(x_axis, n_times)
sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().buildAsLogNormal(sample)
That's it!
print(fitdist) will show you >>> LogNormal(muLog = 2.92142, sigmaLog = 0.305, gamma = -6.24996)
and the fitting seems good:
import matplotlib.pyplot as plt
plt.hist(S, density =True, color = 'grey', bins = 34, alpha = 0.5)
plt.scatter(x_axis, y_axis, color= 'red')
plt.plot(x_axis, fitdist.computePDF(ot.Sample([[p] for p in x_axis])), color = 'black')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Segmented regression in python using differential evolution - python

Related

Problem Fitting a Residence Time Distribution Data

Non linear complex function fitting - python

Python - Fitting a GEV distribution from these values

Python: fit data to given cosine function

Fitting a Lognormal Distribution in Python using CURVE_FIT

Categories

Resources