How to run non-linear regression in python - python

i am having the following information(dataframe) in python
product baskets scaling_factor
12345 475 95.5
12345 108 57.7
12345 2 1.4
12345 38 21.9
12345 320 88.8
and I want to run the following non-linear regression and estimate the parameters.
a ,b and c
Equation that i want to fit:
scaling_factor = a - (b*np.exp(c*baskets))
In sas we usually run the following model:(uses gauss newton method )
proc nlin data=scaling_factors;
parms a=100 b=100 c=-0.09;
model scaling_factor = a - (b * (exp(c*baskets)));
output out=scaling_equation_parms
parms=a b c;
is there a similar way to estimate the parameters in Python using non linear regression, how can i see the plot in python.

For problems like these I always use scipy.optimize.minimize with my own least squares function. The optimization algorithms don't handle large differences between the various inputs well, so it is a good idea to scale the parameters in your function so that the parameters exposed to scipy are all on the order of 1 as I've done below.
import numpy as np
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
def lsq(arg):
a = arg[0]*100
b = arg[1]*100
c = arg[2]*0.1
now = a - (b*np.exp(c * baskets)) - scaling_factor
return np.sum(now**2)
guesses = [1, 1, -0.9]
res = scipy.optimize.minimize(lsq, guesses)
print(res.message)
# 'Optimization terminated successfully.'
print(res.x)
# [ 0.97336709 0.98685365 -0.07998282]
print([lsq(guesses), lsq(res.x)])
# [7761.0093358076601, 13.055053196410928]
Of course, as with all minimization problems it is important to use good initial guesses since all of the algorithms can get trapped in a local minimum. The optimization method can be changed by using the method keyword; some of the possibilities are
‘Nelder-Mead’
‘Powell’
‘CG’
‘BFGS’
‘Newton-CG’
The default is BFGS according to the documentation.

Agreeing with Chris Mueller, I'd also use scipy but scipy.optimize.curve_fit.
The code looks like:
###the top two lines are required on my linux machine
import matplotlib
matplotlib.use('Qt4Agg')
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import numpy as np
from scipy.optimize import curve_fit #we could import more, but this is what we need
###defining your fitfunction
def func(x, a, b, c):
return a - b* np.exp(c * x)
###OP's data
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
###let us guess some start values
initialGuess=[100, 100,-.01]
guessedFactors=[func(x,*initialGuess ) for x in baskets]
###making the actual fit
popt,pcov = curve_fit(func, baskets, scaling_factor,initialGuess)
#one may want to
print popt
print pcov
###preparing data for showing the fit
basketCont=np.linspace(min(baskets),max(baskets),50)
fittedData=[func(x, *popt) for x in basketCont]
###preparing the figure
fig1 = plt.figure(1)
ax=fig1.add_subplot(1,1,1)
###the three sets of data to plot
ax.plot(baskets,scaling_factor,linestyle='',marker='o', color='r',label="data")
ax.plot(baskets,guessedFactors,linestyle='',marker='^', color='b',label="initial guess")
ax.plot(basketCont,fittedData,linestyle='-', color='#900000',label="fit with ({0:0.2g},{1:0.2g},{2:0.2g})".format(*popt))
###beautification
ax.legend(loc=0, title="graphs", fontsize=12)
ax.set_ylabel("factor")
ax.set_xlabel("baskets")
ax.grid()
ax.set_title("$\mathrm{curve}_\mathrm{fit}$")
###putting the covariance matrix nicely
tab= [['{:.2g}'.format(j) for j in i] for i in pcov]
the_table = plt.table(cellText=tab,
colWidths = [0.2]*3,
loc='upper right', bbox=[0.483, 0.35, 0.5, 0.25] )
plt.text(250,65,'covariance:',size=12)
###putting the plot
plt.show()
###done
Eventually, giving you:

Related

Fitting data in python using curve_fit

I'm trying to fit data in python to obtain the coefficients for the best fit.
The equation I need to fit is:
Vs = a*(qt^b)(fs^c)(ov^d)
Whereby I have the data for qt, fs and ov and need to obtain the values for a,b,c,d.
The code I'm using is:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
qt = [10867.073, 8074.986, 2208.366, 3066.566, 2945.326, 4795.766, 2249.813, 2018.3]
fs = [229.6, 17.4, 5.3, 0.1, 0.1, 0.1, 0.1, 0.1]
ov = [19.159, 29.054, 37.620, 44.854, 51.721, 58.755, 65.622, 72.492]
Vs = [149.787, 125.3962, 133.927, 110.047, 149.787, 137.809, 201.506, 154.925]
d = [1.018, 1.518, 2.0179, 2.517, 3.017, 3.517, 4.018, 4.52]
def func(a, b, c, d):
return a*qt**b*fs**c*ov**d
popt, pcov = curve_fit(func, Vs, d)
print(popt[0], popt[1], popt[2])
plt.plot(Vs, d, 'ro',label="Original Data")
plt.plot(Vs, func(Vs,*popt), label="Fitted Curve")
plt.gca().invert_yaxis()
plt.show()
Which produces the following output (Significant figures cut by me):
-0.333528 -0.1413381 -0.3553966
I was hoping to get something more like below where the data it fitted but it hasn't been done perfectly (note the one below is just an example and is not correct).
The main hitch is the graphical representation of the result. What is drawn has no signifiance : Even with perfect data and with perfect fitting the points will appear verry scattered. This is misleading.
Better draw (Vs from computation) divided by (Vs from data) and compare to 1 which should be the exact value if the fitting was perfect.
Note that your problem is a simple linear regression in logarithm scale.
ln(Vs) = A + b * ln(qt) + c * ln(fs) + d * ln(ov)
Linear regression straight away gives A , b , c , d and a = exp(A) .

Curve_Fit not accurate

i tried to fit very fluctual data over time as good as possible. So first i smoothed the data which is working fine. The smoothed data I get from this should further be represented from a fit to get out more of the peaks. As you see in the code I want to use an log-tanh function to fit the data. I am well aware that this problem accured in some of the threads already, but I tried them already and the data is also not very small or very big which i know can also cause problems.
The polynomial fit i tried works also pretty good as you see, but it does not eliminate all the wavy values. They cause problems for the following derivative which is very bad.
import tkinter as tk
from tkinter import filedialog
import numpy as np
import scipy.signal
from scipy.optimize import curve_fit
from numpy import diff
import matplotlib.pyplot as plt
from lmfit.models import StepModel, LinearModel
def loghypfunc(x, A, B, C, D, E):
return A*np.log(1+x)+B*np.tanh(C*x)+D*x+E
def expfunc(t, c0, c1, c2, c3):
return c0+c1*t-c2*np.exp(-c3*t)
def expdecay(x, a, b, c):
return a * np.exp(-b * x) + c
path="C:/Users/Sammy/Documents/Masterarbeit WT/CSM und Kriechdaten/Kriechen/Creep_10mN_00008_LC_20210406_2121_DYN.txt"
dataFile = np.loadtxt(path, delimiter='\t', skiprows=2, usecols=(0, 1, 2, 3, 29, 30), dtype=float)
num_rows, num_cols = dataFile.shape
# time column
time = dataFile[:, [0]].transpose()
time = time.flatten()
refTime = time[0] # get first time in column (reference)
# genullte Testzeit
timeNull = time - refTime
print("time", time)
flatTimeNull = timeNull.flatten() # jetzt ein 1D array (one row)
##################################################################################
# indent displacement column
indentDis = dataFile[:, [4]].transpose()
indentDis = indentDis.flatten()
indentDis = indentDis - indentDis[0]
# the indendt data has to be smoothed so there is not such a big fluctuation
indentSmooth = scipy.signal.savgol_filter(indentDis, 2001, 3)
# null the indent Smooth data
indentSmooth_Null = indentSmooth - indentSmooth[0]
hind_Smooth_flat = indentSmooth_Null.flatten() # jetzt ein 1D array
print('indent smooth', indentSmooth)
######################################################################
p0 = [100, 0.1, 100, 0.1]
c, cov = curve_fit(expfunc, time, indentSmooth, p0)
y_indent = expfunc(indentSmooth, *c)
p0 = [70, 0.5, 50, 0.1, 100]
popt, pcov = curve_fit(loghypfunc, time, indentSmooth, p0, maxfev = 10000)
y_indentTan = loghypfunc(indentSmooth, *popt)
modelh_t = np.poly1d(np.polyfit(time, indentSmooth, 8))
plt.plot(time, indentSmooth, 'r', label="Data smoothed")
plt.scatter(time, modelh_t(time), s=0.1, label="Polyfit")
plt.plot(time, y_indentTan, label="Curve fit Tangens function")
plt.plot(time, y_indent, label="Curve fit exp function")
plt.legend(loc="lower right")
plt.xlabel("time")
plt.ylabel("indent")
plt.show()
These are the two arrays i get the data from
time [ 6.299596 6.349592 6.399589 ... 608.0109 608.060897 608.110894]
indent smooth [120.81411822 121.07093706 121.32748184 ... 476.78825661 476.89357473 476.99915287]
Here the plots
Plots
The question for me now is how to fix it. Is it because of the false optimizied parameters to fit? But python should do that automatic sufficiently good i guess?
My second guess was that the data is timed to compact along this axes, as the array is about 12000 values long. Could this be a reason?
I would be very grateful for any kind of advices regarding the fits.
Regards
Hndrx

scipy curve_fit returns initial estimates

To fit a hyperbolic function I am trying to use the following code:
import numpy as np
from scipy.optimize import curve_fit
def hyperbola(x, s_1, s_2, o_x, o_y, c):
# x > Input x values
# s_1 > slope of line 1
# s_2 > slope of line 2
# o_x > x offset of crossing of asymptotes
# o_y > y offset of crossing of asymptotes
# c > curvature of hyperbola
b_2 = (s_1 + s_2) / 2
b_1 = (s_2 - s_1) / 2
return o_y + b_1 * (x - o_x) + b_2 * np.sqrt((x - o_x) ** 2 + c ** 2 / 4)
min_fit = np.array([-3.0, 0.0, -2.0, -10.0, 0.0])
max_fit = np.array([0.0, 3.0, 3.0, 0.0, 10.0])
guess = np.array([-2.5/3.0, 4/3.0, 1.0, -4.0, 0.5])
vars, covariance = curve_fit(f=hyperbola, xdata=n_step, ydata=n_mean, p0=guess, bounds=(min_fit, max_fit))
Where n_step and n_mean are measurement values generated earlier on. The code runs fine and gives no error message, but it only returns the initial guess with a very small change. Also, the covariance matrix contains only zeros. I tried to do the same fit with a better initial guess, but that does not have any influence.
Further, I plotted the exact same function with the initial guess as input and that gives me indeed a function which is close to the real values. Does anyone know where I make a mistake here? Or do I use the wrong function to make my fit?
The issue must lie with n_step and n_mean (which are not given in the question as currently stated); when trying to reproduce the issue with some arbitrarily chosen set of input parameters, the optimization works as expected. Let's try it out.
First, let's define some arbitrarily chosen input parameters in the given parameter space by
params = [-0.1, 2.95, -1, -5, 5]
Let's see what that looks like:
import matplotlib.pyplot as plt
xs = np.linspace(-30, 30, 100)
plt.plot(xs, hyperbola(xs, *params))
Based on this, let us define some rather crude inputs for xdata and ydata by
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xs, *params)
With these, let us run the optimization and see if we match our given parameters:
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [-0.1 2.95 -1. -5. 5. ]
That is, the fit is perfect even though our params are rather different from our guess. In other words, if we are free to choose n_step and n_mean, then the method works as expected.
In order to try to challenge the optimization slightly, we could also try to add a bit of noise:
np.random.seed(42)
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xdata, *params) + np.random.normal(0, 10, size=len(xdata))
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [ -1.18173287e-01 2.84522636e+00 -1.57023215e+00 -6.90851334e-12 6.14480856e-08]
plt.plot(xdata, ydata, '.')
plt.plot(xs, hyperbola(xs, *vars))
Here we note that the optimum ends up being different from both our provided params and the guess, still within the bounds provided by min_fit and max_fit, and still provided a good fit.

Fitting a curve to a power-law distribution with curve_fit does not work

I am trying to find a curve fitting my data that visually seem to have a power law distribution.
I hoped to utilize scipy.optimize.curve_fit, but no matter what function or data normalization I try, I am getting either a RuntimeError (parameters not found or overflow) or a curve that does not fit my data even remotely. Please help me to figure out what I am doing wrong here.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
df = pd.DataFrame({
'x': [ 1000, 3250, 5500, 10000, 32500, 55000, 77500, 100000, 200000 ],
'y': [ 1100, 500, 288, 200, 113, 67, 52, 44, 5 ]
})
df.plot(x='x', y='y', kind='line', style='--ro', figsize=(10, 5))
def func_powerlaw(x, m, c, c0):
return c0 + x**m * c
target_func = func_powerlaw
X = df['x']
y = df['y']
popt, pcov = curve_fit(target_func, X, y)
plt.figure(figsize=(10, 5))
plt.plot(X, target_func(X, *popt), '--')
plt.plot(X, y, 'ro')
plt.legend()
plt.show()
Output
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-243-17421b6b0c14> in <module>()
18 y = df['y']
19
---> 20 popt, pcov = curve_fit(target_func, X, y)
21
22 plt.figure(figsize=(10, 5))
/Users/evgenyp/.virtualenvs/kindle-dev/lib/python2.7/site-packages/scipy/optimize/minpack.pyc in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, **kwargs)
653 cost = np.sum(infodict['fvec'] ** 2)
654 if ier not in [1, 2, 3, 4]:
--> 655 raise RuntimeError("Optimal parameters not found: " + errmsg)
656 else:
657 res = least_squares(func, p0, args=args, bounds=bounds, method=method,
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
As the traceback states, the maximum number of function evaluations was reached without finding a stationary point (to terminate the algorithm). You can increase the maximum number using the option maxfev. For this example, setting maxfev=2000 is large enough to successfully terminate the algorithm.
However, the solution is not satisfactory. This is due to the algorithm choosing a (default) initial estimate for the variables, which, for this example, is not good (the large number of iterations required is an indicator of this). Providing another initialization point (found by simple trial and error) results in a good fit, without the need to increase maxfev.
The two fits and a visual comparison with the data is shown below.
x = np.asarray([ 1000, 3250, 5500, 10000, 32500, 55000, 77500, 100000, 200000 ])
y = np.asarray([ 1100, 500, 288, 200, 113, 67, 52, 44, 5 ])
sol1 = curve_fit(func_powerlaw, x, y, maxfev=2000 )
sol2 = curve_fit(func_powerlaw, x, y, p0 = np.asarray([-1,10**5,0]))
Your func_powerlaw is not strictly a power law, as it has an additive constant.
Generally speaking, if you want a quick visual appraisal of a power law relation, you would
plot(log(x),log(y))
or
loglog(x,y)
Both of them should give a straight line, although there are subtle differences among them (in particular, regarding curve fitting).
All this without the additive constant, which messes up the power law relation.
If you want to fit a power law that weighs data according to the log-log scale (typically desirable), you can use code below.
import numpy as np
from scipy.optimize import curve_fit
def powlaw(x, a, b) :
return a * np.power(x, b)
def linlaw(x, a, b) :
return a + x * b
def curve_fit_log(xdata, ydata) :
"""Fit data to a power law with weights according to a log scale"""
# Weights according to a log scale
# Apply fscalex
xdata_log = np.log10(xdata)
# Apply fscaley
ydata_log = np.log10(ydata)
# Fit linear
popt_log, pcov_log = curve_fit(linlaw, xdata_log, ydata_log)
#print(popt_log, pcov_log)
# Apply fscaley^-1 to fitted data
ydatafit_log = np.power(10, linlaw(xdata_log, *popt_log))
# There is no need to apply fscalex^-1 as original data is already available
return (popt_log, pcov_log, ydatafit_log)

Fitting negative binomial in python

In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))

Categories

Resources