scipy curve_fit returns initial estimates - python

To fit a hyperbolic function I am trying to use the following code:
import numpy as np
from scipy.optimize import curve_fit
def hyperbola(x, s_1, s_2, o_x, o_y, c):
# x > Input x values
# s_1 > slope of line 1
# s_2 > slope of line 2
# o_x > x offset of crossing of asymptotes
# o_y > y offset of crossing of asymptotes
# c > curvature of hyperbola
b_2 = (s_1 + s_2) / 2
b_1 = (s_2 - s_1) / 2
return o_y + b_1 * (x - o_x) + b_2 * np.sqrt((x - o_x) ** 2 + c ** 2 / 4)
min_fit = np.array([-3.0, 0.0, -2.0, -10.0, 0.0])
max_fit = np.array([0.0, 3.0, 3.0, 0.0, 10.0])
guess = np.array([-2.5/3.0, 4/3.0, 1.0, -4.0, 0.5])
vars, covariance = curve_fit(f=hyperbola, xdata=n_step, ydata=n_mean, p0=guess, bounds=(min_fit, max_fit))
Where n_step and n_mean are measurement values generated earlier on. The code runs fine and gives no error message, but it only returns the initial guess with a very small change. Also, the covariance matrix contains only zeros. I tried to do the same fit with a better initial guess, but that does not have any influence.
Further, I plotted the exact same function with the initial guess as input and that gives me indeed a function which is close to the real values. Does anyone know where I make a mistake here? Or do I use the wrong function to make my fit?

The issue must lie with n_step and n_mean (which are not given in the question as currently stated); when trying to reproduce the issue with some arbitrarily chosen set of input parameters, the optimization works as expected. Let's try it out.
First, let's define some arbitrarily chosen input parameters in the given parameter space by
params = [-0.1, 2.95, -1, -5, 5]
Let's see what that looks like:
import matplotlib.pyplot as plt
xs = np.linspace(-30, 30, 100)
plt.plot(xs, hyperbola(xs, *params))
Based on this, let us define some rather crude inputs for xdata and ydata by
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xs, *params)
With these, let us run the optimization and see if we match our given parameters:
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [-0.1 2.95 -1. -5. 5. ]
That is, the fit is perfect even though our params are rather different from our guess. In other words, if we are free to choose n_step and n_mean, then the method works as expected.
In order to try to challenge the optimization slightly, we could also try to add a bit of noise:
np.random.seed(42)
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xdata, *params) + np.random.normal(0, 10, size=len(xdata))
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [ -1.18173287e-01 2.84522636e+00 -1.57023215e+00 -6.90851334e-12 6.14480856e-08]
plt.plot(xdata, ydata, '.')
plt.plot(xs, hyperbola(xs, *vars))
Here we note that the optimum ends up being different from both our provided params and the guess, still within the bounds provided by min_fit and max_fit, and still provided a good fit.

Related

Scipy method SLSQP with vector returns in constraints and objective function

[Edited]
Recently I work with Nelson Siegel Svensson Yield Curve Model, but I ran into a situation: search best fit of paramenters Model.
Given the above, I use simple dataset represented with Period (const vector) and Yield Value (y_real vector), to calibrate the parameters b0, b1, b2, b3, t1 and t2 of original function (detailed in model objective function with x[0], x[1], x[2], x[3], x[4] and x[5] respectively) to find a minimal difference of original Yield Value vs. estimated Yield Value with the NSS Yield Curve Model using the most adjusted parameters for this purpose, so that in the end calculate interpolation and extrapolations Yields Values based on specific Periods.
Note: The constraint function defined with this premise fun(x) == 0 (type 'eq') to search a minimal difference between the y_real and result of model (in vector form), for example:
If y_real = [1.1, 1.4 ,1.3] and the result of model function is [0.2, 1.7, 3.3], then the difference result is [0.9, -0.3, -2] and its necesary to iterate again until you get approximately zero vector in result difference, for exalmple of solution vector difference is: [1.0e-22, 1.0e-21, 1.0e-25]
I deveolped trial solution with Scipy minimize least_squares (Calibrate parameters of Yield Curve Nelson Siegel Svensson) but it's a very simple form and I need more accuracy, for this some people recommended me SLSQP method of Scipy Optimize Minimize.
This is my code for search a calibrated parameters and then use in a NSS Yield Curve:
from numpy import array, append
from scipy.optimize import minimize
from math import exp as EXP
const = [30,90,180,270,365,730,1095,1460,1825,2190,2555,2920,3285,3650,4015,4380,4745,5110,5475,5840,6205,6570,6935,7300]
y_real = [3.11826,3.71463,3.74677,3.83900,4.00049,4.40666,4.52346,4.64026,4.75706,4.87386,4.99066,5.10746,5.22426,
5.34106,5.44522,5.54669,5.64816,5.74963,5.85110,5.88607,5.91162,5.93717,5.96272,5.98827]
def model(x, const, y_real):
arr = array([])
for val in const:
arr = append(arr,(x[0])+(x[1]*((1-EXP(-val/x[4]))/(val/x[4])))+(x[2]*((((1-EXP(-val/x[4]))/(val/x[4])))-(EXP(-val/x[4]))))+x[3]*((((1-EXP(-val/x[5]))/(val/x[5])))-(EXP(-val/x[5]))))
return array(y_real) - arr
def fun(x, const, y_real):
eval = model(x, const, y_real)
leval = array([0 if val < 1.0e-20 else 1 for val in eval])
return leval
con = {'type': 'eq', 'fun': fun, 'args' : (const, y_real)}
x0 = array([0.001, 0.001, 0.001, 0.001, 1.0e-10, 1.0e-10])
#bounds_ = [(0.001,8),(0.001,8),(0.001,8),(0.001,8),(1.0e-15,3),(1.0e-15,3)] bounds=bounds_
res = minimize(model, x0, method='SLSQP', constraints=[con] , args=(const, y_real))
print(res)
but I reached one result error:
in _minimize_slsqp
w = zeros(len_w)
ValueError: negative dimensions are not allowed
How I reach the solution with SLSQP method (or another best option) with a comprehensive way without this error?
Thanks in advance.
First of all, your problem is probably that your model doesn't return a scalar value.
It's highly recommended to make yourself familiar with numpy. For instance, given const and y_real are numpy arrays, you don't need loops and append to implement your model function. It can be written as follows (note that it returns the sum of the squared differences):
import numpy as np
const = np.array([ # your values here ])
y_real = np.array([ # your values here ])
def model(x, const, y_real):
arr = (
x[0] + (x[1] * ((1 - np.exp(-const / x[4])) / (const / x[4])))
+ (x[2] * ((((1 - np.exp(-const / x[4])) / (const / x[4])))
- (np.exp(-const / x[4]))))
+ x[3] * ((((1 - np.exp(-const / x[5])) / (const / x[5])))
- (np.exp(-const / x[5])))
)
return np.sum((y_real - arr)**2)
In addition, there are a few more things that do not make sense and it's still not 100% clear to me what you are trying to achieve:
Subtracting a vector of zeros from the evaluated model inside fun
Subtracting a vector of zeros from leval inside fun
Adding the constraint fun(x) >= 0. What's the purpose of this constraint? The function returns a vector consisting of zeros or ones, so this constraint is always fulfilled.
By ignoring the constraint fun (which, by the way, is not differentiable and contradicts the mathematical assumptions of the SLSQP algorithm), you can write:
from scipy.optimize import minimize
x0 = np.array([0.001, 0.001, 0.001, 0.001, 1.0e-10, 1.0e-10])
res = minimize(lambda x: model(x, const, y_real), x0, method="SLSQP")
Team
In order to this case, I worked part of functional solution using perspective and some good tips of #joni in previos answer (very kind for that), it's only a part because works approximately for first third segment of dataset, to work with rest of dataset segment can use the previous case code (Calibrate parameters of Yield Curve Nelson Siegel Svensson).
I share with you to reuse and get idea of one possible solution in two parts, please discuss if you have a more sofistified solution
from scipy.optimize import minimize
import numpy as np
from math import trunc
from nelson_siegel_svensson import NelsonSiegelSvenssonCurve
import pandas as pd
import matplotlib.pyplot as plt
# days
const = np.array([30,90,270,548,913,1278,1643,2008,2373,2738,3103,3468,3833,4198,4563,4928,5293,5658,6023,6388,6935,7300,7665,8030])/365
# empty values represented with 0
y_real = np.array([3.33156,3.44928,3.62778,3.74313,3.96015,4.384,4.4705,4.55701,4.63817,4.69949,4.76081,4.82213,4.87285,4.8681,4.86336,4.85861,4.85387,4.84912,4.87039,4.89833,4.94286,4.98739,5.03192,5.07645])
def model(x, const, y_real):
arr = np.array([])
for val in const:
arr = np.append(arr,(x[0])+(x[1]*((1-np.exp(-val/x[4]))/(val/x[4])))+(x[2]*((((1-np.exp(-val/x[4]))/(val/x[4])))-(np.exp(-val/x[4]))))+x[3]*((((1-np.exp(-val/x[5]))/(val/x[5])))-(np.exp(-val/x[5]))))
return np.sum((y_real - arr)**2)
# initial coefficients of Nelson Siegel Svensson
x0 = np.array([0.001, 0.001, 0.001, 0.001, 1.0e-10, 1.0e-10])
res = minimize(model, x0, method='SLSQP', args=(const, y_real),
options={'maxiter': 10000, 'ftol': 1e-190})
print(res.x)
X_fix = np.linspace(start=const[0], stop=const[-1], num=(const.size*20))
NSS = NelsonSiegelSvenssonCurve(beta0=res.x[0], beta1=res.x[1], beta2=res.x[2], beta3=res.x[3], tau1=res.x[4], tau2=res.x[5])
pd_interpolation = pd.DataFrame(columns=['Period','Value'])
font = {'family': 'serif',
'color': '#1F618D',
'weight': 'bold',
'size': 14,
}
font_x_y = {'family': 'serif',
'color': '#C70039',
'weight': 'bold',
'size': 13,
'style': 'oblique'
}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
minx = -const[1]*3
config_manager = plt.get_current_fig_manager()
config_manager.set_window_title("Visualización " )
screen_x, screen_y = config_manager.window.wm_maxsize()
anchura = str(trunc(screen_x/12))
altura = str(trunc(screen_y/8))
middle_window = "+" + anchura + "+" + altura
config_manager.window.wm_geometry(middle_window)
plt.title('Interpolación ', fontdict=font, loc='center')
plt.xlabel('Periodo', fontdict=font_x_y)
plt.ylabel('Aproximación', fontdict=font_x_y, rotation=90, labelpad=10)
ax.set_ylim(ymin=0, ymax=(np.amax(y_real)*1.1))
ax.set_xlim(xmin=minx, xmax=(np.amax(const)*1.03))
ax.plot(const, y_real, 'ro', label='Dato real')
ax.plot(X_fix, NSS(X_fix),'--', label='Dato interpolado')
ax.legend(loc='lower right', frameon=False)
plt.show()
And result graph is this:

How to run non-linear regression in python

i am having the following information(dataframe) in python
product baskets scaling_factor
12345 475 95.5
12345 108 57.7
12345 2 1.4
12345 38 21.9
12345 320 88.8
and I want to run the following non-linear regression and estimate the parameters.
a ,b and c
Equation that i want to fit:
scaling_factor = a - (b*np.exp(c*baskets))
In sas we usually run the following model:(uses gauss newton method )
proc nlin data=scaling_factors;
parms a=100 b=100 c=-0.09;
model scaling_factor = a - (b * (exp(c*baskets)));
output out=scaling_equation_parms
parms=a b c;
is there a similar way to estimate the parameters in Python using non linear regression, how can i see the plot in python.
For problems like these I always use scipy.optimize.minimize with my own least squares function. The optimization algorithms don't handle large differences between the various inputs well, so it is a good idea to scale the parameters in your function so that the parameters exposed to scipy are all on the order of 1 as I've done below.
import numpy as np
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
def lsq(arg):
a = arg[0]*100
b = arg[1]*100
c = arg[2]*0.1
now = a - (b*np.exp(c * baskets)) - scaling_factor
return np.sum(now**2)
guesses = [1, 1, -0.9]
res = scipy.optimize.minimize(lsq, guesses)
print(res.message)
# 'Optimization terminated successfully.'
print(res.x)
# [ 0.97336709 0.98685365 -0.07998282]
print([lsq(guesses), lsq(res.x)])
# [7761.0093358076601, 13.055053196410928]
Of course, as with all minimization problems it is important to use good initial guesses since all of the algorithms can get trapped in a local minimum. The optimization method can be changed by using the method keyword; some of the possibilities are
‘Nelder-Mead’
‘Powell’
‘CG’
‘BFGS’
‘Newton-CG’
The default is BFGS according to the documentation.
Agreeing with Chris Mueller, I'd also use scipy but scipy.optimize.curve_fit.
The code looks like:
###the top two lines are required on my linux machine
import matplotlib
matplotlib.use('Qt4Agg')
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import numpy as np
from scipy.optimize import curve_fit #we could import more, but this is what we need
###defining your fitfunction
def func(x, a, b, c):
return a - b* np.exp(c * x)
###OP's data
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
###let us guess some start values
initialGuess=[100, 100,-.01]
guessedFactors=[func(x,*initialGuess ) for x in baskets]
###making the actual fit
popt,pcov = curve_fit(func, baskets, scaling_factor,initialGuess)
#one may want to
print popt
print pcov
###preparing data for showing the fit
basketCont=np.linspace(min(baskets),max(baskets),50)
fittedData=[func(x, *popt) for x in basketCont]
###preparing the figure
fig1 = plt.figure(1)
ax=fig1.add_subplot(1,1,1)
###the three sets of data to plot
ax.plot(baskets,scaling_factor,linestyle='',marker='o', color='r',label="data")
ax.plot(baskets,guessedFactors,linestyle='',marker='^', color='b',label="initial guess")
ax.plot(basketCont,fittedData,linestyle='-', color='#900000',label="fit with ({0:0.2g},{1:0.2g},{2:0.2g})".format(*popt))
###beautification
ax.legend(loc=0, title="graphs", fontsize=12)
ax.set_ylabel("factor")
ax.set_xlabel("baskets")
ax.grid()
ax.set_title("$\mathrm{curve}_\mathrm{fit}$")
###putting the covariance matrix nicely
tab= [['{:.2g}'.format(j) for j in i] for i in pcov]
the_table = plt.table(cellText=tab,
colWidths = [0.2]*3,
loc='upper right', bbox=[0.483, 0.35, 0.5, 0.25] )
plt.text(250,65,'covariance:',size=12)
###putting the plot
plt.show()
###done
Eventually, giving you:

Linear regression with leastsq() and global minimum not found

In Python scipy.optimize.leastsq() is normally used for non-linear regression. However, leastsq() should in principle be expected to work with linear fitting functions also. Here appears to be a simple linear regression problem that leastsq() apparently fails to solve properly. Data is fitted with the line y=mx.
Code sample is at the bottom of the post. When plot_real_data = False, then 100 points of linearly correlated data are generated randomly. Here leastsq() can effectively find the minimum of the sum-squared error function:
Graph of correct solution
However, when plot_real_data = True, then 100 data points are taken from a real data set. Here, leastsq() cannot, for some unknown reason, find the minimum of the sum-squared error function:
Graph of incorrect solution
leastsq() consistently reports an optimal gradient parameter m=1.082, regardless of the initial guess of the gradient. However m=1.082 is not the global minimum. The proper value is closer to m=1.25:
print sum(errorfunc([1.0], x, y))
3.9511006207
print sum(errorfunc([1.08], x, y))
3.59052114948
print sum(errorfunc([1.25], x, y))
3.37109033259 (near the minimum)
print sum(errorfunc([1.4], x, y))
3.79503789072
This is puzzling behaviour. In this case, the sum squared error function is a simple quadratic and there is no risk of local minima.
I know that direct methods exist for linear regression, but any ideas on this issue with leastsq()?
Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
Scipy version 0.17.0
CODE:
from __future__ import division
import matplotlib.pyplot as plt
import numpy
import random
from scipy.optimize import leastsq
def errorfunc(params, x_data, y_data) :
"""
Return error at each x point, to a straight line of gradient m
This 1-parameter error function has a clearly defined minimum
"""
squared_errors = []
for i, lm in enumerate(x_data) :
predicted_um = lm * params[0]
squared_errors.append((y_data[i] - predicted_um)**2)
return squared_errors
plt.figure()
###################################################################
# STEP 1: make a scatter plot of the data
plot_real_data = True
###################################################################
if plot_real_data :
# 100 points of real data
x = [0.85772, 0.17135, 0.03401, 0.17227, 0.17595, 0.1742, 0.22454, 0.32792, 0.19036, 0.17109, 0.16936, 0.17357, 0.6841, 0.24588, 0.22913, 0.28291, 0.19845, 0.3324, 0.66254, 0.1766, 0.47927, 0.47999, 0.50301, 0.16035, 0.65964, 0.0, 0.14308, 0.11648, 0.10936, 0.1983, 0.13352, 0.12471, 0.29475, 0.25212, 0.08334, 0.07697, 0.82263, 0.28078, 0.24192, 0.25383, 0.26707, 0.26457, 0.0, 0.24843, 0.26504, 0.24486, 0.0, 0.23914, 0.76646, 0.66567, 0.62966, 0.61771, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.79157, 0.06889, 0.07669, 0.1372, 0.11681, 0.11103, 0.13577, 0.07543, 0.10636, 0.09176, 0.10941, 0.08327, 1.19903, 0.20987, 0.21103, 0.21354, 0.26011, 0.28862, 0.28441, 0.2424, 0.29196, 0.20248, 0.1887, 0.20045, 1.2041, 0.20687, 0.22448, 0.23296, 0.25434, 0.25832, 0.25722, 0.24378, 0.24035, 0.17912, 0.18058, 0.13556, 0.97535, 0.25504, 0.20418, 0.22241]
y = [1.13085, 0.19213, 0.01827, 0.20984, 0.21898, 0.12174, 0.38204, 0.31002, 0.26701, 0.2759, 0.26018, 0.24712, 1.18352, 0.29847, 0.30622, 0.5195, 0.30406, 0.30653, 1.13126, 0.24761, 0.81852, 0.79863, 0.89171, 0.19251, 1.33257, 0.0, 0.19127, 0.13966, 0.15877, 0.19266, 0.12997, 0.13133, 0.25609, 0.43468, 0.09598, 0.08923, 1.49033, 0.27278, 0.3515, 0.38368, 0.35134, 0.37048, 0.0, 0.3566, 0.36296, 0.35054, 0.0, 0.32712, 1.23759, 1.02589, 1.02413, 0.9863, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.19224, 0.12192, 0.12815, 0.2672, 0.21856, 0.14736, 0.20143, 0.1452, 0.15965, 0.14342, 0.15828, 0.12247, 0.5728, 0.10603, 0.08939, 0.09194, 0.1145, 0.10313, 0.13377, 0.09734, 0.12124, 0.11429, 0.09536, 0.11457, 0.76803, 0.10173, 0.10005, 0.10541, 0.13734, 0.12192, 0.12619, 0.11325, 0.1092, 0.11844, 0.11373, 0.07865, 1.28568, 0.25871, 0.22843, 0.26608]
else :
# 100 points of test data with noise added
x_clean = numpy.linspace(0,1.2,100)
y_clean = [ i * 1.38 for i in x_clean ]
x = [ i + random.uniform(-1 * random.uniform(0, 0.1), random.uniform(0, 0.1)) for i in x_clean ]
y = [ i + random.uniform(-1 * random.uniform(0, 0.5), random.uniform(0, 0.5)) for i in y_clean ]
plt.subplot(2,1,1)
plt.scatter(x,y); plt.xlabel('x'); plt.ylabel('y')
# STEP 2: vary gradient m of a y = mx fitting line
# plot sum squared error with respect to gradient m
# here you can see by eye, the optimal gradient of the fitting line
plt.subplot(2,1,2)
try_m = numpy.linspace(0.1,4,200)
sse = [ sum(errorfunc([m], x, y)) for m in try_m ]
plt.plot(try_m,sse); plt.xlabel('line gradient, m'); plt.ylabel('sum-squared error')
# STEP 3: use leastsq() to find optimal gradient m
params = [2] # start with initial guess of 2 for gradient
params_fitted, cov, infodict, mesg, ier = leastsq(errorfunc, params[:], args=(x, y), full_output=1)
optimal_m = params_fitted[0]
print optimal_m
# optimal gradient m should be the minimum of the error function
plt.subplot(2,1,2)
plt.plot([optimal_m,optimal_m],[0,100], 'r')
# optimal gradient m should give best fit straight line
plt.subplot(2,1,1)
plt.plot([0, 1.2],[0, 1.2 * optimal_m],'r')
plt.show()

Fit a non-linear function to data/observations with pyMCMC/pyMC

I am trying to fit some data with a Gaussian (and more complex) function(s). I have created a small example below.
My first question is, am I doing it right?
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
It is very hard to find nice guides on how to do this kind of regression in pyMC. Perhaps because its easier to use some least squares, or similar approach, I however have many parameters in the end and need to see how well we can constrain them and compare different models, pyMC seemed like the good choice for that.
import pymc
import numpy as np
import matplotlib.pyplot as plt; plt.ion()
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
# Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
# define the model/function to be fitted.
def model(x, f):
amp = pymc.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pymc.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pymc.Normal('ps', 0.13, 40, value=0.15)
#pymc.deterministic(plot=False)
def gauss(x=x, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pymc.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pymc.MCMC(model(x,f))
MDL.sample(1e4)
# extract and plot results
y_min = MDL.stats()['gauss']['quantiles'][2.5]
y_max = MDL.stats()['gauss']['quantiles'][97.5]
y_fit = MDL.stats()['gauss']['mean']
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=2, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
I realize that I might have to run more iterations, use burn in and thinning in the end. The figure plotting the data and the fit is seen here below.
The pymc.Matplot.plot(MDL) figures looks like this, showing nicely peaked distributions. This is good, right?
My first question is, am I doing it right?
Yes! You need to include a burn-in period, which you know. I like to throw out the first half of my samples. You don't need to do any thinning, but sometimes it will make your post-MCMC work faster to process and smaller to store.
The only other thing I advise is to set a random seed, so that your results are "reproducible": np.random.seed(12345) will do the trick.
Oh, and if I was really giving too much advice, I'd say import seaborn to make the matplotlib results a little more beautiful.
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
One way is to include a latent variable for each error. This works in your example, but will not be feasible if you have many more observations. I'll give a little example to get you started down this road:
# add noise to observed x values
x_obs = pm.rnormal(mu=x, tau=(1e4)**-2)
# define the model/function to be fitted.
def model(x_obs, f):
amp = pm.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pm.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pm.Normal('ps', 0.13, 40, value=0.15)
x_pred = pm.Normal('x', mu=x_obs, tau=(1e4)**-2) # this allows error in x_obs
#pm.deterministic(plot=False)
def gauss(x=x_pred, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pm.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pm.MCMC(model(x_obs, f))
MDL.use_step_method(pm.AdaptiveMetropolis, MDL.x_pred) # use AdaptiveMetropolis to "learn" how to step
MDL.sample(200000, 100000, 10) # run chain longer since there are more dimensions
It looks like it may be hard to get good answers if you have noise in x and y:
Here is a notebook collecting this all up.
EDIT: Important note
This has been bothering me for a while now. The answers given by myself and Abraham here are correct in the sense that they add variability to x. HOWEVER: Note that you cannot simply add uncertainty in this way to cancel out the errors you have in your x-values, so that you regress against "true x". The methods in this answer can show you how adding errors to x affects your regression if you have the true x. If you have a mismeasured x, these answers will not help you. Having errors in the x-values is a very tricky problem to solve, as it leads to "attenuation" and an "errors-in-variables effect". The short version is: having unbiased, random errors in x leads to bias in your regression estimates. If you have this problem, check out Carroll, R.J., Ruppert, D., Crainiceanu, C.M. and Stefanski, L.A., 2006. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC., or for a Bayesian approach, Gustafson, P., 2003. Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments. CRC Press. I ended up solving my specific problem using Carroll et al.'s SIMEX method along with PyMC3. The details are in Carstens, H., Xia, X. and Yadavalli, S., 2017. Low-cost energy meter calibration method for measurement and verification. Applied energy, 188, pp.563-575. It is also available on ArXiv
I converted Abraham Flaxman's answer above into PyMC3, in case someone needs it. Some very minor changes, but can be confusing nevertheless.
The first is that the deterministic decorator #Deterministic is replaced by a distribution-like call function var=pymc3.Deterministic(). Second, when generating a vector of normally distributed random variables,
rvs = pymc2.rnormal(mu=mu, tau=tau)
is replaced by
rvs = pymc3.Normal('var_name', mu=mu, tau=tau,shape=size(var)).random()
The complete code is as follows:
import numpy as np
from pymc3 import *
import matplotlib.pyplot as plt
# set random seed for reproducibility
np.random.seed(12345)
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
#Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
with Model() as model3:
amp = Uniform('amp', 0.05, 0.4, testval= 0.15)
size = Uniform('size', 0.5, 2.5, testval= 1.0)
ps = Normal('ps', 0.13, 40, testval=0.15)
gauss=Deterministic('gauss',amp*np.exp(-1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.)))+ps)
y =Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
trace=sample(2000,start=start)
# extract and plot results
y_min = np.percentile(trace.gauss,2.5,axis=0)
y_max = np.percentile(trace.gauss,97.5,axis=0)
y_fit = np.percentile(trace.gauss,50,axis=0)
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=1, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
Which results in
y_error
For errors in x (note the 'x' suffix to variables):
# define the model/function to be fitted in PyMC3:
with Model() as modelx:
x_obsx = pm3.Normal('x_obsx',mu=x, tau=(1e4)**-2, shape=40)
ampx = Uniform('ampx', 0.05, 0.4, testval=0.15)
sizex = Uniform('sizex', 0.5, 2.5, testval=1.0)
psx = Normal('psx', 0.13, 40, testval=0.15)
x_pred = Normal('x_pred', mu=x_obsx, tau=(1e4)**-2*np.ones_like(x_obsx),testval=5*np.ones_like(x_obsx),shape=40) # this allows error in x_obs
gauss=Deterministic('gauss',ampx*np.exp(-1*(np.pi**2*sizex*x_pred/(3600.*180.))**2/(4.*np.log(2.)))+psx)
y = Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
tracex=sample(20000,start=start)
Which results in:
x_error_graph
the last observation is that when doing
traceplot(tracex[100:])
plt.tight_layout();
(result not shown), we can see that sizex seems to be suffering from 'attenuation' or 'regression dilution' due to the error in the measurement of x.

Fitting negative binomial in python

In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))

Categories

Resources