I would like to fit a curve with curve_fit and prevent it from becoming negative. Unfortunately, the code below does not work. Any hints? Thanks a lot!
# Imports
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
xData = [0.0009824379203203417, 0.0011014182912933933, 0.0012433979929054324, 0.0014147106052612918, 0.0016240300315499524, 0.0018834904507916608, 0.002210485320720769, 0.002630660216394964, 0.0031830988618379067, 0.003929751681281367, 0.0049735919716217296, 0.0064961201261998095, 0.008841941282883075, 0.012732395447351627, 0.019894367886486918, 0.0353677651315323, 0.07957747154594767, 0.3183098861837907]
yData = [99.61973156923796, 91.79478510744039, 92.79302188621314, 84.32927272723863, 77.75060981602016, 75.62801782349504, 70.48026800610839, 72.21240551953743, 68.14019252499526, 55.23015406920851, 57.212682880377464, 50.777016257727176, 44.871140881319626, 40.544138806850846, 32.489105158795525, 25.65367127756607, 19.894206907130403, 13.057996247388862]
def func(x,m,c,d):
'''
Fitting Function
I put d as an absolute number to prevent negative values for d?
'''
return x**m * c + abs(d)
p0 = [-1, 1, 1]
coeff, _ = curve_fit(func, xData, yData, p0) # Fit curve
m, c, d = coeff[0], coeff[1], coeff[2]
print("d: " + str(d)) # Why is it negative!!
Your model actually works fine as the following plot shows. I used your code and plotted the original data and the data you obtain with the fitted parameters:
As you can see, the data can nicely be reproduced but you indeed obtain a negative value for d (which must not be a bad thing depending on the context of the model). If you want to avoid it, I recommend to use lmfit where you can constrain your parameters to certain ranges. The next plot shows the outcome.
As you can see, it also reproduces the data well and you obtain a positive value for d as desired.
namely:
m: -0.35199747
c: 8.48813181
d: 0.05775745
Here is the entire code that reproduces the figures:
# Imports
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
#additional import
from lmfit import minimize, Parameters, Parameter, report_fit
xData = [0.0009824379203203417, 0.0011014182912933933, 0.0012433979929054324, 0.0014147106052612918, 0.0016240300315499524, 0.0018834904507916608, 0.002210485320720769, 0.002630660216394964, 0.0031830988618379067, 0.003929751681281367, 0.0049735919716217296, 0.0064961201261998095, 0.008841941282883075, 0.012732395447351627, 0.019894367886486918, 0.0353677651315323, 0.07957747154594767, 0.3183098861837907]
yData = [99.61973156923796, 91.79478510744039, 92.79302188621314, 84.32927272723863, 77.75060981602016, 75.62801782349504, 70.48026800610839, 72.21240551953743, 68.14019252499526, 55.23015406920851, 57.212682880377464, 50.777016257727176, 44.871140881319626, 40.544138806850846, 32.489105158795525, 25.65367127756607, 19.894206907130403, 13.057996247388862]
def func(x,m,c,d):
'''
Fitting Function
I put d as an absolute number to prevent negative values for d?
'''
print m,c,d
return np.power(x,m)*c + d
p0 = [-1, 1, 1]
coeff, _ = curve_fit(func, xData, yData, p0) # Fit curve
m, c, d = coeff[0], coeff[1], coeff[2]
print("d: " + str(d)) # Why is it negative!!
plt.scatter(xData, yData, s=30, marker = "v",label='P')
plt.scatter(xData, func(xData, *coeff), s=30, marker = "v",color="red",label='curvefit')
plt.show()
#####the new approach starts here
def func2(params, x, data):
m = params['m'].value
c = params['c'].value
d = params['d'].value
model = np.power(x,m)*c + d
return model - data #that's what you want to minimize
# create a set of Parameters
params = Parameters()
params.add('m', value= -2) #value is the initial condition
params.add('c', value= 8.)
params.add('d', value= 10.0, min=0) #min=0 prevents that d becomes negative
# do fit, here with leastsq model
result = minimize(func2, params, args=(xData, yData))
# calculate final result
final = yData + result.residual
# write error report
report_fit(params)
try:
import pylab
pylab.plot(xData, yData, 'k+')
pylab.plot(xData, final, 'r')
pylab.show()
except:
pass
You could use the scipy.optimize.curve_fit method's bounds option to specify the maximum bound and the minimum bound.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
Bounds is a two tuple array. In your case, you just need to specify the lower bound for d. You could use,
bounds=([-np.inf, -np.inf, 0], np.inf)
Note: If you provide a scalar as a parameter (eg:- as the second variable above), it automatically applies as the upper bound for all three coefficients.
You just need to add one little argument to constrain your parameters. That is:
curve_fit(func, xData, yData, p0, bounds=([m1,c1,d1],[m2,c2,d2]))
where m1,c1,d1 are the lower bounds of the parameters (in your case they should be 0) and
m2,c2,d2 are the upper bounds.
If u want all m,c,d to be positive, the code should goes like the following:
curve_fit(func, xData, yData, p0, bounds=(0,numpy.inf))
where all the parameters have a lower bound of 0 and an upper bound of infinity(no bound)
Related
The acquisition channel of scipy and the same version are used.
The result of least_squares is different depending on the environment.
Differences in the environment, the PC is different.
version:1.9.1 py39h316f440_0
channel:conda-forge
environment:windows
I've attached the source code I ran.
If the conditions are the same except for the environment, I would like to get the same results.
Why different causes? How can I do that?
thank you.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import least_squares
import random
random.seed(134)
import numpy as np
np.random.seed(134)
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import least_squares
def report_params(fit_params_values, fit_param_names):
for each in range(len(fit_param_names)):
print(fit_param_names[each], 'is', fit_params_values[each])
# define your modules
def pCon1():
# This is the module for a specific insubstatiation of a constituitive promoter
# the input is nothing
# the output is a protein production amount per time unit
pCon1_production_rate = 100
return pCon1_production_rate
def pLux1(LuxR, AHL):
# This is the module for a specific insubstatiation of a lux promoter
# the input is a LuxR amount and an AHL amount
# the output is a protein production amount per time unit
# For every promoter there is some function that determines what the promoter's
# maximal and basal expression are based on the amount of transcriptional factor
# is floating around in the cell. These numbers are empircally determined, and
# for demonstration purposes are fictionally and arbitrarily filled in here.
# These functions take the form of hill functions.
basal_n = 2
basal_basal = 2
basal_max = 2
basal_kd = 2
basal_expression_rate = basal_basal + (basal_max * (LuxR**basal_n / (LuxR**basal_n + basal_kd)))
max_n = 2
max_max = 2
max_kd = 2
maximal_expression_rate = (LuxR**max_n / (LuxR**max_n + max_kd))
pLux1_n = 2
pLux1_kd = 10
pLux1_production_rate = basal_expression_rate + maximal_expression_rate*(AHL**pLux1_n / (pLux1_kd + AHL**pLux1_n))
return pLux1_production_rate
def simulation_set_of_equations(y, t, *args):
# Args are strictly for parameters we want to eventually estimate.
# Everything else must be hardcoded below. Sorry for the convience.
# Unpack your parameters
k_pCon_express = args[0] # A summation of transcription and translation from a pCon promoter
k_pLux_express = args[1] # A summation of transcription and translation from a pLux promoter
k_loss = args[2] # A summation of dilution and degredation
# Unpack your current amount of each species
LuxR, GFP, AHL = y
# Determine the change in each species
dLuxR = pCon1() - k_loss*LuxR
dGFP = pLux1(LuxR, AHL)*k_pLux_express - k_loss*GFP
dAHL = 0 # for now we're assuming AHL was added exogenously and never degrades
# Return the change in each species; make sure same order as your init values
# scipy.odeint will take these values and apply them to the current value of each species in the next time step for you
return [dLuxR, dGFP, dAHL]
# Parameters
k_pCon_express = 101
k_pLux_express = 50
k_loss = 0.1
params = (k_pCon_express, k_pLux_express, k_loss)
param_names = ['k_pCon_express', 'k_pLux_express', 'k_loss'] # somehow this is honestly necessary in Python?!
# Initial Conditions
# LuxR, GFP, AHL
init_P = [1000, 0, 11]
# Timesteps
n_steps = 500
t = np.linspace(0, 30, n_steps)
num_P = odeint(simulation_set_of_equations, init_P, t, args = (params))
plt.plot(t, num_P[:,0], c='b', label = 'LuxR')
plt.plot(t, num_P[:,1], c='g', label = 'GFP')
plt.plot(t, num_P[:,2], c='r', label = 'AHL')
plt.xlabel('Time')
plt.ylabel('Concentration')
plt.legend(loc = 'best')
plt.grid()
plt.yscale('log')
plt.show()
noise = np.random.normal(0, 10, num_P.shape)
exp_P = num_P + noise
exp_t = t[::10]
exp_P = exp_P[::10]
# Create experimental data. Just take the regular simulation data and add some gaussian noise to it.
def residuals(params):
params = tuple(params)
sim_P = odeint(simulation_set_of_equations, init_P, exp_t, args = params)
res = sim_P - exp_P
return res.flatten()
initial_guess = (100, 100, 100)
low_bounds = [0, 0, 0]
up_bounds = [1000, 1000, 1000]
fitted_params = least_squares(residuals, initial_guess, bounds=(low_bounds, up_bounds)).x
# small reminder: .x is the fitted parameters attribute of the least_squares output
# With least_squares function, unlike, say, curve_fit, it does not compute the covariance matrix for you
# TODO calculate standard deviation of parameter estimation
# (will this ever be used other than sanity checking?)
print(params)
report_params(fitted_params, param_names)
(101, 50, 0.1)
k_pCon_express is 100.0
k_pLux_express is 49.9942246627
k_loss is 0.100037839987
plt.plot(t, odeint(simulation_set_of_equations, init_P, t, args = tuple(params))[:,1], c='r', label='GFP - Given Param Simulation')
plt.scatter(exp_t, exp_P[:,1], c='b', label='GFP - Fake Experimental Data')
plt.plot(t, odeint(simulation_set_of_equations, init_P, t, args = tuple(fitted_params))[:,1], c='g', label='GFP - Fitted Param Simlulation')
plt.legend(loc = 'best')
plt.xlabel('Time')
plt.ylabel('Concentration')
plt.grid()
plt.yscale('log')
plt.show()
Attempting to fit a model to observational data. The code uses data in the range of 0.5 to 1.0 for the independent variable with scipy curve_fit and numerical integration. The function to be integrated also includes an unknown parameter, then subjecting the integrand to evaluation using the trig function sinh(integrand).
After applying curve_fit I get an error message of "loop of ufunc does not support argument 0 of type function which has no callable sinh method". Have I hit a dead end with Python 3? Hope not.
This evaluation code is
#O_m, Hu are unknown parameters to be estimated with model, data
def integr(x,O_m):
return intg.quad(lambda x: 1/x(np.sqrt((O_m/x) + (1-O_m))) , x, 1, args=(0.02))[0]
O_m = 0.02 #Guess for value of O_m, which shall lie between 0.01 and 1.0
def funcX(x,O_m):
result = np.asarray([integr(xx,O_m) for xx in x]) * np.sqrt(abs(1-O_m))
return result
litsped=299793 #the constant speed of light in a vacuum (m/s)
def funcY(x,Hu,O_m):
return (litsped/(x * Hu * np.sqrt(abs(1-O_m))))*np.sinh(funcX)
init_guess = [65,0.02]
bnds=([50,0.001],[80,1.0])
params, pcov = curve_fit(funcY, xdata, ydata, p0 = init_guess, bounds = bnds, sigma = error, absolute_sigma = True)
ans_Hu, ans_O_m = params
perr = np.sqrt(np.diag(pcov))
##################################
Complete code below - as far as I have gotten with this curve_fit.
import numpy as np
import csv
import matplotlib.pylab as plt
from scipy.optimize import curve_fit
from scipy import integrate as intg
with open("Riess_1998_D_L.csv",'r') as i: #SNe Ia data file
rawdata = list(csv.reader(i,delimiter=",")) #make a data list
exmdata = np.array(rawdata[1:],dtype=float) #convert to data array
xdata = exmdata[:,1]
ydata = exmdata[:,2]
error = exmdata[:,3]
#plot of imported data
plt.title("Observed SNe Ia Data")
plt.figure(1,dpi=120)
plt.xlabel("Expansion factor")
plt.ylabel("Distance (Mpc)")
plt.plot(xdata,ydata,label = "Observed SNe Ia data")
plt.xlim(0.5,1)
plt.ylim(0.0,9000)
plt.xscale("linear")
plt.yscale("linear")
plt.errorbar(xdata, ydata, yerr=error, fmt='.k', capsize = 4)
# O_m and Hu are the unknown parameters which shall be estimated using the model and observational data
def integr(x,O_m):
return intg.quad(lambda x: 1/x(np.sqrt((O_m/x) + (1-O_m))) , x, 1, args=(0.02))[0]
O_m = 0.02 # Guess for value of O_m, which are between 0.01 and 1.0
def funcX(x,O_m):
result = np.asarray([integr(xx,O_m) for xx in x])* np.sqrt(abs(1-O_m))
return result
litsped=299793 #the constant speed of light in a vacuum (m/s)
def funcY(x,Hu,O_m):
return (litsped/(x*Hu*np.sqrt(abs(1-O_m))))*np.sinh(funcX)
init_guess = [65,0.02]
bnds=([50,0.001],[80,1.0])
params, pcov = curve_fit(funcY, xdata, ydata, p0 = init_guess, bounds = bnds, sigma = error, absolute_sigma = True)
ans_b, ans_c = params
perr = np.sqrt(np.diag(pcov))
TotalInt = intg.trapz(ydata,xdata) #Compute numerical integral to check data import
print("The total area is: ", TotalInt)
########################
Some more information would be useful, e.g. what is your xdata/ydata? Could you rewrite your code as a minimal reproducable example?
P.S. you can format things on stackoverflow as code by writing ``` before and after the code for better readability ;)
Your problem has nothing to do with the fitting procedure. It was a bit hard for me to understand the code. IF I understood correctly, I recommend sth like this:
import numpy as np
import csv
import matplotlib.pylab as plt
from scipy.optimize import curve_fit
from scipy import integrate as intg
exmdata = np.array(np.random.random((10,4)),dtype=float) #convert to data array
xdata = exmdata[:,1]
ydata = exmdata[:,2]
error = exmdata[:,3]
#plot of imported data
plt.title("Observed SNe Ia Data")
plt.figure(1,dpi=120)
plt.xlabel("Expansion factor")
plt.ylabel("Distance (Mpc)")
plt.plot(xdata,ydata,label = "Observed SNe Ia data")
plt.xlim(0.5,1)
plt.ylim(0.0,9000)
plt.xscale("linear")
plt.yscale("linear")
plt.errorbar(xdata, ydata, yerr=error, fmt='.k', capsize = 4)
# O_m and Hu are the unknown parameters which shall be estimated using the model and observational data
def integr(x,O_m):
return 5*x+3*O_m#Some analytical form
O_m = 0.02 # Guess for value of O_m, which are between 0.01 and 1.0
def funcX(x,O_m):
result = integr(x,O_m)* np.sqrt(abs(1-O_m))
return result
litsped=299793 #the constant speed of light in a vacuum (m/s)
def funcY(x,Hu,O_m):
return (litsped/(x*Hu*np.sqrt(abs(1-O_m))))*np.sinh(funcX(x,O_m))
init_guess = np.array([65,0.02])
bnds=([50,0.001],[80,1.0])
params, pcov = curve_fit(funcY, xdata, ydata, p0 = init_guess, bounds = bnds, sigma = error, absolute_sigma = True)
Where you still need to put in an analytical form of the integral in intgr and replace my random arrays with your CSV file data. The error you referred to earlier was indeed due to you passing the whole function instead of the function evaluated at some point. Please first try to implement these steps and make sure that you can call your three functions independently without errors. It is quite hard to search for bugs, if you tackle the whole program immediately. Try to make the individual parts work first ;). If you still need help after you have implemented these changes, say for the actual fitting procedure, just ask me again ;).
I have been trying to fit a data file with unknown fit parameter "ga" and "MA". What I want to do is set a range withing which the value of "MA" will reside and fit the data, for example I want the fitted value of MA in the range [0.5,0.8] and want to keep "ga" as an arbitrary fit paramter. I am not sure how to do it. I am copying the python code here:
#!/usr/bin/env python3
# to the data in "data_file", each line of which contains the data for one point, x_i, y_i, sigma_i.
import numpy as np
from pylab import *
from scipy.optimize import curve_fit
from scipy.stats import chi2
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
n = len(x)
p0 = [-1,1]
f = lambda x, ga, MA: ga/((1+x/(MA*MA))*(1+x/(MA*MA)))
p, covm = curve_fit(f, x, y, p0, err)
ga, MA = p
chisq = sum(((f(x, ga, MA) -y)/err)**2)
ndf = n -len(p)
Q = 1. -chi2.cdf(chisq, ndf)
chisq = chisq / ndf
gaerr, MAerr = sqrt(diag(covm)/chisq) # correct the error bars
print 'ga = %10.4f +/- %7.4f' % (ga, gaerr)
print 'MA = %10.4f +/- %7.4f' % (MA, MAerr)
print 'chi squared / NDF = %7.4lf' % chisq
print (covm)
You might consider using lmfit (https://lmfit.github.io/lmfit-py) for this problem. Lmfit provides a higher-level interface to optimization and curve fitting, including treating Parameters as python objects that have bounds.
Your script might be translated to use lmfit as
import numpy as np
from lmfit import Model
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
# define the fitting model function, similar to your `f`:
def f(x, ga, ma):
return ga/((1+x/(ma*ma))*(1+x/(ma*ma)))
# turn this model function into a Model:
mymodel = Model(f)
# now create parameters for this model, giving initial values
# note that the parameters will be *named* from the arguments of your model function:
params = mymodel.make_params(ga=-1, ma=1)
# params is now an orderded dict with parameter names ('ga', 'ma') as keys.
# you can set min/max values for any parameter:
params['ma'].min = 0.5
params['ma'].max = 2.0
# you can fix the value to not be varied in the fit:
# params['ga'].vary = False
# you can also constrain it to be a simple mathematical expression of other parameters
# now do the fit to your `y` data with `params` and your `x` data
# note that you pass in weights for the residual, so 1/err:
result = mymodel.fit(y, params, x=x, weights=1./err)
# print out fit report with fit statistics and best fit values
# and uncertainties and correlations for variables:
print(result.fit_report())
You can get access to the best-fit parameters as result.params; the initial params will not be changed by the fit. There are also routines to plot the best-fit result and/or residual.
I'm trying to fit a set of data with a function (see the example below) using scipy.optimize.curvefit,
but when I use bounds (documentation) the fit fails and I simply get
the initial guess parameters as output.
As soon as I substitute -np.inf ad np.inf as bounds for the second parameter
(dt in the function), the fit works.
What am I doing wrong?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
#Generate data
crc=np.array([-1.4e-14, 7.3e-14, 1.9e-13, 3.9e-13, 6.e-13, 8.0e-13, 9.2e-13, 9.9e-13,
1.e-12, 1.e-12, 1.e-12, 1.0e-12, 1.1e-12, 1.1e-12, 1.1e-12, 1.0e-12, 1.1e-12])
time=np.array([0., 368., 648., 960., 1520.,1864., 2248., 2655., 3031.,
3384., 3688., 4048., 4680., 5343., 6055., 6928., 8120.])
#Define the function for the fit
def testcurve(x, Dp, dt):
k = -Dp*(x+dt)*2e11
curve = 1e-12 * (1+2*(-np.exp(k) + np.exp(4*k) - np.exp(9*k) + np.exp(16*k)))
curve[0]= 0
return curve
#Set fit bounds
dtmax=time[2]
param_bounds = ((-np.inf, -dtmax),(np.inf, dtmax))
#Perform fit
(par, par_cov) = opt.curve_fit(testcurve, time, crc, p0 = (5e-15, 0), bounds = param_bounds)
#Print and plot output
print(par)
plt.plot(time, crc, 'o')
plt.plot(time, testcurve(time, par[0], par[1]), 'r-')
plt.show()
I encountered the same behavior today in a different fitting problem. After some searching online, I found this link quite helpful: Why does scipy.optimize.curve_fit not fit to the data?
The short answer is that: using extremely small (or large) numbers in numerical fitting is not robust and scale them leads to a much better fitting.
In your case, both crc and Dp are extremely small numbers which could be scaled up. You could play with the scale factors and within certain range the fitting looks quite robust. Full example:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
#Generate data
crc=np.array([-1.4e-14, 7.3e-14, 1.9e-13, 3.9e-13, 6.e-13, 8.0e-13, 9.2e-13, 9.9e-13,
1.e-12, 1.e-12, 1.e-12, 1.0e-12, 1.1e-12, 1.1e-12, 1.1e-12, 1.0e-12, 1.1e-12])
time=np.array([0., 368., 648., 960., 1520.,1864., 2248., 2655., 3031.,
3384., 3688., 4048., 4680., 5343., 6055., 6928., 8120.])
# add scale factors to the data as well as the fitting parameter
scale_factor_1 = 1e12 # 1./np.mean(crc) also works if you don't want to set the scale factor manually
scale_factor_2 = 1./2e11
#Define the function for the fit
def testcurve(x, Dp, dt):
k = -Dp*(x+dt)*2e11 * scale_factor_2
curve = 1e-12 * (1+2*(-np.exp(k) + np.exp(4*k) - np.exp(9*k) + np.exp(16*k))) * scale_factor_1
curve[0]= 0
return curve
#Set fit bounds
dtmax=time[2]
param_bounds = ((-np.inf, -dtmax),(np.inf, dtmax))
#Perform fit
(par, par_cov) = opt.curve_fit(testcurve, time, crc*scale_factor_1, p0 = (5e-15/scale_factor_2, 0), bounds = param_bounds)
#Print and plot output
print(par[0]*scale_factor_2, par[1])
plt.plot(time, crc*scale_factor_1, 'o')
plt.plot(time, testcurve(time, par[0], par[1]), 'r-')
plt.show()
Fitting results: [6.273102923176595e-15, -21.12202697564494], which gives a reasonable fitting and also is very close to the result without any bounds: [6.27312512e-15, -2.11307470e+01]
In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))