Maximum Likelihood Estimator Python - python

Suppose I have data points distributed over time like a decreasing exponential function but it includes zero mean Gaussian noise with variance of say 20. How would I determine the likelihood function and find MLE's for the parameters? So all I have is the following data:
I have fit an exponential curve using python. My attempt:
def func(x):
return params[0]*(x**params[1])+params[2])
params, cov = curve_fit(f, time, X)
params[0] = A
params[1] = B
params[2] = C
LH_function = A*(x**B)
So I am not sure how to determine the likelihood function given just a dataset. Do I need to assume what distribution the data is in? (with 0 mean noise).

Related

How to use Maximum likelihood estimator to find a slope of power law distribution

I have a discrete power spectrum f(x) which is a power law with a specific exponent b: f(x) = A*x^(b). I want to estimate that exponent when I have f(x) and x. I'm interested in using a Maximum Likelihood Estimator (MLE) to find it. Why? Well, there's this paper that says it's the best method for such distributions with least bias: Clauset et al. 2007 .
I already tried many different methods, for instance, I took the commonly known route and took the log for both power and frequency then applied a least squares fit to the loglog spectrum but the slope (sought exponent) of such process is sometimes biased. There's already a Python implementation of the mentioned paper package but it does MLE based on PDF or CDF not the probability distribution you provide. I tried editing it but it's so complicated. I also tried using pytorch to do optimization with no success since I'm not experienced with it.
How can I implement such MLE in python for my problem?
Edit: I have found someone ask a similar question but in R. Someone answered them here but again it's in R not python.
Edit: added code
from astroML.time_series import generate_power_law
import numpy as np
import scipy
N=72 # number of points
dt=5*60 # time resolution
exponent= -2 # power law exponent
rand_seed= 1
time_max = 6*6*60 # observation time
x = np.linspace(0, time_max, N ) # time array
y = generate_power_law(N, dt, -exponent, random_state= rand_seed)
# amplitudes with power law array
# Get the spectrum by FFT for example
sample_freq = scipy.fft.rfftfreq(1*N, d=dt)[1:] # sampling frequency
sig_fft = scipy.fft.rfft(y,1*N)[1:] # FFT amplitudes
psd = (np.abs(sig_fft)**2)*2*dt/(N) # power spectral density
# do least squares fitting
logA = np.log(sample_freq )
logB = np.log(psd )
slope, intercept = np.polyfit(logA, logB, 1)
Another attempt for MLE based on the comment which links article :
import numpy as np
from scipy.optimize import minimize
def func_powerlaw(x, params):
slope = params[0]
coefficient = params[1]
return (x**slope)*coefficient
params = [1,1]
#For MLE, minimize the negative log likelihood
def neglnlike(params, x, y):
model = func_powerlaw(x, params)
output = np.sum(np.log(model) + y/model)
#Check that this is valid, returning large number if not
if not np.isfinite(output):
return 1.0e30
return output
res = minimize(neglnlike, params, args=(sample_freq , psd),
method='Nelder-Mead')
See this library named powerlaw
https://github.com/jeffalstott/powerlaw
powerlaw is a toolbox using the statistical methods developed in Clauset et al. 2007 and Klaus et al. 2011 to determine if a probability distribution fits a power law.

How do I use the Monte Carlo method to find the uncertainties of a value?

I am trying to solve a Physics equation using a Monte Carlo simulation which I know is very long (I just need to use it to learn about it).
I have around 5 values, one is time and I have the random uncertainties (errors) for each of these values. So like mass is (10 +- 0.1)kg, where the error is 0.1 kg
How do I actually find the distribution of measurements if I performed this experiment 5,000 times for example?
I know I could make 2 arrays of errors, and maybe put them in a function. But what am I supposed to do to then? Do I put the errors in the equation and then add the answer to the arrays, and then put the changed array values in the equation and repeat this a thousand times. Or do I actually calculate the real value and add it to the array.
Please can you help me understand this.
Edit:
The problem I have is basically of a sphere of density ds that is falling by a distance l in time t through a liquid of density dl, this fits in an equation for viscosity and I need to find the distribution of viscosity measurements.
The equation shouldn't matter at, whatever equation I have I should be able to use a method like this to find the distribution of measurements. Weather I'm dropping a ball out a window or whatever.
Basic Monte Carlo is very straightforward. The following might get you started:
import random,statistics,math
#The following function generates a
#random observation of f(x) where
#x is a vector of independent normal variables
#whose means are given by the vector mus
#and whose standard deviations are given by sigmas
def sample(f,mus,sigmas):
x = (random.gauss(m,s) for m,s in zip(mus,sigmas))
return f(*x)
#do n times, returning the sample mean and standard deviation:
def monte_carlo(f,mus,sigmas,n):
samples = [sample(f,mus,sigmas) for _ in range(n)]
return (statistics.mean(samples), statistics.stdev(samples))
#for testing purposes:
def V(r,h):
return math.pi*r**2*h
print(monte_carlo(V,(2,4),(0.02, 0.01),1000))
With output:
(50.2497301631037, 1.0215188736786902)
Ok, lets try with simple example - you have air gun which shoots balls with mass m and velocity v. You have to measure kinetic energy
E = m*v2 / 2
There is distribution of velocity - gaussian with mean value of 10 and std deviation 1.
There is distribution of masses - but we cannot do gaussian, lets assume it is truncated normal, with low limit of 1, so that there is no negative values, with loc equal to 5 and scale equal to 3.
So what we will do - sample velocity, sample mass, use them to find kinetic energy, do it multiple times, build energy distribution, get mean value, get std deviation, draw graphs etc
Some simple Python code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import truncnorm
def sampleMass(n, low, high, mu, sigma):
"""
Sample n mass values from truncated normal
"""
tn = truncnorm(low, high, loc=mu, scale=sigma)
return tn.rvs(n)
def sampleVelocity(n, mu, sigma):
return np.random.normal(loc = mu, scale = sigma, size=n)
mass_low = 1.
mass_high = 1000.
mass_mu = 5.
mass_sigma = 3.0
vel_mu = 10.0
vel_sigma = 1.0
nof_trials = 100000
mass = sampleMass(nof_trials, mass_low, mass_high, mass_mu, mass_sigma) # get samples of mass
vel = sampleVelocity(nof_trials, vel_mu, vel_sigma) # get samples of velocity
kinenergy = 0.5 * mass * vel*vel # distribution of kinetic energy
print("Mean value and stddev of the final distribution")
print(np.mean(kinenergy))
print(np.std(kinenergy))
print("Min/max values of the final distribution")
print(np.min(kinenergy))
print(np.max(kinenergy))
# print histogram of the distribution
n, bins, patches = plt.hist(kinenergy, 100, density=True, facecolor='green', alpha=0.75)
plt.xlabel('Energy')
plt.ylabel('Probability')
plt.title('Kinetic energy distribution')
plt.grid(True)
plt.show()
with output like
Mean value and stddev of the final distribution
483.8162951263243
118.34049421853899
Min/max values of the final distribution
128.86671038372
1391.400187563612

Gaussian fit returning negative sigma

One of my algorithms performs automatic peak detection based on a Gaussian function, and then later determines the the edges based either on a multiplier (user setting) of the sigma or the 'full width at half maximum'. In the scenario where a user specified that he/she wants the peak limited at 2 Sigma, the algorithm takes -/+ 2*sigma from the peak center (mu). However, I noticed that the sigma returned by curve_fit can be negative, which is something that has been noticed before as can be seen here. However, as I determine the border by doing -/+ this can lead to the algorithm 'failing' (due to a - - scenario) as can be seen in the following code.
MVCE
#! /usr/bin/env python
from scipy.optimize import curve_fit
import bisect
import numpy as np
X = [16.4697402328,16.4701402404,16.4705402481,16.4709402557,16.4713402633,16.4717402709,16.4721402785,16.4725402862,16.4729402938,16.4733403014,16.473740309,16.4741403166,16.4745403243,16.4749403319,16.4753403395,16.4757403471,16.4761403547,16.4765403623,16.47694037,16.4773403776,16.4777403852,16.4781403928,16.4785404004,16.4789404081,16.4793404157,16.4797404233,16.4801404309,16.4805404385,16.4809404462,16.4813404538,16.4817404614,16.482140469,16.4825404766,16.4829404843,16.4833404919,16.4837404995,16.4841405071,16.4845405147,16.4849405224,16.48534053,16.4857405376,16.4861405452,16.4865405528,16.4869405604,16.4873405681,16.4877405757,16.4881405833,16.4885405909,16.4889405985,16.4893406062,16.4897406138,16.4901406214,16.490540629,16.4909406366,16.4913406443,16.4917406519,16.4921406595,16.4925406671,16.4929406747,16.4933406824,16.49374069,16.4941406976,16.4945407052,16.4949407128,16.4953407205,16.4957407281,16.4961407357,16.4965407433,16.4969407509,16.4973407585,16.4977407662,16.4981407738,16.4985407814,16.498940789,16.4993407966,16.4997408043,16.5001408119,16.5005408195,16.5009408271,16.5013408347,16.5017408424,16.50214085,16.5025408576,16.5029408652,16.5033408728,16.5037408805,16.5041408881,16.5045408957,16.5049409033,16.5053409109,16.5057409186,16.5061409262,16.5065409338,16.5069409414,16.507340949,16.5077409566,16.5081409643,16.5085409719,16.5089409795,16.5093409871,16.5097409947,16.5101410024,16.51054101,16.5109410176,16.5113410252,16.5117410328,16.5121410405,16.5125410481,16.5129410557,16.5133410633,16.5137410709,16.5141410786,16.5145410862,16.5149410938,16.5153411014,16.515741109,16.5161411166,16.5165411243,16.5169411319,16.5173411395,16.5177411471,16.5181411547,16.5185411624,16.51894117,16.5193411776,16.5197411852,16.5201411928,16.5205412005,16.5209412081,16.5213412157,16.5217412233,16.5221412309,16.5225412386,16.5229412462,16.5233412538,16.5237412614,16.524141269,16.5245412767,16.5249412843,16.5253412919,16.5257412995,16.5261413071,16.5265413147,16.5269413224,16.52734133,16.5277413376,16.5281413452,16.5285413528,16.5289413605,16.5293413681,16.5297413757,16.5301413833,16.5305413909,16.5309413986,16.5313414062,16.5317414138,16.5321414214,16.532541429,16.5329414367,16.5333414443,16.5337414519,16.5341414595,16.5345414671,16.5349414748,16.5353414824,16.53574149,16.5361414976,16.5365415052,16.5369415128,16.5373415205,16.5377415281,16.5381415357,16.5385415433,16.5389415509,16.5393415586,16.5397415662,16.5401415738,16.5405415814,16.540941589,16.5413415967,16.5417416043,16.5421416119,16.5425416195,16.5429416271,16.5433416348,16.5437416424,16.54414165,16.5445416576,16.5449416652,16.5453416729,16.5457416805,16.5461416881,16.5465416957,16.5469417033,16.5473417109,16.5477417186,16.5481417262,16.5485417338,16.5489417414,16.549341749,16.5497417567,16.5501417643,16.5505417719,16.5509417795,16.5513417871,16.5517417948,16.5521418024,16.55254181,16.5529418176,16.5533418252,16.5537418329,16.5541418405,16.5545418481,16.5549418557,16.5553418633,16.5557418709,16.5561418786,16.5565418862,16.5569418938,16.5573419014,16.557741909,16.5581419167,16.5585419243,16.5589419319,16.5593419395,16.5597419471,16.5601419548,16.5605419624,16.56094197,16.5613419776,16.5617419852,16.5621419929,16.5625420005,16.5629420081,16.5633420157,16.5637420233,16.564142031]
Y = [11579127.8554,11671781.7263,11764419.0191,11857026.0444,11949589.1124,12042094.5338,12134528.6188,12226877.6781,12319128.0219,12411265.9609,12503277.8053,12595149.8657,12686868.4525,12778419.8762,12869790.334,12960965.209,13051929.5278,13142668.3154,13233166.5969,13323409.3973,13413381.7417,13503068.6552,13592455.1627,13681526.2894,13770267.0602,13858662.5004,13946697.6348,14034357.4886,14121627.0868,14208491.4544,14294935.6166,14380944.5984,14466503.4248,14551597.1208,14636210.7116,14720329.3102,14803938.4081,14887023.5981,14969570.4732,15051564.6263,15132991.6503,15213837.1383,15294086.683,15373725.8775,15452740.3147,15531115.5875,15608837.2888,15685891.0116,15762262.3488,15837936.8934,15912900.2382,15987137.9762,16060635.7004,16133379.0036,16205353.4789,16276544.72,16346938.7731,16416522.8674,16485284.4226,16553210.8587,16620289.5956,16686508.0531,16751853.6511,16816313.8096,16879875.9485,16942527.4876,17004255.8468,17065048.446,17124892.7052,17183776.0442,17241685.8829,17298609.6412,17354534.739,17409448.5962,17463338.6327,17516192.2683,17567996.9463,17618741.7702,17668418.588,17717019.5043,17764536.6238,17810962.0514,17856287.8916,17900506.2493,17943609.2292,17985588.936,18026437.4744,18066146.9493,18104709.4653,18142117.1271,18178362.0396,18213436.3074,18247332.0352,18280041.3279,18311556.2901,18341869.0265,18370971.642,18398856.332,18425517.6188,18450952.493,18475158.064,18498131.4412,18519869.7341,18540370.0523,18559629.505,18577645.202,18594414.2525,18609933.7661,18624200.8523,18637212.6205,18648966.1802,18659458.6408,18668687.1119,18676648.7029,18683340.5233,18688759.6825,18692903.29,18695768.4553,18697352.5327,18697655.9558,18696681.2608,18694431.0245,18690907.8241,18686114.2363,18680052.838,18672726.2063,18664136.918,18654287.5501,18643180.6795,18630818.883,18617204.7377,18602340.8204,18586229.7081,18568873.9777,18550276.2061,18530438.9703,18509364.8471,18487056.4135,18463516.2464,18438747.4526,18412756.9228,18385553.1936,18357144.808,18327540.3094,18296748.2409,18264777.1456,18231635.5669,18197332.0479,18161875.1318,18125273.3619,18087535.2812,18048669.4331,18008684.3606,17967588.6071,17925390.7158,17882099.2297,17837722.6922,17792269.6464,17745748.6355,17698168.2027,17649537.512,17599868.3744,17549173.3069,17497464.8262,17444755.4492,17391057.6927,17336384.0736,17280747.1087,17224159.3148,17166633.2088,17108181.3075,17048816.1277,16988550.1864,16927396.0002,16865366.0862,16802472.961,16738729.1416,16674147.1447,16608739.4873,16542518.6861,16475497.2591,16407688.2541,16339106.0951,16269765.4262,16199680.8916,16128867.1358,16057338.8029,15985110.5372,15912196.9829,15838612.7844,15764372.5859,15689491.0316,15613982.7659,15537862.4329,15461144.6771,15383844.1425,15305975.4735,15227553.3143,15148592.3093,15069107.1026,14989112.3386,14908622.6595,14827652.5673,14746216.3337,14664328.209,14582002.4435,14499253.2874,14416094.9911,14332541.8049,14248607.9791,14164307.764,14079655.4098,13994665.1668,13909351.2855,13823728.016,13737809.6086,13651610.3137,13565144.3816,13478426.0625,13391469.6068,13304289.2646,13216899.2865,13129313.8865,13041546.3657,12953609.0623,12865514.2686,12777274.277,12688901.3798,12600407.8693,12511806.0378,12423108.1777,12334326.5812,12245473.5407,12156561.3486,12067602.297,11978608.6785,11889592.7852]
def gaussFunction(x, *p):
"""Define and return a Gaussian function.
This function returns the value of a Gaussian function, using the
A, mu and sigma value that is provided as *p.
Keyword arguments:
x -- number
p -- A, mu and sigma numbers
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
p0 = [np.max(Y), X[np.argmax(Y)],0.1]
coeff, var_matrix = curve_fit(gaussFunction, X, Y, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
print "Sigma is "+str(coeff[2])
# Original
low = bisect.bisect_left(newGaussX,coeff[1]-2*coeff[2])
high = bisect.bisect_right(newGaussX,coeff[1]+2*coeff[2])
print newGaussX[low], newGaussX[high]
# Absolute
low = bisect.bisect_left(newGaussX,coeff[1]-2*abs(coeff[2]))
high = bisect.bisect_right(newGaussX,coeff[1]+2*abs(coeff[2]))
print newGaussX[low], newGaussX[high]
Bottom-line, is taking the abs() of the sigma 'correct' or should this problem be solved in a different way?
You are fitting a function gaussFunction that does not care whether sigma is positive or negative. So whether you get a positive or negative result is mostly a matter of luck, and taking the absolute value of the returned sigma is fine. Also consider other possibilities:
(Suggested by Thomas Kühn): modify the model function so that it cares about the sign of sigma. Bringing it closer to the normalized Gaussian form would be enough: the formula A/np.sqrt(sigma)*np.exp(-(x-mu)**2/(2.*sigma**2)) would ensure that you get positive sigma only. A possible, mild downside is that the function takes a bit longer to compute.
Use the variance, sigma_squared, as a parameter:
A, mu, sigma_squared = p
return A*np.exp(-(x-mu)**2/(2.*sigma_squared))
This is probably easiest in terms of keeping the model equation simple. You will need to square your initial guess for that parameter, and take square root when you need sigma itself.
Aside: you hardcoded 0.1 as a guess for standard deviation. This probably should be based on data, like this:
peak = X[Y > np.exp(-0.5)*Y.max()]
guess_sigma = 0.5*(peak.max() - peak.min())
The idea is that within one standard deviation of the mean, the values of the Gaussian are greater than np.exp(-0.5) times the maximum value. So the first line locates this "peak" and the second takes half of its width as the guess for sigma.
For the above to work, X and Y should be already converted to NumPy arrays, e.g., X = np.array([16.4697402328,16.4701402404,..... This is a good idea in general: otherwise, you are making each NumPy method that receives X or Y make this conversion again.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this. It includes a Gaussian Model for curve-fitting that does normalize the Gaussian and also restricts sigma to be positive using a parameter transformation that is more gentle than abs(sigma). Your example would look like this
from lmfit.models import GaussianModel
xdat = np.array(X)
ydat = np.array(Y)
model = GaussianModel()
params = model.guess(ydat, x=xdat)
result = model.fit(ydat, params, x=xdat)
print(result.fit_report())
which will print a report with best-fit values and estimated uncertainties for all the parameters, and include FWHM.
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# function evals = 31
# data points = 237
# variables = 3
chi-square = 95927408861.607
reduced chi-square = 409946191.716
Akaike info crit = 4703.055
Bayesian info crit = 4713.459
[[Variables]]
sigma: 0.04880178 +/- 1.57e-05 (0.03%) (init= 0.0314006)
center: 16.5174203 +/- 8.01e-06 (0.00%) (init= 16.51754)
amplitude: 2.2859e+06 +/- 586.4103 (0.03%) (init= 670578.1)
fwhm: 0.11491942 +/- 3.51e-05 (0.03%) == '2.3548200*sigma'
height: 1.8687e+07 +/- 910.0152 (0.00%) == '0.3989423*amplitude/max(1.e-15, sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(sigma, amplitude) = 0.949
The values for center +/- 2*sigma would be found with
xlo = result.params['center'].value - 2 * result.params['sigma'].value
xhi = result.params['center'].value + 2 * result.params['sigma'].value
You can use the result to evaluate the model with fitted parameters and different X values:
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
newGaussY = result.eval(x=newGaussX)
I would also recommend using numpy.where to find the location of center+/-2*sigma instead of bisect:
low = np.where(newGaussX > xlo)[0][0] # replace bisect_left
high = np.where(newGaussX <= xhi)[0][-1] + 1 # replace bisect_right
I got the same problem and I came up with a trivial but effective solution, which is basically to use the variance in the gaussian function definition instead of the standard deviation, since the variance is always positive. Then, you get the std_dev by square rooting the variance, obtaining a positive value i.e., the std_dev will always be positive. So, problem solved easily ;)
I mean, create the function this way:
def gaussian(x, Heigh, Mean, Variance):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Variance))
Instead of:
def gaussian(x, Heigh, Mean, Std_dev):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Std_dev**2))
And then do the fit as usual.

Producing an MLE for a pair of distributions in python

Ok, so my current curve fitting code has a step that uses scipy.stats to determine the right distribution based on the data,
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]
for dist in sorted(zip(distributions, mles), key=lambda d: d[1]):
print dist
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
print [mod[0].name for mod in sorted(zip(distributions, mles), key=lambda d: d[1])]
Where data is a list of numeric values. This is working great so far for fitting unimodal distributions, confirmed in a script that randomly generates values from random distributions and uses curve_fit to redetermine the parameters.
Now I would like to make the code able to handle bimodal distributions, like the example below:
Is it possible to get a MLE for a pair of models from scipy.stats in order to determine if a particular pair of distributions are a good fit for the data?, something like
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
distributionPairs = [[modelA.name, modelB.name] for modelA in distributions for modelB in distributions]
and use those pairs to get an MLE value of that pair of distributions fitting the data?
It's not a complete answer but it may help you to solve your problem. Let say you know your problem is generated by two densities.
A solution would be to use k-mean or EM algorithm.
Initalization.
You initialize your algorithm by affecting every observation to one or the other density. And you initialize the two densities (you initialize the parameters of the density, and one of the parameter in your case is "gaussian", "laplace", and so on...
Iteration.
Then, iterately, you run the two following steps :
Step 1.
Optimize the parameters assuming that the affectation of every point is right. You can now use any optimization solver. This step provide you with an estimation of the best two densities (with given parameter) that fit your data.
Step 2.
You classify every observation to one density or the other according to the greatest likelihood.
You repeat until convergence.
This is very well explained in this web-page
https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
If you do not know how many densities have generated your data, the problem is more difficult. You have to work with penalized classification problem, which is a bit harder.
Here is a coding example in an easy case : you know that your data comes from 2 different Gaussians (you don't know how many variables are generated from each density). In your case, you can adjust this code to loop on every possible pair of density (computationally longer, but would empirically work I presume)
import scipy.stats as st
import numpy as np
#hard coded data generation
data = np.random.normal(-3, 1, size = 1000)
data[600:] = np.random.normal(loc = 3, scale = 2, size=400)
#initialization
mu1 = -1
sigma1 = 1
mu2 = 1
sigma2 = 1
#criterion to stop iteration
epsilon = 0.1
stop = False
while not stop :
#step1
classification = np.zeros(len(data))
classification[st.norm.pdf(data, mu1, sigma1) > st.norm.pdf(data, mu2, sigma2)] = 1
mu1_old, mu2_old, sigma1_old, sigma2_old = mu1, mu2, sigma1, sigma2
#step2
pars1 = st.norm.fit(data[classification == 1])
mu1, sigma1 = pars1
pars2 = st.norm.fit(data[classification == 0])
mu2, sigma2 = pars2
#stopping criterion
stop = ((mu1_old - mu1)**2 + (mu2_old - mu2)**2 +(sigma1_old - sigma1)**2 +(sigma2_old - sigma2)**2) < epsilon
#result
print("The first density is gaussian :", mu1, sigma1)
print("The first density is gaussian :", mu2, sigma2)
print("A rate of ", np.mean(classification), "is classified in the first density")
Hope it helps.

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106

Categories

Resources