When minimizing a function using iminuit:
m = Minuit(func, alpha=1, beta=1)
m.errordef = 1.0
m = m.migrad()
I can get the final parameter values like so:
final_alpha, final_beta = m.values["alpha"], m.values["beta"]
I would like to plot the path the minimizer takes to reach the minimum.
How can I get the intermediate values of alpha and beta?
Related
Suppose I have data points distributed over time like a decreasing exponential function but it includes zero mean Gaussian noise with variance of say 20. How would I determine the likelihood function and find MLE's for the parameters? So all I have is the following data:
I have fit an exponential curve using python. My attempt:
def func(x):
return params[0]*(x**params[1])+params[2])
params, cov = curve_fit(f, time, X)
params[0] = A
params[1] = B
params[2] = C
LH_function = A*(x**B)
So I am not sure how to determine the likelihood function given just a dataset. Do I need to assume what distribution the data is in? (with 0 mean noise).
I am trying to use
scipy.optimize.curve_fit(func,xdata,ydata)
To determine the parameters of exponentiated weibull distribution:
#define exponentiated weibull distribution
def expweib(x,k,lamda,alpha):
return alpha*(k/lamda)*((x/lamda)**(k-1))*((1-np.exp(-(x/lamda)*k))**(alpha-1))*np.exp(-(x/lamda)*k)
#First generate random sample of exponentiated weibull distribution using stats.exponweib.rvs
data = stats.exponweib.rvs(a = 1, c = 82.243021128368554, loc = 0,scale = 989.7422, size = 1000 )
#Then use the sample data to draw a histogram
entries_Test, bin_edges_Test, patches_Test = plt.hist(data, bins=50, range=[909.5,1010.5], normed=True)
#calculate bin middles of the histogram
bin_middles_Test = 0.5*(bin_edges_Test[1:] + bin_edges_Test[:-1])
#use bin_middles_Test as xdata, bin_edges_Test as ydata, previously defined expweib as func, call curve_fit method:
params, pcov = curve_fit(weib,bin_middles_Test, entries_Test )
Then the error occurs:
OptimizeWarning: Covariance of the parameters could not be estimatedcategory=OptimizeWarning)
I cannot identify which step has the issue, could anyone help?
Thank you
Reading through documentation for curve_fit method here, https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html, for method argument, they have mentioned that the default 'lm' method won't work if the number of observations is less than the number of variables, in which case you should use either of *'trf'* or *'dogbox'* method.
Also, reading about 'pcov' in Return values section, they have mentioned that the entries will be inf if the Jacobian matrix at the solution does not have a full rank.
I tried your code with both trf and dogbox and got pconv array full of zeros
I have generated random data using:
bkg= 240-140*np.random.power(3.5,50000)
I plotted the points into a histogram by using
h_all = plt.hist(all,bins=binedges,histtype='step')
My question is, provided that I know the pdf (in this case called "bkg") can I generate a curve using scipy.optimize that fits the points generated perfectly, and what equation it is for the curve ?
First of all, remark that your bkg is NOT a probability density function (pdf). Rather, it is a list of observations from a pdf. By calling matplotlib.pyplot.hist on this list of observations, you get to see a curve that approximates the (offset and scaled version of the) probability density function. If you are given this curve, it is possible to get a good estimation of the parameters needed to model this, provided you've been given the parameterized model a priori.
For example:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
offset, scale, a, nsamples = 240, -140, 3.5, 500000
bkg = offset + scale*np.random.power(a, nsamples) # values range between (offset, offset+scale), which map to 0 and 1
nbins = 100
count, bins, ignored = plt.hist(bkg, bins=nbins, histtype='stepfilled', edgecolor='none')
If now you are given the centers of these bins and the counts,
xdata = .5*(bins[1:]+bins[:-1])
ydata = count
and you are asked to find the parameters of the power distribution function that fits to this data (-> someone told you this, you trust that source), then you could go about in the following manner.
First, observe that the power distribution function P(x,a) is a monotonously increasing function (i.e. P(x1, a ) < P(x2, a) when 0 <= x1 < x2 <= 1). That means that the dataset given above has been flipped left-to-right, or that it represents factor*P(x, a ) with factor < 0.
Next, notice that the given data is not given over the interval [0,1], typical for a probability density function. That means that you should rescale the given xdata to the [0,1] interval prior to attempting to fit the power function distribution to it. Just by observing the graph, you figure out that the values that 0 and 1 map to are 100 and 240. However, this is just luck here, because matplotlib chose a sensible range for plotting. When you are confronted with not actually knowing the limits to which 0 and 1 have mapped to, you could choose the less optimal (but still very good) choice of xdata[0] - binwidth/2 and xdata[-1] + binwidth/2 or (a slightly worse choice) xdata[0] and xdata[-1]. From the previous paragraph, you know that 1 maps to xdata[0] - binwidth/2 :=: a and 0 maps to xdata[-1] + binwidth/2 :=: b. The linear map that does this is lambda x: (a - b)*x + b (simple algebra).
If you pass this to [0,1]-mapped version of the xdata to curve_fit, it'll give you a good guess for the exponent.
def get_model(nobservations, binwidth, scale, offset):
def model(bin_centers, exponent):
x = (bin_centers - offset)/scale
y = exponent*x**(exponent - 1)
normed_y = nobservations * binwidth * y / np.abs(scale)
return normed_y
return model
binwidth = np.diff(xdata)[0]
p0, _ = curve_fit(get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2), xdata, ydata)
print(p0) # prints e.g.: 3.37117679
plt.plot(xdata, get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2)(xdata, *p0))
At this moment, you have found a rather accurate description of the distribution
that was used to generate the observations of bkg:
f(x) = offset + scale*(exponent * x**(exponent - 1))
= (xdata[-1] + binwidth/2) + (-xdata.ptp() - binwidth)*(p0[0] * x**(p0[0] - 1))
~ 234.85 - 1.34.85*(3.37 * x**(3.37 - 1))
By the way, I'd like to point out that replicating bkg (the observations from the distribution)
as a perfect copy is something you can only do if you know the exact parameters of the distribution (240, -140 and 3.5) AND set the seed for the random number generation equal to the seed that was in effect prior to the initial call to np.random.power.
If you'd like to fit a curve to the histogram using splines, you should retrieve the knots and coefficients from the generated spline and pass those into the function of bspleval, as shown here. The topic of writing out those equations is a long one however, and there are numerous resources on the internet that you can check to understand how it's done. Needless to say, that function bspleval is what you'll need in case you want to go that route. If it were me, I'd go the route of curve fitting shown above.
I am running a regression as follows (df is a pandas dataframe):
import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()
Which gave me, among others, an R-squared of 0.942. So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values:
orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()
This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9. Therefore, I tried to calculate it manually myself:
yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()
1 - SSReg/SSTot
Out[79]: 0.2618159806908984
Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot, SSReg have values of 48084, 35495.
If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.
tss = (ys ** 2).sum() # un-centred total sum of squares
as opposed to
tss = ((ys - ys.mean())**2).sum() # centred total sum of squares
as a result, R-squared would be much higher.
This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:
ys = beta1 . xs + beta0 + noise
then the reduced model can be: ys = beta0 + noise, where the estimate for beta0 is the sample average, thus we have: noise = ys - ys.mean(). That is where de-meaning comes from in a model with intercept.
But from a model like:
ys = beta . xs + noise
you may only reduce to: ys = noise. Since noise is assumed zero-mean, you may not de-mean ys. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.
This is documented here under rsquared item. Set yBar equal to zero, and I would expect you will get the same number.
If your model is:
a = <yourmodel>.fit()
Then, to compute fitted values:
a.fittedvalues
and to compute R squared:
a.rsquared
The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106