Curve_fit fails on Exponentiated Weibull distribution - python

I am trying to use
scipy.optimize.curve_fit(func,xdata,ydata)
To determine the parameters of exponentiated weibull distribution:
#define exponentiated weibull distribution
def expweib(x,k,lamda,alpha):
return alpha*(k/lamda)*((x/lamda)**(k-1))*((1-np.exp(-(x/lamda)*k))**(alpha-1))*np.exp(-(x/lamda)*k)
#First generate random sample of exponentiated weibull distribution using stats.exponweib.rvs
data = stats.exponweib.rvs(a = 1, c = 82.243021128368554, loc = 0,scale = 989.7422, size = 1000 )
#Then use the sample data to draw a histogram
entries_Test, bin_edges_Test, patches_Test = plt.hist(data, bins=50, range=[909.5,1010.5], normed=True)
#calculate bin middles of the histogram
bin_middles_Test = 0.5*(bin_edges_Test[1:] + bin_edges_Test[:-1])
#use bin_middles_Test as xdata, bin_edges_Test as ydata, previously defined expweib as func, call curve_fit method:
params, pcov = curve_fit(weib,bin_middles_Test, entries_Test )
Then the error occurs:
OptimizeWarning: Covariance of the parameters could not be estimatedcategory=OptimizeWarning)
I cannot identify which step has the issue, could anyone help?
Thank you

Reading through documentation for curve_fit method here, https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html, for method argument, they have mentioned that the default 'lm' method won't work if the number of observations is less than the number of variables, in which case you should use either of *'trf'* or *'dogbox'* method.
Also, reading about 'pcov' in Return values section, they have mentioned that the entries will be inf if the Jacobian matrix at the solution does not have a full rank.
I tried your code with both trf and dogbox and got pconv array full of zeros

Related

scipy.stats cdf greater than 1

I'm using scipy.stats and I need the CDF up to a given value x for some distributions, I know PDFs can be greater than 1 because they are not probabilities but densities so they should integrate to 1 even if specific values are greater, but CDFs should never be greater than 1 and when running the cdf function on scipy.stats sometimes I get values like 2.89, i'm completely sure i'm using cdf and not pdf(that was my first guess), this is messing my results and algorithm because I need accumulated probabilities, why is scipy.stats cdf returning values greater than 1 and/or how should I proceed to fix it?
Code for reproducing the issue with a sample distribution and parameters(but it happens with others too):
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, -38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
cdf: [2.68047481 7.2027761 7.2027761 ]
pdf: [2.76857133 2.23996739 2.23996739]
The problem lies in the parameters that you have specified for the Gaussian hypergeometric (HG) distribution, specifically in the third element of params, which is the parameter beta in the HG distribution (see equation 2 in this paper for the definiton of the density of the Gauss Hypergeometric distr.). This parameter has to be positive for HG to have a valid density. Otherwise, the density won't integrate to 1, which is exactly what is happening in your example. With a negative beta, the distribution is not a valid probability distribution.
You can also find the requirement that beta (denoted as b) has to be positive in the scipy documentation here.
Changing beta to a positive parameter immediately solves your problem:
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, 38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
Output:
cdf: [1. 1. 1.]
pdf: [3.83898392e-32 1.25685346e-35 1.25685346e-35]
,where all cdfs integrate to 1 as desired. Also note that your x also has to be between 0 and 1, as described in the scipy documentation here.

Maximum Likelihood Estimator Python

Suppose I have data points distributed over time like a decreasing exponential function but it includes zero mean Gaussian noise with variance of say 20. How would I determine the likelihood function and find MLE's for the parameters? So all I have is the following data:
I have fit an exponential curve using python. My attempt:
def func(x):
return params[0]*(x**params[1])+params[2])
params, cov = curve_fit(f, time, X)
params[0] = A
params[1] = B
params[2] = C
LH_function = A*(x**B)
So I am not sure how to determine the likelihood function given just a dataset. Do I need to assume what distribution the data is in? (with 0 mean noise).

Using scipy to fit CDF with real data, but CDF start not from 0

Herewith my samples and my codes for fitting CDF.
import numpy as np
import pandas as pd
import scipy.stats as st
samples = [2,3,10,7,9,6,1,3,7,2,5,4,6,3,4,1,4,6,3,10,3,7,5,6,6,5,4,2,2,5,4,5,6,4,4,6,3,3,3,2,2,2,4,2,6,2,7,4,3,2,2,1,4,2,2,5,3,9,6,8,3,6,6,3,9,2,3,3,3,5,4,4,5,4,1,8,5,8,6,6,7,6,3,2,4,2,16,6,2,3,4,2,2,9,9,5,5,5,1,5,2,8,5,3,5,8,11,4,7,4,11,3,7,3,6,6,1,4,2,1,1,1,9,4,15,2,1,3,4,9,3,3,4,3,6,3,3,5,5,6,3,3,4,8,4,4,2,5,6,7,3,5,5,2,5,9,7,6,1,3,4,9,3,2,4,8,5,8,4,4,5,6,5,8,6,1,3,7,9,6,7,12,4,1,4,5,5,7,1,7,1,15,3,3,2,3,7,7,15,6,5,1,7,4,2,10,1,3,3,8,3,8,1,5,4,7,4,2,9,2,1,3,6,1,6,10,6,3,4,7,5,7,3,3,7,4,4,3,5,3,5,2,2,1,2,3,1,1,2,1,1,2,3,10,7,3,2,6,5,6,5,11,1,7,5,2,9,5,12,6,3,9,9,4,3,4,6,4,10,4,8,6,1,7,2,5,8,3,1,3,1,1,3,3,2,2,6,3,3,2,6,6,6,4,2,4,1,10,5,3,5,6,3,4,1,1,7,6,6,5,7,6,3,4,6,6,5,3,2,3,2,1,2,4,1,1,1,3,7,1,6,3,4,3,3,6,7,3,7,4,1,1,7,1,4,4,3,4,2,4,2,6,6,2,2,6,5,4,6,5,6,3,5,1,5,3,3,2,2,2,2,3,3,3,2,2,1,4,2,3,5,7,2,5,1,2,2,5,6,5,2,1,2,4,5,2,3,2,4,9,3,5,2,2,5,4,2,3,4,2,3,1,3,6,7,2,6,3,5,4,2,2,2,2,1,2,5,2,2,3,4,2,5,2,2,3,5,3,2,4,3,2,5,4,1,4,8,6,8,2,2,3,1,2,3,8,2,3,4,3,3,2,1,1,1,3,3,4,3,4,1,2,8,2,2,7,3,1,2,3,3,2,3,1,2,1,1,1,3,2,2,2,4,7,2,1,2,3,1,3,1,1,6,2,1,1,3,1,4,4,1,3,1,1,4,1,1,2,4,4,3,2,3,2,1,2,1,4,2,5,3,4,2,1,1,1,3,1,2,1,1,4,2,1,3,2,1,3,2,1,1,1,2,1,1,1,1,2,1,1,1,1,1,1,1]
bins=np.arange(1, 18, 0.1)
#Because min(samples) = 1, so I start from 1.
y, x = np.histogram(samples, bins=bins, density=True)
params = st.lognorm.fit(samples)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
ccdf = st.lognorm.cdf(x, loc=loc, scale=scale, *arg)
cdf = pd.Series(ccdf, x)
#cdf[1.0] is not 0... That is the issue...
When I print out the first value cdf[1.0], it does not equal to 0. According to theory, it should be 0. As the below picture has shown, the first CDF is not 0. I check my code again and again. However, I cannot fix the problem. If any suggestion to me, I very appreciate it.
In your code, you are trying to plot a bar chart from your sample. This is good, but on the graph you are not having a histogram, but a distribution function of the sample.
The code does not match the picture.
Here is the pdf graph and histogram.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(min(samples), max(samples), 100)
pdf = stats.lognorm.pdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, pdf)
plt.hist(samples, bins=max(samples)-min(samples), density=True, alpha=0.75)
plt.show()
You are also looking in the code for cdf options. And Scipy finds them.
And on the graph you draw exactly the cdf.
You don't understand that the cdf value for the minimum value in the sample is not zero.
However, you should be aware that the fit function only brings the approximated curve closer to your sample, it does not produce a curve that accurately describes the empirical distribution function.
Scipy just thinks your sample may contain values ​​less than one,
although there are no such values ​​in the training set.
The pdf also says that a value greater than 14 is extremely unlikely, but your sample has more than 13 values.
As a result, cdf and should not be equal to zero at your point cdf[1.0].
p.s. cdf will still be equal to zero at zero if you pass this point to it.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(0, max(samples), 100)
cdf = stats.lognorm.cdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, cdf)
plt.show()

How to get the mode of distribution in scipy.stats

The scipy.stats library has functions to find the mean and median of a fitted distribution but not mode.
If I have the parameters of a distribution after fitting to data, how can I find the mode of the fitted distribution?
If I don't get your wrong, you want to find the mode of fitted distributions instead of mode of a given data. Basically, we can do it with following 3 steps.
Step 1: generate a dataset from a distribution
from scipy import stats
from scipy.optimize import minimize
# generate a norm data with 0 mean and 1 variance
data = stats.norm.rvs(loc= 0,scale = 1,size = 100)
data[0:5]
Output:
array([1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799])
Step 2: fit the parameters
# fit the parameters of norm distribution
params = stats.norm.fit(data)
params
Output:
(0.059808015534485, 1.0078822447165796)
Note that there are 2 parameters for stats.norm, i.e. loc and scale. For different dist in scipy.stats, the parameters are different. I think it's convenient to store parameter in a tuple and then unpack it in the next step.
Step 3: get the mode(maximum of your density function) of fitted distribution
# continuous case
def your_density(x):
return -stats.norm.pdf(x,*paras)
minimize(your_density,0).x
Output:
0.05980794
Note that a norm distribution has mode equals to mean. It's a coincidence in this example.
One more thing is that scipy treats continuous dist and discrete dist different(they have different father classes), you can do the same thing with following code on discrete dists.
## discrete dist, example for poisson
x = np.arange(0,100) # the range of x should be specificied
x[stats.poisson.pmf(x,mu = 2).argmax()] # find the x value to maximize pmf
Out:
1
You can it try with your own data and distributions!

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106

Categories

Resources