Different values weibull pdf - python

I was wondering why the values of weibull pdf with the prebuilt function dweibull.pdf are more or less the half they should be
I did a test. For the same x I created the weibull pdf for A=10 and K=2 twice, one by writing myself the formula and the other one with the prebuilt function of dweibull.
import numpy as np
from scipy.stats import exponweib,dweibull
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
K=2.0
A=10.0
x=np.arange(0.,20.,1)
#own function
def weib(data,a,k):
return (k / a) * (data / a)**(k - 1) * np.exp(-(data / a)**k)
pdf1=weib(x,A,K)
print sum(pdf1)
#prebuilt function
dist=dweibull(K,1,A)
pdf2=dist.pdf(x)
print sum(pdf2)
f=plt.figure()
suba=f.add_subplot(121)
suba.plot(x,pdf1)
suba.set_title('pdf dweibull')
subb=f.add_subplot(122)
subb.plot(x,pdf2)
subb.set_title('pdf own function')
f.show()
It seems with dweibull the pdf values are the half but that this is wrong as the summation should be in total 1 and not aroung 0.5 as it is with dweibull. By writing myself the formula the summation is around 1[

scipy.stats.dweibull implements the double Weibull distribution. Its support is the real line. Your function weib corresponds to the PDF of scipy's weibull_min distribution.
Compare your function weib to weibull_min.pdf:
In [128]: from scipy.stats import weibull_min
In [129]: x = np.arange(0, 20, 1.0)
In [130]: K = 2.0
In [131]: A = 10.0
Your implementation:
In [132]: weib(x, A, K)
Out[132]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
scipy.stats.weibull_min.pdf:
In [133]: weibull_min.pdf(x, K, scale=A)
Out[133]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
By the way, there is a mistake in this line of your code:
dist=dweibull(K,1,A)
The order of the parameters is shape, location, scale, so you are setting the location parameter to 1. That's why the values in your second plot are shifted by one. That line should have been
dist = dweibull(K, 0, A)

Related

Area under the histogram is not 1 when using density in plt.hist

Consider the following dataset with random data:
test_dataset = np.array([ -2.09601881, -4.26602684, 1.09105452, -4.59559669,
1.05865251, -0.93076762, -14.70398945, -18.01937129,
4.64126152, -10.34178822, -9.46058493, -5.66864965,
-3.17562022, 15.7030379 , 10.59675205, -5.80882413,
-24.00604149, -4.81518663, -1.94333927, 1.18142171,
12.72030312, 3.84917581, -0.4468796 , 11.91828567,
-17.99171774, 9.35108712, -5.57233376, 5.77547128,
5.49296099, -10.96132844, -18.75174336, 5.27843303,
25.73548956, -21.58043021, -14.24734733, 12.57886018,
-22.10002076, 1.72207555, -6.0411867 , -3.63568527,
7.26542117, -0.21449529, -6.64974714, -0.94574606,
-4.23339431, 16.76199734, -12.42195793, 18.965854 ,
-23.85336123, -15.55104466, 6.17215868, 7.34993316,
8.62461351, -16.30482638, -16.35601099, 1.96857833,
18.74440399, -22.48374434, -10.895831 , -10.14393648,
-17.62768751, 4.83388855, 20.1578181 , 6.04299626,
0.97198296, -3.40889754, -10.62734293, 1.70240472,
20.4203839 , 10.26751364, 15.47859675, -10.97940064,
1.82728251, 4.22894717, 8.31502887, -5.48502811,
-1.09244874, -11.32072796, -24.88520436, -7.42108403,
19.4200716 , 4.82704045, -12.46290135, -15.18466755,
6.37714692, -11.06825059, 5.10898588, -9.07485484,
1.63946084, -12.2270078 , 12.63776832, -25.03916909,
2.42972082, -14.22890171, 18.2199446 , 6.9819771 ,
-12.07795089, 2.59948596, -16.90206575, 6.35192719,
7.33823106, -23.69653447, -11.66091871, -19.40251179,
-12.64863792, 11.04004231, 13.7247356 , -16.36107329,
20.43227515, 17.97334692, 16.92675175, -5.62051239,
-8.66304184, -8.40848514, -23.20919855, 0.96808137,
-5.03287253, -3.13212582, 18.81155666, -8.27988284,
3.85708447, 12.43039322, 17.98003878, 18.11009997,
-3.74294421, -16.62276121, 9.4446743 , 2.2060981 ,
8.34853736, 14.79144713, -1.91113975, -5.17061419,
4.53451746, 8.19090358, 7.98343201, 11.44592322,
-16.9132677 , -25.92554857, 10.10638432, -8.09236786,
20.8878207 , 19.52368296, 0.85858125, 2.61760415,
9.21360649, -8.1192651 , -6.94829273, 2.73562447,
13.40981323, -9.05018331, -17.77563166, -21.03927199,
4.10415845, -1.31550732, 5.68284828, 15.08670773,
-19.78675315, 12.94697869, -11.51797637, 1.91485992,
16.69417993, -16.04271622, -1.14028558, 9.79830109,
-18.58386093, -7.52963269, -10.10059878, -25.2194216 ,
-0.10598426, -15.77641532, -14.15999125, 14.35011271,
11.15178588, -14.43856266, 15.84015226, -3.41221883,
11.90724469, 0.57782081, 18.82127466, -6.01068727,
-19.83684476, 2.20091942, -1.38707755, -8.62821053,
-11.89000913, -11.69539815, 5.70242019, -3.83781841,
5.35894135, -0.30995954, 21.76661212, 8.52974329,
-9.13065082, -11.06209 , -12.00654618, 2.769838 ,
-12.21579496, -27.2686534 , -4.58538197, -6.94388425])
I'd like to plot normalized histogram of it, so in the plt.hist options I choose density=True:
import numpy as np
import matplotlib.pyplot as plt
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
which outputs the following histogram:
0.18206124014272715
0.18866449755723017
From matplotlib documentation:
The density parameter, which normalizes bin heights so that the integral of the histogram is 1. The resulting histogram is an approximation of the probability density function.
But from my example it is clearly seen that the integral of the histogram is NOT 1 and strongly depends on the number of bins: if I specify it for example to be 40 the sum will increase:
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
0.7508847002777762
0.7546579902289207
Is it incorrect description in documentation or I misunderstand some issues here?
you do not calculate the area, area you should calculate as follow (in your example):
sum(data1 * np.diff(bins)) == 1

Inverse problem with scipy.optimize.differential_evolution

I've been doing a code trying to calculate y=ax+b with values for x and y. So I want to determine a and b.
For example, I have y=2x+1 so
xexp =[ 0., 0.1111111, 0.2222222, 0.3333333, 0.4444444, 0.5555556, 0.6666667, 0.7777778, 0.8888889, 1.] and
yexp=[1., 1.2222222, 1.4444444, 1.6666667, 1.8888889, 2.1111111, 2.3333333, 2.5555556, 2.7777778, 3.]
I need to create a loop in my code to calculate every ycalc[i] and return as my objective function the mean error of each iteration using all the given data.
from scipy.optimize import differential_evolution
import numpy as np
from matplotlib import pyplot
from scipy.optimize import LinearConstraint
from scipy.optimize import NonlinearConstraint
#2*x+1 // a*x+b --> have xexp and yexp need to calc a and b when xcalc = xexp
errototal=[]
ycalc=[]
erro=[]
#experimental values
xexp =[ 0., 0.1111111, 0.2222222, 0.3333333, 0.4444444, 0.5555556, 0.6666667, 0.7777778, 0.8888889, 1.]
yexp=[1., 1.2222222, 1.4444444, 1.6666667, 1.8888889, 2.1111111, 2.3333333, 2.5555556, 2.7777778, 3.]
#obj function
def func(xcalc):
return np.mean(sum(errototal))
bounds = [(-5, 5), (-5, 5)]
plot1=[]
def condition(xk, convergence):
for i in range(0,9):
ycalc.append(xk[0]*xexp[i]+xk[1])
# ycalc.append(xk[0]*xk[2]+xk[1])
erro.append(abs(yexp[i]-ycalc[i]))
errototal.append(erro[i])
plot1.append(func(sum(errototal)))
print(xk,errototal,ycalc, yexp)
print(sum(errototal))
result = differential_evolution(func, bounds, disp=True, callback=condition)
# line plot of best objective function values
pyplot.plot(plot1, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()
My code stops in the first step. Any tips?

Non-evenly spaced np.array with more points near the boundary

I have an interval, say (0, 9) and I have to generate points between them such that they are denser at the both the boundaries. I know the number of points, say n_x. alpha decides the "denseness" of the system such that points are evenly spaced if alpha = 1.
The cross product of n_x and n_y is supposed to look like this:
[
So far the closest I've been to this is by using np.geomspace, but it's only dense near the left-hand side of the domain,
In [55]: np.geomspace(1,10,15) - 1
Out[55]:
array([0. , 0.17876863, 0.38949549, 0.63789371, 0.93069773,
1.27584593, 1.6826958 , 2.16227766, 2.72759372, 3.39397056,
4.17947468, 5.1054023 , 6.19685673, 7.48342898, 9. ])
I also tried dividing the domain into two parts, (0,4), (5,10) but that did not help either (since geomspace gives more points only at the LHS of the domain).
In [29]: np.geomspace(5,10, 15)
Out[29]:
array([ 5. , 5.25378319, 5.52044757, 5.80064693, 6.09506827,
6.40443345, 6.72950096, 7.07106781, 7.42997145, 7.80709182,
8.20335356, 8.61972821, 9.05723664, 9.51695153, 10. ])
Apart from that, I am a bit confused about which mathematical function can I use to generate such an array.
You can use cumulative beta functions and map to your range.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
def denseboundspace(size=30, start=0, end=9, alpha=.5):
x = np.linspace(0, 1, size)
return start + beta.cdf(x, 2.-alpha, 2.-alpha) * (end-start)
n_x = denseboundspace()
#[0. 0.09681662 0.27092155 0.49228501 0.74944966 1.03538131
# 1.34503326 1.67445822 2.02038968 2.38001283 2.75082572 3.13054817
# 3.51705806 3.9083439 4.30246751 4.69753249 5.0916561 5.48294194
# 5.86945183 6.24917428 6.61998717 6.97961032 7.32554178 7.65496674
# 7.96461869 8.25055034 8.50771499 8.72907845 8.90318338 9. ]
plt.vlines(n_x, 0,2);
n_x = denseboundspace(size=13, start=1.2, end=7.8, alpha=1.0)
#[1.2 1.75 2.3 2.85 3.4 3.95 4.5 5.05 5.6 6.15 6.7 7.25 7.8 ]
plt.vlines(n_x, 0,2);
The spread is continuously controlled by the alpha parameter.

Obtain masked arrays through corresponding values

I have a series of observations y_obs and p, taken at times t. I calculate a model to fit the data that is an integral function of both y_obs and p in dt, $\int f(y_obs, p) dt$. To evaluate the model outside of the defined range of observations, I create x_fit = np.linspace(min(times)-10., max(times)+10, 1e4). I now want to create a correspondent y_obs_modified and p_modified such that they have zero at correspondent x_fit times that are not in the original t (let's say, the difference between the element x_fit and t is larger than 1), but have their original values for correspondent times (let's say, the difference between the element x_fit and t is less than 1). How do I do that?
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as it
t=np.asarray([8418, 8422, 8424, 8426, 8428, 8430, 8540, 8542, 8650, 8654, 8656, 8658, 8660, 8662, 8664, 8666, 8668, 8670, 8672, 8674, 8768, 8770, 8772, 8774, 8776, 8778, 8780, 8782, 8784, 8786, 8788, 8790, 8792, 8794, 8883, 8884, 8886, 8888, 8890, 8890, 8892, 8894, 8896, 8898, 8904])
y_obs =np.asarray([ 0.00393986,0.00522288,0.00820794,0.01102782,0.00411525,0.00297762, 0.00463183,0.00602662,0.0114886, 0.00176694,0.01241464,0.01316199, 0.01108201, 0.01056611, 0.0107585, 0.00723887,0.0082614, 0.01239229, 0.00148118,0.00407329,0.00626722,0.01026926,0.01408419,0.02638901, 0.02284189, 0.02142943, 0.02274845, 0.01315814, 0.01155898, 0.00985705, 0.00476936,0.00130343,0.00350376,0.00463576, 0.00610933, 0.00286234, 0.00845177,0.00849791,0.0151215, 0.0151215, 0.00967625,0.00802465, 0.00291534, 0.00819779,0.00366089])
y_obs_err = np.asarray([6.12189334e-05, 6.07487598e-05, 4.66365096e-05, 4.48781264e-05, 5.55250430e-05, 6.18699105e-05, 6.35339947e-05, 6.21108524e-05, 5.55636135e-05, 7.66087180e-05, 4.34256323e-05, 3.61131000e-05, 3.30783270e-05, 2.41312040e-05, 2.85080015e-05, 2.96644612e-05, 4.58662869e-05, 5.19419065e-05, 6.00479888e-05, 6.62586953e-05, 3.64830945e-05, 2.58120956e-05, 1.83249104e-05, 1.59433858e-05, 1.33375408e-05, 1.29714326e-05, 1.26025166e-05, 1.47293107e-05, 2.17933175e-05, 2.21611713e-05, 2.42946630e-05, 3.61296843e-05, 4.23009806e-05, 7.23405476e-05, 5.59390368e-05, 4.68144974e-05, 3.44773949e-05, 2.32907036e-05, 2.23491451e-05, 2.23491451e-05, 2.92956472e-05, 3.28665479e-05, 4.41214301e-05, 4.88142073e-05, 7.19116984e-05])
p= np.asarray([ 2.82890497,3.75014266,5.89347542,7.91821558,2.95484056,2.13799544, 3.32575733,4.32724456,8.2490644, 1.26870083,8.91397925,9.45059128, 7.95712563, 7.58669608, 7.72483557,5.19766853,5.93186433,8.89793105, 1.06351782,2.92471065,4.49999613,7.37354766, 10.11275281, 18.94787684, 16.40097363, 15.38679306, 16.33387783, 9.44782842, 8.29959664,7.07757293, 3.42450524,0.93588962,2.515773,3.32857547,7.180216, 2.05522399, 6.06855409,6.1016838,10.8575614,10.8575614, 6.94775991,5.76187014, 2.09327787, 5.88619335,2.62859611])
Following OP's suggestion does not lead to desired result:
print y_obs_modified # [8418. 0. 0. ... 0. 0. 8904.]
print y_obs_modified[y_obs_modified > 0] # [8418. 8904.]
Use np.where and np.isin.
You need to use it like that:
y_obs_modified = np.where(np.isin(x_fit, t), x_fit, 0)

Fitting negative binomial in python

In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))

Categories

Resources