How to get the mode of distribution in scipy.stats - python

The scipy.stats library has functions to find the mean and median of a fitted distribution but not mode.
If I have the parameters of a distribution after fitting to data, how can I find the mode of the fitted distribution?

If I don't get your wrong, you want to find the mode of fitted distributions instead of mode of a given data. Basically, we can do it with following 3 steps.
Step 1: generate a dataset from a distribution
from scipy import stats
from scipy.optimize import minimize
# generate a norm data with 0 mean and 1 variance
data = stats.norm.rvs(loc= 0,scale = 1,size = 100)
data[0:5]
Output:
array([1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799])
Step 2: fit the parameters
# fit the parameters of norm distribution
params = stats.norm.fit(data)
params
Output:
(0.059808015534485, 1.0078822447165796)
Note that there are 2 parameters for stats.norm, i.e. loc and scale. For different dist in scipy.stats, the parameters are different. I think it's convenient to store parameter in a tuple and then unpack it in the next step.
Step 3: get the mode(maximum of your density function) of fitted distribution
# continuous case
def your_density(x):
return -stats.norm.pdf(x,*paras)
minimize(your_density,0).x
Output:
0.05980794
Note that a norm distribution has mode equals to mean. It's a coincidence in this example.
One more thing is that scipy treats continuous dist and discrete dist different(they have different father classes), you can do the same thing with following code on discrete dists.
## discrete dist, example for poisson
x = np.arange(0,100) # the range of x should be specificied
x[stats.poisson.pmf(x,mu = 2).argmax()] # find the x value to maximize pmf
Out:
1
You can it try with your own data and distributions!

Related

scipy.stats cdf greater than 1

I'm using scipy.stats and I need the CDF up to a given value x for some distributions, I know PDFs can be greater than 1 because they are not probabilities but densities so they should integrate to 1 even if specific values are greater, but CDFs should never be greater than 1 and when running the cdf function on scipy.stats sometimes I get values like 2.89, i'm completely sure i'm using cdf and not pdf(that was my first guess), this is messing my results and algorithm because I need accumulated probabilities, why is scipy.stats cdf returning values greater than 1 and/or how should I proceed to fix it?
Code for reproducing the issue with a sample distribution and parameters(but it happens with others too):
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, -38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
cdf: [2.68047481 7.2027761 7.2027761 ]
pdf: [2.76857133 2.23996739 2.23996739]
The problem lies in the parameters that you have specified for the Gaussian hypergeometric (HG) distribution, specifically in the third element of params, which is the parameter beta in the HG distribution (see equation 2 in this paper for the definiton of the density of the Gauss Hypergeometric distr.). This parameter has to be positive for HG to have a valid density. Otherwise, the density won't integrate to 1, which is exactly what is happening in your example. With a negative beta, the distribution is not a valid probability distribution.
You can also find the requirement that beta (denoted as b) has to be positive in the scipy documentation here.
Changing beta to a positive parameter immediately solves your problem:
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, 38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
Output:
cdf: [1. 1. 1.]
pdf: [3.83898392e-32 1.25685346e-35 1.25685346e-35]
,where all cdfs integrate to 1 as desired. Also note that your x also has to be between 0 and 1, as described in the scipy documentation here.

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

How to implement a KS-Test in Python

scipy.stats.kstest(rvs, cdf, N) can perform a KS-Test on a dataset rvs. It tests if the dataset follows a propability distribution, whose cdf is specified in the parameters of this method.
Consider now a dataset of N=4800 samples. I have performed a KDE on this data and, therefore, have an estimated PDF. This PDF looks an awful lot like a bimodal distribution. When plotting the estimated PDF and curve_fitting a bimodal distribution to it, these two plots are pretty much identical. The parameters of the fitted bimodal distribution are (scale1, mean1, stdv1, scale2, mean2, stdv2):
[0.6 0.036 0.52, 0.23 1.25 0.4]
How can I apply scipy.stats.kstest to test if my estimated PDF is bimodal distributed?
As my null hypothesis, I state that the estimated PDF equals the following PDF:
hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)
x_grid is just a vector that contains the x-values at which I evaluate my estimated PDF. So each entry of pdf has a corresponding value of x_grid. It might be that my computation of hypoCdf is incorrect. Maybe instead of dividing by len(x_grid), should I divide by np.sum(hypoDist) ?
Challenge: cdf parameter of kstest cannot be specified as bimodal. Neither can I specify it to be hypoDist.
If I wanted to test whether my dataset was Gaussian distributed, I would write:
KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)
measurementError is the dataset that I have performed the KDE on. This returns:
statistic=0.459, pvalue=0.0
To me, it is a little irritating that the pvalue is 0.0
The cdf argument to kstest can be a callable that implements the cumulative distribution function of the distribution against which you want to test your data. To use it, you have to implement the CDF of your bimodal distribution. You want the distribution to be a mixture of two normal distributions. You can implement the CDF for this distribution by computing the weighted sum of the CDFs of the two normal distributions that make up the mixture.
Here's a script that shows how you can do this. To demonstrate how kstest is used, the script runs kstest twice. First it uses a sample that is not from the distribution. As expected, kstest computes a very small p-value for this first sample. It then generates a sample that is drawn from the mixture. For this sample, the p-value is not small.
import numpy as np
from scipy import stats
def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
"""
CDF of a mixture of two normal distributions.
"""
return (weight1*stats.norm.cdf(x, mean1, stdv1) +
(1 - weight1)*stats.norm.cdf(x, mean2, stdv2))
# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4
n = 200
# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)
# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)
# Create a sample from the bimodal distribution. This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution. The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
(stats.norm.rvs(mean2, stdv2, size=n - n1))))
# Most of time, the p-value returned by kstest with sample2 will not
# be small. We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)
Typical output (the numbers will be different each time the script is run):
sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403
You might find that, for your problem, this test does not work well. You have 4800 samples, but in your code you have parameters whose numerical values have just one or two significant digits. Unless you have good reason to believe that your sample is drawn from a distribution with exactly those parameters, it is likely that kstest will return a very small p-value.

Curve_fit fails on Exponentiated Weibull distribution

I am trying to use
scipy.optimize.curve_fit(func,xdata,ydata)
To determine the parameters of exponentiated weibull distribution:
#define exponentiated weibull distribution
def expweib(x,k,lamda,alpha):
return alpha*(k/lamda)*((x/lamda)**(k-1))*((1-np.exp(-(x/lamda)*k))**(alpha-1))*np.exp(-(x/lamda)*k)
#First generate random sample of exponentiated weibull distribution using stats.exponweib.rvs
data = stats.exponweib.rvs(a = 1, c = 82.243021128368554, loc = 0,scale = 989.7422, size = 1000 )
#Then use the sample data to draw a histogram
entries_Test, bin_edges_Test, patches_Test = plt.hist(data, bins=50, range=[909.5,1010.5], normed=True)
#calculate bin middles of the histogram
bin_middles_Test = 0.5*(bin_edges_Test[1:] + bin_edges_Test[:-1])
#use bin_middles_Test as xdata, bin_edges_Test as ydata, previously defined expweib as func, call curve_fit method:
params, pcov = curve_fit(weib,bin_middles_Test, entries_Test )
Then the error occurs:
OptimizeWarning: Covariance of the parameters could not be estimatedcategory=OptimizeWarning)
I cannot identify which step has the issue, could anyone help?
Thank you
Reading through documentation for curve_fit method here, https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html, for method argument, they have mentioned that the default 'lm' method won't work if the number of observations is less than the number of variables, in which case you should use either of *'trf'* or *'dogbox'* method.
Also, reading about 'pcov' in Return values section, they have mentioned that the entries will be inf if the Jacobian matrix at the solution does not have a full rank.
I tried your code with both trf and dogbox and got pconv array full of zeros

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106

Categories

Resources