scipy.stats cdf greater than 1 - python

I'm using scipy.stats and I need the CDF up to a given value x for some distributions, I know PDFs can be greater than 1 because they are not probabilities but densities so they should integrate to 1 even if specific values are greater, but CDFs should never be greater than 1 and when running the cdf function on scipy.stats sometimes I get values like 2.89, i'm completely sure i'm using cdf and not pdf(that was my first guess), this is messing my results and algorithm because I need accumulated probabilities, why is scipy.stats cdf returning values greater than 1 and/or how should I proceed to fix it?
Code for reproducing the issue with a sample distribution and parameters(but it happens with others too):
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, -38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
cdf: [2.68047481 7.2027761 7.2027761 ]
pdf: [2.76857133 2.23996739 2.23996739]

The problem lies in the parameters that you have specified for the Gaussian hypergeometric (HG) distribution, specifically in the third element of params, which is the parameter beta in the HG distribution (see equation 2 in this paper for the definiton of the density of the Gauss Hypergeometric distr.). This parameter has to be positive for HG to have a valid density. Otherwise, the density won't integrate to 1, which is exactly what is happening in your example. With a negative beta, the distribution is not a valid probability distribution.
You can also find the requirement that beta (denoted as b) has to be positive in the scipy documentation here.
Changing beta to a positive parameter immediately solves your problem:
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, 38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
Output:
cdf: [1. 1. 1.]
pdf: [3.83898392e-32 1.25685346e-35 1.25685346e-35]
,where all cdfs integrate to 1 as desired. Also note that your x also has to be between 0 and 1, as described in the scipy documentation here.

Related

How to get the mode of distribution in scipy.stats

The scipy.stats library has functions to find the mean and median of a fitted distribution but not mode.
If I have the parameters of a distribution after fitting to data, how can I find the mode of the fitted distribution?
If I don't get your wrong, you want to find the mode of fitted distributions instead of mode of a given data. Basically, we can do it with following 3 steps.
Step 1: generate a dataset from a distribution
from scipy import stats
from scipy.optimize import minimize
# generate a norm data with 0 mean and 1 variance
data = stats.norm.rvs(loc= 0,scale = 1,size = 100)
data[0:5]
Output:
array([1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799])
Step 2: fit the parameters
# fit the parameters of norm distribution
params = stats.norm.fit(data)
params
Output:
(0.059808015534485, 1.0078822447165796)
Note that there are 2 parameters for stats.norm, i.e. loc and scale. For different dist in scipy.stats, the parameters are different. I think it's convenient to store parameter in a tuple and then unpack it in the next step.
Step 3: get the mode(maximum of your density function) of fitted distribution
# continuous case
def your_density(x):
return -stats.norm.pdf(x,*paras)
minimize(your_density,0).x
Output:
0.05980794
Note that a norm distribution has mode equals to mean. It's a coincidence in this example.
One more thing is that scipy treats continuous dist and discrete dist different(they have different father classes), you can do the same thing with following code on discrete dists.
## discrete dist, example for poisson
x = np.arange(0,100) # the range of x should be specificied
x[stats.poisson.pmf(x,mu = 2).argmax()] # find the x value to maximize pmf
Out:
1
You can it try with your own data and distributions!

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

Multivariate normal CDF in Python

I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value

P-value from Chi sq test statistic in Python

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.
I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:
from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846
However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.
Any ideas?
Cheers,
Davy
Quick refresher here:
Probability Density Function: think of it as a point value; how dense is the probability at a given point?
Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?
In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
PDF
CDF
To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:
>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189
Notice:
chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead
Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:
>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311
While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:
>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147
Using 1-chi2.cdf() breaks down here:
>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0
Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:
>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
You meant to do:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)
Example:
from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
If you want to understand the math, the p-value of a sample, x (fixed), is
P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)
where,
P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.
So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).
for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)
In [7]: from scipy.stats import chi2
In [8]: k = 1
In [9]: z = 2
In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:
>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Update by another user (WestCoastProjects) When using the values from the OP we get:
np.exp(np.longdouble(r.pchisq(3.84,1, lower_tail=False, log_p=True)[0]))
Out[5]: 0.050043521248705198928
So there's that 0.05 you were looking for.

How do I get a lognormal distribution in Python with Mu and Sigma?

I have been trying to get the result of a lognormal distribution using Scipy. I already have the Mu and Sigma, so I don't need to do any other prep work. If I need to be more specific (and I am trying to be with my limited knowledge of stats), I would say that I am looking for the cumulative function (cdf under Scipy). The problem is that I can't figure out how to do this with just the mean and standard deviation on a scale of 0-1 (ie the answer returned should be something from 0-1). I'm also not sure which method from dist, I should be using to get the answer. I've tried reading the documentation and looking through SO, but the relevant questions (like this and this) didn't seem to provide the answers I was looking for.
Here is a code sample of what I am working with. Thanks.
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
total = 37
dist = lognorm.cdf(total,mean,stddev)
UPDATE:
So after a bit of work and a little research, I got a little further. But I still am getting the wrong answer. The new code is below. According to R and Excel, the result should be .7434, but that's clearly not what is happening. Is there a logic flaw I am missing?
dist = lognorm([1.744],loc=2.0785)
dist.cdf(25) # yields=0.96374596, expected=0.7434
UPDATE 2:
Working lognorm implementation which yields the correct 0.7434 result.
def lognorm(self,x,mu=0,sigma=1):
a = (math.log(x) - mu)/math.sqrt(2*sigma**2)
p = 0.5 + 0.5*math.erf(a)
return p
lognorm(25,1.744,2.0785)
> 0.7434
I know this is a bit late (almost one year!) but I've been doing some research on the lognorm function in scipy.stats. A lot of folks seem confused about the input parameters, so I hope to help these people out. The example above is almost correct, but I found it strange to set the mean to the location ("loc") parameter - this signals that the cdf or pdf doesn't 'take off' until the value is greater than the mean. Also, the mean and standard deviation arguments should be in the form exp(Ln(mean)) and Ln(StdDev), respectively.
Simply put, the arguments are (x, shape, loc, scale), with the parameter definitions below:
loc - No equivalent, this gets subtracted from your data so that 0 becomes the infimum of the range of the data.
scale - exp μ, where μ is the mean of the log of the variate. (When fitting, typically you'd use the sample mean of the log of the data.)
shape - the standard deviation of the log of the variate.
I went through the same frustration as most people with this function, so I'm sharing my solution. Just be careful because the explanations aren't very clear without a compendium of resources.
For more information, I found these sources helpful:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html#scipy.stats.lognorm
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
And here is an example, taken from #serv-inc 's answer, posted on this page here:
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
It sounds like you want to instantiate a "frozen" distribution from known parameters. In your example, you could do something like:
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
dist=lognorm([stddev],loc=mean)
which will give you a lognorm distribution object with the mean and standard deviation you specify. You can then get the pdf or cdf like this:
import numpy as np
import pylab as pl
x=np.linspace(0,6,200)
pl.plot(x,dist.pdf(x))
pl.plot(x,dist.cdf(x))
Is this what you had in mind?
from math import exp
from scipy import stats
def lognorm_cdf(x, mu, sigma):
shape = sigma
loc = 0
scale = exp(mu)
return stats.lognorm.cdf(x, shape, loc, scale)
x = 25
mu = 2.0785
sigma = 1.744
p = lognorm_cdf(x, mu, sigma) #yields the expected 0.74341
Similar to Excel and R, The lognorm_cdf function above parameterizes the CDF for the log-normal distribution using mu and sigma.
Although SciPy uses shape, loc and scale parameters to characterize its probability distributions, for the log-normal distribution I find it slightly easier to think of these parameters at the variable level rather than at the distribution level. Here's what I mean...
A log-normal variable X is related to a normal variable Z as follows:
X = exp(mu + sigma * Z) #Equation 1
which is the same as:
X = exp(mu) * exp(Z)**sigma #Equation 2
This can be sneakily re-written as follows:
X = exp(mu) * exp(Z-Z0)**sigma #Equation 3
where Z0 = 0. This equation is of the form:
f(x) = a * ( (x-x0) ** b ) #Equation 4
If you can visualize equations in your head it should be clear that the scale, shape and location parameters in Equation 4 are: a, b and x0, respectively. This means that in Equation 3 the scale, shape and location parameters are: exp(mu), sigma and zero, respectfully.
If you can't visualize that very clearly, let's rewrite Equation 2 as a function:
f(Z) = exp(mu) * exp(Z)**sigma #(same as Equation 2)
and then look at the effects of mu and sigma on f(Z). The figure below holds sigma constant and varies mu. You should see that mu vertically scales f(Z). However, it does so in a nonlinear manner; the effect of changing mu from 0 to 1 is smaller than the effect of changing mu from 1 to 2. From Equation 2 we see that exp(mu) is actually the linear scaling factor. Hence SciPy's "scale" is exp(mu).
The next figure holds mu constant and varies sigma. You should see that the shape of f(Z) changes. That is, f(Z) has a constant value when Z=0 and sigma affects how quickly f(Z) curves away from the horizontal axis. Hence SciPy's "shape" is sigma.
Even more late, but in case it's helpful to anyone else: I found that the Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,TRUE)
provides the same results as python's
from scipy.stats import lognorm
lognorm.cdf(x,sigma,0,mean)
Likewise, Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,FALSE)
seems equivalent to Python's
from scipy.stats import lognorm
lognorm.pdf(x,sigma,0,mean).
#lucas' answer has the usage down pat. As a code example, you could use
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
Known mean and stddev of the lognormal distribution
In case someone is looking for it, here is a solution for getting the scipy.stats.lognorm distribution if the mean mu and standard deviation sigma of the lognormal distribution are known. In this case we have to calculate the stats.lognorm parameters from the known mu and sigma like so:
import numpy as np
from scipy import stats
mu = 10
sigma = 3
a = 1 + (sigma / mu) ** 2
s = np.sqrt(np.log(a))
scale = mu / np.sqrt(a)
This was obtained by looking into the implementation of the variance and mean calculations in the stats.lognorm.stats method and essentially reversing it (solving for the input).
Then we can initialize the frozen distribution instance
distr = stats.lognorm(s, 0, scale)
# generate some randomvals
randomvals = distr.rvs(1_000_000)
# calculate mean and variance using the dedicated method
mu_stats, var_stats = distr.stats("mv")
Compare means and stddevs from input, randomvals and analytical solution from distr.stats:
print(f"""
Mean Std
----------------------------
Input: {mu:6.2f} {sigma:6.2f}
Randomvals: {randomvals.mean():6.2f} {randomvals.std():6.2f}
lognorm.stats: {mu_stats:6.2f} {np.sqrt(var_stats):6.2f}
""")
Mean Std
----------------------------
Input: 10.00 3.00
Randomvals: 10.00 3.00
lognorm.stats: 10.00 3.00
Plot PDF from stats.lognorm and histogram of the random values:
import holoviews as hv
hv.extension('bokeh')
x = np.linspace(0, 30, 301)
counts, _ = np.histogram(randomvals, bins=x)
counts = counts / counts.sum() / (x[1] - x[0])
(hv.Histogram((counts, x))
* hv.Curve((x, distr.pdf(x))).opts(color="r").opts(width=900))
If you read this and just want a function with the behaviour similar to lnorm in R. Well, then relieve yourself from violent anger and use numpy's numpy.random.lognormal.

Categories

Resources