I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.
I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:
from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846
However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.
Any ideas?
Cheers,
Davy
Quick refresher here:
Probability Density Function: think of it as a point value; how dense is the probability at a given point?
Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?
In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
PDF
CDF
To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:
>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189
Notice:
chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead
Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:
>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311
While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:
>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147
Using 1-chi2.cdf() breaks down here:
>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0
Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:
>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
You meant to do:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)
Example:
from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
If you want to understand the math, the p-value of a sample, x (fixed), is
P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)
where,
P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.
So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).
for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)
In [7]: from scipy.stats import chi2
In [8]: k = 1
In [9]: z = 2
In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:
>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Update by another user (WestCoastProjects) When using the values from the OP we get:
np.exp(np.longdouble(r.pchisq(3.84,1, lower_tail=False, log_p=True)[0]))
Out[5]: 0.050043521248705198928
So there's that 0.05 you were looking for.
Related
I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.
Bugfix
You can use scipy.stats.binom equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.
Another way would be to use the CDF and it's inverse, something like:
from scipy import stats
dist = stats.binom(100, 0.5)
# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])
# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())
should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution
note that for large values you might as well use the normal approximation
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
Good day!
I have two gamma distributions, and want to find distribution of their difference.
Use np.random.gamma to generate distribution by parameters, but the resulting distribution is very different from time to time.
Code:
import numpy as np
from scipy.stats import gamma
for i in range(0, 10):
s1 = np.random.gamma(1.242619972, 0.062172619, 2000) + 0.479719122
s2 = np.random.gamma(456.1387112, 0.002811328, 2000) - 0.586076723
r_a, r_loc, r_scale = gamma.fit(s1 - s2)
print(1 - gamma.cdf(0.0, r_a, r_loc, r_scale))
Result:
0.4795655021157602
0.07061938039031612
0.06960741675590854
0.4957568913729331
0.4889900326940878
0.07381963810128422
0.0690800784280835
0.07198551429809896
0.07659274505827551
0.06967441935502583
I receive two quite different cdf of 0.: 0.48 and 0.07. What can be the problem?
You're fitting a gamma distribution to the difference between two other gamma distributions. A gamma distribution can only be positive, so that makes no sense and you can't expect to get a consistent answer. If you print the mean difference you get consistent results.
Is it possible to do a t-test using scipy.stats.ttest_1samp where the input is a statistic rather than an array? For example, with difference in means you have two options: ttest_ind() and ttest_ind_from_stats().
import numpy as np
import scipy.stats as stats
from scipy.stats import norm
mean1=35.6
std1=11.3
nobs1=84
mean2=44.7
std2=8.9
nobs2=84
print(stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=False))
# alternatively, you can pass 2 arrays
print(stats.ttest_ind(
stats.norm.rvs(loc=mean1, scale=std1, size=84),
stats.norm.rvs(loc=mean2, scale=std2, size=84),
equal_var=False)
)
Is there an equivalent function with a one-sample t-test?
Thank you for your help.
TL;DR
There is no such function for the one sample test, but you can use the two sample function.
In short, to perform a one sample t-test do this:
sp.stats.ttest_ind_from_stats(mean1=sample_mean,
std1=sample_std,
nobs1=n_samples,
mean2=population_mean,
std2=0,
nobs2=2,
equal_var=False)
Note that the result is completely independent from nobs2 (as it should be, since there is no n2 in the one sample test). Just make sure to pass in a value >1 to avoid a division by zero.
How does it work?
Check out the Wikipedia page about the different types of t-test.
The one sample t-test uses the statistic
with n - 1 degrees of freedom.
The ttest_ind_from_stats function can do Welch's t-test (unequal sample size, unequal variance), which is defined as
with
and degrees of freedom:
We can transform the definition of Welch's t-test to the one sample t-test. If we set mean2 to the population mean and std2 to 0 the equations for the t-statistic are the same, and the degrees of freedom reduces to n - 1.
I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value