p-value from ks_2samp is unexpected - python

why the p-value of kstest between array'x' and array'y' is less than 0.05? As you see, they are actually from one distribution (that is ,normal distribution).I cannot find the reasons and I'm very confused.Thanks you in advance!
import scipy.stats as st
import numpy as np
np.random.seed(12)
x = np.random.normal(0,1,size=1000)
y = np.random.normal(0,1,size=1000)
st.ks_2samp(x,y)
Out[9]: KstestResult(statistic=0.066, pvalue=0.025633868930359294)

This is correct. Remember the p-value being low means you have grounds to reject the null hypothesis, which says that these two samples came from the same distribution. But rejecting the null hypothesis is not the same as affirming that these two came from different distributions, it just means that you can't conclude that they came from the same distribution.

Related

2 samples hypothesis testing in Python

I am endeavouring to perform a two sample hypothesis test in Python, having been given the original code in R.
The code in R is:-
prop.test(x=c(10,16), n=c(100,100))`
#The p-value is 0.2931, being greater than alpha=0.5,
#so we fail to reject the null hypothesis
I have tried to perform the same test in both scipy and statsmodels libraries.
The code using scipy is:-
import numpy as np
import scipy.stats as stats
hats = np.array([[100,10], [100, 16]])
print("Hats scipy: ", stats.chi2_contingency(hats))`
#The p-value .368767 is greater than alpha=0.5, so we fail to reject the null hypothesis
The code using statsmodels is:-
import numpy as np
import statsmodels.stats.proportion as proportion
hat_a = 10
hat_b = 16
sample_a = 100
sample_b = 100
hats = np.array([hat_a, hat_b])
samples = np.array([sample_a,sample_b])
chisq, pvalue, table = proportion.proportions_chisquare(hats, samples)
print('Results are ','chisq =%.3f, pvalue = %.3f'%(chisq, pvalue))
#The p-value .207 is greater than alpha=0.5, so we fail to reject the null hypothesis
I would like to say that I have researched the internet looking for the correct way to perform the 2 sample hypothesis test and have found a variety of ways to code this request using both scipy and statsmodels.
My question is:-
Is there a hypothesis test I can perform in statsmodels or scipy that will give me the same result that I achieved using R, which is a p-value of 0.2931.
I am new to statistics and probabilities, so any advice would be greatly appreciated in advance.
R's prop.test uses the Yates continuity correction by default (it can be turned off using correct=F)
Therefore, to replicate in python, you need to use that Yates continuity correction. This can be done with stats.chi2_contingency(). However, your array of observed values needs to be adjusted, so that the number in each cell of an RxC table (in this case, 2x2) is reflected.
hats = np.array([[90,10], [84, 16]])
stats.chi2_contingency(hats, correction=True)[1]
Output:
0.2931241173597945

Kolmogorov-Smirnov test for uniformity giving unexpected results

I'm trying to check if my random variables are from uniform distribution or not. In my case, though, I have arrays like [0, 0, 0.44, 0, ... 0] that I need to check. This vector looks somewhat like uniform if you ask me, but KS-test doesn't seem to agree.
Here's what I do:
from scipy.stats import kstest, uniform
x = np.zeros((1,30))[0]
x[3] = 0.44 # created an array that i'm talking about
ks_statistic, p_value = kstest(x, 'uniform')
print(ks_statistic, p_value)
And here's what I get:
0.9666666666666667 9.713871499237658e-45
So the verdict is 'not uniform'. As far as I got the math underneath the KS-test, there is a possibility that my all-zeros arrays are far from what stats.uniform will generate to compare, and therefore the distance between the distributions is huge. But there is also a huge chance that I've got something wrong.
What do I do in order to test my variables for uniformity correctly?
Update: I checked out the Chi-square test and I got something a lot more like what I expected.
from scipy.stats import chisquare
stat, p_value = chisquare(x)
print(stat, p_value)
>>> 12.760000000000002 0.9960743669884126
So my question is why are these results so different and what am I missing?
I see where I'm wrong, that was easy.
What I claimed to look like uniformly distributed variables, looks exactly nothing like it. The whole point of uniform distribution is that you will most definitely never see two equivalent samples from it. And I have zeros, zeros, zeros all over the place.
I still don't get the difference between the p-value for KS and for Chi^2 though. But that's probably another question.

Get statistical difference of correlation coefficient in python

To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.

wald distribution and inverse gaussian distribution in scipy.stats

I am using scipy.stats to fit my data.
scipy.stats.invgauss.fit(my_data_array)
scipy.stats.wald.fit(my_data_array)
From wiki http://en.wikipedia.org/wiki/Inverse_Gaussian_distribution it says that Wald distribution is just another name for inverse gaussian, but using two functions above gives me different fitting parameters. scipy.stats.invgauss.fit gives me three parameters and scipy.stats.wald.fit gives two.
What is the difference between these two distributons in scipy.stats?
I was trying to find the answer here, http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wald.html and http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgauss.html, but really no clue.
The link to the scipy.stats wald distribution has the answer to your question.
wald is a special case of invgauss with mu == 1.
So the following should produce the same answer:
import numpy as np
import scipy.stats as st
my_data = np.random.randn(1000)
wald_params = st.wald.fit(my_data)
invgauss_params = st.invgauss.fit(my_data, f0=1)
wald_params and invgauss_params are the same except invgauss has a 1 in front of the other two parameters which is the parameter that they said would be fixed as one in the wald distribution (I fixed it with the arg f0=1).
Hope that helps.

P-value from Chi sq test statistic in Python

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.
I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:
from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846
However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.
Any ideas?
Cheers,
Davy
Quick refresher here:
Probability Density Function: think of it as a point value; how dense is the probability at a given point?
Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?
In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
PDF
CDF
To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:
>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189
Notice:
chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead
Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:
>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311
While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:
>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147
Using 1-chi2.cdf() breaks down here:
>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0
Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:
>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
You meant to do:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)
Example:
from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
If you want to understand the math, the p-value of a sample, x (fixed), is
P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)
where,
P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.
So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).
for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)
In [7]: from scipy.stats import chi2
In [8]: k = 1
In [9]: z = 2
In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:
>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Update by another user (WestCoastProjects) When using the values from the OP we get:
np.exp(np.longdouble(r.pchisq(3.84,1, lower_tail=False, log_p=True)[0]))
Out[5]: 0.050043521248705198928
So there's that 0.05 you were looking for.

Categories

Resources