I am endeavouring to perform a two sample hypothesis test in Python, having been given the original code in R.
The code in R is:-
prop.test(x=c(10,16), n=c(100,100))`
#The p-value is 0.2931, being greater than alpha=0.5,
#so we fail to reject the null hypothesis
I have tried to perform the same test in both scipy and statsmodels libraries.
The code using scipy is:-
import numpy as np
import scipy.stats as stats
hats = np.array([[100,10], [100, 16]])
print("Hats scipy: ", stats.chi2_contingency(hats))`
#The p-value .368767 is greater than alpha=0.5, so we fail to reject the null hypothesis
The code using statsmodels is:-
import numpy as np
import statsmodels.stats.proportion as proportion
hat_a = 10
hat_b = 16
sample_a = 100
sample_b = 100
hats = np.array([hat_a, hat_b])
samples = np.array([sample_a,sample_b])
chisq, pvalue, table = proportion.proportions_chisquare(hats, samples)
print('Results are ','chisq =%.3f, pvalue = %.3f'%(chisq, pvalue))
#The p-value .207 is greater than alpha=0.5, so we fail to reject the null hypothesis
I would like to say that I have researched the internet looking for the correct way to perform the 2 sample hypothesis test and have found a variety of ways to code this request using both scipy and statsmodels.
My question is:-
Is there a hypothesis test I can perform in statsmodels or scipy that will give me the same result that I achieved using R, which is a p-value of 0.2931.
I am new to statistics and probabilities, so any advice would be greatly appreciated in advance.
R's prop.test uses the Yates continuity correction by default (it can be turned off using correct=F)
Therefore, to replicate in python, you need to use that Yates continuity correction. This can be done with stats.chi2_contingency(). However, your array of observed values needs to be adjusted, so that the number in each cell of an RxC table (in this case, 2x2) is reflected.
hats = np.array([[90,10], [84, 16]])
stats.chi2_contingency(hats, correction=True)[1]
Output:
0.2931241173597945
Related
I am working in python and I have some performance data of some actions
DailyReturn = [0.325, -0.287, ...]
I've been trying to fit a normal distribution and a student's t-distribution to the density histogram of that data to use as a PDF. I would like to get the adjustment parameters, the standard errors of the parameters and the value of the LogLikelihood by the method of MLE (maximum likelihood). But I have run into some issues. At the moment I have this idea
import numpy as np
import math
import scipy.optimize as optimize
import statistics
def llnorm(par, data):
n = len(data)
mu, sigma = par
ll = -np.sum(-n/2*math.log(2*math.pi*(sigma**2))-((data-mu)**2)/(2*(sigma**2)))
return ll
data = DailyReturn
result = optimize.minimize(llnorm, [statistics.mean(data),statistics.stdev(data)], args = (data)
But I'm not sure and I'm lost with the t student distribution, is there an easier way to do it?
In scipy.stats you find several distributions, amongn them student's T and normal
These modules have a fit method. You can see an example here for normal distribution.
Your approach seems to be correct to normal distribution, I there is no point in this case since the optimal solution will be given by the mu and sigma you are passing.
why the p-value of kstest between array'x' and array'y' is less than 0.05? As you see, they are actually from one distribution (that is ,normal distribution).I cannot find the reasons and I'm very confused.Thanks you in advance!
import scipy.stats as st
import numpy as np
np.random.seed(12)
x = np.random.normal(0,1,size=1000)
y = np.random.normal(0,1,size=1000)
st.ks_2samp(x,y)
Out[9]: KstestResult(statistic=0.066, pvalue=0.025633868930359294)
This is correct. Remember the p-value being low means you have grounds to reject the null hypothesis, which says that these two samples came from the same distribution. But rejecting the null hypothesis is not the same as affirming that these two came from different distributions, it just means that you can't conclude that they came from the same distribution.
I generated two distributions using the following code:
rand_num1 = 2*np.random.randn(10000) + 1
rand_num2 = 2*np.random.randn(10000) + 1
stats.ks_2samp(rand_num1, rand_num2)
My question is why do both these distributions do not test to be the same based on kstest and chisquare test.
When I run a kstest on the 2 distributions I get:
Ks_2sampResult(statistic=0.019899999999999973, pvalue=0.037606196570126725)
which implies that the two distributions are statistically different. I use the following code to plot the CDF of the two distributions:
count1, bins = np.histogram(rand_num1, bins = 100)
count2, _ = np.histogram(rand_num2, bins = bins)
plt.plot(np.cumsum(count1), 'g-')
plt.plot(np.cumsum(count2), 'b.')
This is how the CDF of two distributions looks.
When I run a chisquare test I get the following:
stats.chisquare(count1, count2) # Gives an nan output
stats.chisquare(count1+1, count2+1) # Outputs "Power_divergenceResult(statistic=180.59294741316694, pvalue=1.0484033143507713e-06)"
I have 3 questions below:
Even though the CDF looks the same and the data comes from same distribution why do kstest and chisquare test both reject the same distribution hypothesis? Is there an underlying assumption that I am missing here?
Some counts are 0 and hence the first chisquare() gives an error. Is it an accepted practice to just add a non-0 number to all counts to get a correct estimate?
Is there a kstest to test against non standard distributions, say a normal with a non 0 mean and std != 1?
CDF, in my humble opinion, is not a good curve to look at. It will hide a lot of details due to the fact that it is an integral. Basically, some outlier in distribution which is way below will be compensated by another outlier which is way above.
Ok, lets take a look at distribution of K-S results. I've run the test 100 times and plotted statistics vs p-value, and, as expected, for some cases there would be (small p, large stat) points.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(12345)
x = []
y = []
for k in range(0, 100):
rand_num1 = 2.0*np.random.randn(10000) + 1.0
rand_num2 = 2.0*np.random.randn(10000) + 1.0
q = stats.ks_2samp(rand_num1, rand_num2)
x.append(q.statistic)
y.append(q.pvalue)
plt.scatter(x, y, alpha=0.1)
plt.show()
Graph
UPDATE
In reality if I run a test and see the test vs control distribution of my metric as shown in my plot then I would want to be able to say that they are they same - are there any statistics or parameters around these tests that can tell me how close these distributions are?
Of course, they are - and you're using one of such tests! K-S is most general but weakest test. And as with any test you would use there are ALWAYS cases where test will say those samples come from different distributions even you deliberately sample them from the same routine. It is just NATURE of the things,
you'll get yes or no with some confidence, but not much more. Look
at the graph again for illustrations.
Concerning your exercises with chi2 I'm very skeptical from the beginning to use chi2 for such task. For me, given the problem of making decision about two samples, test to be used should be explicitly symmetric. K-S is ok, but looking at the definition of chi2, it is NOT symmetric. Simple modification of
your code
count1, bins = np.histogram(rand_num1, bins = 40, range=(-2.,2.))
count2, _ = np.histogram(rand_num2, bins = bins, range=(-2.,2.))
q = stats.chisquare(count2, count1)
print(q)
q = stats.chisquare(count1, count2)
print(q)
produces something like
Power_divergenceResult(statistic=87.645335824746468, pvalue=1.3298580128472864e-05)
Power_divergenceResult(statistic=77.582358201839526, pvalue=0.00023275129585256563)
Basically, it means that test may pass if you run (1,2) but fail if you run (2,1), which is not good, IMHO. Chi2 is ok with me as soon as you test against expected values from known distribution curve - here test asymmetry makes sense
I would advice to try Anderson-Darling test along the lines
q = stats.anderson_ksamp([np.sort(rand_num1), np.sort(rand_num2)])
print(q)
But remember, it is the same as with K-S, some samples may fail to pass the test even if they are drawn from the same underlying distribution - this is just the nature of the beast.
UPDATE: Some reading material
https://stats.stackexchange.com/questions/187016/scipy-chisquare-applied-on-continuous-data
To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.
I am using scipy.stats to fit my data.
scipy.stats.invgauss.fit(my_data_array)
scipy.stats.wald.fit(my_data_array)
From wiki http://en.wikipedia.org/wiki/Inverse_Gaussian_distribution it says that Wald distribution is just another name for inverse gaussian, but using two functions above gives me different fitting parameters. scipy.stats.invgauss.fit gives me three parameters and scipy.stats.wald.fit gives two.
What is the difference between these two distributons in scipy.stats?
I was trying to find the answer here, http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wald.html and http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgauss.html, but really no clue.
The link to the scipy.stats wald distribution has the answer to your question.
wald is a special case of invgauss with mu == 1.
So the following should produce the same answer:
import numpy as np
import scipy.stats as st
my_data = np.random.randn(1000)
wald_params = st.wald.fit(my_data)
invgauss_params = st.invgauss.fit(my_data, f0=1)
wald_params and invgauss_params are the same except invgauss has a 1 in front of the other two parameters which is the parameter that they said would be fixed as one in the wald distribution (I fixed it with the arg f0=1).
Hope that helps.