Python: p value from scipy is not correct - python

i want to calculate a p value with the package scipy. this is the code is used:
x = st.ttest_1samp(df_efw.stack(),np.round(np.mean(df_lw).mean(),2))
This is my output:
Ttest_1sampResult(statistic=-1.3939917717040629, pvalue=0.16382682901590806)
I also calculated it manually and my statistic value is correct but the p value is not..? The p value can be read on the standard normal distribution table.
So the problem is: if you read the table you will see that -1,39399 has a p value of 0,0823 and not 0,1638. So i am thinking that i did the code wrong or i am interpreting something wrong. What is it?

By default, ttest_1samp returns the two-sided or two-tailed p-value, which is twice the single-sided p-value due to the symmetry about 0 of the t distribution. Consistent with this, your manually computed single-sided p-value is (roughly) half of SciPy's p-value.
One solution is just to divide the two-sided p-value from ttest_1samp by 2. In SciPy 1.6.0 and later, you can pass the argument alternative='greater' or alternative='less' to get a single-sided p-value.
Further Reading
ttest_1samp documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html
The GitHub issue where the alternative argument was proposed: https://github.com/scipy/scipy/pull/12597
The resulting pull request: https://github.com/scipy/scipy/pull/12597

Related

multivariate normal pdf with nan in mean

Is there an efficient implementation in Python to evaluate the PDF of a multivariate normal distribution when there are missing values in x? I guess the idea would just be that you'd effectively reduce the dimensionality to whatever number of available data points you had for a particular vector for which you are trying to evaluate the probability. But I can't figure out if the scipy implementation has a way to ignore masked values.
e.g.,
from scipy.stats import multivariate_normal as mvnorm
import numpy as np
means = [0.0,0.0,0.0]
cov = np.array([[1.0,0.2,0.2],[0.2,1.0,0.2],[0.2,0.2,1.0]])
d = mvnorm(means,cov)
x = [0.5,-0.2,np.nan]
d.pdf(x)
yields output:
nan
(as expected)
Is there a way to efficiently evaluate the PDF for only values that are present (in this case, making effectively 3D case into a bivariate case?) using this implementation?
This question is a bit of a tricky in terms of math and code. Let me elaborate.
First, the code. scipy.stats does not offer nan-handling as you desire. Speedy code likely requires implementing the multivariate normal distribution PDF by hand and applying it to NumPy arrays directly. Leveraging vectorization is the only way to efficiently offer this functionality for large-scale datasets. On the other hand, the nan-tolerant function nanTol_pdf() below provides the desired functionality while staying true to the multivariate normal distribution as implemented in SciPy. You might find it sufficient for your use case.
def nanTol_pdf(d, x):
'''
Function returns function value of multivariate probability density conditioned on
non-NAN indices of the input vector x
'''
assert isinstance(d, stats._multivariate.multivariate_normal_frozen) and (isinstance(x,list) or isinstance(x,np.ndarray))
# check presence of nan entries
if any(np.isnan(x)):
# indices
subIndex = np.argwhere(~np.isnan(x)).reshape(-1)
# lower-dimensional multiv. Gaussian distribution
lowDim_mean = d.mean[subIndex]
lowDim_cov = cov[np.ix_(subIndex, subIndex)]
lowDim_d = mvnorm(lowDim_mean, lowDim_cov)
return (lowDim_d.pdf(x[subIndex]))
else:
return d.pdf(x)
Regardless, the fact we can do it shouldn't stop us to think if we should.
Second, the math. Mathematically speaking, it is unclear what you attempt to achieve. In your example, SciPy returns nan as you query it with an ill-defined input vector x. Output not-defined, i.e. returning not a number (nan) seems to be the most appropriate answer. Jointly truncating the distribution d and input vector x circumvents numerical problems but opens up statistical questions. In particular, since the probability density function values cannot be understood as (conditional) probabilities. Moreover, the output alone conceals if truncation was applied. Remember that nanTol_pdf() will happily provide a non-negative real number as an output as long as at least one entry in the vector is a real number. Your use case will decide if this is reasonable.
Finally, I would suggest at least considering missing data imputation techniques before moving forward. Let me know if this helps.

Inverse normal random number generation in python?

I've used random.normal() in the past to generate a single number who, if called multiple times, in aggregate would create a bell curve distribution. What I'm trying to do now is to create the opposite / inverse, where the distribution is biased towards the extremes within a range? There are built in functions in excel that seem to do what I want. Is there a way to do it in python? Thank you
It appears you want a distribution with an "upside-down bell curve"
compared to the normal distribution. If so, then the following method
implements this distribution via rejection sampling and a modified
version of the standard normal distribution. 'x0' and 'x1' are the ranges
of numbers to generate.
def invertedNormal(x0, x1):
# Get the ends of the PDF (the bounding
# box will cover the PDF at the given range)
x0pdf = 1-math.exp(-(x0*x0))
x1pdf = 1-math.exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while True:
# Choose a random x-coordinate
x=random.random()*(x1-x0)+x0
# Choose a random y-coordinate
y=random.random()*ymax
# Return x if y falls within PDF
if y < 1-math.exp(-(x*x)):
return x
You'd have to decide which probability distribution to use.
If you want to use pure Python, without external dependencies, check which distributions are available in the Random module: https://docs.python.org/3/library/random.html
For example you might use a Beta distribution with parameters (0.5, 0.5): https://docs.python.org/3/library/random.html#random.betavariate
See the Wikipedia page for beta distribution to understand the parameters: https://en.wikipedia.org/wiki/Beta_distribution
For advanced use, the external package scipy is the usual way to access probability distributions within Python: https://docs.scipy.org/doc/scipy/reference/stats.html
It sounds like what you are wanting is to shift the distribution the the edge of the range, and wrap around?
Something like this could do what you're looking for:
num = (random.normal() + 0.5) % 1

Combining p values using scipy

I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak
It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side
I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.

Chi-Square test for groups of unequal size

I'd like to apply chi-square test scipy.stats.chisquare. And the total number of observations is different in my groups.
import pandas as pd
data={'expected':[20,13,18,21,21,29,45,37,35,32,53,38,25,21,50,62],
'observed':[19,10,15,14,15,25,25,20,26,38,50,36,30,28,59,49]}
data=pd.DataFrame(data)
print(data.expected.sum())
print(data.observed.sum())
To ignore this is incorrect - right?
Does the default behavior of scipy.stats.chisquare takes this into account? I checked with pen and paper and looks like it doesn't. Is there a parameter for this?
from scipy.stats import chisquare
# incorrect since the number of observations is unequal
chisquare(f_obs=data.observed, f_exp=data.expected)
When I do manual adjustment I get slightly different result.
# adjust actual number of observations
data['obs_prop']=data['observed'].apply(lambda x: x/data['observed'].sum())
data['observed_new']=data['obs_prop']*data['expected'].sum()
# proper way
chisquare(f_obs=data.observed_new, f_exp=data.expected)
Please correct me if I am wrong at some point. Thanks.
ps: I tagged R for additional statistical expertise
Basically this was a different statistical problem - Chi-square test of independence of variables in a contingency table.
from scipy.stats import contingency as cont
chi2, p, dof, exp=cont.chi2_contingency(data)
p
I didn't get the question quite well. However, the way I see it is that you can use scipy.stats.chi2_contingency if you want to compute the independence test between two categorical variable.
Also the scipy.stats.chi2_sqaure can be used to compare observed vs expected. Here the number of categories should be the same. Logicaly a category would get a 0 frequency if there is an observed frequecy but the expeceted frequency does not exist and vice-versa.
Hope this helps

Different result using welch function between Matlab and Python

When I run welch function on the same data in Matlab and Python, I get slightly PSD estimation difference, while the sample frequencies are identical.
Here is the parameters i used in both Matlab and Python:
Matlab:
winlength=512;
overlap=0;
fftlength=1024;
fs=127.886;
[PSD, freqs] = pwelch(data, winlength, overlap, fftlength, fs);
Python:
freqs, PSD = welch(data, fs=127.886, window='hamming', nperseg=512,
noverlap=None, nfft=1024)
here's a plot presenting the difference:
enter image description here
Does anyone have any idea what should I change to get the same results of PSD?
In the Matlab documentation https://se.mathworks.com/help/signal/ref/pwelch.html it says that the overlap parameter has to be a positive integer thus 0 is not a valid value.
If you omit the overlap value - (or declare a non-valid value) the parameter is automatically set to a 50% overlap i.e. changing the curve.
Try to set the Python function to a 50% overlap and see if they match.
BTW you rarely want to have zero overlap as this is likely to cause transients in the signal.

Categories

Resources