Kolmogorov-Smirnov test for uniformity giving unexpected results - python

I'm trying to check if my random variables are from uniform distribution or not. In my case, though, I have arrays like [0, 0, 0.44, 0, ... 0] that I need to check. This vector looks somewhat like uniform if you ask me, but KS-test doesn't seem to agree.
Here's what I do:
from scipy.stats import kstest, uniform
x = np.zeros((1,30))[0]
x[3] = 0.44 # created an array that i'm talking about
ks_statistic, p_value = kstest(x, 'uniform')
print(ks_statistic, p_value)
And here's what I get:
0.9666666666666667 9.713871499237658e-45
So the verdict is 'not uniform'. As far as I got the math underneath the KS-test, there is a possibility that my all-zeros arrays are far from what stats.uniform will generate to compare, and therefore the distance between the distributions is huge. But there is also a huge chance that I've got something wrong.
What do I do in order to test my variables for uniformity correctly?
Update: I checked out the Chi-square test and I got something a lot more like what I expected.
from scipy.stats import chisquare
stat, p_value = chisquare(x)
print(stat, p_value)
>>> 12.760000000000002 0.9960743669884126
So my question is why are these results so different and what am I missing?

I see where I'm wrong, that was easy.
What I claimed to look like uniformly distributed variables, looks exactly nothing like it. The whole point of uniform distribution is that you will most definitely never see two equivalent samples from it. And I have zeros, zeros, zeros all over the place.
I still don't get the difference between the p-value for KS and for Chi^2 though. But that's probably another question.

Related

Question regarding differences in calculating T statistics in Python for the difference in means

I am re-learning introductory statistics and wanted to try implementing my own versions of the general and unpooled formulas that find the T Value. I implemented it in 2 ways, one by just replicating the formulas as is as Python Functions. The other was to use Python's ability to generate a normal distribution and use that to find the difference in means. But I noticed my values were pretty different in both versions. So my question is why is there a difference? Is it with how the function works itself?
Here's the "generate a distribution itself" method:
from numpy.random import seed
from numpy.random import normal
from scipy import stats
from datetime import datetime
import math
#Plan: Generate 2 random normal distributions of the desired critiera. And T Test them
data1 = normal(loc=65.2, scale=7.8, size=30)
data2 = normal(loc=70.3, scale=8.4, size=30)
stats.ttest_ind(a=data1, b=data2)
Ttest_indResult(statistic=-2.029830829733737, pvalue=0.04696953433513939)
As you can see, it gives a T statistic of ~-2.0298 and a p value of ~ 0.0470.
Here's my "manual version":
def pop_2_mean_pooled_t(mean1, mean2, s1, s2, n1, n2):
dof = (n1+n2)-2
mean_diff = mean1 - mean2
#The N part on the right
right_n = math.sqrt((1/n1) + (1/n2))
#The Sp part
sp_numereator_left = ((n1-1)*(s1**2))
sp_numberator_right = ((n2-1)*(s2**2))
sp = math.sqrt((sp_numereator_left + sp_numberator_right)/(dof))
pooled_sp = sp*right_n
t = mean_diff/pooled_sp
p = stats.t.cdf(t, dof)
print("T is " +str(t))
print("p is " +str(p))
return t, p
pop_2_mean_pooled_t(65.2, 70.3, 7.8, 8.4, 30, 30)
T is -2.4368742610942298
p is 0.00895208222413155
(-2.4368742610942298, 0.00895208222413155)
As you can see, it gives a T statistic of ~-2.439 and a p value of ~ 0.009.
My question is why is there a discrepancy here? My "manual version" is closer to the example I was referencing. But surely the generator one should also be?
My understanding is that if a sample is significantly large enough, it would resemble a normal distribution. Therefore, one could generate a normal distribution using code and use that to approximate the corresponding T Values. For some reason, that differed quite a bit from my "manual" version
Your thinking is basically correct (I did not check your formulae though). What your encountering is in the nature of the problem: the two random samples you're drawing are, well, random and they differ in subsequent runs, so you will always get a different p-value ant the t-statistics.
Two suggestions from me:
increase the sample size in the first snippet to hundreds (not 30): you should get much closer to the stats from the second snippet.
keep 30 samples in the first snippet but run the simulation several times; you will learn the distributions of p-values and t-statistics and, again, you can check the values from your second snippet against the simulated distributions.
(Some conceptual flaws occur in this approach, e.g. repeated testing affects the p-value, but let us put them aside for now; the goal is to see your two sets of values converge.)

Demonstrating the Universality of the Uniform using numpy - an issue with transformation

Recently I wanted to demonstrate generating a continuous random variable using the universality of the Uniform. For that, I wanted to use the combination of numpy and matplotlib. However, the generated random variable seems a little bit off to me - and I don't know whether it is caused by the way in which NumPy's random uniform and vectorized works or if I am doing something fundamentally wrong here.
Let U ~ Unif(0, 1) and X = F^-1(U). Then X is a real variable with a CDF F (please note that the F^-1 here denotes the quantile function, I also omit the second part of the universality because it will not be necessary).
Let's assume that the CDF of interest to me is:
then:
According to the universality of the uniform, to generate a real variable, it is enough to plug U ~ Unif(0, 1) in the F-1. Therefore, I've written a very simple code snippet for that:
U = np.random.uniform(0, 1, 1000000)
def logistic(u):
x = np.log(u / (1 - u))
return x
logistic_transform = np.vectorize(logistic)
X = logistic_transform(U)
However, the result seems a little bit off to me - although the histogram of a generated real variable X resembles a logistic distribution (which simplified CDF I've used) - the r.v. seems to be distributed in a very unequal way - and I can't wrap my head around exactly why it is so. I would be grateful for any suggestions on that. Below are the histograms of U and X.
You have a large sample size, so you can increase the number of bins in your histogram and still get a good number samples per bin. If you are using matplotlib's hist function, try (for exampe) bins=400. I get this plot, which has the symmetry that I think you expected:
Also--and this is not relevant to the question--your function logistic will handle a NumPy array without wrapping it with vectorize, so you can save a few CPU cycles by writing X = logistic(U). And you can save a few lines of code by using scipy.special.logit instead of implementing it yourself.

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Using Scipy's stats.kstest module for goodness-of-fit testing

I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.
The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:
kstest(mydata,'norm')
where mydata is a Numpy array. Instead, I want to do something like:
kstest(mydata,myfunc)
where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:
ks_2samp(mydata,myfunc(abscissa))
but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)
In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?
Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:
>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))
To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:
>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)
The above would normally be run with the more convenient form:
>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)
If we have uniformly distributed data, it is easy to build the cdf by hand:
>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)
as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.
you can do for example:
>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>>
where x, y are two instances of numpy.array:
>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)
first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.

Tracking down the assumptions made by SciPy's `ttest_ind()` function

I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.
Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.
My code:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]; num2 = pop2.shape[0];
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
Update:
After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.
With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).
Second update:
While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.
Bonus question :)
(More statistical theory related; feel free to ignore)
Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.
So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?
By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.
I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.
As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).
You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.
This should give you your expected result:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]
num2 = pop2.shape[0];
var1 = np.var(pop1, ddof=1)
var2 = np.var(pop2, ddof=1)
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1))
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.
For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.
Looks like you forgot **2 to the numerator of your df. The Welch-Satterthwaite degrees of freedom.
df = (np.var(pop1)/num1 + np.var(pop2)/num2)/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
should be:
df = (np.var(pop1)/num1 + np.var(pop2)/num2)**2/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )

Categories

Resources