How to implement a KS-Test in Python

How to implement a KS-Test in Python - python

scipy.stats.kstest(rvs, cdf, N) can perform a KS-Test on a dataset rvs. It tests if the dataset follows a propability distribution, whose cdf is specified in the parameters of this method.
Consider now a dataset of N=4800 samples. I have performed a KDE on this data and, therefore, have an estimated PDF. This PDF looks an awful lot like a bimodal distribution. When plotting the estimated PDF and curve_fitting a bimodal distribution to it, these two plots are pretty much identical. The parameters of the fitted bimodal distribution are (scale1, mean1, stdv1, scale2, mean2, stdv2):
[0.6 0.036 0.52, 0.23 1.25 0.4]
How can I apply scipy.stats.kstest to test if my estimated PDF is bimodal distributed?
As my null hypothesis, I state that the estimated PDF equals the following PDF:
hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)
x_grid is just a vector that contains the x-values at which I evaluate my estimated PDF. So each entry of pdf has a corresponding value of x_grid. It might be that my computation of hypoCdf is incorrect. Maybe instead of dividing by len(x_grid), should I divide by np.sum(hypoDist) ?
Challenge: cdf parameter of kstest cannot be specified as bimodal. Neither can I specify it to be hypoDist.
If I wanted to test whether my dataset was Gaussian distributed, I would write:
KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)
measurementError is the dataset that I have performed the KDE on. This returns:
statistic=0.459, pvalue=0.0
To me, it is a little irritating that the pvalue is 0.0

The cdf argument to kstest can be a callable that implements the cumulative distribution function of the distribution against which you want to test your data. To use it, you have to implement the CDF of your bimodal distribution. You want the distribution to be a mixture of two normal distributions. You can implement the CDF for this distribution by computing the weighted sum of the CDFs of the two normal distributions that make up the mixture.
Here's a script that shows how you can do this. To demonstrate how kstest is used, the script runs kstest twice. First it uses a sample that is not from the distribution. As expected, kstest computes a very small p-value for this first sample. It then generates a sample that is drawn from the mixture. For this sample, the p-value is not small.
import numpy as np
from scipy import stats
def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
"""
CDF of a mixture of two normal distributions.
"""
return (weight1*stats.norm.cdf(x, mean1, stdv1) +
(1 - weight1)*stats.norm.cdf(x, mean2, stdv2))
# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4
n = 200
# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)
# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)
# Create a sample from the bimodal distribution. This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution. The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
(stats.norm.rvs(mean2, stdv2, size=n - n1))))
# Most of time, the p-value returned by kstest with sample2 will not
# be small. We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)
Typical output (the numbers will be different each time the script is run):
sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403
You might find that, for your problem, this test does not work well. You have 4800 samples, but in your code you have parameters whose numerical values have just one or two significant digits. Unless you have good reason to believe that your sample is drawn from a distribution with exactly those parameters, it is likely that kstest will return a very small p-value.

Related

What is the hypothesis in .get_influence().cooks_distance in python?

Suppose i ended up with a cook's distance array like this:
and looking at the first element (cook's distance = 0.368 and p-value = 0.701).
How can i interpret the p-value? It is larger than 0.05 and reject the H0, but what is H0?
example obtained from https://www.statology.org/cooks-distance-python/

The p value is not the p value you get from a hypothesis test. If you check wiki, Cook's distance follows a F distribution with p and n-p degrees of freedom. So the p-value you get is actually the probability of observing a value more extreme than that, with the assumptions of a linear model that is.
We can look at the source code for statsmodels.stats.outliers_influence.OLSInfluence which is the function called for calculating cooks distance:
def cooks_distance(self):
"""Cook's distance and p-values
Based on one step approximation d_params and on results.cov_params
Cook's distance divides by the number of explanatory variables.
p-values are based on the F-distribution which are only approximate
outside of linear Gaussian models.
Warning: The definition of p-values might change if we switch to using
chi-square distribution instead of F-distribution, or if we make it
dependent on the fit keyword use_t.
"""
cooks_d2 = (self.d_params * np.linalg.solve(self.cov_params,
self.d_params.T).T).sum(1)
cooks_d2 /= self.k_vars
from scipy import stats
# alpha = 0.1
# print stats.f.isf(1-alpha, n_params, res.df_modelwc)
# TODO use chi2 # use_f option
pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid)
return cooks_d2, pvals
The relevant line is pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid) . So you calculate cooks distance and look at its 1-cdf value on the F distribution.
It is similar to how you obtain the p-value for a one sided t-test, you ask what is the probability of observing a t-statistic more extreme than that obtained from the test.

How to calculate one-sided tolerance interval with scipy

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?

I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)

Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.

Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

How can I maximize the Poissonian likelihood of a histogram given a fit curve with scipy/numpy?

I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.

If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?

Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement a KS-Test in Python - python

Related

What is the hypothesis in .get_influence().cooks_distance in python?

How to calculate one-sided tolerance interval with scipy

How to sample from a custom distribution when parameters are known?

How can I maximize the Poissonian likelihood of a histogram given a fit curve with scipy/numpy?

statsmodels - plotting the fitted distribution

Categories

Resources