Using Scipy's stats.kstest module for goodness-of-fit testing - python

I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.
The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:
kstest(mydata,'norm')
where mydata is a Numpy array. Instead, I want to do something like:
kstest(mydata,myfunc)
where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:
ks_2samp(mydata,myfunc(abscissa))
but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)
In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?

Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:
>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))
To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:
>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)
The above would normally be run with the more convenient form:
>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)
If we have uniformly distributed data, it is easy to build the cdf by hand:
>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)

as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.
you can do for example:
>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>>
where x, y are two instances of numpy.array:
>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)
first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.

Related

findfrequency spec.ar equivalent in Python

There is a very useful function in R called findfrequency on the forecast package that returns the period of the dominant frequency of a time series. More info on the function from the author can be found here: https://robjhyndman.com/hyndsight/tscharacteristics/
I want to implement something equivalent in Python and I am having trouble with the functions that should be equal to the spec.ar R function that is inside findfrequency.
The function starts from detrending the series which is easily done with x = statsmodels.tsa.tsatools.detrend(myTs, order=1, axis=0). Now that I have the residuals I would like to do in Python the equivalent of the spec.ar function in R that first fits an AR model to x (or uses the existing fit) and computes (and by default plots) the spectral density of the fitted model.
I have not found anything similar so I am doig each step at a time, first the AR and then the spec estimation.
I am using the Airpassengers time series and I am not able to get the same results on R and Python for the AR order or coefficients.
My R code:
x <- AirPassengers
x <- residuals(tslm(x ~ trend))
ARmodel <- ar(x)
ARmodel
I get that 15 is the selected order for my autoregressive model.
My Python Code:
import statsmodels.api as sm
dataPeriodic = pd.read_csv('AirPassengers.csv')
tsPeriodic = dataPeriodic.iloc[:,1]
x = statsmodels.tsa.tsatools.detrend(tsPeriodic, order=1, axis=0)
n = x.shape[0]
est_order = sm.tsa.AR(x).select_order(maxlag=20, ic='aic', trend='nc')
print(est_order)
Here I get a very different result with an order selected that is equal to 10 instead of 15 and I have to specify the upper limit of the lag search with the maxlag parameter..
I have tried with the tsa.AutoReg without success, I get another different result.
So, is there a way to fit an AR model in the same way that R does ? Something similiar to spec.ar or even something similar to the findfrequency function ? I am quite confused by the big diferences the 'same' methods can output in the two languages.
Closest I could find in Python for findfrequency of the R forecast package was by using pandas.infer_freq like this:
>>> import pandas as pd
>>> ts_data = pd.read_csv("ts_data.csv")
>>> pd.infer_freq(ts_data.index.values)
4

How to obtain a python scipy-type continuous rv distribution object that is bounded?

I would like to define a bounded version of a continuous random variable distribution (say, an exponential, but I might want to use others as well). The bounds are 0 and 1. I would like to
draw random variates (as done by scipy.stats.rv_continuous.rvs),
use the ppf (percentage point function) (as done by scipy.stats.rv_continuous.ppf), and possibly
use the cdf (cumulative density function) (as done by scipy.stats.rv_continuous.cdf)
Possible approaches I can think of:
Getting random variates in an ad hoc way is not difficult
import scipy.stats
d = scipy.stats.expon(0, 3/10.) # an exponential distribution as an example
rv = d.rvs(size=target_number_of_rv)
rv = rv[0=<rv]
rv = rv[rv<=1]
while len(rv) < target_number_of_rv:
rv += d.rvs(1)
rv = rv[0=<rv]
rv = rv[rv<=1]
but 1) this is non-generic and potentially error-prone and 2) it does not help with the ppf or cdf.
Subclassing scipy.stats.rv_continuous, as is done here and here. Thereby, the ppf of scipy.stats.rv_continuous can be used. The drawback is that it requires the pdf (not just a pre-defined rv_continuous object or the pdf of the unbounded distribution and the bounds), and if this is wrong, cdf and ppf and everything else will be wrong as well.
Designing a class that cares for applying the bounds to the rv generation and for correcting the value of the ppf obtained from the unbounded object in scipy.stats. A drawback is that this is non-generic and error-prone as well and that it may be difficult to correct the ppf. My feeling is that the value of the cdf of the unbounded distribution could be scaled by what share of probability mass is out of the bounds (in total, lower and upper), but I may be wrong. That would be for lower and upper bounds l and u and any valid quantile x (with l<=x<=u): (cdf(x)-cdf(l))/(cdf(u)-cdf(l)). Obtaining the ppf would, however, require to invert the resulting function.
My feeling is that there might be a better and more generic way to do this. Is there? Maybe with sympy? Maybe by somehow obtaining the function object of the unbounded cdf and modifying it directly?
Python is version: 3.6.2, scipy is version 0.19.1.
If the distribution is one of those that is available in scipy.stats then you can evaluate its integral between the two bounds using the cdf for that distribution. Otherwise, you can define the pdf for rv_continuous and then use its cdf to get this integral.
Now, you have, in effect, the pdf for the bounded version of the pdf you want because you have calculated the normalising constant for it, in that integral. You can proceed to use rv_continuous with the form that you have for the pdf plus the normalising constant and with the bounds.
Here's what your code might be like. The variable scale is set according to the scipy documents. norm is the integral of the exponential pdf over [0,1]. Only about .49 of the probability mass is accounted for. Therefore, to make the exponential, when truncated to the [0,1] interval give a mass of one we must divide its pdf by this factor.
Truncated_expon is defined as a subclass of rv_continuous as in the documentation. By supplying its pdf we make it possible (at least for such a simple integral!) for scipy to calculate this distribution's cdf and thereby to calculate random samples.
I have calculated the cdf at one as a check.
>>> from scipy import stats
>>> lamda = 2/3
>>> scale = 1/lamda
>>> norm = stats.expon.cdf(1, scale=scale)
>>> norm
0.48658288096740798
>>> from math import exp
>>> class Truncated_expon(stats.rv_continuous):
... def _pdf(self, x, lamda):
... return lamda*exp(-lamda*x)/0.48658288096740798
...
>>> e = Truncated_expon(a=0, b=1, shapes='lamda')
>>> e.cdf(1, lamda=lamda)
1.0
>>> e.rvs(size=20, lamda=lamda)
array([ 0.20064067, 0.67646465, 0.89118679, 0.86093035, 0.14334989,
0.10505598, 0.53488779, 0.11606106, 0.41296616, 0.33650899,
0.95126415, 0.57481087, 0.04495104, 0.00308469, 0.23585195,
0.00653972, 0.59400395, 0.34919065, 0.91762547, 0.40098409])

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Get statistical difference of correlation coefficient in python

To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.

Integrating a function using non-uniform measure (python/scipy)

I would like to integrate a function in python and provide the probability density (measure) used to sample values. If it's not obvious, integrating f(x)dx in [a,b] implicitly use the uniform probability density over [a,b], and I would like to use my own probability density (e.g. exponential).
I can do it myself, using np.random.* but then
I miss the optimizations available in scipy.integrate.quad. Or maybe all those optimizations assume the uniform density?
I need to do the error estimation myself, which is not trivial. Or maybe it is? Maybe the error is just the variance of sum(f(x))/n?
Any ideas?
As unutbu said, if you have the density function, the you can just integrate the product of your function with the pdf using scipy.integrate.quad.
For the distribution that are available in scipy.stats, we can also just use the expect function.
For example
>>> from scipy import stats
>>> f = lambda x: x**2
>>> stats.norm.expect(f, loc=0, scale=1)
1.0000000000000011
>>> stats.norm.expect(f, loc=0, scale=np.sqrt(2))
1.9999999999999996
scipy.integrate.quad also has some predefined weight functions, although they are not normalized to be probability density functions.
The approximation error depends on the settings for the call to integrate.quad.
Just for the sake of brevity, 3 ways were suggested for calculating the expected value of f(x) under the probability p(x):
Assuming p is given in closed-form, use scipy.integrate.quad to evaluate f(x)p(x)
Assuming p can be sampled from, sample N values x=P(N), then evaluate the expected value by np.mean(f(X)) and the error by np.std(f(X))/np.sqrt(N)
Assuming p is available at stats.norm, use stats.norm.expect(f)
Assuming we have the CDF(x) of the distribution rather than p(x), calculate H=Inverse[CDF] and then integrate f(H(x)) using scipy.integrate.quad
Another possibilty would be to integrate x -> f( H(x)) where H is the inverse of the cumulative distribution of your probability distribtion.
[This is because of change of variable: replacing y=CDF(x) and noting that p(x)=CDF'(x) yields the change dy=p(x)dx and thus int{f(x)p(x)dx}==int{f(x)dy}==int{f(H(y))dy with H the inverse of CDF.]

Categories

Resources