How to calculate one-sided tolerance interval with scipy - python

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?

I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)

Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

Related

What is the hypothesis in .get_influence().cooks_distance in python?

Suppose i ended up with a cook's distance array like this:
and looking at the first element (cook's distance = 0.368 and p-value = 0.701).
How can i interpret the p-value? It is larger than 0.05 and reject the H0, but what is H0?
example obtained from https://www.statology.org/cooks-distance-python/
The p value is not the p value you get from a hypothesis test. If you check wiki, Cook's distance follows a F distribution with p and n-p degrees of freedom. So the p-value you get is actually the probability of observing a value more extreme than that, with the assumptions of a linear model that is.
We can look at the source code for statsmodels.stats.outliers_influence.OLSInfluence which is the function called for calculating cooks distance:
def cooks_distance(self):
"""Cook's distance and p-values
Based on one step approximation d_params and on results.cov_params
Cook's distance divides by the number of explanatory variables.
p-values are based on the F-distribution which are only approximate
outside of linear Gaussian models.
Warning: The definition of p-values might change if we switch to using
chi-square distribution instead of F-distribution, or if we make it
dependent on the fit keyword use_t.
"""
cooks_d2 = (self.d_params * np.linalg.solve(self.cov_params,
self.d_params.T).T).sum(1)
cooks_d2 /= self.k_vars
from scipy import stats
# alpha = 0.1
# print stats.f.isf(1-alpha, n_params, res.df_modelwc)
# TODO use chi2 # use_f option
pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid)
return cooks_d2, pvals
The relevant line is pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid) . So you calculate cooks distance and look at its 1-cdf value on the F distribution.
It is similar to how you obtain the p-value for a one sided t-test, you ask what is the probability of observing a t-statistic more extreme than that obtained from the test.

Find bootstrapped confidence interval with bootstrapped library

So let's imagine I have an array of sample data which is normally distributed. What I want, is to compute the probability of another sample being less than -3 and provide a bootstrapped confidence interval for that probability. After doing some research, I found the bootstrapped python library which I want to use to find the CI.
So I have:
import numpy as np
import bootstrapped.bootstrap as bs
import bootstrapped.stats_functions as bs_stats
mu, sigma = 2.5, 4 # mean and standard deviation
samples = np.random.normal(mu, sigma, 1000)
bs.bootstrap(samples, stat_func= ???)
What should I write for stat_func ? I tried writing a lambda function to compute the probability of -3, but it did not work. I know how to compute the probability of a sample being less than -3, it's simply the CI which I am having a hard time dealing with.
I followed the example of stat_functions.mean from the bootstrapped package. Below it is wrapped in a 'factory' so that you can specify the level at which you want to calculate the frequency (sadly you cannot pass it as an optional argument to functions that bootstrap() is expecting). Basically prob_less_func_factory(level) returns a function that calculates the proportion of your sample that is less than that level. It can be used for matrices just like the example I followed.
def prob_less_func_factory(level = -3.0):
def prob_less_func(values, axis=1):
'''Returns the proportion of samples that are less than the 'level' of each row of a matrix'''
return np.mean(np.asmatrix(values)<level, axis=axis).A1
return prob_less_func
Now you pass it in like so
level = -3
bs_res = bs.bootstrap(samples, stat_func = prob_less_func_factory(level=level))
and the result I get (yours will be slightly different because samples is random) is
0.088 (0.06999999999999999, 0.105)
so the boostrap function estimated (well, calculated) the proportion of values in samples that are less than -3 to be 0.088 and the confidence interval around it is (0.06999999999999999, 0.105)
For checking we can calculate the theoretical value of one sample from your distribution being less than -3:
from scipy.stats import norm
print(f'Theoretical Prob(N(mean={mu},std={sigma})<{level}): {norm.cdf(level, loc=mu,scale =sigma)}')
and we get
Theoretical Prob(N(mean=2.5,std=4)<-3): 0.08456572235133569
so it all seems consistent consistent.

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

Multivariate normal CDF in Python

I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value

variance vs coefficient of variation

I need to identify which statistic let me to find on digital image which line has the highest variation. I am using Variance (square units, calculated as numpy.var(x)) and Coefficient of Variation (unitless, calculated as numpy.sd(x)/numpy.mean(x)), but I got different values, as here:
v1 = line(VAR(x))
v2 = line(CV(x))
print(v1,v2)
The result:
(12,17)
Should not both find the same line?
Which one could be better to use in this case?
Coefficient of variation and variance are not supposed to choose the same array on a random data. Coefficient of variation will be sensitive to both variance and the scale of your data, whereas variance will be geared towards variation in your data.
Please see the example:
import numpy as np
x = np.random.randn(10)
x1= x+10
np.var(x), np.std(x)/np.mean(x)
(2.0571740850649021, -2.2697110381499224)
np.var(x1), np.std(x1)/np.mean(x1)
(2.0571740850649016, 0.1531035017615747)
Which one to choose depends on your application, but I'm leaning towards variance in your case.
Variance defines how much it varies from the mean (No Noise in the data) or Median(Noise in the data).
Coefficient of variation defines standarddeviation divided by mean. It always expressed in percentages.

Categories

Resources