Suppose i ended up with a cook's distance array like this:
and looking at the first element (cook's distance = 0.368 and p-value = 0.701).
How can i interpret the p-value? It is larger than 0.05 and reject the H0, but what is H0?
example obtained from https://www.statology.org/cooks-distance-python/
The p value is not the p value you get from a hypothesis test. If you check wiki, Cook's distance follows a F distribution with p and n-p degrees of freedom. So the p-value you get is actually the probability of observing a value more extreme than that, with the assumptions of a linear model that is.
We can look at the source code for statsmodels.stats.outliers_influence.OLSInfluence which is the function called for calculating cooks distance:
def cooks_distance(self):
"""Cook's distance and p-values
Based on one step approximation d_params and on results.cov_params
Cook's distance divides by the number of explanatory variables.
p-values are based on the F-distribution which are only approximate
outside of linear Gaussian models.
Warning: The definition of p-values might change if we switch to using
chi-square distribution instead of F-distribution, or if we make it
dependent on the fit keyword use_t.
"""
cooks_d2 = (self.d_params * np.linalg.solve(self.cov_params,
self.d_params.T).T).sum(1)
cooks_d2 /= self.k_vars
from scipy import stats
# alpha = 0.1
# print stats.f.isf(1-alpha, n_params, res.df_modelwc)
# TODO use chi2 # use_f option
pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid)
return cooks_d2, pvals
The relevant line is pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid) . So you calculate cooks distance and look at its 1-cdf value on the F distribution.
It is similar to how you obtain the p-value for a one sided t-test, you ask what is the probability of observing a t-statistic more extreme than that obtained from the test.
Related
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.
I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value
I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)
I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.
Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.
So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?
I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.
So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.
I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function
from numpy.linalg import inv
def lnprob_trunc_norm(x, mean, bounds, C):
if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
return -np.inf
else:
return -0.5*(x-mean).dot(inv(C)).dot(x-mean)
Here, C is the covariance matrix of the multivariate normal. Then, you can run something like
S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))
pos, prob, state = S.run_mcmc(pos, Nsteps)
for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,
pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)
or sampled from an untruncated multivariate normal,
pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)
and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.
The number of steps for convergence is up to you.
Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.
I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.
The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.
The code is available at: https://github.com/brunzema/truncated-mvn-sampler.
An Example:
d = 10 # dimensions
# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)
# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf
# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)
Plotting the first dimension results in:
Reference:
Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148
Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.
My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from , which can handle general linear constraints in the form of , even when you have non-full rank D and more constraints than the dimensionality.
import numpy as np
from trun_mvnt import rtmvn, rtmvt
########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix
D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])
n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin)
# Numpy array n-by-p as result!
random_sample
########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal
D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])
n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!
A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.