Related
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
As part of my research, I measure the mean and standard deviation of draws from a lognormal distribution. Given a value of the underlying normal distribution, it should be possible to analytically predict these quantities (as given at https://en.wikipedia.org/wiki/Log-normal_distribution).
However, as can be seen in the plots below, this does not seem to be the case. The first plot shows the mean of the lognormal data against the sigma of the gaussian, while the second plot shows the sigma of the lognormal data against that of the gaussian. Clearly, the "calculated" lines deviate from the "analytic" ones very significantly.
I take the mean of the gaussian distribution to be related to the sigma by mu = -0.5*sigma**2 as this ensures that the lognormal field ought to have mean of 1. Note, this is motivated by the area of physics that I work in: the deviation from analytic values still occurs if you set mu=0.0 for example.
By copying and pasting the code at the bottom of the question, it should be possible to reproduce the plots below. Any advice as to what might be causing this would be much appreciated!
Mean of lognormal vs sigma of gaussian:
Sigma of lognormal vs sigma of gaussian:
Note, to produce the plots above, I used N=10000, but have put N=1000 in the code below for speed.
import numpy as np
import matplotlib.pyplot as plt
mean_calc = []
sigma_calc = []
mean_analytic = []
sigma_analytic = []
ss = np.linspace(1.0,10.0,46)
N = 1000
for s in ss:
mu = -0.5*s*s
ln = np.random.lognormal(mean=mu, sigma=s, size=(N,N))
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
plt.loglog(ss,mean_calc,label='calculated')
plt.loglog(ss,mean_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\mu_{LN}$')
plt.show()
plt.loglog(ss,sigma_calc,label='calculated')
plt.loglog(ss,sigma_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\sigma_{LN}$')
plt.show()
TL;DR
Lognormal are positively skewed and heavy tailed distribution. When performing float arithmetic operations (such as sum, mean or std) on sample drawn from a highly skewed distribution, the sampling vector contains values with discrepancy over several order of magnitude (many decades). This makes the computation inaccurate.
The problem comes from those two lines:
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
Because ln contains both very low and very high values with order of magnitude much higher than float precision.
The problem can be easily detected to warn user that its computation are wrong using the following predicate:
(max(ln) + min(ln)) <= max(ln)
Which is obviously false in Strictly Positive Real but must be considered when using Finite Precision Arithmetic.
Modifying your MCVE
If we slightly modify your MCVE to:
from scipy import stats
for s in ss:
mu = -0.5*s*s
ln = stats.lognorm(s, scale=np.exp(mu)).rvs(N*N)
f = stats.lognorm.fit(ln, floc=0)
mean_calc += [f[2]*np.exp(0.5*s*s)]
sigma_calc += [np.sqrt((np.exp(f[0]**2)-1)*(np.exp(2*mu + s*s)))]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
It gives the reasonably correct mean and standard deviation estimation even for high value of sigma.
The key is that fit uses MLE algorithm to estimates parameters. This totally differs from your original approach which directly performs the mean of the sample.
The fit method returns a tuple with (sigma, loc=0, scale=exp(mu)) which are parameters of the scipy.stats.lognorm object as specified in documentation.
I think you should investigate how you are estimating mean and standard deviation. The divergence probably comes from this part of your algorithm.
There might be several reasons why it diverges, at least consider:
Biased estimator: Are your estimator correct and unbiased? Mean is unbiased estimator (see next section) but maybe not efficient;
Sampled outliers from Pseudo Random Generator may not be as intense as they should be compared to the theoretical distribution: maybe MLE is less sensitive than your estimator New MCVE bellow does not support this hypothesis, but Float Arithmetic Error can explain why your estimators are underestimated;
Float Arithmetic Error New MCVE bellow highlights that it is part of your problem.
A scientific quote
It seems naive mean estimator (simply taking mean), even if unbiased, is inefficient to properly estimate mean for large sigma (see Qi Tang, Comparison of Different Methods for Estimating Log-normal Means, p. 11):
The naive estimator is easy to calculate and it is unbiased. However,
this estimator can be inefficient when variance is large and sample
size is small.
The thesis reviews several methods to estimate mean of a lognormal distribution and uses MLE as reference for comparison. This explains why your method has a drift as sigma increases and MLE stick better alas it is not time efficient for large N. Very interesting paper.
Statistical considerations
Recalling than:
Lognormal is a heavy and long tailed distribution positively skewed. One consequence is: as the shape parameter sigma grows, the asymmetry and skweness grows, so does the strength of outliers.
Effect of Sample Size: as the number of samples drawn from a distribution grows, the expectation of having an outlier increases (so does the extent).
Building a new MCVE
Lets build a new MCVE to make it clearer. The code bellow draws samples of different sizes (N ranges between 100 and 10000) from lognormal distribution where shape parameter varies (sigma ranges between 0.1 and 10) and scale parameter is set to be unitary.
import warnings
import numpy as np
from scipy import stats
# Make computation reproducible among batches:
np.random.seed(123456789)
# Parameters ranges:
sigmas = np.arange(0.1, 10.1, 0.1)
sizes = np.logspace(2, 5, 21, base=10).astype(int)
# Placeholders:
rv = np.empty((sigmas.size,), dtype=object)
xmean = np.full((3, sigmas.size, sizes.size), np.nan)
xstd = np.full((3, sigmas.size, sizes.size), np.nan)
xextent = np.full((2, sigmas.size, sizes.size), np.nan)
eps = np.finfo(np.float64).eps
# Iterate Shape Parameter:
for (i, s) in enumerate(sigmas):
# Create Random Variable:
rv[i] = stats.lognorm(s, loc=0, scale=1)
# Iterate Sample Size:
for (j, N) in enumerate(sizes):
# Draw Samples:
xs = rv[i].rvs(N)
# Sample Extent:
xextent[:,i,j] = [np.min(xs), np.max(xs)]
# Check (max(x) + min(x)) <= max(x)
if (xextent[0,i,j] + xextent[1,i,j]) - xextent[1,i,j] < eps:
warnings.warn("Potential Float Arithmetic Errors: logN(mu=%.2f, sigma=%2f).sample(%d)" % (0, s, N))
# Generate different Estimators:
# Fit Parameters using MLE:
fit = stats.lognorm.fit(xs, floc=0)
xmean[0,i,j] = fit[2]
xstd[0,i,j] = fit[0]
# Naive (Bad Estimators because of Float Arithmetic Error):
xmean[1,i,j] = np.mean(xs)*np.exp(-0.5*s**2)
xstd[1,i,j] = np.sqrt(np.log(np.std(xs)**2*np.exp(-s**2)+1))
# Log-transform:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
Observation: The new MCVE starts to raise warnings when sigma > 4.
MLE as Reference
Estimating shape and scale parameters using MLE performs well:
The two figures above show than:
Error on estimation grows alongside with shape parameter;
Error on estimation reduces as sample size increases (CTL);
Note than MLE also fits well the shape parameter:
Float Arithmetic
It is worthy to plot the extent of drawn samples versus shape parameter and sample size:
Or the decimal magnitude between smallest and largest number form the sample:
On my setup:
np.finfo(np.float64).precision # 15
np.finfo(np.float64).eps # 2.220446049250313e-16
It means we have at maximum 15 significant figures to work with, if the magnitude between two numbers exceed then the largest number absorb the smaller ones.
A basic example: What is the result of 1 + 1e6 if we can only keep four significant figures?
The exact result is 1,000,001.0 but it must be rounded off to 1.000e6. This implies: the result of the operation equals to the highest number because of the rounding precision. It is inherent of Finite Precision Arithmetic.
The two previous figures above in conjunction with statistical consideration supports your observation that increasing N does not improve estimation for large value of sigma in your MCVE.
Figures above and below show than when sigma > 3 we haven't enough significant figures (less than 5) to performs valid computations.
Further more we can say that estimator will be underestimated because largest numbers will absorb smallest and the underestimated sum will then be divided by N making the estimator biased by default.
When shape parameter becomes sufficiently large, computations are strongly biased because of Arithmetic Float Errors.
It means using quantities such as:
np.mean(xs)
np.std(xs)
When computing estimate will have huge Arithmetic Float Error because of the important discrepancy among values stored in xs. Figures below reproduce your issue:
As stated, estimations are in default (not in excess) because high values (few outliers) in sampled vector absorb small values (most of the sampled values).
Logarithmic Transformation
If we apply a logarithmic transformation, we can drastically reduces this phenomenon:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
And then the naive estimation of the mean is correct and will be far less affected by Arithmetic Float Error because all sample values will lie within few decades instead of having relative magnitude higher than the float arithmetic precision.
Actually, taking the log-transform returns the same result for mean and std estimation than MLE for each N and sigma:
np.allclose(xmean[0,:,:], xmean[2,:,:]) # True
np.allclose(xstd[0,:,:], xstd[2,:,:]) # True
Reference
If you are looking for complete and detailed explanations of this kind of issues when performing scientific computing, I recommend you to read the excellent book: N. J. Higham, Accuracy and Stability of Numerical Algorithms, Siam, Second Edition, 2002.
Bonus
Here an example of figure generation code:
import matplotlib.pyplot as plt
fig, axe = plt.subplots()
idx = slice(None, None, 5)
axe.loglog(sigmas, xmean[0,:,idx])
axe.axhline(1, linestyle=':', color='k')
axe.set_title(r"MLE: $x \sim \log\mathcal{N}(\mu=0,\sigma)$")
axe.set_xlabel(r"Standard Deviation, $\sigma$")
axe.set_ylabel(r"Mean Estimation, $\hat{\mu}$")
axe.set_ylim([0.1,10])
lgd = axe.legend([r"$N = %d$" % s for s in sizes[idx]] + ['Exact'], bbox_to_anchor=(1,1), loc='upper left')
axe.grid(which='both')
fig.savefig('Lognorm_MLE_Emean_Sigma.png', dpi=120, bbox_extra_artists=(lgd,), bbox_inches='tight')
I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.
Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.
The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.
I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.
I'm trying to understand what f_regression() in the feature selection package does.
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)
According to the documentation, the first step in f_regression is as follows:
"1. the regressor of interest and the data are orthogonalized wrt constant regressors."
What does this line mean, exactly? What are these constant regressors?
Thanks!
It means that the mean is subtracted on both variables.
A constant regressor is a vector full of ones. What this vector can explain in your data is then subtracted out. This leads to a vector with zero sum, i.e. a centered variable.
What f1_regression essentially calculates is correlation, a scalar product between centered and appropriately rescaled variables.
The resulting score is a function of this value and the degrees of freedom, i.e. the dimensionality of the vectors. The higher the score, the more probably the variables are associated.
I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)