Weighted standard deviation in NumPy - python

numpy.average() has a weights option, but numpy.std() does not. Does anyone have suggestions for a workaround?

How about the following short "manual calculation"?
def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.
values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))

There is a class in statsmodels that makes it easy to calculate weighted statistics: statsmodels.stats.weightstats.DescrStatsW.
Assuming this dataset and weights:
import numpy as np
from statsmodels.stats.weightstats import DescrStatsW
array = np.array([1,2,1,2,1,2,1,3])
weights = np.ones_like(array)
weights[3] = 100
You initialize the class (note that you have to pass in the correction factor, the delta degrees of freedom at this point):
weighted_stats = DescrStatsW(array, weights=weights, ddof=0)
Then you can calculate:
.mean the weighted mean:
>>> weighted_stats.mean
1.97196261682243
.std the weighted standard deviation:
>>> weighted_stats.std
0.21434289609681711
.var the weighted variance:
>>> weighted_stats.var
0.045942877107170932
.std_mean the standard error of weighted mean:
>>> weighted_stats.std_mean
0.020818822467555047
Just in case you're interested in the relation between the standard error and the standard deviation: The standard error is (for ddof == 0) calculated as the weighted standard deviation divided by the square root of the sum of the weights minus 1 (corresponding source for statsmodels version 0.9 on GitHub):
standard_error = standard_deviation / sqrt(sum(weights) - 1)

Here's one more option:
np.sqrt(np.cov(values, aweights=weights))

There doesn't appear to be such a function in numpy/scipy yet, but there is a ticket proposing this added functionality. Included there you will find Statistics.py which implements weighted standard deviations.

There is a very good example proposed by gaborous:
import pandas as pd
import numpy as np
# X is the dataset, as a Pandas' DataFrame
mean = mean = np.ma.average(X, axis=0, weights=weights) # Computing the
weighted sample mean (fast, efficient and precise)
# Convert to a Pandas' Series (it's just aesthetic and more
# ergonomic; no difference in computed values)
mean = pd.Series(mean, index=list(X.keys()))
xm = X-mean # xm = X diff to mean
xm = xm.fillna(0) # fill NaN with 0 (because anyway a variance of 0 is
just void, but at least it keeps the other covariance's values computed
correctly))
sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); # Compute the
unbiased weighted sample covariance
Correct equation for weighted unbiased sample covariance, URL (version: 2016-06-28)

A follow-up to "sample" or "unbiased" standard deviation in the "frequency weights" sense since "weighted sample standard deviation python" Google search leads to this post:
def frequency_sample_std_dev(X, n):
"""
Sample standard deviation for X and n,
where X[i] is the quantity each person in group i has,
and n[i] is the number of people in group i.
See Equation 6.4 of:
Montgomery, Douglas, C. and George C. Runger. Applied Statistics
and Probability for Engineers, Enhanced eText. Available from:
WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
"""
n_groups = len(n)
n_people = sum(n)
lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
denominator = n_people-1
var = (lhs_numerator - rhs_numerator) / denominator
std = sqrt(var)
return std
Or modifying the answer by #Eric as follows:
def weighted_sample_avg_std(values, weights):
"""
Return the weighted average and weighted sample standard deviation.
values, weights -- Numpy ndarrays with the same shape.
Assumes that weights contains only integers (e.g. how many samples in each group).
See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
"""
average = np.average(values, weights=weights)
variance = np.average((values-average)**2, weights=weights)
variance = variance*sum(weights)/(sum(weights)-1)
return (average, sqrt(variance))
print(weighted_sample_avg_std(X, n))

I was just searching for an API equivalent of the numpy np.std function that also allows the axis parameter to be set:
(I just tested it with two dimensions, so feel free for improvements if something is incorrect.)
def std(values, weights=None, axis=None):
"""
Return the weighted standard deviation.
axis -- the axis for std calculation
values, weights -- Numpy ndarrays with the same shape on the according axis.
"""
average = np.expand_dims(np.average(values, weights=weights, axis=axis), axis=axis)
# Fast and numerically precise:
variance = np.average((values-average)**2, weights=weights, axis=axis)
return np.sqrt(variance)
Thanks to Eric O Lebigot for the original answer.

Related

What does the "scale" parameter in scipy.stats.t.std() stand for?

My goal is, to find the standard deviation of a dataset with a supposed t-distribution to calculate the survival function given a quantile.
As the documentation of scipy.stats is very counter intuitive to me, I tried several things and ended up with the implementation below. (Note: the numerated variables only demonstrate, that there are different results. My goal is to end up with only one result each!)
import scipy
df, loc, scale = scipy.stats.t.fit(data, fdf=len(data)-1)
std1 = scipy.stats.t.std(df=df, loc=loc, scale=scale)
std2 = scipy.stats.t.std(df=df, loc=loc)
res1 = scipy.stats.sf(some_x, df, loc, scale)
res2 = scipy.stats.sf(some_x, df, loc, std1)
res3 = scipy.stats.sf(some_x, df, loc, std2)
I encountered that, loc equals the stats.t.mean() function, when given the values from the fit-function. But scale does not equal stats.t.std(). Hence the std1 and std2 are different and not equal to scale.
I can only find sources for the normal distribution, where it's stated that scale equals std.
How should I use the functions above appropriatly?
Any help or suggestions for editing the question would be much appreciated :)
Code on and stay healthy!
Student's T distribution is not supposed to be shifted or scaled, it's used as a standard distribution with mean=0, usually to test the difference between two means of normally distributed populations https://en.wikipedia.org/wiki/Student%27s_t-distribution.
Given a sample with n observation and the Student's T distribution with v=n-1 degrees of freedom, the standard deviation is sqrt(v / (v-2)).
You can check in scipy that this is true
n = 11
v = n - 1
dist = sps.t(df=v)
# standard deviation
# from scipy distribution
print(dist.std()) # will return 1.118033988749895
# standard deviation
# from theory
print(np.sqrt(v / (v - 2))) # will return 1.118033988749895

How to calculate one-sided tolerance interval with scipy

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

Bivariate Skewness algorithm in Python

I wonder if anyone can help me on my interpretion of the algorithm that allow to compute the bivariate skewness which is a univariate measure of skewness for bivariate data.
The bivariate skewness is defined as described in this paper:
http://www.jstor.org/discover/10.2307/2346576?sid=21105063910471&uid=3737592&uid=4&uid=2
I considered p =2(bivariate) and took the same other assumptions as described in the paper.
Here is the function I wrote to compute b1,p (algorithm of Skewness in the paper) in python:
def multiSkew(x1, x2): #bivariate function, e.g use two columns of dataframe (same len)
covariance_x1_x2 = np.cov(x1,x2) # compute the covariance matrix
inv_covariance_x1_x2 = np.linalg.inv(covariance_x1_x2) #inverse of covariance mat
x1_x2_mean = np.mean(x1),np.mean(x2) #mean value of each variable
mk = []
for x_i in x1:
for y_i in x2:
x_diff = x_i - x1_x2_mean[0] #from the equation(see link) (xi-xbar)
y_diff = y_i - x1_x2_mean[1] # (xj-xbar)
yj = np.dot(np.dot(np.transpose(x_diff),inv_covariance_x1_x2),y_diff)
mk.append(yj**3)
skew=(1.0/(len(x1)**2))*sum(mk)
return skew
I have this when I tested it
My Skewness is too big. It normally should turn around zero if I am right
In: multiSkew(x1,x2)
Out[15]: 2809276168.079186
Can anyone, more advanced in programming ,help me please?I should have made an error somewhere in the summing part.
I don't think that there is a python module that can help me to compute the skewness of a multivariate set of data.In scipy, the skew module is just for univariate set of data by the way.

(Python) MVHR Covariance and OLS Beta difference

I calculated the minimum variance hedge ratio (MVHR) of two securities' returns by:
1. Calculating the optimal h* = Cov(S,F) / Var(F) using samples
2. Running an OLS regression and obtain the beta value
Both values differ slightly, for example I got h* = 0.9547 and beta = 0.9537. But they are supposed to be the same. Why is that so?
Below is my code:
import numpy as np
import statsmodels.api as sm
var = np.var(secRets, ddof = 1)
cov_denom = len(secRets) - 1
for i in range (0, len(secRets)):
cov_num += (indexRets[i] - indexAvg) * (secRets[i] - secAvg)
cov = cov_num / cov_denom
h = cov / var
ols_res = sm.OLS(indexRets, secRets).fit()
beta = ols_res.params[0]
print h, beta
indexRets and secRets are lists of daily returns of the index and the security (futures), respectively.
This is also a case of missing constant in OLS regression. The covariance and variance calculation subtracts the mean which is the same in the linear regression as including a constant. statsmodels doesn't include a constant by default unless you use the formulas.
For more details and an example see for example OLS of statsmodels does not work with inversely proportional data?
Also, you can replace the python loop to calculate the covariance by a call to numpy.cov.

P-value from Chi sq test statistic in Python

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.
I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:
from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846
However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.
Any ideas?
Cheers,
Davy
Quick refresher here:
Probability Density Function: think of it as a point value; how dense is the probability at a given point?
Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?
In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
PDF
CDF
To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:
>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189
Notice:
chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead
Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:
>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311
While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:
>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147
Using 1-chi2.cdf() breaks down here:
>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0
Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:
>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
You meant to do:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)
Example:
from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
If you want to understand the math, the p-value of a sample, x (fixed), is
P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)
where,
P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.
So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).
for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)
In [7]: from scipy.stats import chi2
In [8]: k = 1
In [9]: z = 2
In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:
>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Update by another user (WestCoastProjects) When using the values from the OP we get:
np.exp(np.longdouble(r.pchisq(3.84,1, lower_tail=False, log_p=True)[0]))
Out[5]: 0.050043521248705198928
So there's that 0.05 you were looking for.

Categories

Resources