I wonder if anyone can help me on my interpretion of the algorithm that allow to compute the bivariate skewness which is a univariate measure of skewness for bivariate data.
The bivariate skewness is defined as described in this paper:
http://www.jstor.org/discover/10.2307/2346576?sid=21105063910471&uid=3737592&uid=4&uid=2
I considered p =2(bivariate) and took the same other assumptions as described in the paper.
Here is the function I wrote to compute b1,p (algorithm of Skewness in the paper) in python:
def multiSkew(x1, x2): #bivariate function, e.g use two columns of dataframe (same len)
covariance_x1_x2 = np.cov(x1,x2) # compute the covariance matrix
inv_covariance_x1_x2 = np.linalg.inv(covariance_x1_x2) #inverse of covariance mat
x1_x2_mean = np.mean(x1),np.mean(x2) #mean value of each variable
mk = []
for x_i in x1:
for y_i in x2:
x_diff = x_i - x1_x2_mean[0] #from the equation(see link) (xi-xbar)
y_diff = y_i - x1_x2_mean[1] # (xj-xbar)
yj = np.dot(np.dot(np.transpose(x_diff),inv_covariance_x1_x2),y_diff)
mk.append(yj**3)
skew=(1.0/(len(x1)**2))*sum(mk)
return skew
I have this when I tested it
My Skewness is too big. It normally should turn around zero if I am right
In: multiSkew(x1,x2)
Out[15]: 2809276168.079186
Can anyone, more advanced in programming ,help me please?I should have made an error somewhere in the summing part.
I don't think that there is a python module that can help me to compute the skewness of a multivariate set of data.In scipy, the skew module is just for univariate set of data by the way.
Related
I have a discrete power spectrum f(x) which is a power law with a specific exponent b: f(x) = A*x^(b). I want to estimate that exponent when I have f(x) and x. I'm interested in using a Maximum Likelihood Estimator (MLE) to find it. Why? Well, there's this paper that says it's the best method for such distributions with least bias: Clauset et al. 2007 .
I already tried many different methods, for instance, I took the commonly known route and took the log for both power and frequency then applied a least squares fit to the loglog spectrum but the slope (sought exponent) of such process is sometimes biased. There's already a Python implementation of the mentioned paper package but it does MLE based on PDF or CDF not the probability distribution you provide. I tried editing it but it's so complicated. I also tried using pytorch to do optimization with no success since I'm not experienced with it.
How can I implement such MLE in python for my problem?
Edit: I have found someone ask a similar question but in R. Someone answered them here but again it's in R not python.
Edit: added code
from astroML.time_series import generate_power_law
import numpy as np
import scipy
N=72 # number of points
dt=5*60 # time resolution
exponent= -2 # power law exponent
rand_seed= 1
time_max = 6*6*60 # observation time
x = np.linspace(0, time_max, N ) # time array
y = generate_power_law(N, dt, -exponent, random_state= rand_seed)
# amplitudes with power law array
# Get the spectrum by FFT for example
sample_freq = scipy.fft.rfftfreq(1*N, d=dt)[1:] # sampling frequency
sig_fft = scipy.fft.rfft(y,1*N)[1:] # FFT amplitudes
psd = (np.abs(sig_fft)**2)*2*dt/(N) # power spectral density
# do least squares fitting
logA = np.log(sample_freq )
logB = np.log(psd )
slope, intercept = np.polyfit(logA, logB, 1)
Another attempt for MLE based on the comment which links article :
import numpy as np
from scipy.optimize import minimize
def func_powerlaw(x, params):
slope = params[0]
coefficient = params[1]
return (x**slope)*coefficient
params = [1,1]
#For MLE, minimize the negative log likelihood
def neglnlike(params, x, y):
model = func_powerlaw(x, params)
output = np.sum(np.log(model) + y/model)
#Check that this is valid, returning large number if not
if not np.isfinite(output):
return 1.0e30
return output
res = minimize(neglnlike, params, args=(sample_freq , psd),
method='Nelder-Mead')
See this library named powerlaw
https://github.com/jeffalstott/powerlaw
powerlaw is a toolbox using the statistical methods developed in Clauset et al. 2007 and Klaus et al. 2011 to determine if a probability distribution fits a power law.
Suppose i ended up with a cook's distance array like this:
and looking at the first element (cook's distance = 0.368 and p-value = 0.701).
How can i interpret the p-value? It is larger than 0.05 and reject the H0, but what is H0?
example obtained from https://www.statology.org/cooks-distance-python/
The p value is not the p value you get from a hypothesis test. If you check wiki, Cook's distance follows a F distribution with p and n-p degrees of freedom. So the p-value you get is actually the probability of observing a value more extreme than that, with the assumptions of a linear model that is.
We can look at the source code for statsmodels.stats.outliers_influence.OLSInfluence which is the function called for calculating cooks distance:
def cooks_distance(self):
"""Cook's distance and p-values
Based on one step approximation d_params and on results.cov_params
Cook's distance divides by the number of explanatory variables.
p-values are based on the F-distribution which are only approximate
outside of linear Gaussian models.
Warning: The definition of p-values might change if we switch to using
chi-square distribution instead of F-distribution, or if we make it
dependent on the fit keyword use_t.
"""
cooks_d2 = (self.d_params * np.linalg.solve(self.cov_params,
self.d_params.T).T).sum(1)
cooks_d2 /= self.k_vars
from scipy import stats
# alpha = 0.1
# print stats.f.isf(1-alpha, n_params, res.df_modelwc)
# TODO use chi2 # use_f option
pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid)
return cooks_d2, pvals
The relevant line is pvals = stats.f.sf(cooks_d2, self.k_vars, self.results.df_resid) . So you calculate cooks distance and look at its 1-cdf value on the F distribution.
It is similar to how you obtain the p-value for a one sided t-test, you ask what is the probability of observing a t-statistic more extreme than that obtained from the test.
I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value
I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.
Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.
So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?
I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.
So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.
I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function
from numpy.linalg import inv
def lnprob_trunc_norm(x, mean, bounds, C):
if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
return -np.inf
else:
return -0.5*(x-mean).dot(inv(C)).dot(x-mean)
Here, C is the covariance matrix of the multivariate normal. Then, you can run something like
S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))
pos, prob, state = S.run_mcmc(pos, Nsteps)
for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,
pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)
or sampled from an untruncated multivariate normal,
pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)
and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.
The number of steps for convergence is up to you.
Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.
I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.
The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.
The code is available at: https://github.com/brunzema/truncated-mvn-sampler.
An Example:
d = 10 # dimensions
# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)
# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf
# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)
Plotting the first dimension results in:
Reference:
Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148
Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.
My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from , which can handle general linear constraints in the form of , even when you have non-full rank D and more constraints than the dimensionality.
import numpy as np
from trun_mvnt import rtmvn, rtmvt
########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix
D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])
n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin)
# Numpy array n-by-p as result!
random_sample
########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal
D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])
n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!
A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.
numpy.average() has a weights option, but numpy.std() does not. Does anyone have suggestions for a workaround?
How about the following short "manual calculation"?
def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.
values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))
There is a class in statsmodels that makes it easy to calculate weighted statistics: statsmodels.stats.weightstats.DescrStatsW.
Assuming this dataset and weights:
import numpy as np
from statsmodels.stats.weightstats import DescrStatsW
array = np.array([1,2,1,2,1,2,1,3])
weights = np.ones_like(array)
weights[3] = 100
You initialize the class (note that you have to pass in the correction factor, the delta degrees of freedom at this point):
weighted_stats = DescrStatsW(array, weights=weights, ddof=0)
Then you can calculate:
.mean the weighted mean:
>>> weighted_stats.mean
1.97196261682243
.std the weighted standard deviation:
>>> weighted_stats.std
0.21434289609681711
.var the weighted variance:
>>> weighted_stats.var
0.045942877107170932
.std_mean the standard error of weighted mean:
>>> weighted_stats.std_mean
0.020818822467555047
Just in case you're interested in the relation between the standard error and the standard deviation: The standard error is (for ddof == 0) calculated as the weighted standard deviation divided by the square root of the sum of the weights minus 1 (corresponding source for statsmodels version 0.9 on GitHub):
standard_error = standard_deviation / sqrt(sum(weights) - 1)
Here's one more option:
np.sqrt(np.cov(values, aweights=weights))
There doesn't appear to be such a function in numpy/scipy yet, but there is a ticket proposing this added functionality. Included there you will find Statistics.py which implements weighted standard deviations.
There is a very good example proposed by gaborous:
import pandas as pd
import numpy as np
# X is the dataset, as a Pandas' DataFrame
mean = mean = np.ma.average(X, axis=0, weights=weights) # Computing the
weighted sample mean (fast, efficient and precise)
# Convert to a Pandas' Series (it's just aesthetic and more
# ergonomic; no difference in computed values)
mean = pd.Series(mean, index=list(X.keys()))
xm = X-mean # xm = X diff to mean
xm = xm.fillna(0) # fill NaN with 0 (because anyway a variance of 0 is
just void, but at least it keeps the other covariance's values computed
correctly))
sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); # Compute the
unbiased weighted sample covariance
Correct equation for weighted unbiased sample covariance, URL (version: 2016-06-28)
A follow-up to "sample" or "unbiased" standard deviation in the "frequency weights" sense since "weighted sample standard deviation python" Google search leads to this post:
def frequency_sample_std_dev(X, n):
"""
Sample standard deviation for X and n,
where X[i] is the quantity each person in group i has,
and n[i] is the number of people in group i.
See Equation 6.4 of:
Montgomery, Douglas, C. and George C. Runger. Applied Statistics
and Probability for Engineers, Enhanced eText. Available from:
WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
"""
n_groups = len(n)
n_people = sum(n)
lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
denominator = n_people-1
var = (lhs_numerator - rhs_numerator) / denominator
std = sqrt(var)
return std
Or modifying the answer by #Eric as follows:
def weighted_sample_avg_std(values, weights):
"""
Return the weighted average and weighted sample standard deviation.
values, weights -- Numpy ndarrays with the same shape.
Assumes that weights contains only integers (e.g. how many samples in each group).
See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
"""
average = np.average(values, weights=weights)
variance = np.average((values-average)**2, weights=weights)
variance = variance*sum(weights)/(sum(weights)-1)
return (average, sqrt(variance))
print(weighted_sample_avg_std(X, n))
I was just searching for an API equivalent of the numpy np.std function that also allows the axis parameter to be set:
(I just tested it with two dimensions, so feel free for improvements if something is incorrect.)
def std(values, weights=None, axis=None):
"""
Return the weighted standard deviation.
axis -- the axis for std calculation
values, weights -- Numpy ndarrays with the same shape on the according axis.
"""
average = np.expand_dims(np.average(values, weights=weights, axis=axis), axis=axis)
# Fast and numerically precise:
variance = np.average((values-average)**2, weights=weights, axis=axis)
return np.sqrt(variance)
Thanks to Eric O Lebigot for the original answer.