np.cov giving unexpected number of values - python

When using np.cov command on a random dataset of 10 values, I'm getting a 10x10 array as the answer. I think my data is not formatted correctly, but I'm not sure.
np.random.seed(1)
rho = 0.2
sigma = 1
cov = (sigma**2)*[[1,rho],[rho,1]]
mean1 = (0,0)
x1 = np.random.multivariate_normal(mean1, cov, (10))
mean1 = np.mean(x1)
cov1 = np.cov(x1)
print(cov1)

This is the correct behavior—np.cov returns a covariance matrix.
In particular, it takes each row of the input as a variable, with the columns representing different values of those variables. To reverse this behavior, pass rowvar=False.
In particular, if you have two variables represented as two columns of a matrix, you can use np.cov(data, rowvar=False) (or np.cov(data.T)) to get a 2 by 2 covariance matrix, in which the elements at cov[0,1] and cov[1,0] will be the covariance between the two variables.
This is also discussed here.

Related

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

How to calculate the standard deviation of a list of m x n matrices in Python?

Say I have a data set of 100 data. The interesting part about this data set is that each data is a 4x3 matrix. My question is how should I calculate the standard deviation of this data set? I tried the following code, but I don't know if the result is correct. If it is correct, I want to know how it works. I know the standard deviation equation for 1d data, but I don't know the definition of std for a collection of m x n data. There is only explanation for 1d data in the docstring of np.std.
import numpy as np
datalist = []
for _ in range(100):
data = np.random.random((4,3))
datalist.append(data)
std = np.std(np.asarray(datalist))
print(std)
Seems like you're having unnecessary steps. To begin with, you can get 100 matrices of 4x3 like this:
x = np.random.rand(100, 4, 3)
Then just call np.std on it:
np.std(x)
0.2827262559096299
That's if you want the standard deviation of all values. If you want it per matrix cell, specify the axis argument:
np.std(x, axis=0)
array([[0.27863211, 0.2670126 , 0.28752064],
[0.28540484, 0.25365294, 0.28905531],
[0.28848584, 0.27695767, 0.26886147],
[0.27138472, 0.3135065 , 0.29361115]])
axis=0 means that it's going to collapse the axis 0 (the one with size 100), which will return a matrix of 4x3.

numpy covariance between each column of a matrix and a vector

Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]

Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

I tried to calculate the Pearson's correlation coefficients between every pairs of rows from two 2D arrays. Then, sort the rows/columns of the correlation matrix based on its diagonal elements. First, the correlation coefficient matrix (i.e., 'ccmtx') was calculated from one random matrix (i.e., 'randmtx') in the following code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
def correlation_map(x, y):
n_row_x = x.shape[0]
n_row_y = x.shape[0]
ccmtx_xy = np.empty((n_row_x, n_row_y))
for n in range(n_row_x):
for m in range(n_row_y):
ccmtx_xy[n, m] = pearsonr(x[n, :], y[m, :])[0]
return ccmtx_xy
randmtx = np.random.randn(100, 1000) # generating random matrix
#ccmtx = np.corrcoef(randmtx, randmtx) # cc matrix based on numpy.corrcoef
ccmtx = correlation_map(randmtx, randmtx) # cc matrix based on scipy pearsonr
#
ccmtx_diag = np.diagonal(ccmtx)
#
ids, vals = np.argsort(ccmtx_diag, kind = 'mergesort'), np.sort(ccmtx_diag, kind = 'mergesort')
#ids, vals = np.argsort(ccmtx_diag, kind = 'quicksort'), np.sort(ccmtx_diag, kind = 'quicksort')
plt.plot(ids)
plt.show()
plt.plot(ccmtx_diag[ids])
plt.show()
vals[0]
The issue here is when the 'pearsonr' was used, the diagonal elements of 'ccmtx' are exactly 1.0 which makes sense. However, the 'corrcoef' was used, the diagonal elements of 'ccmtrix' are not exactly one (and slightly less than 1 for some diagonals) seemingly due to a precision error of floating point numbers.
I found to be annoying that the auto-correlation matrix of a single matrix have diagnoal elements not being 1.0 since this resulted in the shuffling of rows/columes of the correlation matrix when the matrix is sorted based on the diagonal elements.
My questions are:
[1] is there any good way to accelerate the computation time when I stick to use the 'pearsonr' function? (e.g., vectorized pearsonr?)
[2] Is there any good way/practice to prevent this precision error when using the 'corrcoef' in numpy? (e.g. 'decimals' option in np.around?)
I have searched the correlation coefficient calculations between all pairs of rows or columns from two matrices. However, as the algorithms containe some sort of "cov / variance" operation, this kind of precision issue seems always existing.
Minor point: the 'mergesort' option seems to provide reliable results than the 'quicksort' as the quicksort shuffled 1d array with exactly 1 to random order.
Any thoughts/comments would be greatly appreciated!
For question 1 vectorized pearsonr see the comments to the question.
I will answer only question 2: how to improve the precision of np.corrcoef.
The correlation matrix R is computed from the covariance matrix C according to
.
The implementation is optimized for performance and memory usage. It computes the covariance matrix, and then performs two divisions by sqrt(C_ii) and by sqrt(Cjj). This separate square-rooting is where the imprecision comes from. For example:
np.sqrt(3 * 3) - 3 == 0.0
np.sqrt(3) * np.sqrt(3) - 3 == -4.4408920985006262e-16
We can fix this by implementing our own simple corrcoef routine:
def corrcoef(a, b):
c = np.cov(a, b)
d = np.diag(c)
return c / np.sqrt(d[:, None] * d[None, :])
Note that this implementation requires more memory than the numpy implementation because it needs to store a temporary matrix with size n * n and it is slightly slower because it needs to do n^2 square roots instead of only 2 n.

numpy.cov() returns unexpected output

I have a X dataset which has 9 features and 683 rows (683x9). I want to take covariance matrix of this X dataset and another dataset which has same shape with X. I use np.cov(originalData, generatedData, rowvar=False) code to get it but it returns a covariance matrix of shape 18x18. I expected to get 9x9 covariance matrix. Can you please help me to fix it.
The method cov calculates the covariances for all pairs of variables that you give it. You have 9 variables in one array, and 9 more in the other. That's 18 in total. So you get 18 by 18 matrix. (Under the hood, cov concatenates the two arrays you gave it before calculating the covariance).
If you are only interested in the covariance of the variables from the 1st array with the variables from the 2nd, pick the first half of rows and second half of columns:
C = np.cov(originalData, generatedData, rowvar=False)[:9, 9:]
Or in general, with two not necessarily equal matrices X and Y,
C = np.cov(X, Y, rowvar=False)[:X.shape[1], Y.shape[1]:]

Categories

Resources