numpy.cov() returns unexpected output - python

I have a X dataset which has 9 features and 683 rows (683x9). I want to take covariance matrix of this X dataset and another dataset which has same shape with X. I use np.cov(originalData, generatedData, rowvar=False) code to get it but it returns a covariance matrix of shape 18x18. I expected to get 9x9 covariance matrix. Can you please help me to fix it.

The method cov calculates the covariances for all pairs of variables that you give it. You have 9 variables in one array, and 9 more in the other. That's 18 in total. So you get 18 by 18 matrix. (Under the hood, cov concatenates the two arrays you gave it before calculating the covariance).
If you are only interested in the covariance of the variables from the 1st array with the variables from the 2nd, pick the first half of rows and second half of columns:
C = np.cov(originalData, generatedData, rowvar=False)[:9, 9:]
Or in general, with two not necessarily equal matrices X and Y,
C = np.cov(X, Y, rowvar=False)[:X.shape[1], Y.shape[1]:]

Related

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

PCA -- Calculating Reduced Size Matrix With Numpy

I am trying to use PCA to reduce the size of an input image from 4096 x 4096 to 4096 x 163 while keeping its important attributes. However, there is something off with my method as I get incorrect results. I believe it is while constructing my matrix U. My results vs correct results are listed below.
Start code:
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
mean_array = np.mean(X, axis = 1)
X_tilde = np.reshape(mean_array, (4096,1))
X_tilde = X - X_tilde
# Construct the covariance matrix for computing u'_i
covmat = np.cov(X_tilde.T)
# Compute u'_i, which is stored in the variable v
w, v = np.linalg.eig(covmat)
# Compute u_i from u'_i, and store it in the variable U
U = np.dot(X_tilde, v)
# Normalize u_i, i.e., each column of U
U = U / np.linalg.norm(U)
My results:
PC1 explains 0.08589754681312775% of the total variance
PC2 explains 0.07613195994874819% of the total variance
First 100 PCs explains 0.943423133356313% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [-0.00908046 -0.00905446 -0.00887831 -0.00879843 -0.00850907]
First 5 elements of last column of U: [0.00047628 0.00048451 0.00045043 0.00035762 0.00025785]
Expected results:
PC1 explains 14.32% of the total variance
PC2 explains 7.08% of the total variance
First 100 PCs explains 94.84% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [0.03381537 0.03353881 0.03292298 0.03238798 0.03146345]
First 5 elements of last column of U: [-0.00672667 -0.00496044 -0.00672151 -0.00759426
-0.00543667]
There must be something off with my calculations, I just can't figure out what. Let me know if you need additional information.
Proof I am using:
It looks to me like you have the steps out of order. You're dropping dimensions from the input before you calculate the eigenvectors and eigenvalues, so you're effectively randomly dropping a bunch of input at this stage with no justification.
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
I don't quite follow what the intent is behind the call to transpose above, but I don't think it matters. You can only drop dimensions from the input after calculating the eigenvectors and eigenvalues of the covariance matrix. And you don't drop dimensions from the data explicitly; you truncate the matrix of eigenvectors and then use that reduced eigenvector matrix for the projection step.
The covariance matrix in this case should be a 4096x4096 matrix. The eigenvalues and eigenvectors will be returned in order, with the largest eigenvalue and corresponding eigenvector at the beginning. You can then truncate the number of eigenvectors to 163 and create the dimension-reduced projection.
It's possible that I've misunderstood something about the assignment, but I am pretty sure this is the problem. I'm reluctant to say more since it's homework.

np.cov giving unexpected number of values

When using np.cov command on a random dataset of 10 values, I'm getting a 10x10 array as the answer. I think my data is not formatted correctly, but I'm not sure.
np.random.seed(1)
rho = 0.2
sigma = 1
cov = (sigma**2)*[[1,rho],[rho,1]]
mean1 = (0,0)
x1 = np.random.multivariate_normal(mean1, cov, (10))
mean1 = np.mean(x1)
cov1 = np.cov(x1)
print(cov1)
This is the correct behavior—np.cov returns a covariance matrix.
In particular, it takes each row of the input as a variable, with the columns representing different values of those variables. To reverse this behavior, pass rowvar=False.
In particular, if you have two variables represented as two columns of a matrix, you can use np.cov(data, rowvar=False) (or np.cov(data.T)) to get a 2 by 2 covariance matrix, in which the elements at cov[0,1] and cov[1,0] will be the covariance between the two variables.
This is also discussed here.

Coefficients from linear sum of matrices?

I have a basis set of square matrices and a data set that I need to find the coefficients for given that my data is a linear sum of the basis set.
def basis(a,b,c):
return a*gam1+b*gam2+c*kapp+allsky
So data = basis and I need the best fit (least square) values of the coefficients a,b and c. The basis matrices and the data matrices are all square matrices of 89x89. I have tried using np.linalg.lstsq however since my A matrix would need to be a matrix of the 4 basis matrices the array dimension becomes 4 and throws an error stating the array dimension must be 2. Any help is appreciated.

Covariance matrix for 9 arrays using np.cov

I have 9 different numpy arrays that denote the same quantity, in our case xi. They are of length 19 each, i.e. they have been binned.
The difference between these 9 arrays is that, they have been calculated using jackknife resampling, i.e. by omitting some elements each time and repeating the same 9 times.
I would now like to calculate the covariance matrix, which should be of size 19x19. The square root of the diagonal elements of this covariance matrix should give me the error on this quantity (xi) for each bin (19 bins overall).
The equation for the covariance matrix is given by:
Here xi is the quantity. i and j are bins of length 19.
I did't want to write a manual code, so I tried with numpy.cov:
vstack = np.vstack((array1,array2,....,array9))
cov = np.cov(vstack)
This is giving me a matrix of size 9x9 instead of 19x19.
What is the mistake here? Each array, i.e. array1, array2...etc all are of length 19.
As you can see in the Example of the docs the shape of the output equals the number of rows squared. Therefore, when you have 9 rows you get a 9x9 matrix
If you expect a 19x19 matrix then you probably mixed your columns and rows up and you should use transpose
vst = np.vstack((array1,array2,....,array9))
cov_matrix = np.cov(vst.T)

Categories

Resources