PCA -- Calculating Reduced Size Matrix With Numpy - python

I am trying to use PCA to reduce the size of an input image from 4096 x 4096 to 4096 x 163 while keeping its important attributes. However, there is something off with my method as I get incorrect results. I believe it is while constructing my matrix U. My results vs correct results are listed below.
Start code:
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
mean_array = np.mean(X, axis = 1)
X_tilde = np.reshape(mean_array, (4096,1))
X_tilde = X - X_tilde
# Construct the covariance matrix for computing u'_i
covmat = np.cov(X_tilde.T)
# Compute u'_i, which is stored in the variable v
w, v = np.linalg.eig(covmat)
# Compute u_i from u'_i, and store it in the variable U
U = np.dot(X_tilde, v)
# Normalize u_i, i.e., each column of U
U = U / np.linalg.norm(U)
My results:
PC1 explains 0.08589754681312775% of the total variance
PC2 explains 0.07613195994874819% of the total variance
First 100 PCs explains 0.943423133356313% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [-0.00908046 -0.00905446 -0.00887831 -0.00879843 -0.00850907]
First 5 elements of last column of U: [0.00047628 0.00048451 0.00045043 0.00035762 0.00025785]
Expected results:
PC1 explains 14.32% of the total variance
PC2 explains 7.08% of the total variance
First 100 PCs explains 94.84% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [0.03381537 0.03353881 0.03292298 0.03238798 0.03146345]
First 5 elements of last column of U: [-0.00672667 -0.00496044 -0.00672151 -0.00759426
-0.00543667]
There must be something off with my calculations, I just can't figure out what. Let me know if you need additional information.
Proof I am using:

It looks to me like you have the steps out of order. You're dropping dimensions from the input before you calculate the eigenvectors and eigenvalues, so you're effectively randomly dropping a bunch of input at this stage with no justification.
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
I don't quite follow what the intent is behind the call to transpose above, but I don't think it matters. You can only drop dimensions from the input after calculating the eigenvectors and eigenvalues of the covariance matrix. And you don't drop dimensions from the data explicitly; you truncate the matrix of eigenvectors and then use that reduced eigenvector matrix for the projection step.
The covariance matrix in this case should be a 4096x4096 matrix. The eigenvalues and eigenvectors will be returned in order, with the largest eigenvalue and corresponding eigenvector at the beginning. You can then truncate the number of eigenvectors to 163 and create the dimension-reduced projection.
It's possible that I've misunderstood something about the assignment, but I am pretty sure this is the problem. I'm reluctant to say more since it's homework.

Related

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

Using np.cov() on a centered matrix not equivalent to matrix multiplication between the array and its transpose

I'm trying to get the eigenvectors and values for the MNIST dataset.
I'm testing out a concept on the dataset so I can carry it to a different dataset
I have a matrix M where the rows are the images and the columns are the pixel values.
I'm trying to do the above in two ways (taken from https://mml-book.github.io/book/mml-book.pdf, chapter 10, section 1 and section 5):
M is of shape 500 rows x 784 columns
First, I'm using the following code:
V = cov(M.T)
and then using:
V2 = np.dot(matrix.T,matrix) / 783
According to numpy's guide on cov(), it seems like with one variable given I the results of both should be identical, but they're not. https://numpy.org/doc/stable/reference/generated/numpy.cov.html
sorry if the question is simple and there's an obvious answer
EDIT:
if I take the highest eigenvector of both methods and scale it so the lowest number is zero and the highest is 255, I get the same vector. What am I missing here?

How to normalize 4D array ( not an image)?

I have a 4D array of shape (1948, 60, 2, 3) which tells the difference in end effector positions (x,y,z) over 60 time steps.
The number 1948 indicates the number of samples, 60 is the number of time steps, 2 is for left_arm and right_arm, 3 denotes the x,y,z positions.
a sample of how it looks is below:
array([[[ 3.93048840e-05, 7.70215296e-04, 1.13865805e-03],
[ 1.11679799e-04, -7.04810066e-04, 1.83552688e-04]],
[[ -6.26468389e-04, 6.86668923e-04, 1.57112482e-04],
[ 3.68164582e-04, 7.98948528e-04, 4.50642200e-04]],
[[ 2.51472961e-04, -2.48105983e-04, 7.52486843e-04],
[ 8.99905240e-05, 1.70473461e-04, -3.09927572e-04]],
[[ -7.52414330e-04, 5.46782063e-04, -3.76679264e-04],
[ -3.12531026e-04, -3.36585211e-04, 5.79075595e-05]],
[[ 7.69968002e-04, -1.95524291e-03, -8.65666619e-04],
[ 2.37583215e-04, 4.59415986e-04, 6.07292643e-04]],
[[ 1.41795261e-03, -1.62364401e-03, -8.99673829e-04],
I want to normalize this data as I need tot rain on a neural netowrk. How do I go about normalizing a 4D array I have an intuition for images. Can I normalize each example data or should the normalization be there for the entire 4D array?
The trick would be to use keepdims set as True, which lets the broadcasting happen without bothering us with the housekeeping work of extending dims. Hence, the solution for generic ndarrays that would handle generic dimension arrays would be -
# Get min, max value aming all elements for each column
x_min = np.min(x, axis=tuple(range(x.ndim-1)), keepdims=1)
x_max = np.max(x, axis=tuple(range(x.ndim-1)), keepdims=1)
# Normalize with those min, max values leveraging broadcasting
out = (x - x_min)/ (x_max - x_min)
First, yes you can do normalization and there is no problem with that.
Second, there is nothing special about 4D arrays. Normalization simply should be performed separately for each feature. Thus depending on the type of the normalization, you should calculate the max and min (or mean and std) values for each feature across all samples in the training set.
In your case you should decide which parts of the data refer to the same distribution. So decide on each dimension:
1) First dimension is just number of samples, so it doesn't make new distribution. Treat it as number of data entries.
2) Time step. Here you should decide: does x,y,z values have unique distribution at each of 60 timesteps? If no, treat it the same way as previous step. If yes, calculate max,min (or mean, std) for following feature, separately for each time step. (For simplicity, think like does arm at step 0 can actually have similar values to 30, 60? If yes again they all correspond to data entries, no: x60 features)
3) Do left arm and right arm have different x,y,z values? If yes, again calculate them separately. ( I guess they do, because left and right arm statistically tend to occupy different points in space)
4) x,y,z values definitely independent distributions, so calculate them separately.
Now when you decide you will have features between 3 and 360 (depending on your decisions) so calculate necessary values for them (max, min or mean, std) and perform standard routine.
Hope it helps!

numpy.cov() returns unexpected output

I have a X dataset which has 9 features and 683 rows (683x9). I want to take covariance matrix of this X dataset and another dataset which has same shape with X. I use np.cov(originalData, generatedData, rowvar=False) code to get it but it returns a covariance matrix of shape 18x18. I expected to get 9x9 covariance matrix. Can you please help me to fix it.
The method cov calculates the covariances for all pairs of variables that you give it. You have 9 variables in one array, and 9 more in the other. That's 18 in total. So you get 18 by 18 matrix. (Under the hood, cov concatenates the two arrays you gave it before calculating the covariance).
If you are only interested in the covariance of the variables from the 1st array with the variables from the 2nd, pick the first half of rows and second half of columns:
C = np.cov(originalData, generatedData, rowvar=False)[:9, 9:]
Or in general, with two not necessarily equal matrices X and Y,
C = np.cov(X, Y, rowvar=False)[:X.shape[1], Y.shape[1]:]

Covariance matrix for 9 arrays using np.cov

I have 9 different numpy arrays that denote the same quantity, in our case xi. They are of length 19 each, i.e. they have been binned.
The difference between these 9 arrays is that, they have been calculated using jackknife resampling, i.e. by omitting some elements each time and repeating the same 9 times.
I would now like to calculate the covariance matrix, which should be of size 19x19. The square root of the diagonal elements of this covariance matrix should give me the error on this quantity (xi) for each bin (19 bins overall).
The equation for the covariance matrix is given by:
Here xi is the quantity. i and j are bins of length 19.
I did't want to write a manual code, so I tried with numpy.cov:
vstack = np.vstack((array1,array2,....,array9))
cov = np.cov(vstack)
This is giving me a matrix of size 9x9 instead of 19x19.
What is the mistake here? Each array, i.e. array1, array2...etc all are of length 19.
As you can see in the Example of the docs the shape of the output equals the number of rows squared. Therefore, when you have 9 rows you get a 9x9 matrix
If you expect a 19x19 matrix then you probably mixed your columns and rows up and you should use transpose
vst = np.vstack((array1,array2,....,array9))
cov_matrix = np.cov(vst.T)

Categories

Resources