Getting first principal component and reduction in variance with PCA using Numpy

Getting first principal component and reduction in variance with PCA using Numpy - python

I am following this example here: https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
which gives:
original data
[[1 2]
[3 4]
[5 6]]
column mean
[ 3. 4.]
centered matrix
[[-2. -2.]
[ 0. 0.]
[ 2. 2.]]
covariance matrix
[[ 4. 4.]
[ 4. 4.]]
vectors
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
values
[ 8. 0.]
projected data
[[-2.82842712 0. ]
[ 0. 0. ]
[ 2.82842712 0. ]]
If I want to find the first principal direction, do I simply take the eigenvalue that corresponds to the largest eigenvector? Therefore:[0.70710678, 0.70710678] ?
Building upon this, is the first principal component the highest eigenvector projected onto the data? Something like:
vectors[:,:1].T.dot(C.T)
which gives:
array([[-2.82842712, 0. , 2.82842712]])
I just fear I have the terminology confused, or I'm oversimplifying things. Thanks in advance!

Related

Numpy - Product of Cartesian Product Along Last Axis of Jagged Array

Given a nested list with unequal number of elements, I would like to find the fastest way to calculate the product of the cartesian product along the last axis. In other words, first calculate the cartesian product between all sublists, then find the multiplicative product along all combinations. Then finally, I want to insert those values into a matrix of the same size/dimensionality as the original input. As an added piece of complexity, I want to pad axes of shape (1, ) with an extra 0. For example:
example1 = [[1, 2], [3, 4], [5], [6], [7]]
should result in
[[[[[ 630. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]
[[[ 840. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]]
[[[[1260. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]
[[[1680. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]]]
which has a shape (2, 2, 2, 2, 2), although it would be (2, 2, 1, 1, 1) without padding.
My initial function is:
def convert_nest_to_product_tensor(nest):
# find indices to collect elements from
combinations = list(itertools.product(*[range(len(l)) for l in nest]))
# collect elements and then calculate product for every Cartesian product
products = np.array(
[np.product([nest[i][idx] for i, idx in enumerate(comb)]) for comb in combinations]
)
# pad tensor for axes of shape 1
tensor_shape = [len(l) for l in nest]
tensor_shape = tuple([axis_shape+1 if axis_shape==1 else axis_shape for axis_shape in tensor_shape])
tensor = np.zeros(tensor_shape)
# insert values
for i, idx in enumerate(combinations):
tensor[idx] = products[i]
return tensor
However, it takes while, specifically the part where I find the product of the Cartesian products. I tried replacing that component using np.meshgrid + np.stack:
products = np.stack(np.meshgrid(*nest), axis=-1).reshape(-1, len(nest))
products = np.prod(products, axis=-1)
and while I get the correct values much faster, but they are not in the correct output order:
[[[[[ 630. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]
[[[1260. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]]
[[[[ 840. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]
[[[1680. 0.]
[ 0. 0.]]
[[ 0. 0.]
[ 0. 0.]]]]]
Any feedback on how to make this work (quickly) is much appreciated!

A simple way of getting the cartesian tuples and product:
In [10]: alist = list(itertools.product(*example1))
In [11]: alist
Out[11]: [(1, 3, 5, 6, 7), (1, 4, 5, 6, 7), (2, 3, 5, 6, 7), (2, 4, 5, 6, 7)]
In [12]: [np.prod(x) for x in alist]
Out[12]: [630, 840, 1260, 1680]
Or use math.prod for a no-numpy solution.

Interpolate between two matrices with numpy

I have two HxW matrices A and B. I'd like to get an NxHxW matrix C such that C[0]=A, C[-1]=B, and each of the remaining N-2 slices are linearly interpolated between A and B. Is there a single numpy function I can do this with, without needing a for loop?

Just use linspace if you are looking for linear interpolation between just 2 points.
A = np.array([[0,1],
[2,3]])
B = np.array([[1, 3],
[-1,-2]])
C = np.linspace(A,B,4) #<- Change this to H+2, which is H linearly interpolated values between the 2 points
C
array([[[ 0. , 1. ], #<-- A matrix is C[0]
[ 2. , 3. ]],
[[ 0.33333333, 1.66666667],
[ 1. , 1.33333333]], #
#<-- Elementwise equally spaced values
[[ 0.66666667, 2.33333333], #
[ 0. , -0.33333333]],
[[ 1. , 3. ], #<-- B matrix is C[-1]
[-1. , -2. ]]])

Numpy covariance command returning matrix with more dimensions than input

I have an arbitrary row vector "u" and an arbitrary matrix "e" as follows:
u = np.resize(np.array([8,3]),[1,2])
e = np.resize(np.array([[2,2,5,5],[1, 6, 7, 4]]),[4,2])
np.cov(u,e)
array([[ 12.5, 0. , 0. , -12.5, 7.5],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[-12.5, 0. , 0. , 12.5, -7.5],
[ 7.5, 0. , 0. , -7.5, 4.5]])
The matrix that this returns is 5x5. This is confusing to me because the largest dimension of the inputs is only 4.
Thus, this may be less of a numpy question and more of a math question...not sure...

Please refer to the official numpy documentation (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cov.html) and check whether you usage of the numpy.cov function is consistent with what you are trying to achieve and you understand what you are trying to do.
When looking at the signature
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
m : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of m represents a variable, and each column a single observation > > of all those variables. Also see rowvar below.
y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
Note how m and y are combined as shown in the last example on the page
>>> x = [-2.1, -1, 4.3]
>>> y = [3, 1.1, 0.12]
>>> X = np.stack((x, y), axis=0)
>>> print(np.cov(X))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x, y))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x))
11.71

How to index columns with a computed array?

Please have a look at this code:
import numpy as np
from scipy.spatial import distance
#1
X = [[0,0], [0,1], [0,2], [0,3], [0,4], [0,5]]
c = [[0,0], [0,1], [0,3]]
#2
dists = distance.cdist(X, c)
print(dists)
#3
dmini = np.argmin(dists, axis=1)
print(dmini)
#4
mindists = dists[:, dmini]
print(mindists)
(#1) So I have my data X, some other points (centroids) c, then (#2) I compute the distance from each point in X to all the centroids c, and store the result in dists.
(#3) Then I select the index of the minimum distances with argmin.
(#4) Now I only want to select the value of the minimum values, using the indexes computed in step #3.
However, I get a strange output.
# dists
[[ 0. 1. 3.]
[ 1. 0. 2.]
[ 2. 1. 1.]
[ 3. 2. 0.]
[ 4. 3. 1.]
[ 5. 4. 2.]]
#dmini
[0 1 1 2 2 2]
#mindists
[[ 0. 1. 1. 3. 3. 3.]
[ 1. 0. 0. 2. 2. 2.]
[ 2. 1. 1. 1. 1. 1.]
[ 3. 2. 2. 0. 0. 0.]
[ 4. 3. 3. 1. 1. 1.]
[ 5. 4. 4. 2. 2. 2.]]
Reading here and there, it seems possible to select specific columns by giving a list of integers (indexes). In this case I should use the dmini values for indexing columns along rows.
I was expecting mindists to be (6,) in shape. What am I doing wrong?

Numpy - Modal matrix and diagonal Eigenvalues

I wrote a simple Linear Algebra code in Python Numpy to calculate the Diagonal of EigenValues by calculating $M^{-1}.A.M$ (M is the Modal Matrix) and it's working strange.
Here's the Code :
import numpy as np
array = np.arange(16)
array = array.reshape(4, -1)
print(array)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
eigenvalues, eigenvectors = np.linalg.eig(array)
print eigenvalues
[ 3.24642492e+01 -2.46424920e+00 1.92979794e-15 -4.09576009e-16]
print eigenvectors
[[-0.11417645 -0.7327781 0.54500164 0.00135151]
[-0.3300046 -0.28974835 -0.68602671 0.40644504]
[-0.54583275 0.15328139 -0.2629515 -0.8169446 ]
[-0.76166089 0.59631113 0.40397657 0.40914805]]
inverseEigenVectors = np.linalg.inv(eigenvectors) #M^(-1)
diagonal= inverseEigenVectors.dot(array).dot(eigenvectors) #M^(-1).A.M
print(diagonal)
[[ 3.24642492e+01 -1.06581410e-14 5.32907052e-15 0.00000000e+00]
[ 7.54951657e-15 -2.46424920e+00 -1.72084569e-15 -2.22044605e-16]
[ -2.80737213e-15 1.46768503e-15 2.33547852e-16 7.25592561e-16]
[ -6.22319863e-15 -9.69656080e-16 -1.38050658e-30 1.97215226e-31]]
the final 'diagonal' matrix should be a diagonal matrix with EigenValues on the main diagonal and zeros elsewhere. but it's not... the two first main diagonal values ARE eigenvalues but the two second aren't (although just like the two second eigenvalues, they are nearly zero).
and by the way a number like $-1.06581410e-14$ is literally zero so how can I make numpy show them as zero?
What am I doing wrong?
Thanks...

Just round the final result to the desired digits :
print(diagonal.round(5))
array([[ 32.46425, 0. , 0. , 0. ],
[ 0. , -2.46425, 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
Don't confuse precision of computation and printing policies.

>>> diagonal[np.abs(diagonal)<0.0000000001]=0
>>> print diagonal
[[ 32.4642492 0. 0. 0. ]
[ 0. -2.4642492 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting first principal component and reduction in variance with PCA using Numpy - python

Related

Numpy - Product of Cartesian Product Along Last Axis of Jagged Array

Interpolate between two matrices with numpy

Numpy covariance command returning matrix with more dimensions than input

How to index columns with a computed array?

Numpy - Modal matrix and diagonal Eigenvalues

Categories

Resources