Principal Component Analysis (PCA) in Python numpy using the Snapshot method

Principal Component Analysis (PCA) in Python numpy using the Snapshot method - python

I am trying to implement PCA analysis using numpy to mimic the results from sklearn's decomposition.PCA classifier.
I am using as input vectors of N flattened images of fixed size M = 128x192 (image dimensions) joined horizontally into a single matrix D of dimensions MxN
I am aiming to use the Snapshot method, as other implementations (see here and here) crash my build while computing np.cov, since the size of the covariant matrix would be C = D(D^T) = MxM.
The snapshot method first computes C_acute = (D^T)D, then computes the (acute) eigenvectors and values of this NxN matrix. This gives eigenvectors that are (D^T)v, and eigenvalues that are the same.
To retrieve the eigenvectors v from the (acute) eigenvectors, we simply do v = (1/eigenvalue) * (D(v_acute)).
Here is the reference implementation I am using adapted from this SO post (which is known to work):
class TemplatePCA:
def __init__(self, n_components=None):
self.n_components = n_components
def fit_transform(self, X):
X -= np.mean(X, axis = 0)
R = np.cov(X, rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
evals, evecs = np.linalg.eig(R)
# sort eigenvalue in decreasing order
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
# sort eigenvectors according to same index
evals = evals[idx]
# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
evecs = evecs[:, :self.n_components]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data
return -1 * np.dot(X, evecs) #
Here is the implementation I have so far.
class MyPCA:
def __init__(self, n_components=None):
self.n_components = n_components
def fit_transform(self, X):
X -= np.mean(X, axis = 0)
D = X.T
M, N = D.shape
D_T = X # D.T == (X.T).T == X
C_acute = np.dot(D_T, D)
eigen_values, eigen_vectors_acute = np.linalg.eig(C_acute)
eigen_vectors = []
for i in range(eigen_vectors_acute.shape[0]): # for each eigenvector
v = np.dot(D, eigen_vectors_acute[i]) / eigen_values[i]
eigen_vectors.append(v)
eigen_vectors = np.array(eigen_vectors)
# sort eigenvalues and eigenvectors in decreasing order
idx = np.argsort(eigen_values)[::-1]
eigen_vectors = eigen_vectors[:,idx]
eigen_values = eigen_values[idx]
# select the first n_components eigenvectors
eigen_vectors = eigen_vectors[:, :self.n_components]
# carry out the transformation on the data using eigenvectors
# return the re-scaled data (projection)
return np.dot(C_acute, eigen_vectors)
The reference text I am using notes that:
The eigenvector is now (D^T)v, so to do face detection we first multiply our test image vector by (D^T) before projecting onto the eigenimages.
I am not sure whether it is possible to retrieve the exact same principal components (i.e. eigenvectors) using this method, and it would seem impossible to even get the same eigenvectors back, since the size of the eigen_vectors_acute is only (4, 6) (meaning there are only 4 vectors), compared to the other method where it is (6, 6) (there are 6).
Running both on an input:
x = np.array([
[0.387,123, 789,256, 4878, 5.42],
[0.723,9.78,1.90,1234, 12104,5.25],
[1,123, 67.98,7.91,12756,5.52],
[1.524,1.34,23.456,1.23,6787,3.94],
])
# These two are the same
print(sklearn.decomposition.PCA(n_components=3).fit_transform(x))
print(TemplatePCA(n_components=3).fit_transform(x))
# This one is different
print(MyPCA(n_components=3).fit_transform(x))
Output:
[[ 4282.20163145 147.84415964 -267.73483211]
[-3025.62452358 683.58580386 67.76941319]
[-3599.15380006 -569.33984612 -148.62757658]
[ 2342.57669218 -262.09011737 348.5929955 ]]
[[-4282.20163145 -147.84415964 267.73483211]
[ 3025.62452358 -683.58580386 -67.76941319]
[ 3599.15380006 569.33984612 148.62757658]
[-2342.57669218 262.09011737 -348.5929955 ]]
[[ 3.35535639e+15, -5.70493660e+17, -8.57482740e+17],
[-2.45510474e+15, 4.17428591e+17, 6.27417685e+17],
[-2.82475918e+15, 4.80278997e+17, 7.21885236e+17],
[ 1.92450753e+15, -3.27213928e+17, -4.91820181e+17]]

Related

Is there a vector 2-norm function in Pydrake?

I have defined the following function to compute the pairwise distances between positions of different agents:
def compute_pairwise_distance(X, x_dims):
"""Compute the distance between each pair of agents"""
assert len(set(x_dims)) == 1
m = sym if X.dtype == object else np
n_agents = len(x_dims)
n_states = x_dims[0]
pair_inds = np.array(list(itertools.combinations(range(n_agents), 2)))
X_agent = X.reshape(-1, n_agents, n_states).swapaxes(0, 2)
dX = X_agent[:2, pair_inds[:, 0]] - X_agent[:2, pair_inds[:, 1]]
return m.linalg.norm(dX, axis=0)
where X is a 2-dimensional array, and x_dim refers to the dimension of the state vector of each agent as a list, such as [4,4,4], which means there are 3 agents each with 4 state vectors. However, since I'd like to include this pairwise distance metric into a collision-avoidance cost function in symbolic form, the following error occurred:
def cost_avoidance(x,x_dim):
#`x` here is a 1-dimensional vector
m = sym if x.dtype == object else np
if len(x_dim) == 1:
return 0
threshold = 0.5 #threshold distance below which cost avoidance is activated
distances = compute_pairwise_distance(x,x_dim)
cost_avoid = np.sum((distances[distances<threshold]-threshold)**2)*1000
return cost_avoid
<ipython-input-46-eeef96aeac91> in cost_avoidance(x, x_dim)
7 threshold = 0.5 #threshold distance below which cost avoidance is activated
8
----> 9 distances = compute_pairwise_distance(x,x_dim)
10
11 cost_avoid = np.sum((distances[distances<threshold]-threshold)**2)*1000
<ipython-input-45-8b13423fbdb8> in compute_pairwise_distance(X, x_dims)
14 # return torch.linalg.norm(dX, dim=0)
15
---> 16 return m.linalg.norm(dX, axis=0)
AttributeError: module 'pydrake.symbolic' has no attribute 'linalg'
It seems like I must use a symbolic version of vector 2-norm. However, I can't find one according to the documentation. Is there a symbolic vector 2-norm function in Pydrake at all?
https://drake.mit.edu/pydrake/index.html

If you do want the 2-norm in symbolic form, then np.sqrt(x.dot(x)) will do the trick:
import numpy as np
from pydrake.all import MakeVectorVariable
x = MakeVectorVariable(2, 'x')
print(np.sqrt(x.dot(x)))
But if your goal is collision avoidance, you might want to take a look at Drake's implementation of MinimumDistanceConstraint. There are a lot of details in there that make things work well -- like smoothing potential discontinuities at zero distance and/or when the closest body changes to a different collision pair.

How to vectorize indexing and computation when indexed tensors are different dimensions?

I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.

It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.

You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
batch_data
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.

Most efficient way to index into a numpy array from a scipy CSR matrix?

I have a numpy ndarray X with shape (4000, 3), where each sample in X is a 3D coordinate (x,y,z).
I have a scipy csr matrix nn_rad_csr of shape (4000, 4000), which is the nearest neighbors graph generated from sklearn.neighbors.radius_neighbors_graph(X, 0.01, include_self=True).
nn_rad_csr.toarray()[i] is a shape (4000,) sparse vector with binary weights (0 or 1) associated with the edges in the nearest neighbors graph from node X[i].
For instance, if nn_rad_csr.toarray()[i][j] == 1 then X[j] is within the nearest neighbor radius of X[i], whereas a value of 0 means it is not a neighbor.
What I'd like to do is have a function radius_graph_conv(X, rad) which returns an array Y which is X, averaged by its neighbors' values. I'm not sure how to exploit the sparsity of a CSR matrix to efficiently perform radius_graph_conv. I have two naive implementations of graph conv below.
import numpy as np
from sklearn.neighbors import radius_neighbors_graph, KDTree
def radius_graph_conv(X, rad):
nn_rad_csr = radius_neighbors_graph(X, rad, include_self=True)
csr_indices = nn_rad_csr.indices
csr_indptr = nn_rad_csr.indptr
Y = np.copy(X)
for i in range(X.shape[0]):
j, k = csr_indptr[i], csr_indptr[i+1]
neighbor_idx = csr_indices[j:k]
rad_neighborhood = X[neighbor_idx] # ndim always 2
Y[i] = np.mean(rad_neighborhood, axis=0)
return Y
def radius_graph_conv_matmul(X, rad):
nn_rad_arr = radius_neighbors_graph(X, rad, include_self=True).toarray()
# np.sum(nn_rad_arr, axis=-1) is basically a count of neighbors
return np.matmul(nn_rad_arr / np.sum(nn_rad_arr, axis=-1), X)
Is there a better way to do this? With a knn graph, its a very simple function, since the number of neighbors is fixed and you can just index into X, but with a radius or density based nearest neighbors graph, you have to work with a CSR, (or an array of arrays if you are using a kd tree).

Here is the direct way of exploiting csr format. Your matmul solution probably does similar things under the hood. But we save one lookup (from the .data attribute) by also exploiting that it is an adjacency matrix; also, diffing .indptr should be more efficient than summing the equivalent amount of ones.
>>> import numpy as np
>>> from scipy import sparse
>>>
# create mock data
>>> A = np.random.random((100, 100)) < 0.1
>>> A = (A | A.T).view(np.uint8)
>>> AS = sparse.csr_matrix(A)
>>> X = np.random.random((100, 3))
>>>
# dense solution for reference
>>> Xa = A # X / A.sum(axis=-1, keepdims=True)
# sparse solution
>>> XaS = np.add.reduceat(X[AS.indices], AS.indptr[:-1], axis=0) / np.diff(AS.indptr)[:, None]
>>>
# check they are the same
>>> np.allclose(Xa, XaS)
True

Calculate the Euclidean distance for 2 different size arrays [duplicate]

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:
xy1=numpy.array(
[[ 243, 3173],
[ 525, 2997]])
xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])
My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.
mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))
for i,xy in enumerate(xy1):
dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
mindist[i],minid[i]=dists.min(),dists.argmin()
Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.
Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?

(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .
# cdist example: (nx,dim) (ny,dim) -> (nx,ny)
from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist
#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1
# change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )
title = "%s dim %d nx %d ny %d metric %s" % (
__file__, dim, nx, ny, metric )
print "\n", title
#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric ) # -> (nx, ny) distances
#...............................................................................
print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
dist[0,3], cdist( [X[0]], [Y[3]] ))
# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2 ~ .4 +- .2/sqrt dim

To compute the m by p matrix of distances, this should work:
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).

The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.
Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.
All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.
This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows
import numpy as np
from scipy import spatial
xy1 = np.array(
[[243, 3173],
[525, 2997]])
xy2 = np.array(
[[682, 2644],
[277, 2651],
[396, 2640]])
# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)
# This solution by #denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)
where mindist is the minimum distance between each point in xy1 and the set of points in xy2

For what you're trying to do:
dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)
Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:
dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])

import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)

I think the following function also works.
import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
Y = X if Y is None else Y
xx = (X ** 2).sum(axis = 1)[:, None]
yy = (Y ** 2).sum(axis = 1)[:, None]
return xx + yy.T - 2 * (X # Y.T)
Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.

I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.
#alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
Safety concerns
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
overflow/underflow/speeds

I think that the most straightforward and efficient solution is to do it like this:
distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.

Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.
Assuming you have two arrays like so:
# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)
You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.
deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points
distance_matrix = (deltas ** 2).sum(axis=2)
The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :
distance_matrix.argmin(axis=1)
This returns the index of the point in b that is closest to each point of a.

Python PCA - projection into lower dimensional space

i am trying to implement PCA, which worked well regarding the intermediate results such as eigenvalues and eigenvectors. Yet when i try to project the data (3 dimensional) into the a 2D-principal-component space, the result is wrong.
I spent a lot of time comparing my code to other implementations such as:
http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
Yet after a long time there is no progress and I can not find the mistake. I assume the problem is a simple coding mistake due to the correct intermediate results.
Thanks in advance for anyone who actually read this question and thanks even more to those who give helpful comments/answers.
My code is as follows:
import numpy as np
class PCA():
def __init__(self, X):
#center the data
X = X - X.mean(axis=0)
#calculate covariance matrix based on X where data points are represented in rows
C = np.cov(X, rowvar=False)
#get eigenvectors and eigenvalues
d,u = np.linalg.eigh(C)
#sort both eigenvectors and eigenvalues descending regarding the eigenvalue
#the output of np.linalg.eigh is sorted ascending, therefore both are turned around to reach a descending order
self.U = np.asarray(u).T[::-1]
self.D = d[::-1]
**problem starts here**
def project(self, X, m):
#use the top m eigenvectors with the highest eigenvalues for the transformation matrix
Z = np.dot(X,np.asmatrix(self.U[:m]).T)
return Z
The result of my code is:
myresult
([[ 0.03463706, -2.65447128],
[-1.52656731, 0.20025725],
[-3.82672364, 0.88865609],
[ 2.22969475, 0.05126909],
[-1.56296316, -2.22932369],
[ 1.59059825, 0.63988429],
[ 0.62786254, -0.61449831],
[ 0.59657118, 0.51004927]])
correct result - such as by sklearn.PCA
([[ 0.26424835, -2.25344912],
[-1.29695602, 0.60127941],
[-3.59711235, 1.28967825],
[ 2.45930604, 0.45229125],
[-1.33335186, -1.82830153],
[ 1.82020954, 1.04090645],
[ 0.85747383, -0.21347615],
[ 0.82618248, 0.91107143]])
The input is defined as follows:
X = np.array([
[-2.133268233289599,0.903819474847349,2.217823388231679,-0.444779660856219,-0.661480010318842,-0.163814281248453,-0.608167714051449, 0.949391996219125],
[-1.273486742804804,-1.270450725314960,-2.873297536940942, 1.819616794091556,-2.617784834189455, 1.706200163080549,0.196983250752276,0.501491995499840],
[-0.935406638147949,0.298594472836292,1.520579082270122,-1.390457671168661,-1.180253547776717,-0.194988736923602,-0.645052874385757,-1.400566775105519]]).T

You need to center your data by subtracting the mean before you project it onto the new basis:
mu = X.mean(0)
C = np.cov(X - mu, rowvar=False)
d, u = np.linalg.eigh(C)
U = u.T[::-1]
Z = np.dot(X - mu, U[:2].T)
print(Z)
# [[ 0.26424835 -2.25344912]
# [-1.29695602 0.60127941]
# [-3.59711235 1.28967825]
# [ 2.45930604 0.45229125]
# [-1.33335186 -1.82830153]
# [ 1.82020954 1.04090645]
# [ 0.85747383 -0.21347615]
# [ 0.82618248 0.91107143]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Principal Component Analysis (PCA) in Python numpy using the Snapshot method - python

Related

Is there a vector 2-norm function in Pydrake?

How to vectorize indexing and computation when indexed tensors are different dimensions?

Most efficient way to index into a numpy array from a scipy CSR matrix?

Calculate the Euclidean distance for 2 different size arrays [duplicate]

Python PCA - projection into lower dimensional space

Categories

Resources