Using sklearn's PCA to find un-mean-centered components - python

I am using sklearn's PCA for use in a physical modeling problem. In this problem, the return values from PCA.fit_transform() and PCA.components_ have physical meanings. However, it seems that sklearn's PCA automatically mean centers the input data, so that the return values of PCA.fit_transform() and PCA.components_ are in mean-centered space. I realize that PCA.inverse_transform returns the original un-mean-centered input data, but it does this through np.dot(X, PCA.components_) + PCA.mean_ where X is the return value of PCA.fit_transform().
In other words, how can I alter X and PCA.components_ into X1 and PCA.components_1 by using PCA.mean_ such that np.dot(X1,pca.components_1) returns the same value as PCA.inverse_transform(X)?
There is probably a simple linear algebra solution to this but I can't seem to figure it out.

Related

PCA from scratch and Sklearn PCA giving different output

I am trying to implement PCA from scratch. Following is the code:
sc = StandardScaler() #standardization
X_new = sc.fit_transform(X)
Z = np.divide(np.dot(X_new.T,X_new),X_new.shape[0]) #covariance matrix
eig_values, eig_vectors = np.linalg.eig(Z) #eigen vectors calculation
eigval_sorted = np.sort(eig_values)[::-1]
ev_index =np.argsort(eigval_sorted)[::-1]
pc = eig_vectors[:,ev_index] #eigen vectors sorts on the basis of eigen values
W = pc[:,0:2] #extracting 2 components
print(W)
and getting the following components:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
When I use the sklearn's PCA I get the following two components:
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105],
[ 0.37231836, 0.92555649, 0.02109478, 0.06541577]])
Projection onto new feature space gives following different figures:
Where am I doing it wrong and what can be done to resolve the problem?
The result of a PCA are technically not n vectors, but a subspace of dimension n. This subspace is represented by n vectors that span that subspace.
In your case, while the vectors are different, the spanned subspace is the same, so the result of the PCA is the same.
If you want to align your solution perfectly with the sklearn solution, you need to normalise your solution in the same way. Apparently sklearn prefers positive values over negative values? You'd need to dig into their documentation.
edit:
Yes, of course, what I wrote is wrong. The algorithm itself returns ordered orthonormal basis vectors. So vectors that are of length one and orthogonal to each other and they are ordered in their 'importance' to the dataset. So way more information than just the subspace.
However, if v, w, u are a solution of the PCA, so should +/- v, w, u be.
edit: It seems that np.linalg.eig has no mechanism to guarantee it will also return the same set of eigenvectors representing the eigenspace, see also here:
NumPy linalg.eig
So, a new version of numpy, or just how the stars are aligned today, can change your result. Although, for a PCA it should only vary in +/-

Sklearn LDA-analysis won't generate 2 dimensions

I'm trying to plot a 3-feature dataset with a binary classification on a matplotlib plot. This worked with an example dataset provided in a guide (http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/) but when I try to instead insert my own dataset, the LinearDiscriminantAnalysis will only output a one-dimensional series, no matter what number I put in "n_components". Why would this not work with my own code?
Data = pd.read_csv("DataFrame.csv", sep=";")
x = Data.iloc[:, [3, 5, 7]]
y = Data.iloc[:, 8]
lda = LDA(n_components=2)
lda_transformed = pd.DataFrame(lda.fit_transform(x, y))
plt.scatter(lda_transformed[y==0][0], lda_transformed[y==0][1], label='Loss', c='red')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1], label='Win', c='blue')
plt.legend()
plt.show()
In the case when the number of different class labels, C, is less than the number of observations (almost always), then linear discriminant analysis will always produce C - 1 discriminating components. Using n_components from the sklearn API is only a means to choose possibly fewer components, e.g. in the case when you know what dimensionality you'd like to reduce down to. But you could never use n_components to get more components.
This is discussed in the Wikipedia section on Multiclass LDA. The definition of the between-class scatter is given as
\Sigma_{b} = (1 / C) \sum_{i}^{C}( (\mu_{i} - mu)(\mu_{i} - mu)^{T}
which is the empirical covariance matrix among the population of class means. By definition, such a covariance matrix has rank at most C - 1.
... the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the C − 1 largest eigenvalues ...
So because LDA uses a decomposition of the class mean covariance matrix, it means the dimensionality reduction it can provide is based on the number of class labels, and not on the sample size nor the feature dimensionality.
In the example you linked, it doesn't matter how many features there are. The point is that the example uses 3 simulated cluster centers, so there are 3 class labels. This means linear discriminant analysis could produce projection of the data onto either 1-dimensional or 2-dimensional discriminating subspaces.
But in your data, you start out with only 2 class labels, a binary problem. This means the dimensionality of the linear discriminant model can be at most 1-dimensional, literally a line that forms the decision boundary between the two classes. Dimensionality reduction with LDA in this case would simply be the projection of data points onto a particular normal vector of that separating line.
If you want to specifically reduce down to two dimensions, you can try many of the other algorithms that sklearn provides: t-SNE, ISOMAP, PCA and kernel PCA, random projection, multi-dimensional scaling, among others. Many of these allow you to choose the dimensionality of the projected space, up to the original feature dimensionality, or sometimes you can even project into larger spaces, like with kernel PCA.
In the example you give, dimension reduction by LDA reduces the data from 13 features to 2 features, however in your example it reduces from 3 to 1 (even though you wanted to get 2 features), thus it is not possible to plot in 2D.
If you really want to select 2 features out of 3, you can use feature_selection.SelectKBest to choose 2 best features and there won't be any problems plotting in 2D.
For more information, please read this fantastic answer for PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Probably it's because of sklearn implementation that won't allowed you to do so if you only have 2 class. The problem has been stated in here, https://github.com/scikit-learn/scikit-learn/issues/1967.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks
The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).
And X_proj is the better name of X_new, because it is the projection of X onto principal components
You can reconstruct the X_rec as
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.
In my opinion, I can say the noise is discard.
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Extracting PCA components with sklearn

I am using sklearn's PCA for dimensionality reduction on a large set of images. Once the PCA is fitted, I would like to see what the components look like.
One can do so by looking at the components_ attribute. Not realizing that was available, I did something else instead:
each_component = np.eye(total_components)
component_im_array = pca.inverse_transform(each_component)
for i in range(num_components):
component_im = component_im_array[i, :].reshape(height, width)
# do something with component_im
In other words, I create an image in the PCA space that has all features but 1 set to 0. By inversely transforming them, I should then get the image in the original space which, once transformed, can be expressed solely with that PCA component.
The following image shows the results. On the left is the component calculated using my method. On the right is pca.components_[i] directly. Additionally, with my method, most images are very similar (but they are different) while by accessing the components_ the images are very different as I would have expected
Is there a conceptual problem in my method? Clearly the components from pca.components_[i] are correct (or at least more correct) than the ones I'm getting. Thanks!
Components and inverse transform are two different things. The inverse transform maps the components back to the original image space
#Create a PCA model with two principal components
pca = PCA(2)
pca.fit(data)
#Get the components from transforming the original data.
scores = pca.transform(data)
# Reconstruct from the 2 dimensional scores
reconstruct = pca.inverse_transform(scores )
#The residual is the amount not explained by the first two components
residual=data-reconstruct
Thus you are inverse transforming the original data and not the components, and thus they are completely different. You almost never inverse_transform the orginal data. pca.components_ are the actual vectors representing the underlying axis used to project the data to the pca space.
The difference between grabbing the components_ and doing an inverse_transform on the identity matrix is that the latter adds in the empirical mean of each feature. I.e.:
def inverse_transform(self, X):
return np.dot(X, self.components_) + self.mean_
where self.mean_ was estimated from the training set.

Categories

Resources