Extracting PCA components with sklearn - python

I am using sklearn's PCA for dimensionality reduction on a large set of images. Once the PCA is fitted, I would like to see what the components look like.
One can do so by looking at the components_ attribute. Not realizing that was available, I did something else instead:
each_component = np.eye(total_components)
component_im_array = pca.inverse_transform(each_component)
for i in range(num_components):
component_im = component_im_array[i, :].reshape(height, width)
# do something with component_im
In other words, I create an image in the PCA space that has all features but 1 set to 0. By inversely transforming them, I should then get the image in the original space which, once transformed, can be expressed solely with that PCA component.
The following image shows the results. On the left is the component calculated using my method. On the right is pca.components_[i] directly. Additionally, with my method, most images are very similar (but they are different) while by accessing the components_ the images are very different as I would have expected
Is there a conceptual problem in my method? Clearly the components from pca.components_[i] are correct (or at least more correct) than the ones I'm getting. Thanks!

Components and inverse transform are two different things. The inverse transform maps the components back to the original image space
#Create a PCA model with two principal components
pca = PCA(2)
pca.fit(data)
#Get the components from transforming the original data.
scores = pca.transform(data)
# Reconstruct from the 2 dimensional scores
reconstruct = pca.inverse_transform(scores )
#The residual is the amount not explained by the first two components
residual=data-reconstruct
Thus you are inverse transforming the original data and not the components, and thus they are completely different. You almost never inverse_transform the orginal data. pca.components_ are the actual vectors representing the underlying axis used to project the data to the pca space.

The difference between grabbing the components_ and doing an inverse_transform on the identity matrix is that the latter adds in the empirical mean of each feature. I.e.:
def inverse_transform(self, X):
return np.dot(X, self.components_) + self.mean_
where self.mean_ was estimated from the training set.

Related

How to reuse the coefficient obtained from scipy.interpolate.RectBivariateSpline.get_coeffs() to reconstruct the interpolation function?

I have a huge 3D (x,y,z) data set with (x,y) being the inputs and (z) is the output. Now the dataset is very large, and I need to use that information in real time with minimal delay.
Therefore, indexing/look-up table might seem slow. So my thought is to interpolate the dataset and in real time, instead of look-up table, I calcualte the value. So I don't have to store the original dataset but instead I can store the coefficients, which hopefully would be of smaller size than the original data set.
I used the scipy.interpolate.RectBivariateSpline to perform interpolation. And I was able to fit the data and also obtain coefficients. But I am not sure how to reconstruct the interpolation function from the coefficients.
I want to emphesize that the interpolation function will only be evaluated at input (x,y). Generalization is not of concern here.
from scipy import interpolate
import numpy as np
x = np.arange(1,500)
y = np.arange(2,200)
X,Y = np.meshgrid(x,y)
z = np.sin(X+Y).T
a = interpolate.RectBivariateSpline(x,y,z)
# print(len(a.get_coeffs()))
# coefficients can be obtained by a.get_coeffs()
# I want to have the following
# f = construct_spline_from_coefficient(a.get_coeffs())
# z = f(x_old, y_old)
Another approach I had in mind is use deep neural network. Can anyone shed some light here? Is this an over-kill?
The solution is in the scipy official doc (link).
Use bisplrep function (rep stands for representation) to obtain the interpoltaion output tck (see the docstring for bisplrep).
The output tck is an array and can be stored in a file.
Use bisplev (ev stands for evaluation) to constrcut an function.
For using nueral network at interpolation see this state-of-the-art (paper)
Training Neural Networks for and by Interpolation.

Using sklearn's PCA to find un-mean-centered components

I am using sklearn's PCA for use in a physical modeling problem. In this problem, the return values from PCA.fit_transform() and PCA.components_ have physical meanings. However, it seems that sklearn's PCA automatically mean centers the input data, so that the return values of PCA.fit_transform() and PCA.components_ are in mean-centered space. I realize that PCA.inverse_transform returns the original un-mean-centered input data, but it does this through np.dot(X, PCA.components_) + PCA.mean_ where X is the return value of PCA.fit_transform().
In other words, how can I alter X and PCA.components_ into X1 and PCA.components_1 by using PCA.mean_ such that np.dot(X1,pca.components_1) returns the same value as PCA.inverse_transform(X)?
There is probably a simple linear algebra solution to this but I can't seem to figure it out.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

Feature Extraction: Dense SURFs, PCA-whitening, Improved Fisher Vectors & GMMs

I'm trying to implement the classifier discussed in this paper. I've implemented everything apart from the feature extraction. In section 5.1, the author(s) writes:
"For each superpixel, two feature types are extracted: Dense surfs which are transformed using signed squarerooting and Lab color values. In our experiments it has proved beneficial to also extract features around the superpixels namely within its bounding box, to include more context. Both surf and color values are encoded using Improved Fisher Vectors as implemented in VlFeat and a gmm with 64 modes. We perform pca-whitening on both feature channels. In the end the two encoded feature vectors are concatenated, producing a dense vector with 8’576 values."
There are a lot of things going on here, and I am confused in what order I should be performing the steps, as well as on which portion of the data set.
Here's my interpretation, in pseudo python:
def getFeatures(images):
surfs_arr = []
colors_arr = []
for image in images:
superpixels = findSuperpixels
for superpixel in superpixels:
box = boundingBox(superpixel)
surfs = findDenseSURFs(box)
colors = findColorValues(box)
surfs_arr.append(surfs)
colors_arr.append(colors)
surfs_sample = (randomly choose X samples from surfs_arr)
colors_sample = (randomly choose Y samples from colors_arr) #or histogram?
# gmm has covariances, means properties
gmm_surf = GMM(modes=64, surfs_sample)
gmm_color = GMM(modes=64, colors_sample)
surfs_as_fisher_vectors = IFV(gmm_surf, surfs_arr)
colors_as_fisher_vectors = IFV(gmm_color, color_arr)
pca_surfs = PCA(ifv_surfs, whiten, n_components = 64)
pca_colors = PCA(ifv_colors, whiten, n_components = 64
features = concatenate((pca_surfs, pca_colors), axis=1)
return features
my questions:
i. should PCA-whitening be performed prior to creating the GMMs? (like in this example)
ii. Should i remove the surfs_sample and colors_sample sets from
surfs_arr and colors_arr, respectively, before they are encoded as
Fisher Vectors?
iii. As far as describing color values, is it best to leave them as is or
create a histogram?
iv. The author states that he uses Dense SURFs, but makes no mention of
how dense. Do you recommend a particular starting point? 4x4,
16x16? Am I misunderstanding this?
v. Any idea where the author comes up with "a dense vector with 8,576
values"? To get a consistent number of features w/ different size
superpixels, it seems to me that he must be
1) using a histogram to represent the color values, and either
2a) resizing each superpixel, or
2b) changing the density of his SURF grid.
I'm working in python w/ numpy, opencv, scikit-learn, mahotas, and a fisher vector implementation ported from VLFeat.
Thanks.

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks
The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).
And X_proj is the better name of X_new, because it is the projection of X onto principal components
You can reconstruct the X_rec as
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.
In my opinion, I can say the noise is discard.
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Categories

Resources