PCA with sklearn. Unable to figure out feature selection with PCA - python

I have been trying to do some dimensionality reduction using PCA. I currently have an image of size (100, 100) and I am using a filterbank of 140 Gabor filters where each filter gives me a response which is again an image of (100, 100). Now, I wanted to do feature selection where I only wanted to select non-redundant features and I read that PCA might be a good way to do.
So I proceeded to create a data matrix which has 10000 rows and 140 columns. So, each row contains the various responses of the Gabor filters for that filterbank. Now, as I understand it I can do a decomposition of this matrix using PCA as
from sklearn.decomposition import PCA
pca = pca(n_components = 3)
pca.fit(Q) # Q is my 10000 X 140 matrix
However, now I am confused as to how I can figure out which of these 140 feature vectors to keep from here. I am guessing it should give me 3 of these 140 vectors (corresponding to the Gabor filters which contain the most information about the image) but I have no idea how to proceed from here.

PCA will give you a linear combination of features, not a selection of features. It will give you the linear combination that is the best for reconstruction in the L2 sense, aka the one that captures the most variance.
What is you goal? If you do this on one image, any kind of selection will give you features that will discriminate best some parts of an image against other parts of the same image.
Also: Garbor Filters are a sparse basis for natural images. I would not expect anything interesting to happen unless you have very specific images.

Related

PCA in Sklearn: how to return dimensions which explain the most variation, in order? [duplicate]

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them.
My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow:
pca = sklearn.decomposition.PCA()
pca.fit(data_matrix)
When I print pca.explained_variance_ratio_, it outputs an array of variance ratios ordered from highest to lowest, but it doesn't tell me which dimension from the data they correspond to (I've tried changing the order of columns on my matrix, and the resulting variance ratio array was the same).
Printing pca.components_ gives me a 4x4 matrix (I left the original number of components as argument to pca) with some values I can't understand the meaning of...according to scikit's documentation, they should be the components with the maximum variance (the eigenvectors perhaps?), but no sign of which dimension those values refer to.
Transforming the data doesn't help either, because the dimensions are changed in a way I can't really know which one they were originally.
Is there any way I can get this information with scikit's pca? Thanks
The pca.explained_variance_ratio_ returned are the variances from principal components. You can use them to find how many dimensions (components) your data could be better transformed by pca. You can use a threshold for that (e.g, you count how many variances are greater than 0.5, among others). After that, you can transform the data by PCA using the number of dimensions (components) that are equal to principal components higher than the threshold used. The data reduced to these dimensions are different from the data on dimensions in original data.
you can check the code from this link:
http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

Sklearn LDA-analysis won't generate 2 dimensions

I'm trying to plot a 3-feature dataset with a binary classification on a matplotlib plot. This worked with an example dataset provided in a guide (http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/) but when I try to instead insert my own dataset, the LinearDiscriminantAnalysis will only output a one-dimensional series, no matter what number I put in "n_components". Why would this not work with my own code?
Data = pd.read_csv("DataFrame.csv", sep=";")
x = Data.iloc[:, [3, 5, 7]]
y = Data.iloc[:, 8]
lda = LDA(n_components=2)
lda_transformed = pd.DataFrame(lda.fit_transform(x, y))
plt.scatter(lda_transformed[y==0][0], lda_transformed[y==0][1], label='Loss', c='red')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1], label='Win', c='blue')
plt.legend()
plt.show()
In the case when the number of different class labels, C, is less than the number of observations (almost always), then linear discriminant analysis will always produce C - 1 discriminating components. Using n_components from the sklearn API is only a means to choose possibly fewer components, e.g. in the case when you know what dimensionality you'd like to reduce down to. But you could never use n_components to get more components.
This is discussed in the Wikipedia section on Multiclass LDA. The definition of the between-class scatter is given as
\Sigma_{b} = (1 / C) \sum_{i}^{C}( (\mu_{i} - mu)(\mu_{i} - mu)^{T}
which is the empirical covariance matrix among the population of class means. By definition, such a covariance matrix has rank at most C - 1.
... the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the C − 1 largest eigenvalues ...
So because LDA uses a decomposition of the class mean covariance matrix, it means the dimensionality reduction it can provide is based on the number of class labels, and not on the sample size nor the feature dimensionality.
In the example you linked, it doesn't matter how many features there are. The point is that the example uses 3 simulated cluster centers, so there are 3 class labels. This means linear discriminant analysis could produce projection of the data onto either 1-dimensional or 2-dimensional discriminating subspaces.
But in your data, you start out with only 2 class labels, a binary problem. This means the dimensionality of the linear discriminant model can be at most 1-dimensional, literally a line that forms the decision boundary between the two classes. Dimensionality reduction with LDA in this case would simply be the projection of data points onto a particular normal vector of that separating line.
If you want to specifically reduce down to two dimensions, you can try many of the other algorithms that sklearn provides: t-SNE, ISOMAP, PCA and kernel PCA, random projection, multi-dimensional scaling, among others. Many of these allow you to choose the dimensionality of the projected space, up to the original feature dimensionality, or sometimes you can even project into larger spaces, like with kernel PCA.
In the example you give, dimension reduction by LDA reduces the data from 13 features to 2 features, however in your example it reduces from 3 to 1 (even though you wanted to get 2 features), thus it is not possible to plot in 2D.
If you really want to select 2 features out of 3, you can use feature_selection.SelectKBest to choose 2 best features and there won't be any problems plotting in 2D.
For more information, please read this fantastic answer for PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Probably it's because of sklearn implementation that won't allowed you to do so if you only have 2 class. The problem has been stated in here, https://github.com/scikit-learn/scikit-learn/issues/1967.

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks
The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).
And X_proj is the better name of X_new, because it is the projection of X onto principal components
You can reconstruct the X_rec as
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.
In my opinion, I can say the noise is discard.
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Extracting PCA components with sklearn

I am using sklearn's PCA for dimensionality reduction on a large set of images. Once the PCA is fitted, I would like to see what the components look like.
One can do so by looking at the components_ attribute. Not realizing that was available, I did something else instead:
each_component = np.eye(total_components)
component_im_array = pca.inverse_transform(each_component)
for i in range(num_components):
component_im = component_im_array[i, :].reshape(height, width)
# do something with component_im
In other words, I create an image in the PCA space that has all features but 1 set to 0. By inversely transforming them, I should then get the image in the original space which, once transformed, can be expressed solely with that PCA component.
The following image shows the results. On the left is the component calculated using my method. On the right is pca.components_[i] directly. Additionally, with my method, most images are very similar (but they are different) while by accessing the components_ the images are very different as I would have expected
Is there a conceptual problem in my method? Clearly the components from pca.components_[i] are correct (or at least more correct) than the ones I'm getting. Thanks!
Components and inverse transform are two different things. The inverse transform maps the components back to the original image space
#Create a PCA model with two principal components
pca = PCA(2)
pca.fit(data)
#Get the components from transforming the original data.
scores = pca.transform(data)
# Reconstruct from the 2 dimensional scores
reconstruct = pca.inverse_transform(scores )
#The residual is the amount not explained by the first two components
residual=data-reconstruct
Thus you are inverse transforming the original data and not the components, and thus they are completely different. You almost never inverse_transform the orginal data. pca.components_ are the actual vectors representing the underlying axis used to project the data to the pca space.
The difference between grabbing the components_ and doing an inverse_transform on the identity matrix is that the latter adds in the empirical mean of each feature. I.e.:
def inverse_transform(self, X):
return np.dot(X, self.components_) + self.mean_
where self.mean_ was estimated from the training set.

Categories

Resources