Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When do I apply PCA, is it after preprocessing (i.e removing null values, encoding etc.,) the entire dataset or before? After I've completely preprocessed my dataset,
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,0:14] = sc.fit_transform(x_train[:,0:14])
x_test[:,0:14] = sc.transform(x_test[:,0:14])
I'm left with the shape, 113126x91
Applying PCA is better on scaled data because you won't face the Large vs. Tiny problem between features.
Large vs. Tiny problem means that the variance of features would be different. for example, in a dataset, one feature has a range (-5, +5) and another lies in the range of (-10000, +10000). Features with larger values can dominate the process.
PCA is a dimensionality reduction technique used to reduce the dimensionality of large data sets by transforming a large collection of variables into a smaller one that still contains most of the information in the large group. To reduce dimensions, PCA takes eigenvectors with higher eigenvalues and map your data points to those vectors; hence dimensionality is reduced.
Let me give you an example of how applying PCA after scaling will be helpful.
Let me import some valuable things that we will be using for this example.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, normalize
import matplotlib.pyplot as plt
# For reproducibility
np.random.seed(123)
Let me make a dummy data set on which we will see the effect of applying PCA before and after scaling.
rows = 100
features = 7
X = np.random.normal(size=[rows, features])
X = np.append(X, 3*np.random.choice(2, size = [rows,1]), axis = 1)
A dummy dataset is created in variable X having 100 examples and 7 features. Now lets apply PCA on it without scaling and plot the data.
pca = PCA(2)
low_x = pca.fit_transform(X)
plt.scatter(low_x[:,0], low_x[:,1])
Here is a plot of data after reducing the number of features from 7 to 2 without scaling the dataset. You can see that data points are very near and messy. One feature has a higher variance than the other. For further processing or modeling, this will affect the results.
Let's apply feature scaling first and then apply PCA to the dataset.
X_normalized = normalize(X)
pca = PCA(2)
low_x = pca.fit_transform(X_noramlized)
plt.scatter(low_x[:,0], low_x[:,1])
In the following plot, the data is clear and scattered. There is no big difference between the variance of both features.
Hence, it is always better to apply normalization before applying PCA to a dataset.
But always remember one thing, Data science is mostly hit and try for developers. Try this if it doesn't help your results, you can always try a different way.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Can the dimension of the data be reduced to only one principal component?
I tried it on the iris data set-
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
pca = PCA(n_components=1)
pca_X = pca.fit_transform(X) #X = standardized iris data
pca_df = pd.DataFrame(pca_X, columns=["PCA1"])
plt.plot(pca_df["PCA1"], "o")
We can see three different clusters. So can to dimension be reduced to 1?
You can choose to reduce the dimensions to 1 using PCA, the only thing it promises is that the resultant principal component is in the direction of highest variance in the data.
If you are reducing the dimensions in order to improve classification you can use Linear Discriminant Analysis which gives you the direction of maximum separation between the classes.
Yes, the dimension can be reduced to 1, which is exactly what you have done in your example.
The y Axis in your plot shows the coordinate for each observation wrt the first principal component.
The three clusters likely relate to the three species in the Iris dataset and have nothing to do with the number of components.
I need to do some PCA using sklearn and I want to make sure I do it the right way. Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca_result = pca.fit_transform(data)
eigenvalues = pca.singular_values_
print(eigenvalues)
x = pca_result[:,0]
y = pca_result[:,1]
The data looks like this:
[[ -6.4186, -14.3534, 18.1296, -2.8110, 14.0298],
[ -7.1220, -17.1501, 21.2807, -3.5025, 16.4489],
[ -8.4652, -18.9316, 25.0303, -4.1773, 18.5066],
...,
[ -4.7054, 6.1389, 3.5146, -0.1036, -0.7332],
[ -5.8533, 9.9087, 4.1178, -0.5211, -2.2415],
[ -6.2969, 13.8951, 3.4365, -0.9207, -4.2024]]
These are the eigenvalues: [1005.2761 , 853.5491 , 65.058365, 49.994457, 10.277865]. I am not totally sure about the last 2 lines. I want to plot the data projected in the 2D space that seems to make up for most of the variation in the data (basically make a 2D plot of the 5D data, as it seems like it lives on a 2D manifold). Am I doing it right? Thank you!
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components
Such dimensionality reduction can be a very useful step for
visualising and processing high-dimensional datasets, while still
retaining as much of the variance in the dataset as possible. For
example, selecting L = 2 and keeping only the first two principal
components finds the two-dimensional plane through the
high-dimensional dataset in which the data is most spread out, so if
the data contains clusters these too may be most spread out, and
therefore most visible to be plotted out in a two-dimensional diagram;
whereas if two directions through the data (or two of the original
variables) are chosen at random, the clusters may be much less spread
apart from each other, and may in fact be much more likely to
substantially overlay each other, making them indistinguishable.
https://en.wikipedia.org/wiki/Principal_component_analysis
So you need to run:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
x = pca_result[:,0]
y = pca_result[:,1]
Then you have a two dimensional space.
I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
I have been trying to do some dimensionality reduction using PCA. I currently have an image of size (100, 100) and I am using a filterbank of 140 Gabor filters where each filter gives me a response which is again an image of (100, 100). Now, I wanted to do feature selection where I only wanted to select non-redundant features and I read that PCA might be a good way to do.
So I proceeded to create a data matrix which has 10000 rows and 140 columns. So, each row contains the various responses of the Gabor filters for that filterbank. Now, as I understand it I can do a decomposition of this matrix using PCA as
from sklearn.decomposition import PCA
pca = pca(n_components = 3)
pca.fit(Q) # Q is my 10000 X 140 matrix
However, now I am confused as to how I can figure out which of these 140 feature vectors to keep from here. I am guessing it should give me 3 of these 140 vectors (corresponding to the Gabor filters which contain the most information about the image) but I have no idea how to proceed from here.
PCA will give you a linear combination of features, not a selection of features. It will give you the linear combination that is the best for reconstruction in the L2 sense, aka the one that captures the most variance.
What is you goal? If you do this on one image, any kind of selection will give you features that will discriminate best some parts of an image against other parts of the same image.
Also: Garbor Filters are a sparse basis for natural images. I would not expect anything interesting to happen unless you have very specific images.
I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks
The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).
And X_proj is the better name of X_new, because it is the projection of X onto principal components
You can reconstruct the X_rec as
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.
In my opinion, I can say the noise is discard.
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html