apply sklearn PCA on movielens dataset

apply sklearn PCA on movielens dataset - python

I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies
here is my python code :
Load movie names and movie ratings
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)
Standardizing
X_std = StandardScaler().fit_transform(df1)
Apply PCA
pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_ size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result:
As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?
Test with smaller example
I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);
and in this example we have 6 sample and 8 feauters I set the n_components to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples

See the documentation on PCA.
Because you did not pass an n_components parameter to PCA(), sklearn uses min(n_samples, n_features) as the value of n_components, which is why you get a reduced feature set equal to n_samples.
I believe your variance is equal to 1 because you didn't set the n_components, from the documentation:
If n_components is not set then all components are stored and the sum
of explained variances is equal to 1.0.

Related

Is there a way to compute the explained variance of PCA on a test set?

I want to see how well PCA worked with my data.
I applied PCA on a training set and used the returned pca object to transform on a test set. pca object has a variable pca.explained_variance_ratio_ which tells me the percentage of variance explained by each of the selected components for the training set. After applying the pca transform, I want to see how well this worked on the test set. I tried inverse_transform() that returned what the original values would look like but I have no way to compare how it worked on the train set vs test set.
pca = PCA(0.99)
pca.fit(train_df)
tranformed_test = pca.transform(test_df)
inverse_test = pca.inverse_transform(tranformed_test)
npt.assert_almost_equal(test_arr, inverse_test, decimal=2)
This returns:
Arrays are not almost equal to 2 decimals
Is there something like pca.explained_variance_ratio_ after transform()?

Variance explained for each components
You can compute it manually.
If the components X_i are orthogonal (which is the case in PCA), the explained variance by X_i out of X is: 1 - ||X_i - X||^2 / ||X - X_mean||^2
Hence the following example:
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.randn(200, 5)
X_test = np.random.randn(100, 5)
model = PCA(n_components=5).fit(X_train)
def explained_variance(X):
result = np.zeros(model.n_components)
for ii in range(model.n_components):
X_trans = model.transform(X)
X_trans_ii = np.zeros_like(X_trans)
X_trans_ii[:, ii] = X_trans[:, ii]
X_approx_ii = model.inverse_transform(X_trans_ii)
result[ii] = 1 - (np.linalg.norm(X_approx_ii - X) /
np.linalg.norm(X - model.mean_)) ** 2
return result
print(model.explained_variance_ratio_)
print(explained_variance(X_train))
print(explained_variance(X_test))
# [0.25335711 0.23100201 0.2195476 0.15717412 0.13891916]
# [0.25335711 0.23100201 0.2195476 0.15717412 0.13891916]
# [0.17851083 0.199134 0.24198887 0.23286815 0.14749816]
Total variance explained
Alternatively, if you only care about the total variance explained, you can use r2_score:
from sklearn.metrics import r2_score
model = PCA(n_components=2).fit(X_train)
print(model.explained_variance_ratio_.sum())
print(r2_score(X_train, model.inverse_transform(model.transform(X_train)),
multioutput='variance_weighted'))
print(r2_score(X_test, model.inverse_transform(model.transform(X_test)),
multioutput='variance_weighted'))
# 0.46445451252373826
# 0.46445451252373815
# 0.4470229486590848

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred

The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

How to rank features correctly from PCA's eigenvector

My goal is to rank features of a supervised machine learning dataset, by contributions to theris Principal component, thanks to this answer.
I set up an experiment in which I construct a dataset contains 3 infomative, 3 redundent and 3 noise features in order. Then find the index of the largest component on each principal axes.
However, I got a realy worse rank by using this method. Dont know what mistakes I have made. Many thanks for helping. Here is my code:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
# Make a dataset which contains 3 Infomative, redundant, noise features respectively
X, _ = make_classification(n_samples=20, n_features=9, n_informative=3,
n_redundant=3, random_state=0, shuffle=False)
cols = ['I_'+str(i) for i in range(3)]
cols += ['R_'+str(i) for i in range(3)]
cols += ['N_'+str(i) for i in range(3)]
dfX = pd.DataFrame(X, columns=cols)
# Rank each feature by each priciple axis maximum component
model = PCA().fit(dfX)
_ = model.transform(dfX)
n_pcs= model.components_.shape[0]
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
most_important_names = [dfX.columns[most_important[i]] for i in range(n_pcs)]
rank = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
rank outputs:
{'PC0': 'R_1',
'PC1': 'I_1',
'PC2': 'N_1',
'PC3': 'N_0',
'PC4': 'N_2',
'PC5': 'I_2',
'PC6': 'R_1',
'PC7': 'R_0',
'PC8': 'R_2'}
I am expecting to see infomative features I_x to be ranked top3.

PCA ranking criteria is the variance of each columns, if you would like to have a ranking, what you can do is to output the VarianceThreshold of each columns. You can do that by this
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
selector.fit_transform(dfX)
print(selector.variances_)
# outputs [1.57412087 1.08363799 1.11752334 0.58501874 2.2983772 0.2857617
# 1.09782539 0.98715471 0.93262548]
Which you can clearly see that the first 3 columns (I0, I1,I2) has the greatest variance, and thus makes the best candidate for using PCA with.

Feature/Variable importance after a PCA analysis

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset.
How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction?
Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.
Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
Overview:
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]
PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.
Now, let's find the most important features.
print(abs( pca.components_ ))
[[0.52237162 0.26335492 0.58125401 0.56561105]
[0.37231836 0.92555649 0.02109478 0.06541577]
[0.72101681 0.24203288 0.14089226 0.6338014 ]
[0.26199559 0.12413481 0.80115427 0.52354627]]
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).
To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

the pca library contains this functionality.
pip install pca
A demonstration to extract the feature importance is as following:
# Import libraries
import numpy as np
import pandas as pd
from pca import pca
# Lets create a dataset with features that have decreasing variance.
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)
# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)
# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])
# PC feature
# 0 PC1 f1
# 1 PC2 f2
# 2 PC3 f3
# 3 PC4 f4
# 4 PC5 f5
# 5 PC6 f6
# 6 PC7 f7
# 7 PC8 f8
# 8 PC9 f9
Plot the explained variance
model.plot()
Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.
ax = model.biplot(n_feat=10, legend=False)
Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.
ax = model.biplot3d(n_feat=10, legend=False)

# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):
# Change pcs components ndarray to a dataframe
importance_df = pd.DataFrame(pca.components_)
# Assign columns
importance_df.columns = original_num_df.columns
# Change to absolute values
importance_df =importance_df.apply(np.abs)
# Transpose
importance_df=importance_df.transpose()
# Change column names again
## First get number of pcs
num_pcs = importance_df.shape[1]
## Generate the new column names
new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
## Now rename
importance_df.columns =new_columns
# Return importance df
return importance_df
# Call function to create importance df
importance_df =create_importance_dataframe(pca, original_num_df)
# Show first few rows
display(importance_df.head())
# Sort depending on PC of interest
## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )
## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )

Select 5 data points closest to SVM hyperlane

I have written Python code using Sklearn to cluster my dataset:
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_= len(cluster_centers_indices)
I am exploring the use of query-by-clustering and so form an inital training dataset by:
td_title =[]
td_abstract = []
td_y= []
for each in centers:
td_title.append(title[each])
td_abstract.append(abstract[each])
td_y.append(y[each])
I then train my model (an SVM) on it by:
clf = svm.SVC()
clf.fit(X, data_y)
I wish to write a function that given the centres, the model, the X values and the Y values will append the 5 data points which the model is most unsure about, ie. the data points closest to the hyperplane. How can I do this?

The first steps of your process aren't entirely clear to me, but here's a suggestion for "Select(ing) 5 data points closest to SVM hyperplane". The scikit documentation defines decision_function as the distance of the samples to the separating hyperplane. The method returns an array which can be sorted with argsort to find the "top/bottom N samples".
Following this basic scikit example, define a function closestN to return the samples closest to the hyperplane.
import numpy as np
def closestN(X_array, n):
# array of sample distances to the hyperplane
dists = clf.decision_function(X_array)
# absolute distance to hyperplane
absdists = np.abs(dists)
return absdists.argsort()[:n]
Add these two lines to the scikit example to see the function implemented:
closest_samples = closestN(X, 5)
plt.scatter(X[closest_samples][:, 0], X[closest_samples][:, 1], color='yellow')
Original
Closest Samples Highlighted
If you need to append the samples to some list, you could somelist.append(closestN(X, 5)). If you needed the sample values you could do something like somelist.append(X[closestN(X, 5)]).
closestN(X, 5)
array([ 1, 20, 14, 31, 24])
X[closestN(X, 5)]
array([[-1.02126202, 0.2408932 ],
[ 0.95144703, 0.57998206],
[-0.46722079, -0.53064123],
[ 1.18685372, 0.2737174 ],
[ 0.38610215, 1.78725972]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

apply sklearn PCA on movielens dataset - python

Related

Is there a way to compute the explained variance of PCA on a test set?

Outlier detection with Local Outlier Factor (LOF)

How to rank features correctly from PCA's eigenvector

Feature/Variable importance after a PCA analysis

Select 5 data points closest to SVM hyperlane

Categories

Resources