I am trying to do PCA from sklearn with n_components = 5. I apply the dimensionality reduction on my data using fit_transform(data).
Initially I tried to do the classical matrix multiplication between pca.components_ values and my x_features data, but the results are different. So I am wether doing my multiplication incorrectly or I did not understand how fit_transform work.
Below is a mock-up to compare classic matrix multiplication and fit_transform:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
my_matrix = np.random.randn(100, 5)`
mdl = decomposition.PCA(n_components=5)
mdl_FitTrans = mdl.fit_transform(my_matrix)
pca_components = mdl.components_
mdl_FitTrans_manual = np.dot(pca_components, my_matrix.transpose())
mdl_FitTrans_manualT = mdl_FitTrans_manual.transpose()
I am expecting mdl_FitTrans == mdl_FitTrans_manual but the result is False.
Check out, how the transform() method is implemented in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a5ab948/sklearn/decomposition/base.py#L101
According to it, manual reduction is done as following:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
data = np.random.randn(100, 100)
mdl = decomposition.PCA(n_components=5)
mdl_fit = mdl.fit(data)
data_transformed = mdl_fit.transform(data)
data_transformed_manual = np.dot(data - mdl_fit.mean_, mdl.components_.T)
np.all(data_transformed == data_transformed_manual)
True
Related
Python t-sne implementation from this resource: https://lvdmaaten.github.io/tsne/
Btw I'm a beginner to scRNA-seq.
What I am trying to do: Use a scRNA-seq data set and run t-SNE on it but with using previously calculated PCAs (I have PCA.score and PCA.load files)
Q1: I should be able to use my selected calculated PCAs in the tSNE, but which file do I use the pca.score or pca.load when running Y = tsne.tsne(X)?
Q2: I've tried removing/replacing parts of the PCA calculating code to attempt to remove PCA preprocessing but it always seems to give an error. What should I change for it to properly use my already PCA data and not calculate PCA from it again?
The piece of PCA processing code is this in its raw form:
def pca(X=np.array([]), no_dims=50):
"""
Runs PCA on the NxD array X in order to reduce its dimensionality to
no_dims dimensions.
"""
print("Preprocessing the data using PCA...")
(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = X #np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y
You should use the PCA score.
As for not running pca, you can just comment out this line:
X = pca(X, initial_dims).real
What I did is to add a parameter do_pca and edit the function such:
def tsne(X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,do_pca=True):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
if do_pca:
X = pca(X, initial_dims).real
(n, d) = X.shape
max_iter = 50
[.. rest stays the same..]
Using an example dataset, without commenting out that line:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import sys
import os
from tsne import *
X,y = load_digits(return_X_y=True,n_class=3)
If we run the default:
res = tsne(X=X,initial_dims=20,do_pca=True)
plt.scatter(res[:,0],res[:,1],c=y)
If we pass it a pca :
pc = pca(X)[:,:20]
res = tsne(X=pc,initial_dims=20,do_pca=False)
plt.scatter(res[:,0],res[:,1],c=y)
The following code snippet illustrates the issue:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
(nrows, ncolumns) = (1912392, 131)
X = np.random.random((nrows, ncolumns))
pca = PCA(n_components=28, random_state=0)
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)
print((transformed_X1 != transformed_X2).sum()) # Gives output as 53546976
scalar = StandardScaler()
scaled_X1 = scalar.fit_transform(X)
scalar2 = scalar.fit(X)
scaled_X2 = scalar2.transform(X)
(scaled_X1 != scaled_X2).sum() # Gives output as 0
Can someone explain as to why the first output is not zero and the second output is?
Using this works:
pca = PCA(n_components=28, svd_solver = 'full')
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)
print(np.allclose(transformed_X1, transformed_X2))
True
Apparently svd_solver = 'random' (which is what 'auto' defaults to) has enough process difference between .fit(X).transform(X) and fit_transform(X) to give different results even with the same seed. Also remember floating point errors make == and /= unreliable judges of equality of different processes, so use np.allclose().
It seems like StandardScaler.fit_transform() just directly uses .fit(X).transform(X) under the hood, so there were no floating point errors there to trip you up.
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.
How to print the confusion matrix for a logistic regression if change the value of threshold between [0.5,0.6,0.9] once 0.5 and once 0.6 and so one
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
X = [[0.7,0.2],[0.9,0.4]]
y = [1,-1]
model = LogisticRegression()
model = model.fit(X,y)
threshold = [0.5,0.6,0.9]
CM = confusion_matrix(y_true, y_pred)
TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]
Let try this!
for i in threshold:
y_predicted = model.predict_proba(X)[:1] > i
print(confusion_matrix(y, y_predicted))
predict_proba() returns a numpy array of two columns. The first column is the probability that target=0 and the second column is the probability that target=1. That is why we add [:,1] after predict_proba() in order to get the probabilities of target=1
I think an easy approach in pseudo code (based a bit on python) would be:
1 - Predict a set of known value (X) y_prob = model.predict_proba(X) so you will get the probability per each input in X.
2 - Then for each threshold calculate the output. i.e. If y_prob > threshold = 1 else 0
3 - Now get the confussion matrix of each vector obtained.
If you need a deeper explanation on any point let me know!
def predict_y_from_treshold(model,X,treshold):
return np.array(list(map(lambda x : 1 if x > treshold else 0,model.predict_proba(X)[:,1])))
I have been playing around with sklearn PCA and it is behaving oddly.
from sklearn.decomposition import PCA
import numpy as np
identity = np.identity(10)
pca = PCA(n_components=10)
augmented_identity = pca.fit_transform(identity)
np.linalg.norm(identity - augmented_identity)
4.5997749080745738
Note that I set the number of dimensions to be 10. Shouldn't the norm be 0?
Any insight into why it is not would be appreciated.
Although PCA computes the orthogonal components based on covariance matrix, the input to PCA in sklearn is the data matrix instead of covairance/correlation matrix.
import numpy as np
from sklearn.decomposition import PCA
# gaussian random variable, 10-dimension, identity cov mat
X = np.random.randn(100000, 10)
pca = PCA(n_components=10)
X_transformed = pca.fit_transform(X)
np.linalg.norm(np.cov(X.T) - np.cov(X_transformed.T))
Out[219]: 0.044691263454134933