sklearn PCA not working - python

I have been playing around with sklearn PCA and it is behaving oddly.
from sklearn.decomposition import PCA
import numpy as np
identity = np.identity(10)
pca = PCA(n_components=10)
augmented_identity = pca.fit_transform(identity)
np.linalg.norm(identity - augmented_identity)
4.5997749080745738
Note that I set the number of dimensions to be 10. Shouldn't the norm be 0?
Any insight into why it is not would be appreciated.

Although PCA computes the orthogonal components based on covariance matrix, the input to PCA in sklearn is the data matrix instead of covairance/correlation matrix.
import numpy as np
from sklearn.decomposition import PCA
# gaussian random variable, 10-dimension, identity cov mat
X = np.random.randn(100000, 10)
pca = PCA(n_components=10)
X_transformed = pca.fit_transform(X)
np.linalg.norm(np.cov(X.T) - np.cov(X_transformed.T))
Out[219]: 0.044691263454134933

Related

DBSCAN fit_predict on precomputed metrics outputs strange clusters

I am trying to exercise with ML. Specifically, attempting to apply DBSCAN on precomputed distances matrix (just to check how this work). Yes, I know I could use Euclidean metrics but I wanted to test the precomputed.
I am unsure why the labels are all same value for a data set with random pairs in 3 different regions- expecting DBSCAN to separate those. Note: even if I use non-overlapping ranges for the data1/2/3 I still get a single cluster output.
Here is the code:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import pdist, squareform
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data1 = np.array ([[random.randint(1,400) for i in range(2)] for j in range (50)], dtype=np.float64)
data2 = np.array ([[random.randint(300,700) for i in range(2)] for j in range (50)], dtype=np.float64)
data3 = np.array ([[random.randint(600,900) for i in range(2)] for j in range (50)], dtype=np.float64)
data= np.append (np.append (data1,data2,axis=0), data3, axis=0)
d = pdist(data, lambda u, v: np.sqrt(((u-v)**2).sum()))
distance_matrix = squareform(d)
cluster = DBSCAN (eps=0.3, min_samples=2,metric='precomputed')
dbscan_model = cluster.fit_predict (distance_matrix)
plt.scatter (data[:,0], data[:,1], s=100, c=dbscan_model)
plt.show ()

Python logit regression matrix shape error "ValueError: endog and exog matrices are different sizes"

Basic setup: I'm trying to run a logit regression in python on the probability of founding a business (founder variable) the exogenous variables are year, age, edu_cat (education category), and sex.
The X matrix is (4, 650), and the y matrix(1, 650). All of the variables within the x matrix have 650 non-NaN observations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
x=np.array ([ df_all['Year'], df_all['Age'], df_all['Edu_cat'], df_all['sex']])
y= np.array([df_all['founder']])
logit_model = sm.Logit(y, x)
result = logit_model.fit()
print(result)
So I'm tracking that the shape is good, but python is telling me otherwise. Am I missing something basic?
I believe the issue is with the Y array, being [650,1], when it should be [650,], which it defaults to. Additionally I needed to make the x array [650,4] through a transpose.

scRNA-seq: How to use TSNE python implementation using precalculated PCA score/load?

Python t-sne implementation from this resource: https://lvdmaaten.github.io/tsne/
Btw I'm a beginner to scRNA-seq.
What I am trying to do: Use a scRNA-seq data set and run t-SNE on it but with using previously calculated PCAs (I have PCA.score and PCA.load files)
Q1: I should be able to use my selected calculated PCAs in the tSNE, but which file do I use the pca.score or pca.load when running Y = tsne.tsne(X)?
Q2: I've tried removing/replacing parts of the PCA calculating code to attempt to remove PCA preprocessing but it always seems to give an error. What should I change for it to properly use my already PCA data and not calculate PCA from it again?
The piece of PCA processing code is this in its raw form:
def pca(X=np.array([]), no_dims=50):
"""
Runs PCA on the NxD array X in order to reduce its dimensionality to
no_dims dimensions.
"""
print("Preprocessing the data using PCA...")
(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = X #np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y
You should use the PCA score.
As for not running pca, you can just comment out this line:
X = pca(X, initial_dims).real
What I did is to add a parameter do_pca and edit the function such:
def tsne(X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,do_pca=True):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
if do_pca:
X = pca(X, initial_dims).real
(n, d) = X.shape
max_iter = 50
[.. rest stays the same..]
Using an example dataset, without commenting out that line:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import sys
import os
from tsne import *
X,y = load_digits(return_X_y=True,n_class=3)
If we run the default:
res = tsne(X=X,initial_dims=20,do_pca=True)
plt.scatter(res[:,0],res[:,1],c=y)
If we pass it a pca :
pc = pca(X)[:,:20]
res = tsne(X=pc,initial_dims=20,do_pca=False)
plt.scatter(res[:,0],res[:,1],c=y)

fit_transform PCA inconsistent results

I am trying to do PCA from sklearn with n_components = 5. I apply the dimensionality reduction on my data using fit_transform(data).
Initially I tried to do the classical matrix multiplication between pca.components_ values and my x_features data, but the results are different. So I am wether doing my multiplication incorrectly or I did not understand how fit_transform work.
Below is a mock-up to compare classic matrix multiplication and fit_transform:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
my_matrix = np.random.randn(100, 5)`
mdl = decomposition.PCA(n_components=5)
mdl_FitTrans = mdl.fit_transform(my_matrix)
pca_components = mdl.components_
mdl_FitTrans_manual = np.dot(pca_components, my_matrix.transpose())
mdl_FitTrans_manualT = mdl_FitTrans_manual.transpose()
I am expecting mdl_FitTrans == mdl_FitTrans_manual but the result is False.
Check out, how the transform() method is implemented in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a5ab948/sklearn/decomposition/base.py#L101
According to it, manual reduction is done as following:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
data = np.random.randn(100, 100)
mdl = decomposition.PCA(n_components=5)
mdl_fit = mdl.fit(data)
data_transformed = mdl_fit.transform(data)
data_transformed_manual = np.dot(data - mdl_fit.mean_, mdl.components_.T)
np.all(data_transformed == data_transformed_manual)
True

Can python do a gaussian fitting and extrapolation?

I think numpy or scipy will do it, but didn't find. Thanks!
import numpy as np
import scipy.stats as stats
np.random.seed(0)
gaussian = stats.norm
Generating some random, normal data:
data = gaussian.rvs(loc = 5, scale = 22, size = 1000)
Computing descriptive statistics:
print(data.mean())
# 4.00435243522
print(data.std())
# 21.7147294907
Fitting the data to a normal distribution:
mean, std = gaussian.fit(data)
print(mean, std)
# (4.0043524352157016, 21.714729490718568)

Categories

Resources