Python | SKlearn | PCA - python

Edit: Thanks for spotting the typo, it should be 60*50, i have corrected the same in the question.
I am stuck on the following problem, After performing PCA on a matrix with 60 observations and 50 variables when i checked the shape of pca component it comes out to be 50*50. Whereas i think it should be 60*50. Same i checked in R, it comes out to be, as per my understanding, 60*50. Please let me know if i am doing something wrong. PFB the code:
import numpy as np
arr=np.random.randn(20*3*50)
from numpy import *
arr = (arr - mean(arr, axis=0)) / std(arr, axis=0)
arr=arr.reshape(60,50)
arr.shape
#output: (60, 50)
arr[1:20, 2] = 1
arr[21:40, 1] = 2
arr[21:40, 2] = 2
arr[41:60, 1] = 1
arr.shape
#output: (60, 50)
from sklearn.decomposition import PCA
pca = PCA()
X_train_pca = pca.fit_transform(arr)
pca.components_.shape
#output: (50, 50)

Look at PCA class in scikit-learn. It tells us that:
...if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
As far as pca.components_ returns array of shape (n_components, n_features), there is no confusion.

Related

ValueError: Found array with dim 3. Estimator expected <= 2. when using numpy arrays

I have two numpy arrays
import numpy as np
temp_1 = np.array([['19.78018766'],
['19.72487359'],
['19.70280336'],
['19.69589641'],
['19.69746018']])
temp 2 = np.array([['43.8'],
['43.9'],
['44'],
['44.1'],
['44.2']])
and I am preparing X = np.stack((temp_1,temp_2), axis=-1)
which looks something like this
X = [[['19.78018766' '43.8']]
[['19.72487359' '43.9']]
[['19.70280336' '44']]
[['19.69589641' '44.1']]
[['19.69746018' '44.2']]]
I have another variable Y which is also a numpy array
Y = np.array([['28.78'],
['32.72'],
['15.70'],
['32.69'],
['55.69']])
I am trying to run the RandomforestRegressor model
where
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X,Y)
However, it is giving me this error
ValueError: Found array with dim 3. Estimator expected <= 2.
This happens because X and Y shapes are different (5, 1, 2) != (5,1).
Just reshape your X data to the number of samples you have
# In this example 5 samples
X = X.reshape(5, 2)

why my PCA and PCA from sklearn get different results?

I tried to use the PCA provided in "machine learning in action", but I found that the results obtained by it are not the same as those obtained by the PCA in sklearn. I don't quite understand what is going on.
Below is my code:
import numpy as np
from sklearn.decomposition import PCA
x = np.array([
[1,2,3,4,5, 0],
[0.6,0.7,0.8,0.9,0.10, 0],
[110,120,130,140,150, 0]
])
def my_pca(data, dim):
remove_mean = data - data.mean(axis=0)
cov_data = np.cov(remove_mean, rowvar=0)
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
sorted_eig_val = np.argsort(eig_val)
eig_index = sorted_eig_val[:-(dim+1):-1]
transfer = eig_vec[:,eig_index]
low_dim = remove_mean * transfer
return np.array(low_dim, dtype=float)
pca = PCA(n_components = 3)
pca.fit(x)
new_x = pca.transform(x)
print("sklearn")
print(new_x)
new_x = my_pca(x, 3)
print("my")
print(new_x)
Output:
sklearn
[[-9.32494230e+01 1.46120285e+00 2.37676120e-15]
[-9.89004904e+01 -1.43283197e+00 2.98143675e-14]
[ 1.92149913e+02 -2.83708789e-02 2.81307176e-15]]
my
[[ 9.32494230e+01 -1.46120285e+00 7.39333927e-14]
[ 9.89004904e+01 1.43283197e+00 -7.01760428e-14]
[-1.92149913e+02 2.83708789e-02 1.84375626e-14]]
The issue relates to your function, in particular the part where you calculate your eigenvector and eigenvalues:
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
It appears that ScitKit learn uses "eigh" instead of "eig", so if you change the code snippet from np.linalg.eig to np.linalg.eigh, you should get the same results.

sorting via argsort - generalization to 2d matrices

For sorting a numpy via argsort, we can do:
import numpy as np
x = np.random.rand(3)
x_sorted = x[np.argsort(x)]
I am looking for a numpy solution for the generalization to two or higher dimensions.
The indexing as in the 1d case won't work for 2d matrices.
Y = np.random.rand(4, 3)
sort_indices = np.argsort(Y)
#Y_sorted = Y[sort_indices] (what would that line be?)
Related: I am looking for a pure numpy answer that addresses the same problem as solved in this answer: https://stackoverflow.com/a/53700995/2272172
Use np.take_along_axis:
import numpy as np
np.random.seed(42)
x = np.random.rand(3)
x_sorted = x[np.argsort(x)]
Y = np.random.rand(4, 3)
sort_indices = np.argsort(Y)
print(np.take_along_axis(Y, sort_indices, axis=1))
print(np.array(list(map(lambda x, y: y[x], np.argsort(Y), Y)))) # the solution provided
Output
[[0.15599452 0.15601864 0.59865848]
[0.05808361 0.60111501 0.86617615]
[0.02058449 0.70807258 0.96990985]
[0.18182497 0.21233911 0.83244264]]
[[0.15599452 0.15601864 0.59865848]
[0.05808361 0.60111501 0.86617615]
[0.02058449 0.70807258 0.96990985]
[0.18182497 0.21233911 0.83244264]]

Scikit Learn transform method - manual calculation?

I have a question about Scikit-Learn's PCA transform method. The code is found here - scroll down to find the transform() method.
They show the procedure in this simple example - the procedure is to first fit and then transform:
pca.fit(X) #step 1: fit()
X_transformed = fast_dot(X, self.components_.T) #step 2: transform()
I am trying to do this manually as follows:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.utils.extmath import fast_dot
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=3)
pca.fit(X)
Xm = X.mean(axis=1)
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm[None].T # or can use X = X - Xm[:, np.newaxis]
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Expected:
[[-2.68420713 -0.32660731 0.02151184]
[-2.71539062 0.16955685 0.20352143]
[-2.88981954 0.13734561 -0.02470924]
[-2.7464372 0.31112432 -0.03767198]
[-2.72859298 -0.33392456 -0.0962297 ]]
Manual
[[-0.98444292 -2.74509617 2.28864171]
[-0.75404746 -2.44769323 2.35917528]
[-0.89110797 -2.50829893 2.11501947]
[-0.74772562 -2.33452022 2.10205674]
[-1.02882877 -2.75241342 2.17090017]]
As you can see, the two results are different. Is there a step missing somewhere in the transform() method?
I'm not a great expert on PCA, but by looking at the sklearn source code I found your problem - you take the mean along the wrong axis.
Here's the solution:
Xm = X.mean(axis=0) # Axis 0 instead of 1
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm # No need for transpose now
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Results:
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]

random subsampling of the majority class

I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?
Say your data looks like something generated from this code:
import numpy as np
x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])
(only a 1/5th of y is 1, which is the minority class).
To find the size of the minority class, do:
>>> np.sum(y == 1)
20
To find the subset that consists of the majority class, do:
majority_x, majority_y = x[y == 0, :], y[y == 0]
To find a random subset of size 20, do:
inds = np.random.choice(range(majority_x.shape[0]), 20)
followed by
majority_x[inds, :]
and
majority_y[inds]

Categories

Resources