why my PCA and PCA from sklearn get different results? - python

I tried to use the PCA provided in "machine learning in action", but I found that the results obtained by it are not the same as those obtained by the PCA in sklearn. I don't quite understand what is going on.
Below is my code:
import numpy as np
from sklearn.decomposition import PCA
x = np.array([
[1,2,3,4,5, 0],
[0.6,0.7,0.8,0.9,0.10, 0],
[110,120,130,140,150, 0]
])
def my_pca(data, dim):
remove_mean = data - data.mean(axis=0)
cov_data = np.cov(remove_mean, rowvar=0)
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
sorted_eig_val = np.argsort(eig_val)
eig_index = sorted_eig_val[:-(dim+1):-1]
transfer = eig_vec[:,eig_index]
low_dim = remove_mean * transfer
return np.array(low_dim, dtype=float)
pca = PCA(n_components = 3)
pca.fit(x)
new_x = pca.transform(x)
print("sklearn")
print(new_x)
new_x = my_pca(x, 3)
print("my")
print(new_x)
Output:
sklearn
[[-9.32494230e+01 1.46120285e+00 2.37676120e-15]
[-9.89004904e+01 -1.43283197e+00 2.98143675e-14]
[ 1.92149913e+02 -2.83708789e-02 2.81307176e-15]]
my
[[ 9.32494230e+01 -1.46120285e+00 7.39333927e-14]
[ 9.89004904e+01 1.43283197e+00 -7.01760428e-14]
[-1.92149913e+02 2.83708789e-02 1.84375626e-14]]

The issue relates to your function, in particular the part where you calculate your eigenvector and eigenvalues:
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
It appears that ScitKit learn uses "eigh" instead of "eig", so if you change the code snippet from np.linalg.eig to np.linalg.eigh, you should get the same results.

Related

PCA algorithm makes iris dataset to be reversed on y axis

I am making PCA in python with this code:
def OWN_PCA(X,num_components):
#Step-1
X_meaned = X - np.mean(X , axis = 0)
#creating covariance matrix
cov_mat = np.cov(X_meaned,rowvar=False)
#calculating eigenvector and eigen value
eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
#sorting the vectors based on eigen values
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]
#choosing number of components
eigenvector_subset = sorted_eigenvectors[:,0:num_components]
X_reduced = np.dot( eigenvector_subset.transpose(), X.transpose() ).transpose()
return X_reduced
The problem is when I apply it to the iris dataset and plot it, I will get this:
and when I use PCA in sklearn the image is reversed:
What is wrong with my code?

Python | SKlearn | PCA

Edit: Thanks for spotting the typo, it should be 60*50, i have corrected the same in the question.
I am stuck on the following problem, After performing PCA on a matrix with 60 observations and 50 variables when i checked the shape of pca component it comes out to be 50*50. Whereas i think it should be 60*50. Same i checked in R, it comes out to be, as per my understanding, 60*50. Please let me know if i am doing something wrong. PFB the code:
import numpy as np
arr=np.random.randn(20*3*50)
from numpy import *
arr = (arr - mean(arr, axis=0)) / std(arr, axis=0)
arr=arr.reshape(60,50)
arr.shape
#output: (60, 50)
arr[1:20, 2] = 1
arr[21:40, 1] = 2
arr[21:40, 2] = 2
arr[41:60, 1] = 1
arr.shape
#output: (60, 50)
from sklearn.decomposition import PCA
pca = PCA()
X_train_pca = pca.fit_transform(arr)
pca.components_.shape
#output: (50, 50)
Look at PCA class in scikit-learn. It tells us that:
...if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
As far as pca.components_ returns array of shape (n_components, n_features), there is no confusion.

Canonical Discriminant Function in Python sklearn

I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)

SKLearn ElasticNetCV: Looking for a similar cross-validation-error-plot to Matlab's lassoPlot or R's plot(cv.glmnet(x,y))

I use sklearn.linear_model.ElasticNetCV and I would like to get a similar figure as Matlab provides with lassoPlot with plottype=CV or R's plot(cv.glmnet(x,y)), i.e., a plot of the cross validations errors over various alphas (note, in Matlab and R this parameter is called lambda). Here is an example:
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as stats
# toy example
# generate 200 samples of five-dimensional artificial data X from a
# exponential distributions with various means:
X = np.zeros( (200,5 ) )
for col in range(5):
X[ :, col ] = stats.expon.rvs( scale=1.0/(col+1) )
# generate response data Y = X*r + eps where r has just two nonzero
# components, and the noise eps is normal with standard deviation 0.1:
r = np.array( [ 0, 2, 0, -3, 0 ] )
Y = np.dot(X,r) + sp.randn( 200 )*0.1
enet = lm.ElasticNetCV()
alphas,coefs, _ = enet.path( X, Y )
# plot regulization paths
plt.plot( -np.log10(alphas), coefs.T, linestyle='-' )
plt.show()
I would like to plot also in a separate figure the cross validation error for each alpha. But It seems that ElasticNetCV.path() does not return a mse vector. Is there a simliar functionality in sklearn to Matlab.lassoPlot with plottype='CV' see: http://de.mathworks.com/help/stats/lasso-and-elastic-net.html or R's cv.glmnet(x,y) https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. Alternatively, I would implement it using sklearn.cross_validation. Do you have any suggestions?

Replication of scikit.svm.SRV.predict(X)

I'm trying to replicate scikit-learn's svm.svr.predict(X) and don't know how to do it correctly.
I want to do is, because after training the SVM with an RBF kernel I would like to implement the prediction on another programming language (Java) and I would need to be able to export the model's parameters to be able to perform predictions of unknown cases.
On scikit's documentation page, I see that there are 'support_ and 'support_vectors_ attributes, but don't understand how to replicate the .predict(X) method.
A solution of the form y_pred = f(X,svm.svr.support_, svm.svr.support_vectors_,etc,...) is what I am looking for.
Thank you in advance!
Edit:
Its SVM for REGRESSION, not CLASSIFICATION!
Edit:
This is the code I am trying now, from Calculating decision function of SVM manually with no success...
from sklearn import svm
import math
import numpy as np
X = [[0, 0], [1, 1], [1,2], [1,2]]
y = [0, 1, 1, 1]
clf = svm.SVR(gamma=1e-3)
clf.fit(X, y)
Xtest = [0,0]
print 'clf.decision_function:'
print clf.decision_function(Xtest)
sup_vecs = clf.support_vectors_
dual_coefs = clf.dual_coef_
gamma = clf.gamma
intercept = clf.intercept_
diff = sup_vecs - Xtest
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
print 'decision_function replication:'
print dec_func_vec
The results I'm getting are different for both methods, WHY??
clf.decision_function:
[[ 0.89500898]]
decision_function replication:
[ 0.89900498]
Thanks to the contribution of B#rmaley.exe, I found the way to replicate SVM manually. I had to replace
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
with
dec_func_vec = (dual_coefs.dot(np.exp(-gamma*(norm2**2))) + intercept)
So, the full vectorized method is:
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)

Categories

Resources