I tried to use the PCA provided in "machine learning in action", but I found that the results obtained by it are not the same as those obtained by the PCA in sklearn. I don't quite understand what is going on.
Below is my code:
import numpy as np
from sklearn.decomposition import PCA
x = np.array([
[1,2,3,4,5, 0],
[0.6,0.7,0.8,0.9,0.10, 0],
[110,120,130,140,150, 0]
])
def my_pca(data, dim):
remove_mean = data - data.mean(axis=0)
cov_data = np.cov(remove_mean, rowvar=0)
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
sorted_eig_val = np.argsort(eig_val)
eig_index = sorted_eig_val[:-(dim+1):-1]
transfer = eig_vec[:,eig_index]
low_dim = remove_mean * transfer
return np.array(low_dim, dtype=float)
pca = PCA(n_components = 3)
pca.fit(x)
new_x = pca.transform(x)
print("sklearn")
print(new_x)
new_x = my_pca(x, 3)
print("my")
print(new_x)
Output:
sklearn
[[-9.32494230e+01 1.46120285e+00 2.37676120e-15]
[-9.89004904e+01 -1.43283197e+00 2.98143675e-14]
[ 1.92149913e+02 -2.83708789e-02 2.81307176e-15]]
my
[[ 9.32494230e+01 -1.46120285e+00 7.39333927e-14]
[ 9.89004904e+01 1.43283197e+00 -7.01760428e-14]
[-1.92149913e+02 2.83708789e-02 1.84375626e-14]]
The issue relates to your function, in particular the part where you calculate your eigenvector and eigenvalues:
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
It appears that ScitKit learn uses "eigh" instead of "eig", so if you change the code snippet from np.linalg.eig to np.linalg.eigh, you should get the same results.
Related
I am making PCA in python with this code:
def OWN_PCA(X,num_components):
#Step-1
X_meaned = X - np.mean(X , axis = 0)
#creating covariance matrix
cov_mat = np.cov(X_meaned,rowvar=False)
#calculating eigenvector and eigen value
eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
#sorting the vectors based on eigen values
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]
#choosing number of components
eigenvector_subset = sorted_eigenvectors[:,0:num_components]
X_reduced = np.dot( eigenvector_subset.transpose(), X.transpose() ).transpose()
return X_reduced
The problem is when I apply it to the iris dataset and plot it, I will get this:
and when I use PCA in sklearn the image is reversed:
What is wrong with my code?
Edit: Thanks for spotting the typo, it should be 60*50, i have corrected the same in the question.
I am stuck on the following problem, After performing PCA on a matrix with 60 observations and 50 variables when i checked the shape of pca component it comes out to be 50*50. Whereas i think it should be 60*50. Same i checked in R, it comes out to be, as per my understanding, 60*50. Please let me know if i am doing something wrong. PFB the code:
import numpy as np
arr=np.random.randn(20*3*50)
from numpy import *
arr = (arr - mean(arr, axis=0)) / std(arr, axis=0)
arr=arr.reshape(60,50)
arr.shape
#output: (60, 50)
arr[1:20, 2] = 1
arr[21:40, 1] = 2
arr[21:40, 2] = 2
arr[41:60, 1] = 1
arr.shape
#output: (60, 50)
from sklearn.decomposition import PCA
pca = PCA()
X_train_pca = pca.fit_transform(arr)
pca.components_.shape
#output: (50, 50)
Look at PCA class in scikit-learn. It tells us that:
...if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
As far as pca.components_ returns array of shape (n_components, n_features), there is no confusion.
I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)
I use sklearn.linear_model.ElasticNetCV and I would like to get a similar figure as Matlab provides with lassoPlot with plottype=CV or R's plot(cv.glmnet(x,y)), i.e., a plot of the cross validations errors over various alphas (note, in Matlab and R this parameter is called lambda). Here is an example:
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as stats
# toy example
# generate 200 samples of five-dimensional artificial data X from a
# exponential distributions with various means:
X = np.zeros( (200,5 ) )
for col in range(5):
X[ :, col ] = stats.expon.rvs( scale=1.0/(col+1) )
# generate response data Y = X*r + eps where r has just two nonzero
# components, and the noise eps is normal with standard deviation 0.1:
r = np.array( [ 0, 2, 0, -3, 0 ] )
Y = np.dot(X,r) + sp.randn( 200 )*0.1
enet = lm.ElasticNetCV()
alphas,coefs, _ = enet.path( X, Y )
# plot regulization paths
plt.plot( -np.log10(alphas), coefs.T, linestyle='-' )
plt.show()
I would like to plot also in a separate figure the cross validation error for each alpha. But It seems that ElasticNetCV.path() does not return a mse vector. Is there a simliar functionality in sklearn to Matlab.lassoPlot with plottype='CV' see: http://de.mathworks.com/help/stats/lasso-and-elastic-net.html or R's cv.glmnet(x,y) https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. Alternatively, I would implement it using sklearn.cross_validation. Do you have any suggestions?
I'm trying to replicate scikit-learn's svm.svr.predict(X) and don't know how to do it correctly.
I want to do is, because after training the SVM with an RBF kernel I would like to implement the prediction on another programming language (Java) and I would need to be able to export the model's parameters to be able to perform predictions of unknown cases.
On scikit's documentation page, I see that there are 'support_ and 'support_vectors_ attributes, but don't understand how to replicate the .predict(X) method.
A solution of the form y_pred = f(X,svm.svr.support_, svm.svr.support_vectors_,etc,...) is what I am looking for.
Thank you in advance!
Edit:
Its SVM for REGRESSION, not CLASSIFICATION!
Edit:
This is the code I am trying now, from Calculating decision function of SVM manually with no success...
from sklearn import svm
import math
import numpy as np
X = [[0, 0], [1, 1], [1,2], [1,2]]
y = [0, 1, 1, 1]
clf = svm.SVR(gamma=1e-3)
clf.fit(X, y)
Xtest = [0,0]
print 'clf.decision_function:'
print clf.decision_function(Xtest)
sup_vecs = clf.support_vectors_
dual_coefs = clf.dual_coef_
gamma = clf.gamma
intercept = clf.intercept_
diff = sup_vecs - Xtest
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
print 'decision_function replication:'
print dec_func_vec
The results I'm getting are different for both methods, WHY??
clf.decision_function:
[[ 0.89500898]]
decision_function replication:
[ 0.89900498]
Thanks to the contribution of B#rmaley.exe, I found the way to replicate SVM manually. I had to replace
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
with
dec_func_vec = (dual_coefs.dot(np.exp(-gamma*(norm2**2))) + intercept)
So, the full vectorized method is:
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)