sklearn Kernel PCA with different order of samples - python

I've encountered a problem when I used kernel PCA implemented in sklearn. The order of the samples before kpca would significantly influence the classification accuracy.
Here is my processing procedure:
run Kernel PCA for input X(n_samples, n_components).
shuffle the X, split the X into training set and test set (10-fold).
use extratree classifier, svc or other classifiers implemented in sklearn to perform binary classification task.
my code
import numpy as np
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier as etclf
# load data
datapath=r"F:\..."
data=sio.loadmat(datapath+"\\...")
x=data["x"]
labels=data["labels"]
# kernel pca
gm=[1e-5]
nfea=x.shape[1]
kpca=KernelPCA(n_components=nfea,kernel='rbf',gamma=gm,eigen_solver="auto",random_state=(42))
x_pca=kpca.fit_transform(x)
# shuffle the x_pca with labels
x_shuffle,y_shuffle=shuffle(x_pca,labels,random_state=42)
data_label=np.concatenate((x_shuffle,y_shuffle),axis=1)
# 10-fold cross validation
kf = KFold(n_splits=10,shuffle=False)
for train,test in kf.split(data_label):
x_train=data_label[train,:-1]
x_test=data_label[test,:-1]
y_train=data_label[train,-1]
y_test=data_label[test,-1]
# binary classification prediction
clf=etclf(n_estimators=10,criterion='gini',random_state=42)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
Before applying kernel pca, there are two kinds of orders of x :
I sorted the x by their labels from 1 to 0 (i.e., 1111111...111000000...000), I finally got the accuracy close to 0.99.
I shuffled the x with their labels (i.e., 1100011100...00101100100011), I finally got the accuracy about 0.50.
I also adopted other classifiers such as svc, gaussian naive bayes, the results were similar. I think it is not the matter of classifier or leakage between training set and test set. It is more likely that kpca makes high correlations between samples that are close in order. I don't know how to explain this result.
Thanks for help!

Related

Can you use LDA (Linear Discriminant Analysis) as part of sklearn pipeline for preprocessing?

So this is what I put together to run the data through variance threshold for feature selection, then normalizer and LDA for dimensionality reduction.
The LDA element I'm not too sure about as I can't find any examples of this being used in a pipeline (as dimensionality reduction / data transformation technique as opposed to a standalone classifier.)
I am a bit worried, as when this is used and the transformed data passed on to a series classifiers - they result in a series of identical accuracy, precision, recall and F1 scores. Only the application of AdaBoost brings back something different.
Is there something I'm doing wrong here?
pipeline = Pipeline([
('feature_selection', VarianceThreshold()),
('normaliser', Normalizer()),
('lda', LinearDiscriminantAnalysis())], verbose = True)
X_train_post_pipeline = pipeline.fit_transform(X_train, Y_train)
X_test_post_pipeline = pipeline.transform(X_test)
LinearDiscriminantAnalysis is a is a dimensionality reduction technique that can be compared to PCA. Therefore it can be used within a pipeline as preprocessing.
It is possible that classifier that used its result end up with the same score as LDA project inputs to the most discriminative directions.
Below is an example of a pipeline that is using LDA as a preprocessing steps:
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import Normalizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_classes=2)
pipe = make_pipeline(VarianceThreshold(),
Normalizer(),
LinearDiscriminantAnalysis(),
LogisticRegression())
pipe.fit(X, y)

How to reduce the number of vector features?

I'm doing cross fold validation in scikit-learn. Here the script:
import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
r_filenameTSV = "TSV/A19784.tsv"
#DF 300 dimension start
tsv_read = pd.read_csv(r_filenameTSV, sep='\t', names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(" ", 1).tolist(), columns=['label', 'vector'])
print(df)
#DF 300 dimension end
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1).ravel()
print(y.shape)
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
start = time()
clf = svm.SVC(kernel='rbf',
C=32,
gamma=8,
)
print("K-Folds scores:")
originalclass = []
predictedclass = []
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred) # return accuracy score
inner_cv = StratifiedKFold(n_splits=10)
outer_cv = StratifiedKFold(n_splits=10)
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv,
scoring=make_scorer(classification_report_with_accuracy_score))
# Average values in classification report for all folds in a K-fold Cross-validation
print(classification_report(originalclass, predictedclass))
print("10 folds processing seconds: {}".format(time() - start))
As you can see I'm using as input data a Pandas data frame which has 300 features.
How to reduce the feature from 300 to 100?
Everything has to be done in Pandas (i.e creating a df with max 100 features per record) or I can use directly scikit-learn?
there are many ways to reduce the number of features in ML models here are some of them
use statistical methods such as Information Gain and Fisher Score, compute this score between your features and target and then select top 100
remove constant or quasi constant features
There are wrapper methods such as forward feature selection and backward feature selection and their idea is to search feature space and choose the best combination for this method you can use mlxtend.feature_selection this package is rather compatible with scikit learn
use PCA, LDA, ....
you can use embedded methods such as Lasso, Ridge or Random forest use this module from scikit learn: sklearn.feature_selection and import SelectFromModel
use correlation to covariance to determine which features are not contributing to accuracy. Dimension reduction reduces confusion and simplifies your model without compromising accuracy. Another approach is to use a stepwise refinement and look at area under the curve scores for features and remove the features not contributing significantly. Use tsne to visualize your feature clusters - non supervised learning.
https://github.com/dnishimoto/python-deep-learning/blob/master/ANSUR%202%20-%20Army%20-%20Dimension%20reduction.ipynb

How to weigh data points with sklearn training algorithms

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

Low K-fold accuracy for First Fold

I created a text classifier, and I'm trying to utilize K-fold cross-validation. I can't figure out why my first fold has an accuracy of 55% while my other folds are overfitting at 99-100% accuracy. My data set is a 5109x2 dataframe with columns df["Features"] as the features and df["Labels"] as labels. df["Features"] has descriptors based off some product mapping keywords and are separated by commas as seen here: Features. I'm creating indicator variables based off the sub-features through countvectorizer(). This is the result of a 5-fold cv. Result
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
def train(classifier, X, y):
count_vect=CountVectorizer(min_df = 1,lowercase = False)
y=pd.Series(y)
X=count_vect.fit_transform(X)
y=count_vect.fit_transform(y)
kf=KFold(n_splits=5,shuffle=True)
k_fold=pd.Series(np.zeros(5))
for i,(train_index,test_index) in enumerate(kf.split(X)):
print("Train",train_index, "Test",test_index)
X_train,X_test=X[train_index],X[test_index]
y_train,y_test=y[train_index],y[test_index]
k_fold[i]=(print("For K=",i+1," Classifier accuracy= ",classifier.fit(X_train, y_train).score(X_test, y_test), "n = ",X_train.shape[0]))
train(MLPClassifier(hidden_layer_sizes= (100,),activation='relu',random_state=2, max_iter=100, warm_start=True),df["Features"], df["Labels"])
It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.

accuracy difference between svm and logistic regression in python

I have two classifier in python such as svm and logistic regression.
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn import svm
scaler = preprocessing.StandardScaler()
scaler.fit(synthetic_data)
synthetic_data = scaler.transform(synthetic_data)
test_data = scaler.transform(test_data)
svc = svm.SVC(tol=0.0001, C=100.0).fit(synthetic_data, synthetic_label)
predictedSVM = svc.predict(test_data)
print(accuracy_score(test_label, predictedSVM))
LRmodel = LogisticRegression(penalty='l2', tol=0.0001, C=100.0, random_state=1,max_iter=1000, n_jobs=-1)
predictedLR = LRmodel.fit(synthetic_data, synthetic_label).predict(test_data)
print(accuracy_score(test_label, predictedLR))
I use same input but their accuracy is so different. svm sometimes predicts all predicted svm as 1. Accuracy of svm is 0.45 and accuracy of logistic regression is 0.75. I changed parameters of C in a different ways, but I have still some problems.
It is because SVC by default uses radial kernel (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), which is something different than linear classification.
If you want to use linear kernel add parameter kernel='linear' to SVC.
If you want to keep using radial kernel, I suggest to also change gamma parameter.

Categories

Resources