accuracy difference between svm and logistic regression in python - python

I have two classifier in python such as svm and logistic regression.
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn import svm
scaler = preprocessing.StandardScaler()
scaler.fit(synthetic_data)
synthetic_data = scaler.transform(synthetic_data)
test_data = scaler.transform(test_data)
svc = svm.SVC(tol=0.0001, C=100.0).fit(synthetic_data, synthetic_label)
predictedSVM = svc.predict(test_data)
print(accuracy_score(test_label, predictedSVM))
LRmodel = LogisticRegression(penalty='l2', tol=0.0001, C=100.0, random_state=1,max_iter=1000, n_jobs=-1)
predictedLR = LRmodel.fit(synthetic_data, synthetic_label).predict(test_data)
print(accuracy_score(test_label, predictedLR))
I use same input but their accuracy is so different. svm sometimes predicts all predicted svm as 1. Accuracy of svm is 0.45 and accuracy of logistic regression is 0.75. I changed parameters of C in a different ways, but I have still some problems.

It is because SVC by default uses radial kernel (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), which is something different than linear classification.
If you want to use linear kernel add parameter kernel='linear' to SVC.
If you want to keep using radial kernel, I suggest to also change gamma parameter.

Related

How does sklearn calculate accuracy on the validation set when XGBoost is given class weights?

I am using XGBoost's sklearn API with sklearn's RandomizedSearchCV() to train a boosted tree model with cross validation. My problem is imbalanced, so I've supplied the scale_pos_weight parameter to my XGBClassifier. For simplicity, let's say that I'm doing cross validation with two folds (k = 2). At the end of this post, I've provided an example of the model I'm fitting.
How is accuracy (or any metric) calculated on the validation set? Is the accuracy weighted using the scale_pos_weights argument given to XGBoost, or does sklearn calculate the unweighted accuracy?
import xgboost as xgb
from sklearn.model_selection import RandomForestClassifier
xgb_estimator = xgb.XGBClassifier(booster = "gbtree")
tune_grid = {"scale_pos_weight": [1, 10, 100], "max_depth": [1, 5, 10]} # simple hyperparameters as example.
xgb_seearch = RandomizedSearchCV(xgb_estimator, tune_grid, cv=2, n_iter = 10,
scoring = "accuracy", refit = True,
return_train_score = True)
results = xgb_search.fit(X, y)
results.cv_results # look at cross validation metrics
Does sklearn calculate the unweighted accuracy?
scikit-learn will select for the model maximizing the scoring parameter. When "accuracy" is provided, it will return unweighted accuracy—which is usually biased on imbalanced problems.
Pass scoring="balanced_accuracy" to compute the weighted version.

sklearn Kernel PCA with different order of samples

I've encountered a problem when I used kernel PCA implemented in sklearn. The order of the samples before kpca would significantly influence the classification accuracy.
Here is my processing procedure:
run Kernel PCA for input X(n_samples, n_components).
shuffle the X, split the X into training set and test set (10-fold).
use extratree classifier, svc or other classifiers implemented in sklearn to perform binary classification task.
my code
import numpy as np
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier as etclf
# load data
datapath=r"F:\..."
data=sio.loadmat(datapath+"\\...")
x=data["x"]
labels=data["labels"]
# kernel pca
gm=[1e-5]
nfea=x.shape[1]
kpca=KernelPCA(n_components=nfea,kernel='rbf',gamma=gm,eigen_solver="auto",random_state=(42))
x_pca=kpca.fit_transform(x)
# shuffle the x_pca with labels
x_shuffle,y_shuffle=shuffle(x_pca,labels,random_state=42)
data_label=np.concatenate((x_shuffle,y_shuffle),axis=1)
# 10-fold cross validation
kf = KFold(n_splits=10,shuffle=False)
for train,test in kf.split(data_label):
x_train=data_label[train,:-1]
x_test=data_label[test,:-1]
y_train=data_label[train,-1]
y_test=data_label[test,-1]
# binary classification prediction
clf=etclf(n_estimators=10,criterion='gini',random_state=42)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
Before applying kernel pca, there are two kinds of orders of x :
I sorted the x by their labels from 1 to 0 (i.e., 1111111...111000000...000), I finally got the accuracy close to 0.99.
I shuffled the x with their labels (i.e., 1100011100...00101100100011), I finally got the accuracy about 0.50.
I also adopted other classifiers such as svc, gaussian naive bayes, the results were similar. I think it is not the matter of classifier or leakage between training set and test set. It is more likely that kpca makes high correlations between samples that are close in order. I don't know how to explain this result.
Thanks for help!

Metric for K-fold Cross Validation for Regression models

I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination

l1 regularized support for Multinomial Logistic Regresion

The current sklearn LogisticRegression supports the multinomial setting but only allows for an l2 regularization since the solvers l-bfgs-b and newton-cg only support that. Andrew Ng has a paper that discusses why l2 regularization shouldn't be used with l-bfgs-b.
If I were to use sklearn's SGDClassifier with log loss and l1 penalty, would that be the same as multinomial logistic regression with l1 regularization minimized by stochastic gradient descent? If not, are there any open source python packages that support l1 regularized loss for multinomial logistic regression?
According to the SGD documentation:
For multi-class classification, a “one versus all” approach is used.
So I think using SGDClassifier cannot perform multinomial logistic regression either.
You can use statsmodels.discrete.discrete_model.MNLogit, which has a method fit_regularized which supports L1 regularization.
The example below is modified from this example:
import numpy as np
import statsmodels.api as sm
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
iris = load_iris()
X = iris.data
y = iris.target
X = sm.add_constant(X, prepend=False) # An interecept is not included by default and should be added by the user.
X_train, X_test, y_train, y_test = train_test_split(X, y)
mlogit_mod = sm.MNLogit(y_train, X_train)
alpha = 1 * np.ones((mlogit_mod.K, mlogit_mod.J - 1)) # The regularization parameter alpha should be a scalar or have the same shape as as results.params
alpha[-1, :] = 0 # Choose not to regularize the constant
mlogit_l1_res = mlogit_mod.fit_regularized(method='l1', alpha=alpha)
y_pred = np.argmax(mlogit_l1_res.predict(X_test), 1)
Admittedly, the interface of this library is not as easy to use as scikit-learn, but it provides more advanced stuff in statistics.
The package Lighting has support for multinomial logit via SGD for l1 regularization.

Ensamble methods with scikit-learn

Is there any way to combine different classifiers into one in sklearn? I find sklearn.ensamble package. It contains different models, like AdaBoost and RandofForest, but they use decision trees under the hood and I want to use different methods, like SVM and Logistic regression. Is it possible with sklearn?
Do you just want to do majority voting? This is not implemented afaik. But as I said, you can just average the predict_proba scores. Or you can use LabelBinarizer of the predictions and average those. That would implement a voting scheme.
Even if you are not interested in the probabilities, averaging the predicted probabilities might be more robust than doing a simple voting. This is hard to tell without trying out, though.
Yes, you can train different models on the same dataset & Let each model make its predictions
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
# Set seed for reproducibility
SEED = 1
Now Instantiate these models
# Instantiate lr
lr = LogisticRegression(random_state = SEED)
# Instantiate knn
knn = KNN(n_neighbors = 27)
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf = 0.13, random_state = SEED)
then define them as a list of classifiers and combine these different classifiers into one Meta-Model.
classifiers = [('Logistic Regression', lr),
('K Nearest Neighbours', knn),
('Classification Tree', dt)]
now iterate over this pre-defined list of classifiers using for loop
for clf_name, clf in classifiers:
# Fit clf to the training set
clf.fit(X_train, y_train)
# Predict y_pred
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
# Evaluate clf's accuracy on the test set
print('{:s} : {:.3f}'.format(clf_name, accuracy))
Finally, We'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.
# Voting Classifier
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators = classifiers)
# Fit vc to the training set
vc.fit(X_train, y_train)
# Evaluate the test set predictions
y_pred = vc.predict(X_test)
# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy))
for this task I have been using DESLib, a library that has been incorporated into sklearn, but for some reason it is still quite unknown
it's really useful, and it has a lot of combo rules
https://deslib.readthedocs.io/en/latest/
https://www.kaggle.com/competitions/rsna-2022-cervical-spine-fracture-detection/rules

Categories

Resources