I want to use Support Vector Regression (SVR) for regression as it seems quite powerful when I have several features. As I found a very-easy-to-use implementation in scikit-learn, I'm using that implementation. My questions below are regarding this python package in particular, but if you have any solution in any other language or package please let me know as well.
So, I'm using the following code:
from sklearn.svm import SVR
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
svr_rbf = SVR(kernel='rbf')
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
scores = cross_validate(estimator, X, y, cv=KFold(10, shuffle=True), scoring=scoring, return_train_score=False)
score = -1 * scores['test_neg_mean_absolute_error']
print("MAE: %.4f (%.4f)" % (score.mean(), score.std()))
score = -1 * scores['test_neg_mean_squared_error']
print("MSE: %.4f (%.4f)" % (score.mean(), score.std()))
score = scores['test_r2']
print("R^2: %.4f (%.4f)" % (score.mean(), score.std()))
As you can see, I can easily use 10-fold cross validation by dividing my data in 10 shuffled folds, and getting all the MAE, MSE and r^2 for each fold very easily.
However, my big question is how can I get the pvalue, r and adjusted r^2 for my SVR regression model specifically, just like I find in other python packages including statsmodels for linear regressions?
I guess I will have to implement the cross validation with KFold by myself in order to achieve this, but I don't think that's a big problem. The issue is that I'm not sure how to get these scores from sklearn's implementation of SVR itself.
Related
Exploring some classification models in Scikit learn I noticed that the scores I got for log loss and for ROC AUC were consistently lower while performing cross validation than while fitting and predicting on the whole training set (done to check for overfitting), thing that did not make sense to me.
Specifically, using cross_validate I set the scorings as ['neg_log_loss', 'roc_auc'] and while performing manual fitting and prediction on the training set I used the metric functions log_loss' and roc_auc_score.
To try to figure out what was happening, i wrote a code to perform the cross validation manually in order to be able to call the metric functions manually on the various folds and compare the results with the ones from cross_validate. As you can see below, I got different results even like this!
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=3, random_state=42, shuffle=True)
log_reg = LogisticRegression(max_iter=1000)
for train_index, test_index in kf.split(dataset, dataset_labels):
X_train, X_test = dataset[train_index], dataset[test_index]
y_train, y_test = dataset_labels_np[train_index], dataset_labels_np[test_index]
log_reg.fit(X_train, y_train)
pr = log_reg.predict(X_test)
ll = log_loss(y_test, pr)
print(ll)
from sklearn.model_selection import cross_val_score
cv_ll = cross_val_score(log_reg, dataset_prepared_stand, dataset_labels, scoring='neg_log_loss',
cv=KFold(n_splits=3, random_state=42, shuffle=True))
print(abs(cv_ll))
Outputs:
4.795481869275026
4.560119170517534
5.589818973403791
[0.409817 0.32309 0.398375]
The output running the same code for ROC AUC are:
0.8609669592272686
0.8678563239907938
0.8367147503682851
[0.925635 0.94032 0.910885]
To be sure to have written the code right, I also tried the code using 'accuracy' as scoring for cross validation and accuracy_score as metric function and the results are instead consistent:
0.8611584327086882
0.8679727427597955
0.838160136286201
[0.861158 0.867973 0.83816 ]
Can someone explain me why the results in the case of the log loss and the ROC AUC are different? Thanks!
Log-loss and auROC both need probability predictions, not the hard class predictions. So change
pr = log_reg.predict(X_test)
to
pr = log_reg.predict_proba(X_test)[:, 1]
(the subscripting is to grab the probabilities for the positive class, and assumes you're doing binary classification).
I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination
I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.
see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example
I used RandomSearchCV to find the best params for the Random Forest Classifier
n_estimators is the number of decision trees to use.
try using XBBoost to get more accuracy.
parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf':
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}
number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)
random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)
print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
don't use gridsearch for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.
there is a stage_predict attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)
# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.
First, you already answer it from your code, in this lines.
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
As you can see, you already saved the accuracy_score into scores. So you just need to recall it by find the maximum value from the socres's list.
maxs = max(scores)
maxs_idx = scores.index(maxs)
Then just put the print command in the final lines.
print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")
I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.
I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)
As a beginner in scikit-learn, and trying to classify the iris dataset, I'm having problems with adjusting the scoring metric from scoring='accuracy' to others like precision, recall, f1 etc., in the cross-validation step. Below is the full code sample (enough to start at # Test options and evaluation metric).
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection # for command model_selection.cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'
#Below, we build and evaluate 6 different models
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn, we calculate the cv-scores, ther mean and std for each model
#
results = []
names = []
for name, model in models:
#below, we do k-fold cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Now, apart from scoring ='accuracy', I'd like to evaluate other performance metrics for this multiclass classification problem. But when I use, scoring='precision', it raises:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
My questions are:
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct? If yes, then, which command(s) should replace scoring='accuracy' in the code above?
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
Why is this happening, when the model evaluation documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) clearly says balanced_accuracy is a scoring method? I'm quite confused here, so an actual code to show how to evaluate other performance etrics would be appreciated! Thanks inn advance!!
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct?
No. Precision & recall are certainly valid for multi-class problems, too - see the docs for precision & recall.
If yes, then, which command(s) should replace scoring='accuracy' in the code above?
The problem arises because, as you can see from the documentation links I have provided above, the default setting for these metrics is for binary classification (average='binary'). In your case of multi-class classification, you need to specify which exact "version" of the particular metric you are interested in (there are more than one); have a look at the relevant page of the scikit-learn documentation, but some valid options for your scoring parameter could be:
'precision_macro'
'precision_micro'
'precision_weighted'
'recall_macro'
'recall_micro'
'recall_weighted'
The documentation link above contains even an example of using 'recall_macro' with the iris data - be sure to check it.
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
This is not exactly trivial, but you can see a way in my answer for Cross-validation metrics in scikit-learn for each data split
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
This is because you are probably using an older version of scikit-learn. balanced_accuracy became available only in v0.20 - you can verify that it is not available in v0.18. Upgrade your scikit-learn to v0.20 and you should be fine.