I have a question regarding preprocessing data for GridSearchCV, mainly pertaining to scaling.
So what I hope to achieve is:
Perform scaling (e.g. StandardScaler()) on training data during GridSearchCV, and not on the entire set which consists of both training data and testing data.
If I use a Pipeline, for example:
clf = make_pipeline(StandardScaler(), GridSearchCV(KerasRegressor(), param_grid = [....], cv=10, refit=True))
I do not get the ability to choose whether the StandardScaler() is applied on the training set only during each fold -> I believe the scaling is done for the entire set for each fold.
Could I seek your advice please?
Thank you.
Related
I'm trying to do sentiment analysis on text docoments but I got lost in the steps.
So my goal is to:
Train SVM, KNN and Naive Bayes algorithms
Use gridsearch to find best parameters
Evaluate models accuracy and find the best one
Use those parameters and get optimal result
Almost on every guide I find that train_test_split method is used. But I've read that Holdout cross validation method isn't very accurate. It's when you split data into train test sets for example 80:20 and hold that 20% for the testing. So instead i wanted to use K-folds cross validation. But the question is how could i use it and do i still need to split my data into train test sets?
So far what i've tried is:
sentences = svietimas_data['text']
y = svietimas_data['sentiment']
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.1, random_state=1)
sentences_train, sentences_validate, y_train, y_validate = train_test_split(sentences_train, y_train, test_size=0.1111, random_state=1)
classifier = KNeighborsClassifier()
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range, weights = weights, metric = metric )
vectorizer = TfidfVectorizer(lowercase=False, max_df=100)
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_validate = vectorizer.transform(sentences_validate)
X_test = vectorizer.transform(sentences_test)
grid_search = GridSearchCV(classifier, param_grid, cv=10,scoring='accuracy', return_train_score=False)
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
I split the data into train validate and test - 80:10:10. I use my train data for the gridsearch parameter analysis to find best parameters and after that i put those parameters into my classifier to use it with validate and test sets to find the best results like this:
classifier.fit(X_train, y_train)
y_pred_validate = classifier.predict(X_validate)
print(classification_report(y_validate, y_pred_validate))
y_pred_test = classifier.predict(X_test)
print(classification_report(y_test, y_pred_test))
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it? or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set? Because using gridsearch or k-folds with train (80%) data i waste 20% of the data and as far as i know if i would use 100% of the data K-folds would split that data into for example gievn k-5 sets and the data wouldn't count as seen or overfitted?
Or what my exact steps should be to correctly achieve that goal?
You're doing parameter tuning, which is equivalent to training: this is why you must keep a fresh test set to evaluate the final model (otherwise performance could be overestimated).
However since you're using CV in the first level of training, you need only one more test set. So the typical process would be like this:
Split training and test set
Apply CV to the training set for all combinations of parameters (grid search), then pick the best parameters.
Re-train the final model on the full training set with the best parameters.
Evaluate the model on the test set
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it?
If you don't evaluate on a fresh test set after parameter tuning, you might have overfitting (best parameters by chance) and you wouldn't know it, the performance would be biased.
or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set?
It is possible to use CV also for the last stage of evaluation, but it's not so simple: you would have to used nested CV, it's not really worth it and it would take a lot more time because you would have to repeat the parameter tuning stage for each training inside the top-level CV.
Because using gridsearch or k-folds with train (80%) data i waste 20% of the data
Actually you don't waste the data. The test set is needed only for the purpose of reliable evaluation, but once this is done you could perfectly re-train your model on the full data.
Also this is a bad sign when 20% of the data matters a lot for performance, it means that the model probably doesn't have a large enough training set and even the full data might not be enough.
So, I am struggling to understand why is it that, as a common practice, a cross-validation step is done to a model does has not been trained yet. An example of what I am saying can be found in here. A piece of the code is pasted below:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Questions:
What would be the purpose of the cross-validation at that point?
Does some training procedure take place on any part of that code?
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
Thanks in advance!
cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
according to the documentation the "cross_val_score" fits the model using the given cross validation technique, there
in the code above, "model" contains the model that will be fit, and "cv" contains information about the cross validation method that the "cross_val_score" will use to structure the training and CV sets and evaluate the model.
in other words, those are just definitions, the actual training and CV happen inside the "cross_val_score" function.
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
KFold CV generally doesn't tackle an unbalanced dataset, it just assures that the result will not be biased by the choice of the training/CV datasets,
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
if you want to tackle an unbalanced dataset you have to use a better metric than accuracy, like ‘balanced_accuracy’ or ‘roc_auc’ and making sure both the training and CV datasets have both positive and negative cases.
I have a dataset which has 450.000 data points, 12 features and label(0 or 1). I am using imblearn library of python because my dataset is imbalanced(ratio= 1:50, class 1 is minority). I am using EasyEnsembleClassifier as classifier. My problem is; I get high recall but very low precision as you can see from image below(90% recall, 8% precision, 14% f1 score).
Here is my code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from dask_ml.preprocessing import RobustScaler
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn import metrics
df = read_csv(...)
X = df[['features...']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
clf = EasyEnsembleClassifier(n_estimators=50, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)
------code for report------
.............
Output:
Classification Report
I tried different scalers namely MinMaxScaler, StandardScaler. I tried changing test-train split ratio, different parameters of EasyEnsembleClassifier. I also tried BalancedRandomForestClassifier from same library but result are same. Changing number of estimators in classifier parameter also doesn't change the result.
What is the reason of this results? What can I do to improve precision without damaging recall? It looks like I am doing something wrong in my code or I am missing an important concept.
Edit:
I still couldn't figure out the true reason of my problem but since no one answered my question here is some ideas about what could be the reason of this weird model in case someone else encounters with similar problem;
Most probably my dataset is poorly labeled. It is possible that model cannot distinguish classes because they are very alike. I will try to generate some synthetic data to train my model again.
I did not test this but some features may be harming the model. I need to visually inspect to find out if there is correlation between features and remove some of them but I highly suspect this is the problem because boosting classifiers should handle this problem automatically by weighting each feature.
Also 12 features in my case may not be enough. I may need more. Although it is not easy for my dataset to generate more features I will think about it.
Finally maybe undersampling is not suited for my dataset. I will give a shot to oversampling techniques or SMOTE if I feel desperate enough.
You could try other ensemble methods for class imbalance learning. SMOTEBoost is one such method that combines boosting and data sampling method, essential injects SMOTE technique at each boosting iteration.
This article could be of interest to you.
I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set.
I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.
Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?
One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?
Thank you!
You can compute a validation curve with the model that you obtained from GridSearchCV. Read the documentation here. You will just need to define arrays for the hyperparameters that you want to inspect and a scoring function. Here is an example:
train_scores, valid_scores = validation_curve(model, X_train, y_train, "alpha", np.logspace(-7, 3, 3), cv=5, scoring="accuracy")
I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!
The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.
The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10],
"gamma": [.01, .1]}
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: {0:.2f}'.format(mean_val_score))
cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.
If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.