Combining scikitlearn's GridsearchCV and lightgbm's mutliclass classifier - python

I am trying to find reliable hyper parameters for training a multiclass classifier, using both lgbm's "gbdt" and scikitlearn's GridsearchCV.
On the feature side of things there is a ~4k x 40 matrix, containing continuous values.
On the labeling side there is a pool of 4 categorical mutually exclusive classes.
To judge whether any given fold is performing well I would like to use lgbm's auc_mu metric, but I'm ok with any at this point. As you can see in the code below I resorted to weighted accuracy instead.
Below is a simplified version of how the gridsearch is initialised.
param_set = {
'n_estimators':[15, 25]
}
clf = lgb.LGBMModel(
boosting_type='gbdt',
num_leaves=31,
max_depth=5,
learning_rate=0.1,
n_estimators=100,
objective='multiclass',
num_class= len(np.unique(training_data.label)),
min_split_gain=0,
min_child_weight=1e-3,
min_child_samples=10,
subsample=1,
subsample_freq=0,
colsample_bytree=0.6,
reg_alpha=0.3,
reg_lambda=0.7,
random_state=42,
n_jobs=2)
gsearch = GridSearchCV(estimator = clf,
param_grid = param_set,
scoring="balanced_accuracy",
error_score='raise',
n_jobs=2,
cv=5,
verbose = 2)
When I try to call the fit function on the GridSearchCV object,
# separate total data into train/validation and test
stratifiedss = StratifiedShuffleSplit(
n_splits = 1, test_size = 0.2, train_size = 0.8, random_state=723)
for train_ind, test_ind in stratifiedss.split(X,y):
train_feature_obs = X.loc[train_ind]
train_labels = y[train_ind]
validation_feature_obs = X.loc[test_ind]
validation_labels = y[test_ind]
# transform data into lgb Dataset
training_data = lgb.Dataset(train_feature_obs, label=train_labels)
# call the GridSearchCV.fit
lgb_model2 = gsearch.fit(training_data.data.reset_index(drop=True), training_data.label)
it returns
ValueError: Classification metrics can't handle a mix of unknown and continuous-multioutput targets
So I am guessing the sklearnGridSearchCV has trouble evaluating the output of lgbmModel.predict().
I tried fitting a lgbmModel separetly and it should return an array with probabilities of the observation for each of the four classes, summing up to 100%.
I looked at:
ValueError: Classification metrics can't handle a mix of unknown and binary targets
I got the warning "UserWarning: One or more of the test scores are non-finite" when revising a toy scikit-learn gridsearchCV example
But that has not been conclusive yet.
How can I enable the sklearn.GridSearchCV to evaluate the performance of each fold of the lgbmModel classifier?
I am mostly confused as to where the "unknown" type is comnig from.
Any help would be much appreciated.
Regards, Robert

Related

Different features give the same Accuracy in SVM

For a classification task I try to compare two models (with SVM) where the difference between the models is just one feature (namely, cult_distance & inst_dist). Even after HP-tuning (using GridSearchCV) both models show the exact same accuracy. When I try a third model with both cult_distance & inst_dist in the model, still the exact same accuracy. My code is below and in the screenshot you see the classification report. As you can see in the report, the minority class is not predicted. I think my model is overfitting but I don't know how to solve this. Note: also when I run the model without any other features than cult_distance & inst_dist, the accuracy is still the same.
features1 = ['inst_dist', 'cult_distance', 'deal_numb', 'acq_sic', 'acq_stat', 'acq_form',
'tar_sic', 'tar_stat', 'tar_form', 'vendor_name', 'pay_method',
'advisor', 'stake', 'deal_value', 'days', 'sic_sim']
X = data.loc[:, features1]
y = data.loc[:, ['deal_status']]
### training the model
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, train_size = .75)
# defining parameter range
rfc = svm.SVC(random_state = 0)
param_grid = {'C': [0.001],
'gamma': ['auto'],
'kernel': ['rbf'],
'degree':[1]}
grid = GridSearchCV(rfc, param_grid = param_grid, refit = True, verbose = 2,n_jobs=-1)
# fitting the model for grid search
grid.fit(X_train, y_train)
# print best parameter after tuning
grid_predictions = grid.predict(X_test)
# print classification report
print(classification_report(y_test, grid_predictions))
EDIT: After a useful comment, I used class_weight to insert in the dataset. However the accuracy of the data is much lower than the baseline of 0.76?
The problem seems to be connected with unbalanced dataset. One class present much more in your dataset, so model learns to predict only that class. There are numerous ways to combat that, the simplest one is to adjust class_weight parameter in your model. In your situation it should be around class_weight = {0 : 3.2, 1 : 1} as second class is 3 times more popular. You can calculate that using compute_class_weight from sklearn.
Hope that was helpful!

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.

Model help using Scikit-learn when using GridSearch

As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

Is it the same to use 1) StandardScaler & Classifier vs 2) Pipeline(Scalar, Classifier)?

Does running a standard scaler and then a classifier give the same result as using a pipeline?
Hi, I have a classification problem and trying to scale the X variables using scikit learn's StandardScaler(). I see two options of doing this, should they in theory yield the same result? Because I am getting better precision score on my test data set when I use option (1).
(1)
scalar = StandardScaler()
xtrain_ = scalar.fit_transform(xtrain)
RFC = RandomForestClassifier(n_estimators=100)
RFC.fit(xtrain. ytrain)
xtest_ = scalar.transform(xtest)
score = cross_val_score(RFC, xtest_, ytest,cv=10, scoring ='precision')
(2)
RFCs = Pipeline([("scale", StandardScaler()), ("rf", RandomForestClassifier(n_estimators=100))])
RFCs.fit(xtrain, ytrain)
scores = cross_val_score(RFCs, xytest, ytest, cv=10, scoring='precision')
Your option number 2 uses a different data set (xytest) than your version number (1), which uses xtest. Furthermore, your crossvalidation should include the training, not only the prediction.
Apart from that they should be the same, while I advice you to use pipelines.

Categories

Resources