Train Loss and validation loss GradientBoostingClassifier - python

I am learning to do classification for Cover Type data for 7 classes. I train my model with GradientBoostingClassifier from scikit-learn. When I try to plot my loss function this goes like this:
Is this kind of plot shows me that my model suffers from high variance? If yes, what should I do? And I don't know why in the middle of iterations 200 until 500, the plot is shaped like a rectangle.
(EDIT)
To edit this post, I'm not sure what's wrong with my code becaue I just used the regular code to fit the training data. I'm using jupyter notebook. So I'm just going to provide the code
Y = train["Cover_Type"]
X = train.drop({"Cover_Type"}, axis=1)
#split training data dan cross validation
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X,Y,test_size=0.3,random_state=42)
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingClassifier
params = {'n_estimators': 1000,'learning_rate': 0.3, 'max_features' : 'sqrt'}
dtree=GradientBoostingClassifier(**params)
dtree.fit(X_train,Y_train)
#mau lihat F1-Score
from sklearn.metrics import f1_score
Y_pred = dtree.predict(X_val) #prediksi data cross validation menggunakan model tadi
print Y_pred
score = f1_score(Y_val, Y_pred, average="micro")
print("Gradient Boosting Tree F1-score: "+str(score)) # I got 0.86 F1-Score
import matplotlib.pyplot as plt
# Plot training deviance
# compute test set deviance
val_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, Y_pred in enumerate(dtree.staged_predict(X_val)):
val_score[i] = dtree.loss_(Y_val, Y_pred.reshape(-1, 1))
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, dtree.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, val_score, 'r-',
label='Validation Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')

There are several issues that I will explain them one by one, also I have added the correct code for your example.
staged_predict(X) method shall NOT be used
As staged_predict(X) outputs the predicted class instead of predicted probabilities, it is not correct to use that.
One can (where the context accept) use staged_decision_function(X) method and pass the computed decisions at each stage to the model.loss_ attribute. But in this example, it does not work (the loss based on staged decision increases while the loss decreases).
You should use staged_predict_proba(X) with cross entropy loss
you should use staged_predict_proba(X)
you also need to define a function that calculate the cross entropy loss at each stage.
I have provided the code below. Note that I set the verbosity to 2, and then you can see that the sklearn training loss at each stage is the same as our loss (as a sanity check that our approach works correctly).
Why you have big jumps
I think the reason is that GBC becomes very confident and then predict a label is 1 (as an example) with probability one, while it is not correct (for example the label is 2). This creates big jumps (as cross entropy goes to infinity). In such a scenario you should change your GBC parameters.
The code and the Plot are given below
The code is:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
def _cross_entropy_like_loss(model, input_data, targets, num_estimators):
loss = np.zeros((num_estimators, 1))
for index, predict in enumerate(model.staged_predict_proba(input_data)):
loss[index, :] = -np.sum(np.log([predict[sample_num, class_num-1]
for sample_num, class_num in enumerate(targets)]))
print(f'ce loss {index}:{loss[index, :]}')
return loss
covtype = fetch_covtype()
X = covtype.data
Y = covtype.target
n_estimators = 10
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=0.3, verbose=2 )
clf.fit(X_train, Y_train)
tr_loss_ce = _cross_entropy_like_loss(clf, X_train, Y_train, n_estimators)
test_loss_ce = _cross_entropy_like_loss(clf, X_val, Y_val, n_estimators)
plt.figure()
plt.plot(np.arange(n_estimators) + 1, tr_loss_ce, '-r', label='training_loss_ce')
plt.plot(np.arange(n_estimators) + 1, test_loss_ce, '-b', label='val_loss_ce')
plt.ylabel('Error')
plt.xlabel('num_components')
plt.legend(loc='upper right')
The output of console is like below, from which you can easily verify the approach is correct.
Iter Train Loss Remaining Time
1 482434.6631 1.04m
2 398501.7223 55.56s
3 351391.6893 48.51s
4 322290.3230 41.60s
5 301887.1735 34.65s
6 287438.7801 27.72s
7 276109.2008 20.82s
8 268089.2418 13.84s
9 261372.6689 6.93s
10 256096.1205 0.00s
ce loss 0:[ 482434.6630936]
ce loss 1:[ 398501.72228276]
ce loss 2:[ 351391.68933547]
ce loss 3:[ 322290.32300604]
ce loss 4:[ 301887.17346783]
ce loss 5:[ 287438.7801033]
ce loss 6:[ 276109.20077844]
ce loss 7:[ 268089.2418214]
ce loss 8:[ 261372.66892149]
ce loss 9:[ 256096.1205235]
The plot is here.

It seems like you have several issues. It's hard to say for sure, because you don't provide any code.
Does my model suffer from high variance?
First, your model is overfitting from the start. You can tell that this is the case since your validation loss is increasing although your training is decreasing. What's interesting is that your validation loss is increasing from the very start, which suggests that your model is not working. So to answer your question, yes, it suffers from high variance.
What should I do?
Are you sure there is a trend in your data? The fact that the validation increases from the very start hints that either this model does not apply to your data at all, that your data does not have a trend, or that you have issues with your code. Perhaps try other models, and make sure your code is correct. Again, it's hard to say without a minimal example.
The strange rectangle
This looks strange. Either there is an issue with your data in the validation set (because this impact does not occur on the validation set) or you just have an issue with your code. If you provide a sample, we could probably help you more.

Related

Looking for ways and experiences to improve Keras model accuracy

I am an electrical engineer and I am looking for a solution to calculate the DC current of a permanent synchronous motor. So I decided to check the ANN solutions with Keras and so on.Long story short, I'll show you a screenshot of some measured signals.
The first 5 signals are the measured signals. The last one is the DC current, which I will estimate. Here the value was recorded with the help of a current clamp. Okay, I started building a model in Python and tried some things that I assume will increase the accuracy of the model. But after all that, I am not getting that good results from the model and my hope is that maybe I am choosing wrong parameters or not an ideal model for this purpose.
Here is my code:
import numpy as np
from keras.layers import Dense, LSTM
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt
import seaborn as sns
# Import input (x) and output (y) data, and asign these to df1 and df1
df = pd.read_csv('train_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
plt.figure()
sns.heatmap(df.corr(),annot=True)
plt.show()
# Split the data into input (x) training and testing data, and ouput (y) training and testing data,
# with training data being 80% of the data, and testing data being the remaining 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)#, shuffle=True)
# Scale both training and testing input data
X_train = preprocessing.maxabs_scale(X_train)
X_test = preprocessing.maxabs_scale(X_test)
model = Sequential()
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(1, input_shape=(4,)))
model.compile(optimizer="adam", loss="msle", metrics=['mean_squared_logarithmic_error','accuracy'])
# Pass several parameters to 'EarlyStopping' function and assign it to 'earlystopper'
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience=15, verbose=1, mode='auto')
model.summary()
history = model.fit(X_train, y_train, epochs = 2000, validation_split = 0.3, verbose = 2, callbacks = [earlystopper])
# Runs model (the one with the activation function, although this doesn't really matter as they perform the same)
# with its current weights on the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculates and prints r2 score of training and testing data
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))
df = pd.read_csv('test_two_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
X_validate = preprocessing.maxabs_scale(X)
y_pred = model.predict(X_validate)
plt.plot(Y)
plt.plot(y_pred)
plt.show()
(weight_0,bias_0) = model.layers[0].get_weights()
(weight_1,bias_1) = model.layers[1].get_weights()
One limitation is that I can't use LSTM layers or other complex algorithms because I need to implement the trained model in a microcontroller on a motor application later.
I guess you could find some words for me to make my model a little better in accuracy.
At the end here is a figure where I show you the worse prediction performance. Orange is the prediction and blue is the measured current.
The training dataset was this one.
The correlation between the individual values can be found here. Since the values of id and ud have no correlation to idc, I decided to delete them.
The most important thing to keep in mind when trying to improve the accuracy of the model is ALWAYS Normalise the input data which basically means rescaling real-valued numeric attributes into the range 0 and 1. I am not able to understand the way you are providing the training data to the model. Could you please explain that. It would be better in understanding and identifying the scope of higher accuracy.
Now if we talk about parameters, I would suggest you the addition of a Tuning Algorithm for the parameters to get the optimized value of each parameter.
It is always a good parctice to include hidden layers which could provide better feature extract.

How to score and measure accuracy on a model in python (Titanic Dataset, RandomForestRegressor)

I've created a model is trained on the titanic dataset, and I want to see an accuracy percentage for my model. I've done this before, but sadly, I do not remember. I looked at the internet, and I couldn't find anything. Either I just entered the wrong words, or there isn't anything their.
# the tts function is `train_test_split` from `sklearn.model_selection`
train_X, val_X, train_y, val_y = tts(X, y, random_state = 0) # y is the state of survival
forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
val_predictions = forest_model.predict(val_X)
How can I calculate the accuracy?
I wonder why are you using RandomForestRegressor, as titanic dataset can be formulated as a binary-classification problem. Assuming it is a mistake, to measure accuracy you can of a RandomForestClassifier, you can do:
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(val_y, val_predictions)
However, it is often better to use K-fold cross-validation, which gives you more reliable accuracy:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(forest_model, X, y, cv=10, scoring='accuracy') #10-fold cross validation, notice that I am giving X,y not X_train, y_train
K-fold cross-validation gives you 10 accuracy values of accuracy, as it divides the data into 10 folds (i.e. parts). Then, you can get a mean and standard deviation the accuracy values as follows:
>>> import numpy as np
>>> accuracy = np.array(cross_val_score(forest_model, X, y, cv=10, scoring='accuracy'))
>>> accuracy.mean() #mean of accuracies
0.81
>>> accuracy.std() #standard deviation of accuracies
0.04
You can also use other scoring metric such as F1-score, Precision,
Recall,cohen_kappa_score etc.
You can calculate the model score to know the performance accuracy
forest_model.score(val_y, val_predictions)

How to choose n_estimators in RandomForestClassifier?

I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.
see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example
I used RandomSearchCV to find the best params for the Random Forest Classifier
n_estimators is the number of decision trees to use.
try using XBBoost to get more accuracy.
parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf':
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}
number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)
random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)
print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
don't use gridsearch for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.
there is a stage_predict attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)
# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.
First, you already answer it from your code, in this lines.
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
As you can see, you already saved the accuracy_score into scores. So you just need to recall it by find the maximum value from the socres's list.
maxs = max(scores)
maxs_idx = scores.index(maxs)
Then just put the print command in the final lines.
print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")
I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.

KFold Cross Validation does not fix overfitting

I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)

Categories

Resources