I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.
see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example
I used RandomSearchCV to find the best params for the Random Forest Classifier
n_estimators is the number of decision trees to use.
try using XBBoost to get more accuracy.
parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf':
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}
number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)
random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)
print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
don't use gridsearch for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.
there is a stage_predict attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)
# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.
First, you already answer it from your code, in this lines.
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
As you can see, you already saved the accuracy_score into scores. So you just need to recall it by find the maximum value from the socres's list.
maxs = max(scores)
maxs_idx = scores.index(maxs)
Then just put the print command in the final lines.
print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")
I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.
Related
I've created a model is trained on the titanic dataset, and I want to see an accuracy percentage for my model. I've done this before, but sadly, I do not remember. I looked at the internet, and I couldn't find anything. Either I just entered the wrong words, or there isn't anything their.
# the tts function is `train_test_split` from `sklearn.model_selection`
train_X, val_X, train_y, val_y = tts(X, y, random_state = 0) # y is the state of survival
forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
val_predictions = forest_model.predict(val_X)
How can I calculate the accuracy?
I wonder why are you using RandomForestRegressor, as titanic dataset can be formulated as a binary-classification problem. Assuming it is a mistake, to measure accuracy you can of a RandomForestClassifier, you can do:
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(val_y, val_predictions)
However, it is often better to use K-fold cross-validation, which gives you more reliable accuracy:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(forest_model, X, y, cv=10, scoring='accuracy') #10-fold cross validation, notice that I am giving X,y not X_train, y_train
K-fold cross-validation gives you 10 accuracy values of accuracy, as it divides the data into 10 folds (i.e. parts). Then, you can get a mean and standard deviation the accuracy values as follows:
>>> import numpy as np
>>> accuracy = np.array(cross_val_score(forest_model, X, y, cv=10, scoring='accuracy'))
>>> accuracy.mean() #mean of accuracies
0.81
>>> accuracy.std() #standard deviation of accuracies
0.04
You can also use other scoring metric such as F1-score, Precision,
Recall,cohen_kappa_score etc.
You can calculate the model score to know the performance accuracy
forest_model.score(val_y, val_predictions)
Exploring some classification models in Scikit learn I noticed that the scores I got for log loss and for ROC AUC were consistently lower while performing cross validation than while fitting and predicting on the whole training set (done to check for overfitting), thing that did not make sense to me.
Specifically, using cross_validate I set the scorings as ['neg_log_loss', 'roc_auc'] and while performing manual fitting and prediction on the training set I used the metric functions log_loss' and roc_auc_score.
To try to figure out what was happening, i wrote a code to perform the cross validation manually in order to be able to call the metric functions manually on the various folds and compare the results with the ones from cross_validate. As you can see below, I got different results even like this!
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=3, random_state=42, shuffle=True)
log_reg = LogisticRegression(max_iter=1000)
for train_index, test_index in kf.split(dataset, dataset_labels):
X_train, X_test = dataset[train_index], dataset[test_index]
y_train, y_test = dataset_labels_np[train_index], dataset_labels_np[test_index]
log_reg.fit(X_train, y_train)
pr = log_reg.predict(X_test)
ll = log_loss(y_test, pr)
print(ll)
from sklearn.model_selection import cross_val_score
cv_ll = cross_val_score(log_reg, dataset_prepared_stand, dataset_labels, scoring='neg_log_loss',
cv=KFold(n_splits=3, random_state=42, shuffle=True))
print(abs(cv_ll))
Outputs:
4.795481869275026
4.560119170517534
5.589818973403791
[0.409817 0.32309 0.398375]
The output running the same code for ROC AUC are:
0.8609669592272686
0.8678563239907938
0.8367147503682851
[0.925635 0.94032 0.910885]
To be sure to have written the code right, I also tried the code using 'accuracy' as scoring for cross validation and accuracy_score as metric function and the results are instead consistent:
0.8611584327086882
0.8679727427597955
0.838160136286201
[0.861158 0.867973 0.83816 ]
Can someone explain me why the results in the case of the log loss and the ROC AUC are different? Thanks!
Log-loss and auROC both need probability predictions, not the hard class predictions. So change
pr = log_reg.predict(X_test)
to
pr = log_reg.predict_proba(X_test)[:, 1]
(the subscripting is to grab the probabilities for the positive class, and assumes you're doing binary classification).
I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)
I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)
I am learning to do classification for Cover Type data for 7 classes. I train my model with GradientBoostingClassifier from scikit-learn. When I try to plot my loss function this goes like this:
Is this kind of plot shows me that my model suffers from high variance? If yes, what should I do? And I don't know why in the middle of iterations 200 until 500, the plot is shaped like a rectangle.
(EDIT)
To edit this post, I'm not sure what's wrong with my code becaue I just used the regular code to fit the training data. I'm using jupyter notebook. So I'm just going to provide the code
Y = train["Cover_Type"]
X = train.drop({"Cover_Type"}, axis=1)
#split training data dan cross validation
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X,Y,test_size=0.3,random_state=42)
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingClassifier
params = {'n_estimators': 1000,'learning_rate': 0.3, 'max_features' : 'sqrt'}
dtree=GradientBoostingClassifier(**params)
dtree.fit(X_train,Y_train)
#mau lihat F1-Score
from sklearn.metrics import f1_score
Y_pred = dtree.predict(X_val) #prediksi data cross validation menggunakan model tadi
print Y_pred
score = f1_score(Y_val, Y_pred, average="micro")
print("Gradient Boosting Tree F1-score: "+str(score)) # I got 0.86 F1-Score
import matplotlib.pyplot as plt
# Plot training deviance
# compute test set deviance
val_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, Y_pred in enumerate(dtree.staged_predict(X_val)):
val_score[i] = dtree.loss_(Y_val, Y_pred.reshape(-1, 1))
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, dtree.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, val_score, 'r-',
label='Validation Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
There are several issues that I will explain them one by one, also I have added the correct code for your example.
staged_predict(X) method shall NOT be used
As staged_predict(X) outputs the predicted class instead of predicted probabilities, it is not correct to use that.
One can (where the context accept) use staged_decision_function(X) method and pass the computed decisions at each stage to the model.loss_ attribute. But in this example, it does not work (the loss based on staged decision increases while the loss decreases).
You should use staged_predict_proba(X) with cross entropy loss
you should use staged_predict_proba(X)
you also need to define a function that calculate the cross entropy loss at each stage.
I have provided the code below. Note that I set the verbosity to 2, and then you can see that the sklearn training loss at each stage is the same as our loss (as a sanity check that our approach works correctly).
Why you have big jumps
I think the reason is that GBC becomes very confident and then predict a label is 1 (as an example) with probability one, while it is not correct (for example the label is 2). This creates big jumps (as cross entropy goes to infinity). In such a scenario you should change your GBC parameters.
The code and the Plot are given below
The code is:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
def _cross_entropy_like_loss(model, input_data, targets, num_estimators):
loss = np.zeros((num_estimators, 1))
for index, predict in enumerate(model.staged_predict_proba(input_data)):
loss[index, :] = -np.sum(np.log([predict[sample_num, class_num-1]
for sample_num, class_num in enumerate(targets)]))
print(f'ce loss {index}:{loss[index, :]}')
return loss
covtype = fetch_covtype()
X = covtype.data
Y = covtype.target
n_estimators = 10
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=0.3, verbose=2 )
clf.fit(X_train, Y_train)
tr_loss_ce = _cross_entropy_like_loss(clf, X_train, Y_train, n_estimators)
test_loss_ce = _cross_entropy_like_loss(clf, X_val, Y_val, n_estimators)
plt.figure()
plt.plot(np.arange(n_estimators) + 1, tr_loss_ce, '-r', label='training_loss_ce')
plt.plot(np.arange(n_estimators) + 1, test_loss_ce, '-b', label='val_loss_ce')
plt.ylabel('Error')
plt.xlabel('num_components')
plt.legend(loc='upper right')
The output of console is like below, from which you can easily verify the approach is correct.
Iter Train Loss Remaining Time
1 482434.6631 1.04m
2 398501.7223 55.56s
3 351391.6893 48.51s
4 322290.3230 41.60s
5 301887.1735 34.65s
6 287438.7801 27.72s
7 276109.2008 20.82s
8 268089.2418 13.84s
9 261372.6689 6.93s
10 256096.1205 0.00s
ce loss 0:[ 482434.6630936]
ce loss 1:[ 398501.72228276]
ce loss 2:[ 351391.68933547]
ce loss 3:[ 322290.32300604]
ce loss 4:[ 301887.17346783]
ce loss 5:[ 287438.7801033]
ce loss 6:[ 276109.20077844]
ce loss 7:[ 268089.2418214]
ce loss 8:[ 261372.66892149]
ce loss 9:[ 256096.1205235]
The plot is here.
It seems like you have several issues. It's hard to say for sure, because you don't provide any code.
Does my model suffer from high variance?
First, your model is overfitting from the start. You can tell that this is the case since your validation loss is increasing although your training is decreasing. What's interesting is that your validation loss is increasing from the very start, which suggests that your model is not working. So to answer your question, yes, it suffers from high variance.
What should I do?
Are you sure there is a trend in your data? The fact that the validation increases from the very start hints that either this model does not apply to your data at all, that your data does not have a trend, or that you have issues with your code. Perhaps try other models, and make sure your code is correct. Again, it's hard to say without a minimal example.
The strange rectangle
This looks strange. Either there is an issue with your data in the validation set (because this impact does not occur on the validation set) or you just have an issue with your code. If you provide a sample, we could probably help you more.