Random Forest further improvement [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)
x_test = scaler.transform(X_test)
# get a list of models to evaluate
def get_models():
models = dict()
# consider tree depths from 1 to 7 and None=full
depths = [i for i in range(1,8)] + [None]
for n in depths:
models[str(n)] = RandomForestClassifier(max_depth=n)
return models
# evaluate model using cross-validation
def evaluate_model(model, X, y):
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the results
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X, y)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
The data, X is a matrix of (140,20000) and y is (140,) categorical.
I got the following results but would like to explore how to improve accuracy further.
>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)

Here's what stands out to me:
You split the data but do not use the splits.
You're scaling the data, but tree-based methods like random forests do not need this step.
You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
If you use GridSearchCV you don't need to do your own cross validation.
You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.
I would probably do something like this:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X, y = make_classification()
# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)
# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)
# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
scoring='f1_weighted',
cv=cv, verbose=3)
clf.fit(X_train, y_train)
Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.
Almost forgot... you asked about improving the model :) Here are some ideas:
The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
You could try transforming some features (columns in X), or adding new ones.
Look for more data, eg more rows, higher quality labels, etc.
Address any issues with class imbalance.
Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).

Related

Why the voting classifier has less accuracy than one of the individual predictors that made it

I have a simple question concerning the votting classifier. As I understood, the voting classifier should have the highest accuracy than those individual predictors which built it (the wisdom of the crowd). Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# import dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
log_clf = LogisticRegression(solver='liblinear', random_state=42)
svm_clf = SVC(gamma='auto', random_state=42)
voting_clf = VotingClassifier(
estimators= [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf = voting_clf.fit(X_train, y_train)
predictors_list= [log_clf, rnd_clf, svm_clf, voting_clf]
for clf in predictors_list:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(clf.__class__.__name__, accuracy)
what I get as a accuracy is as follows:
LogisticRegression 0.776
RandomForestClassifier 0.88
SVC 0.864
VotingClassifier 0.864
As you can see for this run the Random Forest predictor has a slightly better accuray than the VotingClassifier!
Any explanation for this?
Many thanks in Advance
Fethi
Let's take a look at the voting parameter you passed 'hard'
documentation says:
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
So maybe, the prediction of ‍‍‍‍LogisticRegression and your SVC(SVM) are the same and wrong for some cases this makes your majority vote wrong for those cases.
you can use voting='soft' or assign weight as prior for prediction of each model, this way you make your prediction a little bit immune to the wrong prediction of bad models, and relay more on your best models.

KFold Cross Validation does not fix overfitting

I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)

Selecting number of features for RFE model in Python

I have a dataset with more than 120 features, and I want to use RFE for selecting which features / column names I should use.
I have a problem because RFE is very slow. My code looks like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
full_df = pd.read_csv('data.csv')
x = full_df.iloc[:,:-1]
y = full_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
model = LogisticRegression(solver ='lbfgs')
for i in range(1,120):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
print(acc)
print(fit.support_)
My problem is this: rfe = RFE(model, i). I do not know what's the best number for i. That's why I put it in for i in range(1,120). Is there any better way to do this? is there any better function in scikit learn that can help me determine the number of features and names of those features?
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?

Difference Between Python's Functions `cls.score` and `cls.cv_result_`

I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.
My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.
Please, advise me in the following:
(1) Is my code correct? Please, check each part.
(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data
clf.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
print(clf.best_params_)
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
print(confMatrix)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
print(classificationReport)
I will appreciate any advise.
First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.
Quick Example
Let me hereby introduce a toy example for better understanding:
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_
This code yields the following:
0.98107957707289928 # which is the best possible accuracy score
{'mean_fit_time': array([ 0.15465896, 0.23701136]),
'mean_score_time': array([ 0.0006465 , 0.00065773]),
'mean_test_score': array([ 0.934335 , 0.9376739]),
'mean_train_score': array([ 0.96475625, 0.98225632]),
'param_C': masked_array(data = [0.001 0.01],
'params': ({'C': 0.001}, {'C': 0.01})
mean_train_score has two mean values as I grid over two options for C parameter.
I hope that helps!

Support vector machine overfitting my data

I am trying to make predictions for the iris dataset. I have decided to use svms for this purpose. But, it gives me an accuracy 1.0. Is it a case of overfitting or is it because the model is very good? Here is my code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1,gamma='auto')
svm_model.fit(X_train,y_train)
predictions = svm_model.predict(X_test)
accuracy_score(predictions, y_test)
Here, accuracy_score returns a value of 1. Please help me. I am a beginner in machine learning.
You can try cross validation:
Example:
from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
#load iris data
iris = datasets.load_iris()
X = iris.data
Y = iris.target
#build the model
svm_model = SVC( kernel ='linear', C = 1, gamma = 'auto',random_state = 0 )
#create the Cross validation object
loo = LeaveOneOut()
#calculate cross validated (leave one out) accuracy score
scores = cross_val_score(svm_model, X,Y, cv = loo, scoring='accuracy')
print( scores.mean() )
Result (the mean accuracy of the 150 folds since we used leave-one-out):
0.97999999999999998
Bottom line:
Cross validation (especially LeaveOneOut) is a good way to avoid overfitting and to get robust results.
The iris dataset is not a particularly difficult one from where to get good results. However, you are right not trusting a 100% classification accuracy model. In your example, the problem is that the 30 test points are all correctly well classified. But that doesn't mean that your model is able to generalise well for all new data instances. Just try and change the test_size to 0.3 and the results are no longer 100% (it goes down to 97.78%).
The best way to guarantee robustness and avoid overfitting is using cross validation. An example on how to do this easily from your example:
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X = iris.data[:, :4]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1, gamma='auto')
scores = cross_val_score(svm_model, iris.data, iris.target, cv=10) #10 fold cross validation
Here cross_val_score uses different parts of the dataset as testing data iteratively (cross validation) while keeping all your previous parameters. If you check score you will see that the 10 accuracies calculated now range from 87.87% to 100%. To report the final model performance you can for example use the mean of the scored values.
Hope this helps and good luck! :)

Categories

Resources