How to prevent overfitting in Random Forest - python

I have a random forest model I built to predict if NFL teams will score more combined points than the line Vegas has set. The features I use are Total - the total number of combined points Vegas thinks both teams will score, over_percentage - the percentage of public bets on the over, and under_percentage - the percentage of public bets on the under. The over means people are betting that both team's combined scores will be greater than the number Vegas sets and under means the combined score will go under the Vegas number. When I run my model I'm getting a confusion_matrix like this
and an accuracy_score of 76%. However, the predictions do not perform well. Right now I have it giving me the probability the classification will be 0. I'm wondering if there are parameters I can tune or solutions to prevent my model from overfitting. I have over 30K games in the training data set so I don't think lack of data is causing the issue.
Here is the code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
training_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Training_Data_Betting.csv')
test_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Test_Data_Betting.csv')
df_model = training_data.dropna()
X = df_model.loc[:, ["Total", "Over_Percentage",
"Under_Percentage"]] # independent columns
y = df_model["Over_Under"] # target column
results = []
model = RandomForestClassifier(
random_state=1, n_estimators=500, min_samples_split=2, max_depth=30, min_samples_leaf=1)
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
gridF = GridSearchCV(model, hyperF, cv=3, verbose=1, n_jobs=-1)
model.fit(X, y)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X, X
y_train, y_test = y, y
bestF = gridF.fit(X_train, y_train)
print(bestF.best_params_)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(round(accuracy_score(y_test, y_pred), 2))
index = 0
count = 0
while count < len(test_data):
team = test_data.loc[index].at['Team']
total = test_data.loc[index].at['Total']
over_perc = test_data.loc[index].at['Over_Percentage']
under_perc = test_data.loc[index].at['Under_Percentage']
Xnew = [[total, over_perc, under_perc]]
# make a prediction
ynew = model.predict_proba(Xnew)
# show the inputs and predicted outputs
results.append(
{
'Team': team,
'Over': ynew[0][0]
})
index += 1
count += 1
sorted_results = sorted(results, key=lambda k: k['Over'], reverse=True)
df = pd.DataFrame(sorted_results, columns=[
'Team', 'Over'])
writer = pd.ExcelWriter('/Users/aus10/NFL/Data/ML_Results/Over_Probability.xlsx', # pylint: disable=abstract-class-instantiated
engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()
And here are links the the google docs with the test and training data.
Test Data
Training Data

There's a couple of things to note when using RandomForests. First of all you might want to use cross_validate in order to measure the performance of your model.
Furthermore RandomForests can be regularized by tweaking the following parameters:
Decreasing max_depth: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there's an excess of parameters being fitted.
Increasing min_samples_leaf: Instead of decreasing max_depth we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)
Decreasing max_features: As previously mentioned, overfitting happens when there's abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.
Finally, you might want to try different values and approaches using GridSearchCV to automatize and try different combinations:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
rf_clf = RandomForestClassifier()
parameters = {'max_features':np.arange(5,10),'n_estimators':[500,1000,1500],'max_depth':[2,4,8,16]}
clf = GridSearchCV(rf_clf, parameters, cv = 5)
clf.fit(X,y)
This will a return a table with the performance of all the different models (given the combination of hyperparameter) which will allow you to find the best one easier.

You are splitting the data using train_test_split by setting it totest_split=0.25. The downside to this is that it randomly splits the data and completely ignores the distribution of the classes when doing so. Your model will suffer from sampling bias where the correct distribution of the data is not maintained across the train and test datasets.
In your train set the data could be skewed more towards a particular instance of the data compared to the test set and vice versa.
To overcome this you can use StratifiedKFoldCrossValidation which maintains the distribution of the classes accordingly.
Creates K-Fold for the dataframe
def kfold_(df):
df = pd.read_csv(file)
df["kfold"] = -1
df = df.sample(frac=1).reset_index(drop=True)
y= df.target.values
kf= model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, "kfold"] = f
This function should be run for each fold of the dataset that was created based on the previous function
def run(fold):
df = pd.read_csv(file)
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid= df[df.kfold == fold].reset_index(drop=True)
x_train = df_train.drop("label", axis = 1).values
y_train = df_train.label.values
x_valid = df_valid.drop("label", axis = 1).values
y_valid = df_valid.label.values
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(x_train, y_train)
y_pred = model.predict(x_valid)
print(f"Fold: {fold}")
print(confusion_matrix(y_valid, y_pred))
print(classification_report(y_valid, y_pred))
print(round(accuracy_score(y_valid, y_pred), 2))
Moreover you should perform hyperparameter tuning to find the best parameters for you the other answer shows you how to do so.

Related

Predict one column based on the other columns with XGBoost in python

I have a large dataframe, and I want to predict the last column based on the other columns with xgboost, my codes are written below, but my prediction is wrong and I get the constant value.
the Data is not time-series, my trees also cant be plotted.
Overall is it possible by having 20 columns and I just wanna predict the 20th one by using the other 19th columns with this method?
#XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Separate the target variable
X, y = f.iloc[:,:-1],f.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
#Regressor
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
#Fit the regressor to the training set and make predictions on the test set
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#k-fold Cross Validation
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 10, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
#Visualizing
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
#plot the trees
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=5)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
#Examine the importance of each feature column in the original dataset within the model
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
First of all, yes, the approach to predict the last column with the first 19 columns is ok.
If the model only produces constant values, I would change the parameters of the model.
Or train a linear model as a baseline first.

Decision tree Regressor model get max_depth value of the model with highest accuracy

Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg.
Evaluate the model accuracy on the training data set and print its score.
Evaluate the model accuracy on the testing data set and print its score.
Predict the housing price for the first two samples of the X_test set and print them.(Hint : Use predict() function)
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to 5.
Evaluate each model's accuracy on the testing data set.
Hint: Make use of for loop
Print the max_depth value of the model with the highest accuracy.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import numpy as np
np.random.seed(100)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
y_pred=dt_reg.predict(X_test[:2])
print(y_pred)
I want to get Print the max_depth value of the model with the highest accuracy. But fresco plays not submitted Let me know what is error.
max_reg = None
max_score = 0
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
t = (m,max_score)
print (t)
If you wish to continue to use the loop as you've done, you can create another variable called 'best_max_depth' and replace its value with dt_reg.max_depth if your if-statement condition is met (it being the best model so far).
I suggest however, you look into GridSearchCV to extract parameters from your best models and to loop through different parameter values.
max_reg = None
max_score = 0
best_max_depth = None
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
best_max_depth = rf_reg.max_depth
t = (m,max_score)
print (t)
Try this code -
myList = list(range(2,6))
scores =[]
for i in myList:
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg.fit(X_train,Y_train)
scores.append(dt_reg.score(X_test, Y_test))
print(myList[scores.index(max(scores))])

sklearn Naive Bayes in python

I have trained a classifier on 'Rocks and Mines' dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
And when calculating the accuracy score it always seems to be perfectly accurate (output is 1.0) which I find hard to believe. Am I making any mistakes, or naive bayes is this powerful?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
The X is the input and y_val is the output (I have converted 'R' and 'M' into 0's and 1's)
This is because of random_state argument inside train_test_split() function.
When you set random_state to an integer sklearn ensures that your data sampling is constant.
That means that everytime you run it by specifying random_state, you will get a same result, this is expected behavior.
Refer docs for further details.

How to run through loop to use non-scaled and scaled data in python for loop

I have the following code running through and fitting a model on the iris data using different modeling techniques. How can I add a second step in this process so I can demonstrate the improvement between using scaled and non-scaled data?
I don't need to run the scale transform outside of the loop, i was just having a lot of issues with transforming the data type from pandas dataframe to np array and back again.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
sc = StandardScaler()
X_train_scale = sc.fit_transform(X_train)
X_test_scale = sc.transform(X_test)
numFolds = 10
kf = KFold(len(y_train), numFolds, shuffle=True)
# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, svm.SVC]
params = [{},{}]
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
So ideally my output would be:
CV standard accuracy score of LogisticRegression: 0.683333
CV scaled accuracy score of LogisticRegression: 0.766667
CV standard accuracy score of SVC: 0.766667
CV scaled accuracy score of SVC: 0.783333
It seems like this is unclear, I am trying to loop through scaled and unscaled data, similar to how I am looping through the different ML algorithms.
I wanted to follow up with this. I was able to do this by creating a pipeline and using gridsearchCV
pipe = Pipeline([('scale', StandardScaler()),
('clf', LogisticRegression())])
param_grid = [{
'scale':[None,StandardScaler()],
'clf':[SVC(),LogisticRegression()]}]
grid_search = GridSearchCV(pipe, param_grid=param_grid,n_jobs=-1, verbose=1 )
In the end this got me the results I wanted and was able to test easily how to work between scaling and not scaling.
try this:
from __future__ import division
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
# added to your code
if previous_accuracy:
improvement = 1 - (accuracy / previous_accuracy)
print "CV accuracy score improved by", improvement
else:
previous_accuracy = accuracy

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051

Categories

Resources