I have a large dataframe, and I want to predict the last column based on the other columns with xgboost, my codes are written below, but my prediction is wrong and I get the constant value.
the Data is not time-series, my trees also cant be plotted.
Overall is it possible by having 20 columns and I just wanna predict the 20th one by using the other 19th columns with this method?
#XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Separate the target variable
X, y = f.iloc[:,:-1],f.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
#Regressor
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
#Fit the regressor to the training set and make predictions on the test set
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#k-fold Cross Validation
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 10, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
#Visualizing
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
#plot the trees
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=5)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
#Examine the importance of each feature column in the original dataset within the model
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
First of all, yes, the approach to predict the last column with the first 19 columns is ok.
If the model only produces constant values, I would change the parameters of the model.
Or train a linear model as a baseline first.
Related
I am getting this error (TypeError: fit() takes 2 positional arguments but 3 were given) while trying to perform gridsearchCV on Random Forest algorithm. Below is my code. I can't find the solution on here. Can anyone help me please?
# define dataset
feature_cols = ['V123','V170','V171','V188','V189','V190','V199','V200',
'V201','V228','V230','V242','V243','V244','V246',
'V257','V258']
X = newdf2NAb[feature_cols] # Features
y = newdf2NAb.isFraud
# split train test set 0.8 om 0.2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
# create model
rfc = RandomForestClassifier()
# set candidate values for for parameters k, min_k and user_based
param_grid_RF = {'n_estimators': [0,50,100,200],
'min_samples_split': [0,2,5],
'min_samples_leaf': [1,3,5],
'criterion': ["gini","entropy","log_loss"]}
# perform gridsearch with 5-fold cross-validation on data_temp with top-600 movies
gs = GridSearchCV(rfc, param_grid_RF, cv =5)
gs.fit(xtrain, y_train)
# compute and print best rmse score
print(gs.best_score["rmse"])
# find and print combination of parameters that gave the best rmse score
print(gs.best_params["rmse"])
I have not found a solution to try.
I have a random forest model I built to predict if NFL teams will score more combined points than the line Vegas has set. The features I use are Total - the total number of combined points Vegas thinks both teams will score, over_percentage - the percentage of public bets on the over, and under_percentage - the percentage of public bets on the under. The over means people are betting that both team's combined scores will be greater than the number Vegas sets and under means the combined score will go under the Vegas number. When I run my model I'm getting a confusion_matrix like this
and an accuracy_score of 76%. However, the predictions do not perform well. Right now I have it giving me the probability the classification will be 0. I'm wondering if there are parameters I can tune or solutions to prevent my model from overfitting. I have over 30K games in the training data set so I don't think lack of data is causing the issue.
Here is the code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
training_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Training_Data_Betting.csv')
test_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Test_Data_Betting.csv')
df_model = training_data.dropna()
X = df_model.loc[:, ["Total", "Over_Percentage",
"Under_Percentage"]] # independent columns
y = df_model["Over_Under"] # target column
results = []
model = RandomForestClassifier(
random_state=1, n_estimators=500, min_samples_split=2, max_depth=30, min_samples_leaf=1)
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
gridF = GridSearchCV(model, hyperF, cv=3, verbose=1, n_jobs=-1)
model.fit(X, y)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X, X
y_train, y_test = y, y
bestF = gridF.fit(X_train, y_train)
print(bestF.best_params_)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(round(accuracy_score(y_test, y_pred), 2))
index = 0
count = 0
while count < len(test_data):
team = test_data.loc[index].at['Team']
total = test_data.loc[index].at['Total']
over_perc = test_data.loc[index].at['Over_Percentage']
under_perc = test_data.loc[index].at['Under_Percentage']
Xnew = [[total, over_perc, under_perc]]
# make a prediction
ynew = model.predict_proba(Xnew)
# show the inputs and predicted outputs
results.append(
{
'Team': team,
'Over': ynew[0][0]
})
index += 1
count += 1
sorted_results = sorted(results, key=lambda k: k['Over'], reverse=True)
df = pd.DataFrame(sorted_results, columns=[
'Team', 'Over'])
writer = pd.ExcelWriter('/Users/aus10/NFL/Data/ML_Results/Over_Probability.xlsx', # pylint: disable=abstract-class-instantiated
engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()
And here are links the the google docs with the test and training data.
Test Data
Training Data
There's a couple of things to note when using RandomForests. First of all you might want to use cross_validate in order to measure the performance of your model.
Furthermore RandomForests can be regularized by tweaking the following parameters:
Decreasing max_depth: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there's an excess of parameters being fitted.
Increasing min_samples_leaf: Instead of decreasing max_depth we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)
Decreasing max_features: As previously mentioned, overfitting happens when there's abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.
Finally, you might want to try different values and approaches using GridSearchCV to automatize and try different combinations:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
rf_clf = RandomForestClassifier()
parameters = {'max_features':np.arange(5,10),'n_estimators':[500,1000,1500],'max_depth':[2,4,8,16]}
clf = GridSearchCV(rf_clf, parameters, cv = 5)
clf.fit(X,y)
This will a return a table with the performance of all the different models (given the combination of hyperparameter) which will allow you to find the best one easier.
You are splitting the data using train_test_split by setting it totest_split=0.25. The downside to this is that it randomly splits the data and completely ignores the distribution of the classes when doing so. Your model will suffer from sampling bias where the correct distribution of the data is not maintained across the train and test datasets.
In your train set the data could be skewed more towards a particular instance of the data compared to the test set and vice versa.
To overcome this you can use StratifiedKFoldCrossValidation which maintains the distribution of the classes accordingly.
Creates K-Fold for the dataframe
def kfold_(df):
df = pd.read_csv(file)
df["kfold"] = -1
df = df.sample(frac=1).reset_index(drop=True)
y= df.target.values
kf= model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, "kfold"] = f
This function should be run for each fold of the dataset that was created based on the previous function
def run(fold):
df = pd.read_csv(file)
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid= df[df.kfold == fold].reset_index(drop=True)
x_train = df_train.drop("label", axis = 1).values
y_train = df_train.label.values
x_valid = df_valid.drop("label", axis = 1).values
y_valid = df_valid.label.values
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(x_train, y_train)
y_pred = model.predict(x_valid)
print(f"Fold: {fold}")
print(confusion_matrix(y_valid, y_pred))
print(classification_report(y_valid, y_pred))
print(round(accuracy_score(y_valid, y_pred), 2))
Moreover you should perform hyperparameter tuning to find the best parameters for you the other answer shows you how to do so.
I am building a neural network with my research data in two ways: with a statistical programm (SPSS) and with python.
I am using the scikit learn MLPRegressor. The problem I have is that whereas my code is , apparently, well written (because it runs), the results do not make sense. The r2score should be around 0.70 ( it is-4147.64) and the correlation represented in the graph should be almost linear. (it is just a straight line at a constant distance from X axis). Also the x and y axis should have values ranging from 0 to 180, which is not the case ( X from 20 to 100, y from -4100 to -3500)
If any of you can give a hand I would really appreciate it.
Thank you!!!!!!
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score
vhdata = pd.read_csv('vhrawdata.csv')
vhdata.head()
X = vhdata[['PA NH4', 'PH NH4', 'PA K', 'PH K', 'PA NH4 + PA K', 'PH NH4 + PH K', 'PA IS', 'PH IS']]
y = vhdata['PMI']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_norm = scaler.transform(X_train)
X_test_norm = scaler.transform(X_test)
nnref = MLPRegressor(hidden_layer_sizes = [4], activation = 'logistic', solver = 'sgd', alpha = 1,
learning_rate= 'constant', learning_rate_init= 0.6, max_iter=40000, momentum=
0.3).fit(X_train, y_train)
y_predictions= nnref.predict(X_test)
print('Accuracy of NN classifier on training set (R2 score): {:.2f}'.format(nnref.score(X_train_norm, y_train)))
print('Accuracy of NN classifier on test set (R2 score): {:.2f}'.format(nnref.score(X_test_norm, y_test)))
plt.figure()
plt.scatter(y_test,y_predictions, marker = 'o', color='red')
plt.xlabel('PMI expected (hrs)')
plt.ylabel('PMI predicted (hrs)')
plt.title('Correlation of PMI predicted by MLP regressor and the actual PMI')
plt.show()
You have a couple of issues. First, it is important to use the right scaler or normalization to work with an MLP. NNs work best between 0 and 1, so consider using sklearn's MinMaxScaler to accomplish this.
So:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_norm = scaler.transform(X_train)
X_test_norm = scaler.transform(X_test)
Should be:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.fit_transform(X_test)
Next, you are training and testing on the unscaled data, but then performing your scores on the scaled data. Meaning:
nnref = MLPRegressor(hidden_layer_sizes = [4], activation = 'logistic', solver = 'sgd', alpha = 1,
learning_rate= 'constant', learning_rate_init= 0.6, max_iter=40000, momentum=
0.3).fit(X_train, y_train)
should be:
nnref = MLPRegressor(hidden_layer_sizes = [4], activation = 'logistic', solver = 'sgd', alpha = 1,
learning_rate= 'constant', learning_rate_init= 0.6, max_iter=40000, momentum=
0.3).fit(X_train_norm , y_train)
And...
y_predictions= nnref.predict(X_test)
Should be:
y_predictions= nnref.predict(X_test_norm)
Additional notes...
It doesn't make any sense to predict on your training data. That provides no value, as it is testing the same data it learned from and should predict 100%. That is an example of overfitting.
Well, I found a mistake:
You train the model on samples, that weren't normalized:
nnref = MLPRegressor(...).fit(X_train, y_train)
But later you're trying to predict values from normalized samples:
nnref.score(X_train_norm, y_train)
Also the x and y axis should have values ranging from 0 to 180, which is not the case ( X from 20 to 100, y from -4100 to -3500)
Scikit-learn do not change values by itself. If X is not in range you need, it means that you've changed it somehow. Or, maybe your vision of X values is incorrect.
I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix
The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)
For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051