I have trained a classifier on 'Rocks and Mines' dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
And when calculating the accuracy score it always seems to be perfectly accurate (output is 1.0) which I find hard to believe. Am I making any mistakes, or naive bayes is this powerful?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
The X is the input and y_val is the output (I have converted 'R' and 'M' into 0's and 1's)
This is because of random_state argument inside train_test_split() function.
When you set random_state to an integer sklearn ensures that your data sampling is constant.
That means that everytime you run it by specifying random_state, you will get a same result, this is expected behavior.
Refer docs for further details.
Related
I have a random forest model I built to predict if NFL teams will score more combined points than the line Vegas has set. The features I use are Total - the total number of combined points Vegas thinks both teams will score, over_percentage - the percentage of public bets on the over, and under_percentage - the percentage of public bets on the under. The over means people are betting that both team's combined scores will be greater than the number Vegas sets and under means the combined score will go under the Vegas number. When I run my model I'm getting a confusion_matrix like this
and an accuracy_score of 76%. However, the predictions do not perform well. Right now I have it giving me the probability the classification will be 0. I'm wondering if there are parameters I can tune or solutions to prevent my model from overfitting. I have over 30K games in the training data set so I don't think lack of data is causing the issue.
Here is the code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
training_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Training_Data_Betting.csv')
test_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Test_Data_Betting.csv')
df_model = training_data.dropna()
X = df_model.loc[:, ["Total", "Over_Percentage",
"Under_Percentage"]] # independent columns
y = df_model["Over_Under"] # target column
results = []
model = RandomForestClassifier(
random_state=1, n_estimators=500, min_samples_split=2, max_depth=30, min_samples_leaf=1)
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
gridF = GridSearchCV(model, hyperF, cv=3, verbose=1, n_jobs=-1)
model.fit(X, y)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X, X
y_train, y_test = y, y
bestF = gridF.fit(X_train, y_train)
print(bestF.best_params_)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(round(accuracy_score(y_test, y_pred), 2))
index = 0
count = 0
while count < len(test_data):
team = test_data.loc[index].at['Team']
total = test_data.loc[index].at['Total']
over_perc = test_data.loc[index].at['Over_Percentage']
under_perc = test_data.loc[index].at['Under_Percentage']
Xnew = [[total, over_perc, under_perc]]
# make a prediction
ynew = model.predict_proba(Xnew)
# show the inputs and predicted outputs
results.append(
{
'Team': team,
'Over': ynew[0][0]
})
index += 1
count += 1
sorted_results = sorted(results, key=lambda k: k['Over'], reverse=True)
df = pd.DataFrame(sorted_results, columns=[
'Team', 'Over'])
writer = pd.ExcelWriter('/Users/aus10/NFL/Data/ML_Results/Over_Probability.xlsx', # pylint: disable=abstract-class-instantiated
engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()
And here are links the the google docs with the test and training data.
Test Data
Training Data
There's a couple of things to note when using RandomForests. First of all you might want to use cross_validate in order to measure the performance of your model.
Furthermore RandomForests can be regularized by tweaking the following parameters:
Decreasing max_depth: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there's an excess of parameters being fitted.
Increasing min_samples_leaf: Instead of decreasing max_depth we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)
Decreasing max_features: As previously mentioned, overfitting happens when there's abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.
Finally, you might want to try different values and approaches using GridSearchCV to automatize and try different combinations:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
rf_clf = RandomForestClassifier()
parameters = {'max_features':np.arange(5,10),'n_estimators':[500,1000,1500],'max_depth':[2,4,8,16]}
clf = GridSearchCV(rf_clf, parameters, cv = 5)
clf.fit(X,y)
This will a return a table with the performance of all the different models (given the combination of hyperparameter) which will allow you to find the best one easier.
You are splitting the data using train_test_split by setting it totest_split=0.25. The downside to this is that it randomly splits the data and completely ignores the distribution of the classes when doing so. Your model will suffer from sampling bias where the correct distribution of the data is not maintained across the train and test datasets.
In your train set the data could be skewed more towards a particular instance of the data compared to the test set and vice versa.
To overcome this you can use StratifiedKFoldCrossValidation which maintains the distribution of the classes accordingly.
Creates K-Fold for the dataframe
def kfold_(df):
df = pd.read_csv(file)
df["kfold"] = -1
df = df.sample(frac=1).reset_index(drop=True)
y= df.target.values
kf= model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, "kfold"] = f
This function should be run for each fold of the dataset that was created based on the previous function
def run(fold):
df = pd.read_csv(file)
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid= df[df.kfold == fold].reset_index(drop=True)
x_train = df_train.drop("label", axis = 1).values
y_train = df_train.label.values
x_valid = df_valid.drop("label", axis = 1).values
y_valid = df_valid.label.values
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(x_train, y_train)
y_pred = model.predict(x_valid)
print(f"Fold: {fold}")
print(confusion_matrix(y_valid, y_pred))
print(classification_report(y_valid, y_pred))
print(round(accuracy_score(y_valid, y_pred), 2))
Moreover you should perform hyperparameter tuning to find the best parameters for you the other answer shows you how to do so.
Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg.
Evaluate the model accuracy on the training data set and print its score.
Evaluate the model accuracy on the testing data set and print its score.
Predict the housing price for the first two samples of the X_test set and print them.(Hint : Use predict() function)
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to 5.
Evaluate each model's accuracy on the testing data set.
Hint: Make use of for loop
Print the max_depth value of the model with the highest accuracy.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import numpy as np
np.random.seed(100)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
y_pred=dt_reg.predict(X_test[:2])
print(y_pred)
I want to get Print the max_depth value of the model with the highest accuracy. But fresco plays not submitted Let me know what is error.
max_reg = None
max_score = 0
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
t = (m,max_score)
print (t)
If you wish to continue to use the loop as you've done, you can create another variable called 'best_max_depth' and replace its value with dt_reg.max_depth if your if-statement condition is met (it being the best model so far).
I suggest however, you look into GridSearchCV to extract parameters from your best models and to loop through different parameter values.
max_reg = None
max_score = 0
best_max_depth = None
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
best_max_depth = rf_reg.max_depth
t = (m,max_score)
print (t)
Try this code -
myList = list(range(2,6))
scores =[]
for i in myList:
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg.fit(X_train,Y_train)
scores.append(dt_reg.score(X_test, Y_test))
print(myList[scores.index(max(scores))])
I am trying to use a GMM model to classify the Iris data set. My model seems to produce inconsistent results, some runs have an accuracy of 90% and some of 33% (some even going all the way to 0%). I am not sure if the mistake occurs at the classification stage or my preprocessing or print statements for accuracy are incorrect.
I have tried changing the number of iterations and the init_params when defining the classifier. I am using scikit-learn version 0.21.3 and Python 3.7.0
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.mixture import GaussianMixture as GMM
def data_processor(input_file):
data = []
with open(input_file, 'r') as f:
for line in f:
line = line.strip().split(',')
if line[-1] == 'Iris-setosa':
line[-1] = 0
elif line[-1] == 'Iris-versicolor':
line[-1] = 1
elif line[-1] == 'Iris-virginica':
line[-1] = 2
data.append(line)
return np.array(data, dtype=float)
x, y = data_processor('iris.txt')[:, :-1], data_processor('iris.txt')[:, -1]
# split data into 5 chucnks, 80% used for training and 20% for testing
skf = StratifiedKFold(n_splits=5)
train_index, test_index = next(iter(skf.split(x, y)))
x_train, y_train = x[train_index], y[train_index]
x_test, y_test = x[test_index], y[test_index]
# calculate number of components in data set using number of classes
num_classes = len(np.unique(y_train))
# build classifier and fit model
classifier = GMM(n_components=num_classes, covariance_type='full',
max_iter=200)
classifier.fit(x_train)
# Make predictions and print accuracy
y_train_pred = classifier.predict(x_train)
accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
print('Accuracy on training data =', accuracy_training)
y_test_pred = classifier.predict(x_test)
accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
print('Accuracy on testing data =', accuracy_testing)
The number of test data is 150 rows, 50 per class. Previously with different models on the same data set, I was getting an accuracy of around 90-93%.
The input file contains data in this format: 5.9,3.0,5.1,1.8,Iris-virginica
I have also tried to scale the data using:
x = preprocessing.scale(x)
The data now looks as follows:
x[0] = [-0.90068117 1.03205722 -1.3412724 -1.31297673]
y[0] = 0.0
However, this did not affect the accuracy.
is it possible to limit the maximum value for predictions per sample when using scikit? In my input data there is a column("Announcement") which is the maximum value for this particular sample, "result" in this case is the true value. How can limit the prediction between 0 - $annoucement?
Here is a very small code snippet / example:
#!/usr/bin/env python3
from sklearn.linear_model import LinearRegression
import pandas as pd
from sklearn.model_selection import train_test_split
def main():
mylist = [
{'Id':101,'Username':"john",'Date':1475359200,'Announcement':111,'Result':50},
{'Id':104,'Username':"john",'Date':1475359905,'Announcement':40,'Result':23},
{'Id':222,'Username':"dave",'Date':1475399212,'Announcement':600,'Result':420},
{'Id':301,'Username':"john",'Date':1475559256,'Announcement':300,'Result':150},
{'Id':407,'Username':"dave",'Date':1475659277,'Announcement':10,'Result':8}
]
df = pd.DataFrame(mylist)
df['Username'] = pd.Series(pd.factorize(df['Username'])[0] + 1).astype('category')
y = df['Result'].values
df = df.drop('Result', axis=1)
X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=2)
clf = LinearRegression()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print("predictions")
print(predictions)
print("true values")
print(y_test)
if __name__ == '__main__':
main()
output:
predictions
[ 255.81049569 52.35007969]
true values
[420 8]
The issue is in this case the second value.
thanks in advance
I'm not sure how to do this natively in Scikit-Learn, but you can set the predictions to be the value in the Announcement column if they are greater than this value with a list comprehension:
predictions = [p if p < a else a for p, a in zip(predictions, X_test['Announcement'])]
Result:
predictions
[255.81049569325114, 10]
true values
[420 8]
I have the following code running through and fitting a model on the iris data using different modeling techniques. How can I add a second step in this process so I can demonstrate the improvement between using scaled and non-scaled data?
I don't need to run the scale transform outside of the loop, i was just having a lot of issues with transforming the data type from pandas dataframe to np array and back again.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
sc = StandardScaler()
X_train_scale = sc.fit_transform(X_train)
X_test_scale = sc.transform(X_test)
numFolds = 10
kf = KFold(len(y_train), numFolds, shuffle=True)
# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, svm.SVC]
params = [{},{}]
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
So ideally my output would be:
CV standard accuracy score of LogisticRegression: 0.683333
CV scaled accuracy score of LogisticRegression: 0.766667
CV standard accuracy score of SVC: 0.766667
CV scaled accuracy score of SVC: 0.783333
It seems like this is unclear, I am trying to loop through scaled and unscaled data, similar to how I am looping through the different ML algorithms.
I wanted to follow up with this. I was able to do this by creating a pipeline and using gridsearchCV
pipe = Pipeline([('scale', StandardScaler()),
('clf', LogisticRegression())])
param_grid = [{
'scale':[None,StandardScaler()],
'clf':[SVC(),LogisticRegression()]}]
grid_search = GridSearchCV(pipe, param_grid=param_grid,n_jobs=-1, verbose=1 )
In the end this got me the results I wanted and was able to test easily how to work between scaling and not scaling.
try this:
from __future__ import division
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
# added to your code
if previous_accuracy:
improvement = 1 - (accuracy / previous_accuracy)
print "CV accuracy score improved by", improvement
else:
previous_accuracy = accuracy