Check the accuracy of decision tree classifier with Python - python

I wrote a function that takes dataset (excel / pandas) and some values, and then predicts outcome with decision tree classifier. I have done that with sklearn.
Can you help me with this, I have looked over the web and this website but I couldnt find the answer that works.
I have tried to do this, but it does not work:
from sklearn.metrics import accuracy_score
score = accuracy_score(variable_list, result_list)
This is the error that I get:
ValueError: Classification metrics can't handle a mix of continuous-multioutput and multiclass targets
This is the code(I removed code for accuracy)
import pandas as pd
import math
import xlrd
from sklearn.model_selection import train_test_split
from sklearn import tree
def predict_concrete_class(input_data, cement, blast_fur_slug,fly_ash,
water, superpl, coarse_aggr, fine_aggr, days):
data_for_tree = concrete_strenght_class(input_data)
variable_list = []
result_list = []
for index, row in data_for_tree.iterrows():
variable = row.tolist()
variable = variable[0:8]
variable_list.append(variable)
result_list.append(row[-1])
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(variable_list,result_list)
input_values = [cement, blast_fur_slug, fly_ash, water, superpl, coarse_aggr, fine_aggr, days]
prediction = decision_tree.predict([input_values])
info = "Prediction of future concrete class after "+ str(days)+" days: "+ str(prediction[0])
return info
print(predict_concrete_class(data, 500, 0, 0, 200, 0, 1125, 613, 3))

Split your data into train and test:
var_train, var_test, res_train, res_test = train_test_split(variable_list, result_list, test_size = 0.3)
Train your decision tree on train set:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(var_train, res_train)
Test model performance by calculating accuracy on test set:
res_pred = decision_tree.predict(var_test)
score = accuracy_score(res_test, res_pred)
Or you could directly use decision_tree.score:
score = decision_tree.score(var_test, res_test)
The error you are getting is because you are trying to pass variable_list (which is your list of input features) as a parameter in accuracy_score. You are supposed to pass your list of true labels and predicted labels.

You should perform a cross validation if you want to check the accuracy of your system.
You have to split you data set into two parts. The first one is used to learn your system. Then you perform the prediction process on the second part of the data set and compared the predicted results with the good ones. With this method, you check your system on a unlearned data set.
In order to split your set, you should use train_test_split from sklearn.model_selection
You will split your set randomly.
Here is good lecture: https://machinelearningmastery.com/k-fold-cross-validation/

Related

Found input variables with inconsistent number of samples

I get the Used the following error when trying to do the score or mean squared error of a single sample :"Found input variables with inconsistent number of samples" for a decision tree regressor model from sklearn to find the chance
of having a heart attack based on 13 other parameters. The model seems
to work but the metrics tests always give me this kind of error
regardless of how I transform the data. Its because the sample test is 1 row whilst the training data is 303 rows but I don't know how to fit them.
import pandas as pd
from sklearn import tree
from sklearn.metrics import mean_squared_error
heart = pd.read_csv('heart.csv')
test = pd.read_csv('Heart_test.csv')
X = heart.iloc[:,0:13]
Y = heart.iloc[:,13:14]
test = test.iloc[:,0:13]
#print(X.head(), '\n', test.head(),'\n')
model = tree.DecisionTreeRegressor()
model = model.fit(X,Y)
y_prediction = model.predict(test)
print(mean_squared_error(Y,y_prediction)

Multi Classification Ensemble Cross validation Function too many values to unpack (expected 2)

[Link to SampleFile][1]
[1]: https://www.dropbox.com/s/vk0ht1bowdhz85n/StackoverFlow_Example.csv?dl=0
Code below is in 2 parts Function and main code that calls function. There are a bunch of print statements along the way to help troubleshoot. I believe the issue has to do with the "mean_feature_importances" variable. This procedure works and does the comparison of binary classifiers with no issues. I have tried to change it to evaluate multi-class classifiers so I compare there performance. It makes sense why it expects only 2 labels because that is what it was for but this model has 5 different labels to choice from. I have changed every single value I think should be changed to accommodate 5 different labels instead of 2. Please advise if I missed something the issue happens on the return after print(19)
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron # linear classifiers
from sklearn.model_selection import StratifiedKFold # train/test splitting tool for cross-validation
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, \
GradientBoostingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, auc # scoring metrics
Here is the function used to process classifier ensemble cross validation
def train_MultiClass_classifier_ensemble_CV(classifiers, X_data, y_data, clf_params=None, cv_splits=10,
random_state=21, return_trained_classifiers=True, verbose=0, prtParam=0):
"""
Trains a list of classifiers on the input training data and returns cross-validated accuracy and f1 scores
as well as feature_importances (where available). The list of trained classifier objects is also returned
upon request.
: param classifiers : List of classifier objects; expects each has a scikit-learn wrapper.
: param X_data : Pandas dataframe containing our training features.
: param y_data : Pandas dataframe containing our training class labels.
: param clf_params : (Optional) List of dictionaries containing parameters for each classifier object
in the list 'classifiers'. If not provided, the already-initialized parameters of
each classifier object will be used.
: param cv_splits : Integer number of cross-validation splits.
: param random_state : Seed for reproducibility between executions.
: param return_trained_classifiers : Boolean; if True, function will also return a list containing thefit classifier objects.
: param verbose : The amount of status text displayed during execution; 0 for less, 1 for more.
: return clf_comparison : A pandas dataframe tabulating the cross-validated performance of each classifier.
: return mean_feature_importances : An array containing the ranked feature importances for each classifier having the feature_importances_ attribute.
: return trained_classifiers : (if return_trained_classifiers=True) A list of trained classifier objects.
"""
# initialization
kfold = StratifiedKFold(n_splits=cv_splits, random_state=random_state)
train_accuracy_mean = []
train_accuracy_std = []
test_accuracy_mean = []
test_accuracy_std = []
f1_score_mean = []
f1_score_std = []
mean_feature_importances = []
trained_classifiers = []
classifier_name = []
if clf_params is None: # construct using classifier's existing parameter assignment
clf_params = []
for clf in classifiers:
#print(clf)
params = clf.get_params()
if 'random_state' in params.keys(): # assign random state / seed
params['random_state'] = random_state
elif 'seed' in params.keys():
params['seed'] = random_state
clf_params.append(params)
# step through the classifiers for training and scoring with cross-validation
for clf, params in zip(classifiers, clf_params):
#print(clf)
#print(params)
# automatically obtain the name of the classifier
name = get_clf_name(clf)
classifier_name.append(name)
if prtParam == 1:
print(clf)
if verbose == 1: # print status
print('\nPerforming Cross-Validation on Classifier %s of %s:'
% (len(classifier_name), len(classifiers)))
print(name)
# perform k-fold cross validation for this classifier and calculate scores for each split
kth_train_accuracy = []
kth_test_accuracy = []
kth_test_f1_score = []
kth_feature_importances = []
for (train, test) in kfold.split(X_data, y_data):
clf.set_params(**params)
print(clf)
print(params)
OneVsOneClassifier(clf.fit(X_data.iloc[train], y_data.iloc[train]))
kth_train_accuracy.append(clf.score(X_data.iloc[train], y_data.iloc[train]))
print('1.1')
kth_test_accuracy.append(clf.score(X_data.iloc[test], y_data.iloc[test]))
print('2.2')
kth_test_f1_score.append(f1_score(y_true=y_data.iloc[test], y_pred=clf.predict(X_data.iloc[test]), average='weighted'))
print('3.3')
if hasattr(clf, 'feature_importances_'): # some classifiers (like linReg) lack this attribute
print(clf.feature_importances_)
kth_feature_importances.append(clf.feature_importances_)
# populate scoring statistics for this classifier (over all cross-validation splits)
train_accuracy_mean.append(np.mean(kth_train_accuracy))
print('4')
train_accuracy_std.append(np.std(kth_train_accuracy))
print('5')
test_accuracy_mean.append(np.mean(kth_test_accuracy))
print('6')
test_accuracy_std.append(np.std(kth_test_accuracy))
print('7')
f1_score_mean.append(np.mean(kth_test_f1_score))
print('8')
print('8-1')
f1_score_std.append(np.std(kth_test_f1_score))
print('9')
print(kth_test_f1_score)
# obtain array of mean feature importances, if this classifier had that attribute
print('9-1')
print(kth_feature_importances)
if len(kth_feature_importances) == 0:
print('10')
print(mean_feature_importances)
mean_feature_importances.append(False)
else:
print('10.1')
mean_feature_importances.append(np.mean(kth_feature_importances, axis=0))
# if requested, also export classifier after fitting on the complete training set
if return_trained_classifiers is not False:
print('12')
clf.fit(X_data, y_data)
print('13')
trained_classifiers.append(clf)
print('14')
# remove AdaBoost feature importances (we won't discuss their interpretation)
if type(clf) == type(AdaBoostClassifier()):
print('15')
mean_feature_importances[-1] = False
print('16')
# construct dataframe for comparison of classifiers
clf_comparison = pd.DataFrame({'Classifier Name' : classifier_name,
'Mean Train Accuracy' : train_accuracy_mean,
'Train Accuracy Standard Deviation' : train_accuracy_std,
'Mean Test Accuracy' : test_accuracy_mean,
'Test Accuracy Standard Deviation' : test_accuracy_std,
'Mean Test F1-Score' : f1_score_mean,
'F1-Score Standard Deviation' : f1_score_std})
print('17')
# enforce the desired column order
clf_comparison = clf_comparison[['Classifier Name', 'Mean Train Accuracy',
'Train Accuracy Standard Deviation', 'Mean Test Accuracy',
'Test Accuracy Standard Deviation', 'Mean Test F1-Score',
'F1-Score Standard Deviation']]
print('18')
# add return_trained_classifiers to the function return, if requested, otherwise omit
if return_trained_classifiers is not False:
print('19')
print(clf_comparison)
print(mean_feature_importances)
print(trained_classifiers)
return clf_comparison, mean_feature_importances, trained_classifiers
else:
print('20')
return clf_comparison, mean_feature_importances
This is the code and attachment should help you reproduce error. The Dataframe can be downloaded above and placed here to run the code. I believe I included every package necessary to run code if not please import
dfage_train = pd.read_csv('StackoverFlow_Example.csv')
y1 = dfage_train['AgeBin']
X1 = dfage_train
X1 = X1.drop(['AgeBin'], axis=1)
num_jobs=-1 # I'll use all available CPUs when possible
Ageclassifier_list = [LogisticRegression(n_jobs=num_jobs, solver='lbfgs'),
RandomForestClassifier(criterion = 'entropy',n_estimators=100, n_jobs=num_jobs),
LinearSVC(class_weight=None,random_state=27,multi_class='ovr')]
X1['Pclass'] = X1['Pclass'].astype(int)
X1['isMale'] = X1['isMale'].astype(bool)
X1['Embarked'] = X1['Embarked'].astype(int)
clf_comp_Full_FeatureSet, mean_feature_importances = train_MultiClass_classifier_ensemble_CV(classifiers=Ageclassifier_list, prtParam = 1,
verbose=1,
X_data=X1,
y_data=y1)
Error output
ValueError: too many values to unpack (expected 2)
Depending on a condition, your function train_MultiClass_classifier_ensemble_CV returns either 2 or 3 arguments. Don't do that. Because when you want to assign the returned variables, there can be a mismatch. Now, it's returning 3 values but you want to assign that to only two values. Here's the problematic part:
if return_trained_classifiers is not False:
print('19')
print(clf_comparison)
print(mean_feature_importances)
print(trained_classifiers)
return clf_comparison, mean_feature_importances, trained_classifiers # three here
else:
print('20')
return clf_comparison, mean_feature_importances # two here

sklearn class_weight in random forest does the opposite of what I expect

I'm using the RandomForestClassifier of sklearn on data with highly imbalanced classes--lots of 0 and few of 1. I'm interested in the number of 1s in the prediction. Example (credit):
# Load libraries
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Make class highly imbalanced by removing first 40 observations
X = X[46:,:]
y = y[46:]
# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 1, 0)
#split into training and testing
trainx = X[::2]
trainy = y[::2]
testx = X[1::2]
testy = y[1::2]
# Create decision tree classifer object
clf = RandomForestClassifier()
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
This returns 2. Which is fine, except on my real data the result is a bit lower than the true answer. I want to deal with this by using the class_weight parameter. However when I do:
clf = RandomForestClassifier(class_weight="balanced")
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
I get a result of 0. Same if I use class_weight={1:10}. If I use class_weight={1:.1} I get 2 again.
I get similar behavior on my real data: the higher the weight I give to class 1, the fewer 1s I get in the prediction.
This is the opposite of the behavior I expect (and the opposite of what the class_weight parameter does in svm). What's going on here? This question suggests sklearn is assigning the class labels by some kind of default, but that seems bizarre. Why wouldn't it use the class labels I gave it?

How to make (yes/no or 1-0) decisions with random forest?

This is the data set from Kaggle's Titanic competition (train and test csv files). Each file has features of passengers such as ID, sex, age, etc. The train file has a "survived" column with 0 and 1 values. The test file is missing the survived column as it has to be predicted.
This is my simple code using random forest to give me a benchmark for the starter:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
ID_col = ['PassengerId']
target_col = ["Survived"]
cat_cols = ['Name','Ticket','Sex','Cabin','Embarked']
num_cols= ['Pclass','Age','SibSp','Parch','Fare']
other_col=['Type'] #Test and Train Data set identifier
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Survived"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Survived"].values
x_test=test[list(features)].values
Train[list(features)]
#*************************
from sklearn import tree
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1]) #metrics. added by me
roc_auc = auc(fpr, tpr)
print(roc_auc)
final_status = rf.predict_proba(x_test)
test["Survived2"]=final_status[:,1]
test['my prediction']=np.where(test.Survived2 > 0.6, 1, 0)
test
As you can see, the final_status gives the probability of survival. I'm wondering how to get yes/no (1 or 0) answers from it. The easiest thing that I could think of was to say if probability is greater than 0.6 then the person survived and otherwise died ('my prediction' column) but once I submit the results, the predictions are not good at all.
I appreciate any insights. Thanks
Transforming your probability into binary output is the right way to go, but why did you choose > .6 and not > .5?
Also, if you are having bad results in that case, it is most likely because you did not do a proper job in data cleaning and feature extraction. For example, the title ("Mr", "Mrs",...) can give you indication on the gender, which is a super important feature to consider in your problem (I assume this is the titanic competition from kaggle).
I just needed to use a line like:
out = rf.predict(x_test)
and that would be the 0/1 answers I was looking for.

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]
depth = 5
maxFeat = 3
result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)
result
# result is now something like array([ 0.66773295, 0.58824739])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)
RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc #something like 0.83
RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?
Note:
When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.
The answer is what #KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
source: Better way to shuffle two numpy arrays in unison

Categories

Resources