How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I have a csv of size 12500 X 3. The first two columns (A and B) are inputs and the the final column (C) is the sum of the two columns.
I wanted to build a prediction model to get the value of C for a given A and B. This is just a basic model to imporve my understanding of machine learning.
The accuracy score is almost zero (0.00032) and the model is way to simple to get the predictions wrong. The code is below:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('Dataset.csv') #importing dataset
X = data.drop(columns=['C'])
y = data['C']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score
I did not even include outlier into the data and I create the csv using excel formulae. I used jupyter notebook to build this prediction model. Can someone please point out if/what I'm doing wrong?
Before you build your model, you should understand the behavior of the model and its main function. Decision Tree is used to classify data based on the criterias extracted from data. For this purpose, you should just choose the simple Linear Regression model, not the Decision Tree.
I am fairly new to Machine Learning and currently I am going through the book 'Hands-on
Machine Learning with Scikit-Learn, Keras & TensorFlow' from O'Reilly.
In exercise 8 of chapter 6 (here is the full exercise) the goal is to fit a random forest classifier on the training set of the moons dataset. For that it is asked to generate 1,000 subsets of the training, each containing 100 instances selected randomly. This can be done by means of the ShuffleSplit method from scikit learn. When I do so I also get nice results. However, I tried to implement the ShuffleSplit method myself but something goes wrong and I can't figure out what it is.
First I do a small Grid Search to find a good parameter setting for fitting one single decision tree:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np
np.random.seed(1234)
X, y = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
param_dic = {'max_leaf_nodes': list(range(2, 100)),
'min_samples_split': [2,3,4]}
dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, param_dic, cv=5, scoring='accuracy')
dtc_grid.fit(X_train, y_train)
best_params = dtc_grid.best_params_
dtc.set_params(**best_params)
dtc.fit(X_train, y_train)
predictions = dtc.predict(X_train)
print(f'Accuracy Score on Train Set: {accuracy_score(y_train, predictions)}')
predictions = dtc.predict(X_test)
print(f'Accuracy Score on Test Set: {accuracy_score(y_test, predictions)}')
# Accuracy Score on Train Set: 0.855875
# Accuracy Score on Test Set: 0.8525
Now I want to create 1000 random subsets of the training set of size 100. For that I create the following class:
from sklearn.base import TransformerMixin, BaseEstimator
import random
class CreateSubsets(TransformerMixin, BaseEstimator):
def __init__(self, n_sets, set_size):
self.n_sets = n_sets
self.set_size = set_size
def fit(self, X, y):
return self
def transform(self, X):
self.new_sets = {}
for set_i in range(self.n_sets):
idx = random.choices(range(len(X)), k=self.set_size)
X_new, y_new= X[idx], y[idx]
self.new_sets[set_i] = (X_new, y_new)
return self.new_sets
n_sets, set_size = 1000, 100
cs = CreateSubsets(n_sets, set_size)
new_sets = cs.fit_transform(X_train,y_train)
Next I fit the decision tree classifier with the optimal hyperparameter setting found by GridSearchCV on each of the small random datasets and predict labels of the test set observations. Those predictions are saved in tree_predictions for later purpose. Finally, the accuracy score is predicted for each fit and those values are averaged.
dtc = DecisionTreeClassifier()
dtc.set_params(**best_params)
tree_predictions = np.zeros((n_sets, len(X_test)))
accuracy_scores = []
for set_i in range(n_sets):
X_train_new, y_train_new = new_sets[set_i][0], new_sets[set_i][1]
dtc.fit(X_train_new, y_train_new)
predictions = dtc.predict(X_test)
tree_predictions[set_i,:] = predictions
accuracy_scores.append(accuracy_score(y_test, predictions))
print(f' Average performance of Decision Tree Classifier with optimal Hyperparameter Setting,\n fitted on smaller subsets: {np.mean(accuracy_scores)}')
# Average performance of Decision Tree Classifier with optimal Hyperparameter Settings, fitted on smaller subsets: 0.4941225
As a final step, for each instance of the test set I do a majority vote over the predictions (by taking the median over the predictions per test observation). Computing the accuracy score on those majority predictions should be higher than the average prediction score prior.
from scipy.stats import mode
predictions = mode(tree_predictions, axis=0)
accuracy_score(y_test, predictions[0].reshape(-1))
#0.403
Something certainly goes wrong with the class CreateSubsets, but I can't figure out what it is. If I use ShuffleSplit from sklearn like this instead, the random forest classifier performs well:
from sklearn.model_selection import ShuffleSplit
n_sets, set_size = 1000, 100
new_sets = []
rs = ShuffleSplit(n_splits=n_sets, test_size=len(X_train) - set_size)
for mini_train_index, mini_test_index in rs.split(X_train):
X_mini_train = X_train[mini_train_index]
y_mini_train = y_train[mini_train_index]
new_sets.append((X_mini_train, y_mini_train))
My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)
I'm implementing Naive Bayes by sklearn with imbalanced data.
My data has more than 16k records and 6 output categories.
I tried to fit the model with the sample_weight calculated by sklearn.utils.class_weight
The sample_weight received something like:
sample_weight = [11.77540107 1.82284768 0.64688602 2.47138047 0.38577435 1.21389195]
import numpy as np
data_set = np.loadtxt("./data/_vector21.csv", delimiter=",")
inp_vec = data_set[:, 1:22]
out_vec = data_set[:, 22:]
#
# # Split dataset into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_vec, out_vec, test_size=0.2) # 80% training and 20% test
#
# class weight
from keras.utils.np_utils import to_categorical
output_vec_categorical = to_categorical(y_train)
from sklearn.utils import class_weight
y_ints = [y.argmax() for y in output_vec_categorical]
c_w = class_weight.compute_class_weight('balanced', np.unique(y_ints), y_ints)
cw = {}
for i in set(y_ints):
cw[i] = c_w[i]
# Create a Gaussian Classifier
from sklearn.naive_bayes import *
model = GaussianNB()
# Train the model using the training sets
print(c_w)
model.fit(X_train, y_train, c_w)
# Predict the response for test dataset
y_pred = model.predict(X_test)
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("\nClassification Report: \n", (metrics.classification_report(y_test, y_pred)))
print("\nAccuracy: %.3f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
I got this message:
ValueError: Found input variables with inconsistent numbers of samples: [13212, 6]
Can anyone tell me what did I do wrong and how can fix it?
Thanks a lot.
The sample_weight and class_weight are two different things.
As their name suggests:
sample_weight is to be applied to individual samples (rows in your data). So the length of sample_weight must match the number of samples in your X.
class_weight is to make the classifier give more importance and attention to the classes. So the length of class_weight must match the number of classes in your targets.
You are calculating class_weight and not sample_weight by using the sklearn.utils.class_weight, but then try to pass it to the sample_weight. Hence the dimension mismatch error.
Please see the following questions for more understanding of how these two weights interact internally:
What is the difference between sample weight and class weight options in scikit learn?
https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier
This way I was able to calculate the weights to deal with class imbalance.
from sklearn.utils import class_weight
sample = class_weight.compute_sample_weight('balanced', y_train)
#Classifier Naive Bayes
naive = naive_bayes.MultinomialNB()
naive.fit(X_train,y_train, sample_weight=sample)
predictions_NB = naive.predict(X_test)