I have a dataset of reviews which has a class label of positive/negative. I am applying Decision Tree to that reviews dataset. Firstly, I am converting into a Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix.
I am splitting the data into train and test dataset.
X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(sorted_data['Text'], labels, test_size=0.3, random_state=0)
# BOW
count_vect = CountVectorizer()
count_vect.fit(X_tr.values)
final_counts = count_vect.transfrom(X_tr.values)
applying the Decision Tree algorithm as follows
# instantiate learning model k = optimal_k
# Applying the vectors of train data on the test data
optimal_lambda = 15
final_counts_x_test = count_vect.transform(X_test.values)
bow_reg_optimal = DecisionTreeClassifier(max_depth=optimal_lambda,random_state=0)
# fitting the model
bow_reg_optimal.fit(final_counts, y_tr)
# predict the response
pred = bow_reg_optimal.predict(final_counts_x_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Decision Tree for depth = %f is %f%%' % (optimal_lambda, acc))
bow_reg_optimal is a decision tree classifier. Could anyone tell how to get the feature importance using the decision tree classifier?
Use the feature_importances_ attribute, which will be defined once fit() is called. For example:
import numpy as np
X = np.random.rand(1000,2)
y = np.random.randint(0, 5, 1000)
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)
tree.feature_importances_
# array([ 0.51390759, 0.48609241])
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I am fairly new to Machine Learning and currently I am going through the book 'Hands-on
Machine Learning with Scikit-Learn, Keras & TensorFlow' from O'Reilly.
In exercise 8 of chapter 6 (here is the full exercise) the goal is to fit a random forest classifier on the training set of the moons dataset. For that it is asked to generate 1,000 subsets of the training, each containing 100 instances selected randomly. This can be done by means of the ShuffleSplit method from scikit learn. When I do so I also get nice results. However, I tried to implement the ShuffleSplit method myself but something goes wrong and I can't figure out what it is.
First I do a small Grid Search to find a good parameter setting for fitting one single decision tree:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np
np.random.seed(1234)
X, y = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
param_dic = {'max_leaf_nodes': list(range(2, 100)),
'min_samples_split': [2,3,4]}
dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, param_dic, cv=5, scoring='accuracy')
dtc_grid.fit(X_train, y_train)
best_params = dtc_grid.best_params_
dtc.set_params(**best_params)
dtc.fit(X_train, y_train)
predictions = dtc.predict(X_train)
print(f'Accuracy Score on Train Set: {accuracy_score(y_train, predictions)}')
predictions = dtc.predict(X_test)
print(f'Accuracy Score on Test Set: {accuracy_score(y_test, predictions)}')
# Accuracy Score on Train Set: 0.855875
# Accuracy Score on Test Set: 0.8525
Now I want to create 1000 random subsets of the training set of size 100. For that I create the following class:
from sklearn.base import TransformerMixin, BaseEstimator
import random
class CreateSubsets(TransformerMixin, BaseEstimator):
def __init__(self, n_sets, set_size):
self.n_sets = n_sets
self.set_size = set_size
def fit(self, X, y):
return self
def transform(self, X):
self.new_sets = {}
for set_i in range(self.n_sets):
idx = random.choices(range(len(X)), k=self.set_size)
X_new, y_new= X[idx], y[idx]
self.new_sets[set_i] = (X_new, y_new)
return self.new_sets
n_sets, set_size = 1000, 100
cs = CreateSubsets(n_sets, set_size)
new_sets = cs.fit_transform(X_train,y_train)
Next I fit the decision tree classifier with the optimal hyperparameter setting found by GridSearchCV on each of the small random datasets and predict labels of the test set observations. Those predictions are saved in tree_predictions for later purpose. Finally, the accuracy score is predicted for each fit and those values are averaged.
dtc = DecisionTreeClassifier()
dtc.set_params(**best_params)
tree_predictions = np.zeros((n_sets, len(X_test)))
accuracy_scores = []
for set_i in range(n_sets):
X_train_new, y_train_new = new_sets[set_i][0], new_sets[set_i][1]
dtc.fit(X_train_new, y_train_new)
predictions = dtc.predict(X_test)
tree_predictions[set_i,:] = predictions
accuracy_scores.append(accuracy_score(y_test, predictions))
print(f' Average performance of Decision Tree Classifier with optimal Hyperparameter Setting,\n fitted on smaller subsets: {np.mean(accuracy_scores)}')
# Average performance of Decision Tree Classifier with optimal Hyperparameter Settings, fitted on smaller subsets: 0.4941225
As a final step, for each instance of the test set I do a majority vote over the predictions (by taking the median over the predictions per test observation). Computing the accuracy score on those majority predictions should be higher than the average prediction score prior.
from scipy.stats import mode
predictions = mode(tree_predictions, axis=0)
accuracy_score(y_test, predictions[0].reshape(-1))
#0.403
Something certainly goes wrong with the class CreateSubsets, but I can't figure out what it is. If I use ShuffleSplit from sklearn like this instead, the random forest classifier performs well:
from sklearn.model_selection import ShuffleSplit
n_sets, set_size = 1000, 100
new_sets = []
rs = ShuffleSplit(n_splits=n_sets, test_size=len(X_train) - set_size)
for mini_train_index, mini_test_index in rs.split(X_train):
X_mini_train = X_train[mini_train_index]
y_mini_train = y_train[mini_train_index]
new_sets.append((X_mini_train, y_mini_train))
The following code shows how when trained on a SMOTE-produced dataset the accuracy and roc_auc_score comes out as same.
sm = SMOTE(random_state = 2) #creating SMOTE object
X_t, y_t = sm.fit_sample(X, y) #fitting X and y in smote and storing the new dataset in X_t and y_t
clf_temp = LinearDiscriminantAnalysis()
clf_temp.fit(X_t,y_t)
y_score = clf_temp.predict(X_t)
print(accuracy_score(y_t,y_score))
print(roc_auc_score(y_t,y_score))
0.8015075376884422
0.8015075376884422
whereas on training the dataset on the same classifier without SMOTE gives a different result
clf_temp.fit(X,y)
y_score = clf_temp.predict(X)
print(accuracy_score(y,y_score))
print(roc_auc_score(y,y_score))
0.7996592361209712
0.7183922969169426
I have the following code running through and fitting a model on the iris data using different modeling techniques. How can I add a second step in this process so I can demonstrate the improvement between using scaled and non-scaled data?
I don't need to run the scale transform outside of the loop, i was just having a lot of issues with transforming the data type from pandas dataframe to np array and back again.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
sc = StandardScaler()
X_train_scale = sc.fit_transform(X_train)
X_test_scale = sc.transform(X_test)
numFolds = 10
kf = KFold(len(y_train), numFolds, shuffle=True)
# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, svm.SVC]
params = [{},{}]
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
So ideally my output would be:
CV standard accuracy score of LogisticRegression: 0.683333
CV scaled accuracy score of LogisticRegression: 0.766667
CV standard accuracy score of SVC: 0.766667
CV scaled accuracy score of SVC: 0.783333
It seems like this is unclear, I am trying to loop through scaled and unscaled data, similar to how I am looping through the different ML algorithms.
I wanted to follow up with this. I was able to do this by creating a pipeline and using gridsearchCV
pipe = Pipeline([('scale', StandardScaler()),
('clf', LogisticRegression())])
param_grid = [{
'scale':[None,StandardScaler()],
'clf':[SVC(),LogisticRegression()]}]
grid_search = GridSearchCV(pipe, param_grid=param_grid,n_jobs=-1, verbose=1 )
In the end this got me the results I wanted and was able to test easily how to work between scaling and not scaling.
try this:
from __future__ import division
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
# added to your code
if previous_accuracy:
improvement = 1 - (accuracy / previous_accuracy)
print "CV accuracy score improved by", improvement
else:
previous_accuracy = accuracy
How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))