Use SHAP values to explain LogisticRegression Classification - python

I am trying to do some bad case analysis on my product categorization model using SHAP. My data looks something like this:
corpus_train, corpus_test, y_train, y_test = train_test_split(data['Name_Description'],
data['Category_Target'],
test_size = 0.2,
random_state=8)
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word')
X_train = vectorizer.fit_transform(corpus_train)
X_test = vectorizer.transform(corpus_test)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
X_train_sample = shap.sample(X_train, 100)
X_test_sample = shap.sample(X_test, 20)
masker = shap.maskers.Independent(data=X_test_sample)
explainer = shap.LinearExplainer(model, masker=masker)
shap_values = explainer.shap_values(X_test_sample)
X_test_array = X_test_sample.toarray()
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names(), class_names=data['Category'].unique())
Now to save space I didn't include the actual summary plot, but it looks fine. My issue is that I want to be able to analyze a single prediction and get something more along these lines:
In other words, I want to know which specific words contribute the most to the prediction. But when I run the code in cell 36 in the image above I get an
AttributeError: 'numpy.ndarray' object has no attribute 'output_names'
I'm still confused on the indexing of shap_values. How can I solve this?

I was unable to find a solution with SHAP, but I found a solution using LIME. The following code displays a very similar output where its easy to see how the model made its prediction and how much certain words contributed.
c = make_pipeline(vectorizer, classifier)
# saving a list of strings version of the X_test object
ls_X_test= list(corpus_test)
# saving the class names in a dictionary to increase interpretability
class_names = list(data.Category.unique())
# Create the LIME explainer
# add the class names for interpretability
LIME_explainer = LimeTextExplainer(class_names=class_names)
# explain the chosen prediction
# use the probability results of the logistic regression
# can also add num_features parameter to reduce the number of features explained
LIME_exp = LIME_explainer.explain_instance(ls_X_test[idx], c.predict_proba)
LIME_exp.show_in_notebook(text=True, predict_proba=True)

Using the kernalSHAP, first you need to find the shaply value and then find the single instance, as following below;
#convert your training and testing data using the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_train = tfidf_vectorizer.fit_transform(IV_train)
tfidf_test = tfidf_vectorizer.transform(IV_test)
model=LogisticRegression()
model.fit(tfidf_train, DV_train)
#shap apply
#first shorten the data & convert to data frame
X_train_sample = tfidf_train[0:20]
sample_text = pd.DataFrame(X_test_sample)
SHAP_explainer = shap.KernelExplainer(model.predict, X_train_sample)
shap_vals = SHAP_explainer.shap_values(X_test_sample)
#print it.
print(df_test.iloc[7].Text , df_test.iloc[7].Label)
shap.initjs()
shap.force_plot(SHAP_explainer.expected_value, shap_vals[7,:],sample_text.iloc[7,:], feature_names=tfidf_vectorizer.get_feature_names_out())
as the original text is "good article interested natural alternatives treat ADHD" and Label is "1"

Related

How to predict sentiment of unseen text?

Using scikit learn, I have trained my model but dont know how to use the model to predict new text passages. I have watched tons of tutorials but none of them go beyond training and testing. Below is the code Im using
data_source_url = "/path/to/file.csv"
airline_tweets = pd.read_csv(data_source_url)
features = airline_tweets.iloc[:, 10].values
labels = airline_tweets.iloc[:, 1].values
processed_features = []
# I do some text processing here and then append the text to processed_features
vectorizer = CountVectorizer(analyzer = 'word', lowercase = False)
features = vectorizer.fit_transform(processed_features)
features_nd = features.toarray() # for easy usage
X_train, X_test, y_train, y_test = train_test_split(features_nd, labels, train_size=0.80, random_state=1234)
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)
predictions = log_model.predict(X_test)
Basically, you just need to follow the same step to transform your new dataset. Then, use your trained model to predict. It looks something like that:
new_dataset = ... # read your new dataset
new_processed_features = []
# do the same text processing here
# Use the same vectorizer to transform your new dataset
new_features = vectorizer.transform(new_processed_features)
new_features_nd = new_features.toarray() # for easy usage
# Use your trained model to predict new dataset
new_predictions = log_model.predict(new_features_nd)

How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

I am struggling to understand how training our algorithms connects with making predictions on new data.
My situation: I have an algorithm that I use on a labeled dataset. After the steps of importing it, encoding it, fit_transforming it and fitting it to make predictions on the data_test of the train_test_split function I get a really nice prediction from using the labeled dataset.
I am stumped as to how I need to feed a new dataset (unlabeled this time) to the trained model, which has learned from the labeled dataset. I know that technically the data used to train withheld the labels from itself to predict, but I am unaware how I have to provide the gaussianNB algorithm new data features to predict unknown labels.
My code for the training:
df = pd.read_csv(chosen_file, sep=',')
cat_cols = df.select_dtypes(include=['object'])
cat_cols_filled = cat_cols.fillna('0')
le = LabelEncoder()
cat_cols_fitted = cat_cols_filled.apply(lambda col: le.fit_transform(col))
non_cat_cols = df.select_dtypes(exclude=['object'])
non_cat_cols_filled = non_cat_cols.fillna('0')
non_cat_cols_fitted = non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
target_prep = df.iloc[:,-1]
target = le.fit_transform(target_prep.astype(str))
data = pd.concat([cat_cols_fitted, non_cat_cols_fitted], axis=1)
try:
data_train, data_test, target_train, target_test = train_test_split(data, target, train_size=0.3))
alg = GaussianNB()
pred = alg.fit(data_train, target_train).predict(***data_test***)
This is all fine and dandy. But I cannot understand how I have to give something in place of data_test. Do I need to provide the new dataset with some placeholder values for the label column? My label column from the beginning dataframe is the last one.
My attempt:
new_df = pd.read_csv(new_chosen_file, sep=',')
new_cat_cols = new_df.select_dtypes(include=['object'])
new_cat_cols_filled = new_cat_cols.fillna('0')
new_cat_cols_fitted = new_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_non_cat_cols = new_df.select_dtypes(exclude=['object'])
new_non_cat_cols_filled = new_non_cat_cols.fillna('0')
new_non_cat_cols_fitted = new_non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_data = pd.concat([new_cat_cols_fitted, new_non_cat_cols_fitted], axis=1)
print(new_data)
new_pred = alg.predict(new_data)
new_prediction = pd.DataFrame({'NEW ML prediction':new_pred})
print(new_pred)
print(new_prediction)
Notice I do not provide the target column in the new dataset. However the program errors out on me if I my column count does not match, so I am forced to add at least the label for the column for it to not do that:
Am I way off in my understanding of how this works? Please let me know.
EDIT:
I found my major screw-up in the code. I had not isolated my target column out of the data DataFrame. This was why data was 10 column shape.
I can finally appreciate the simplicity of the code.
You are instantiating an empty model to alg. Returning the prediction from fitted model to a variable named pred. So you are not actually saving the fitted model.
The concatenation of multiple methods such as
alg.fit(data_train, target_train).predict(***data_test***) is known as method chaining and can cause confusion.
A cleaner & more readable alternative is to :
alg = GaussianNB() # initiating model
alg = alg.fit(data_train, target_train) # fitting model with train data
pred = alg.predict(***data_test***) # testing with test data
new_pred = alg.predict(new_data) # test with new data`

using saved sklearn model to make prediction

I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)
Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)
You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)

classify new document - Random Forest, Bag of Words

This is my first attempt of document classification with ML and Python.
I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
Then I extract 500 articles not related to money laundering and also convert them to pandas df
I concatenate both dfs and label them either 'money-laundering' or 'other'
I do preprocessing (removing punctuation and stopwords, lower case etc)
and then feed the model based on bag of words principle as below:
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
text_features = vectorizer.fit_transform(full_df["processed full text"])
text_features = text_features.toarray()
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
accuracy_score(y_pred=y_pred, y_true=y_test)
It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:
ValueError: Number of features of the model must match the input. Model n_features is 5000 and input n_features is 45
I am not sure how to overcome this to be able to classify totally new article.
First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code.
Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test.
Then, let define full_added the corpus of texts with full_df and test.
text_features = vectorizer.fit_transform(full_added)
text_features = text_features.toarray()
You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])).
And then you can run
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(test)
Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.
Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.
I hope it helps.
My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.
The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:
#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values
#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)
#create classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)
#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction
output:
array(['money laundering'],
dtype='<U16')
And the accuracy is around 91% so more realistic than 99.96%..
Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.

Scikit Learn - ValueError: X has 26879 features per sample; expecting 7087

I am doing feature selection by first training LogisticRegression with L1 penalty and then using the reduced feature set to re-train the model using L2 penalty. Now, when I try to predict test data, the transform() done on it results in a different dimensional array. I am confused as to how to re-size the test data to be able to predict.
Appreciate any help. Thank you.
vectorizer = CountVectorizer()
output = vectorizer.fit_transform(train_data)
output_test = vectorizer.transform(test_data)
logistic = LogisticRegression(penalty = "l1")
logistic.fit(output, train_labels)
predictions = logistic.predict(output_test)
logistic = LogisticRegression(penalty = "l2", C = i + 1)
output = logistic.fit_transform(output, train_labels)
predictions = logistic.predict(output_test)
The following error message is shown resulting from the last predict line. Original number of features is 26879:
ValueError: X has 26879 features per sample; expecting 7087
There seem to be a couple of things wrong here.
Firstly, I suggest you give different names to the two logistic models, as you need both to make a prediction.
In you code, you never call the transform of the l1 logistic regression, which is not what you say you want to do.
What you should be doing is
l1_logreg = LogisticRegression(penalty="l1")
l1_logreg.fit(output, train_labels)
out_reduced = l1_logreg.transform(out)
out_reduced_test = l1_logreg.transform(out_test)
l2_logreg = LogisticRegression(penalty="l2")
l2_logreg.fit(out_reduced, train_labels)
pedictions = l2_logreg.predict(out_reduced_test)
or
pipe = make_pipeline(CountVectorizer(), LogisticRegression(penalty="l1"),
LogisticRegression(penalty="l2"))
pipe.fit(train_data, train_labels)
preditions = pipe.predict(test_data)
FYI I wouldn't expect that to work better than just doing l2 logreg. Also you could try SGDClassifier(penalty="elasticnet").

Categories

Resources