How to obtain probabilities from Pyspark One-vs-Rest multiclass classifier - python

It does not appear that Pyspark Onv-vs-Rest classifier provides probabilities. Is there a way to do this?
I am appending code below. I am adding the standard multiclass classifier for comparison.
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# load data file.
inputData = spark.read.format("libsvm") \
.load("/data/mllib/sample_multiclass_classification_data.txt")
(train, test) = inputData.randomSplit([0.8, 0.2])
# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)
# train the multiclass model.
ovrModel = ovr.fit(train)
lrm = lr.fit(train)
# score the model on test data.
predictions = ovrModel.transform(test)
predictions2 = lrm.transform(test)
predictions.show(6)
predictions2.show(6)

I don't think you can access the probabilities(confidence) vector because it takes the max of the confidence and drops the confidence vector. To test, you can make a copy of the class and modify it and remove the .drop(accColName)
http://spark.apache.org/docs/2.0.1/api/python/_modules/pyspark/ml/classification.html
# output the index of the classifier with highest confidence as prediction
labelUDF = udf(
lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]),
DoubleType())
# output label and label metadata as prediction
return aggregatedDataset.withColumn(
self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)

Related

How can I explain a part of my predictions as a whole with SHAP values? (and not every single prediction)

Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?

Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

I am working on a classification model in which I use the logistic regression algorithm. I got as result: Prediction LR": "['AnxiousPersonalityDisorder']". Now I need to calculate the probability of this result and I have a lot of problem.
here is the code if anyone has an idea of the source of the problem
# code in colab notebook
x = df.text.values.tolist()
y = df.label.values.tolist()
vectorizer = CountVectorizer()
data1 = vectorizer.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(data1, y, test_size=0.2, random_state=32,stratify=y)
#Training model
lr = LogisticRegression(random_state=40)
lr.fit(x_train, y_train)
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
i use fastapi for the deployment
class EmotionAdoPrediction():
def __init__(self):
# Note: for model_path you should make full path (see docker volume)
self.vector = load("C:/Users/aabid/PycharmProjects/emotion-detection-multilabels/model1/Lrvector.joblib")
self.model = load("C:/Users/aabid/PycharmProjects/emotion-detection-multilabels/model1/LRClassifier.joblib")
self.classes_names = {0: ["angry"], 1: ["joy"], 2: ["sadness"], 3: ["fear"]}
return
def predict(self, text):
text = self.data_cleaning(text)
text_clean = self.standardization(text)
probs = self.model.predict([[text_clean]])[:, 1]
proba = np.max(probs[0])
class_ind = np.argmax(probs[0])
return self.classes_names[class_ind], proba```
I suppose that you use the scikit-learn package for logistic regression. If this is the case than you can calculate the probability of the result. The LogisticRegression class contains the predict_log_proba(X) and the predict_proba(X) function.
According to the documentation:
predict_log_proba(X): Predict logarithm of probability estimates.
predict_proba(X): For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
Use the predict_proba function, because you have a multi class problem.
y_prob = lr.predict_proba(x_test)
EDIT: Also encode your labels from categorical values to numerical values. Use pandas.get_dummies to convert the categorical value.

Making predictions using all labels in multilabel text classification

I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)

How to predict for test query with multinomial Bayes

Trying out Multinomial Naive Bayes. My data set (df) looks like this :
I created training and test dataset and did train the model. This is what I tried so far:
from pyspark.ml.classification import NaiveBayes
# Initialise the model
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
# Fit the model
model = nb.fit(train)
# Make predictions on test data
predictions = model.transform(test)
Now how do I predict the top labels for query = "something" ?

decision tree algorithm on feature set

I'm trying to predict the no.of updates('sys_mod_count')based on the text description('eng')
I have predefined the 'sys_mod_count' into two classes if >=17 as 1; <17 as 0.
But I want to remove this condition as this value is not available at decision time in real world.
I'm thinking to do this in Decision tree/ Random forest method to train the classifier on feature set.
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
# return metrics.accuracy_score(predictions, valid_y)
return predictions
import pandas as pd
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
df_3 =pd.read_csv('processedData.csv', sep=";")
st_new = df_3[['sys_mod_count','eng','ger']]
st_new['updates_binary'] = st_new['sys_mod_count'].apply(lambda x: 1 if x >= 17 else 0)
st_org = st_new[['eng','updates_binary']]
st_org = st_org.dropna(axis=0, subset=['eng']) #Determine if column 'eng'contain missing values are removed
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(st_org['eng'], st_org['updates_binary'],stratify=st_org['updates_binary'],test_size=0.20)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(st_org['eng'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", metrics.accuracy_score(accuracy, valid_y))
This seems to be a threshold setting problem - you would like to set a threshold at which a certain classification is made. No supervised classifier can set the threshold for you because if it does not have any training data with binary classes, then you cannot train the cvlassifier, and to create training data, you need to set the threshold to begin with. It's a chicken and egg problem.
If you have some way of identifying which binary label is correct, then you can vary the threshold and measure errors similar to how it's suggested here. Then you can either run a Classifier on your binary labels based on the threshold or a Regressor on sys_mod_count and convert to binary based on the identified threshold.
The above approach does not work if you have no way to identify what the correct binary label should be. Then, the problem you are trying to solve is creating some boundary between points based on the value of your sys_mod_count variable. This is unsupervised learning. So, techniques like clustering will be helpful here. You can cluster your data into two clusters based on the distance of points from each other, and then label each cluster, which becomes your binary label.

Categories

Resources