I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.
My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)
I have scraped some data from spotify to see if I can classify the music genre of different songs.
I have split my data up into a test set and a remaining set, which I have then further divided into training and validation set.
When I run the model (I try to classify between 112 genres) I get 30% accuracy in the validation set. Of course this is not great, but to be expected with 112 genres and limited data. What really confuses me is that when I apply the model to the test data, accuracy goes down to 1%.
I am not sure why that is: as far as I can see the validation and test data should be comparable. I train the model on the training data which should be completely independent.
I must be making some mistake either allowing the model to peak into the validation data (better performance there) or mess up my test data.
Or maybe applying the model twice messes things up?
Any idea what could be going on or how to debug it?
Thanks a lot!
Franka
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
# re-read data
track_df = pd.read_csv('track_df_corr.csv')
features = [ 'acousticness', 'speechiness',
'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
'loudness', 'danceability', 'valence',
'duration_mins', 'year', 'genre']
track_df = track_df[features]
#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)
#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train
X_test_full = test
# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
# numbers needed for classifier
# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt
# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set.
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
# General preprocessing steps: take care of categorical data (does not apply here).
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
#Time to run the model.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below
# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to
#be applied to subsets of the data.
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
# clf stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one
# Preprocessing of training data, fit model
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model)
clf.fit(X_train, y_train)
# --------------------------------------------------------
#Test our model on the VALIDATION data
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct!
# The code yields a value around 30%.
# --------------------------------------------------------
Apply our model on the TESTING data
# Preprocessing of training data, fit model
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)
#The code yields a value around 1%.
The problem that I see is that you're encoding the train and test labels using pd.factorize. Since you're using pd.factorize on y and y_test independently, the resulting encodings will not correspond to one another. You want to use a LabelEncoder, so that when you fit the encoder using the train data, you then transform y_test using the same encoding scheme.
Here's an example to illustrate this:
from sklearn.preprocessing import LabelEncoder
l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)
Here we get the correct encodings. However if we apply a pd.factorize, obviously pandas can't guess which are the correct encodings:
pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)
I'm implementing Naive Bayes by sklearn with imbalanced data.
My data has more than 16k records and 6 output categories.
I tried to fit the model with the sample_weight calculated by sklearn.utils.class_weight
The sample_weight received something like:
sample_weight = [11.77540107 1.82284768 0.64688602 2.47138047 0.38577435 1.21389195]
import numpy as np
data_set = np.loadtxt("./data/_vector21.csv", delimiter=",")
inp_vec = data_set[:, 1:22]
out_vec = data_set[:, 22:]
#
# # Split dataset into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_vec, out_vec, test_size=0.2) # 80% training and 20% test
#
# class weight
from keras.utils.np_utils import to_categorical
output_vec_categorical = to_categorical(y_train)
from sklearn.utils import class_weight
y_ints = [y.argmax() for y in output_vec_categorical]
c_w = class_weight.compute_class_weight('balanced', np.unique(y_ints), y_ints)
cw = {}
for i in set(y_ints):
cw[i] = c_w[i]
# Create a Gaussian Classifier
from sklearn.naive_bayes import *
model = GaussianNB()
# Train the model using the training sets
print(c_w)
model.fit(X_train, y_train, c_w)
# Predict the response for test dataset
y_pred = model.predict(X_test)
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("\nClassification Report: \n", (metrics.classification_report(y_test, y_pred)))
print("\nAccuracy: %.3f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
I got this message:
ValueError: Found input variables with inconsistent numbers of samples: [13212, 6]
Can anyone tell me what did I do wrong and how can fix it?
Thanks a lot.
The sample_weight and class_weight are two different things.
As their name suggests:
sample_weight is to be applied to individual samples (rows in your data). So the length of sample_weight must match the number of samples in your X.
class_weight is to make the classifier give more importance and attention to the classes. So the length of class_weight must match the number of classes in your targets.
You are calculating class_weight and not sample_weight by using the sklearn.utils.class_weight, but then try to pass it to the sample_weight. Hence the dimension mismatch error.
Please see the following questions for more understanding of how these two weights interact internally:
What is the difference between sample weight and class weight options in scikit learn?
https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier
This way I was able to calculate the weights to deal with class imbalance.
from sklearn.utils import class_weight
sample = class_weight.compute_sample_weight('balanced', y_train)
#Classifier Naive Bayes
naive = naive_bayes.MultinomialNB()
naive.fit(X_train,y_train, sample_weight=sample)
predictions_NB = naive.predict(X_test)