BertForSequenceClassification vs. BertForMultipleChoice for sentence multi-class classification

BertForSequenceClassification vs. BertForMultipleChoice for sentence multi-class classification - python

I'm working on a text classification problem (e.g. sentiment analysis), where I need to classify a text string into one of five classes.
I just started using the Huggingface Transformer package and BERT with PyTorch. What I need is a classifier with a softmax layer on top so that I can do 5-way classification. Confusingly, there seem to be two relevant options in the Transformer package: BertForSequenceClassification and BertForMultipleChoice.
Which one should I use for my 5-way classification task? What are the appropriate use cases for them?
The documentation for BertForSequenceClassification doesn't mention softmax at all, although it does mention cross-entropy. I am not sure if this class is only for 2-class classification (i.e. logistic regression).
Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
labels (torch.LongTensor of shape (batch_size,), optional, defaults to None) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
The documentation for BertForMultipleChoice mentions softmax, but the way the labels are described, it sound like this class is for multi-label classification (that is, a binary classification for multiple labels).
Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.
labels (torch.LongTensor of shape (batch_size,), optional, defaults to None) – Labels for computing the multiple choice classification loss. Indices should be in [0, ..., num_choices] where num_choices is the size of the second dimension of the input tensors.
Thank you for any help.

The answer to this lies in the (admittedly very brief) description of what the tasks are about:
[BertForMultipleChoice] [...], e.g. for RocStories/SWAG tasks.
When looking at the paper for SWAG, it seems that the task is actually learning to choose from varying options. This is in contrast to your "classical" classification task, in which the "choices" (i.e., classes) do not vary across your samples, which is exactly what BertForSequenceClassification is for.
Both variants can in fact be for an arbitrary number of classes (in the case of BertForSequenceClassification), respectively choices (for BertForMultipleChoice), via changing the labels parameter in the config. But, since it seems like you are dealing with a case of "classical classification", I suggest using the BertForSequenceClassification model.
Shortly addressing the missing Softmax in BertForSequenceClassification: Since classification tasks can compute loss across classes indipendent of the sample (unlike multiple choice, where your distribution is changing), this allows you to use Cross-Entropy Loss, which factors in Softmax in the backpropagation step for increased numerical stability.

Related

Model doesn't learn from data

We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?

XGBoost for multiclassification and imbalanced data

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.

sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)

You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)

Multi-label text classification with non-uniform distribution of class labels for every train data

I have a multi-label classification problem, I want to classify texts with six labels, each text can have one to six labels but this label distribution is not equal. For example, 10 people annotated sentence1 as below:
These labels are the number of votes for that class. I can normalize them like sad 0.7, anger 0.2, fear 0.1, happy 0.0,...
What is the best classifier for this problem? What is the best type for labels I mean I should normalize them or not?
What keywords should I search for this kind of multi-label classification problem where the probability of labels is not equal?

Well, first, to clarify if I understand your problem correctly. You have sentences=[sent1, sent2, ... sentn] and you want to classify them into these six labels labels=[l1,l2,...,l6]. Your data isn't the labels themselves, but the probability of having that label in the text. You also mentioned the six labels comes from human annotation (I don't know what you mean by 10 people commented, I'll guess it is annotation)
If this is the case, you can deal with the problem with multi-label classification or a multi-target regression perspectives. I'll approach what you can do with your data both cases:
Multilabel Classification: In this case, you need to define the classes for each sentence so that you can train your model. Right now you have only the probabilities. You can do that by creating a threshold and the probabilities of labels that are above the threshold can be considered the labels for a sentence. You can read more about the evaluation metrics here.
Multi-target Regression: In this case, you don't need to define the classes, you just use the training input and we use the data to predict the probabilities for each label. I think it is a better and easier problem, given your data collection. If you want to know more about the problem of multi-target regression, you can read more about it here, but the models they used in this tutorial are not the the state-of-the-art (be aware of it).
Training Models: You can use both shallow and deep models for this task. You need a model that can receive a sentence as input and predict six labels or six probabilities. I suggest you take a look into this example, it can be a very good starting point for your work. The author provides a tutorial on how to build a multi-label text classifier using deep neural networks. He basically built a LSTM and a Feed-forward layer in the end to classify the labels. If you decide to use regression instead of classification, you can just drop the activation in the end.
The best results are likely to be obtained by deep neural networks, so the article I sent you can work very well. I also suggest you take a look in the state-of-the-art methods for text classification, such as BERT or XLNET. I implemented a Multi-label classification method using BERT, maybe it can be helpful to you.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?

1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

How to choose cross-entropy loss in TensorFlow?

Classification problems, such as logistic regression or multinomial
logistic regression, optimize a cross-entropy loss.
Normally, the cross-entropy layer follows the softmax layer,
which produces probability distribution.
In tensorflow, there are at least a dozen of different cross-entropy loss functions:
tf.losses.softmax_cross_entropy
tf.losses.sparse_softmax_cross_entropy
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.softmax_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sigmoid_cross_entropy_with_logits
...
Which one works only for binary classification and which are suitable for multi-class problems? When should you use sigmoid instead of softmax? How are sparse functions different from others and why is it only softmax?
Related (more math-oriented) discussion: What are the differences between all these cross-entropy losses in Keras and TensorFlow?.

Preliminary facts
In functional sense, the sigmoid is a partial case of the softmax function, when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.
In simple binary classification, there's no big difference between the two,
however in case of multinomial classification, sigmoid allows to deal
with non-exclusive labels (a.k.a. multi-labels), while softmax deals
with exclusive classes (see below).
A logit (also called a score) is a raw unscaled value associated with a class, before computing the probability. In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.
Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities, and apply the transformation themselves (which is simply more efficient).
Sigmoid functions family
tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy (DEPRECATED)
As stated earlier, sigmoid loss function is for binary classification.
But tensorflow functions are more general and allow to do
multi-label classification, when the classes are independent.
In other words, tf.nn.sigmoid_cross_entropy_with_logits solves N
binary classifications at once.
The labels must be one-hot encoded or can contain soft class probabilities.
tf.losses.sigmoid_cross_entropy in addition allows to set the in-batch weights,
i.e. make some examples more important than others.
tf.nn.weighted_cross_entropy_with_logits allows to set class weights
(remember, the classification is binary), i.e. make positive errors larger than
negative errors. This is useful when the training data is unbalanced.
Softmax functions family
tf.nn.softmax_cross_entropy_with_logits (DEPRECATED IN 1.5)
tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy (DEPRECATED)
These loss functions should be used for multinomial mutually exclusive classification,
i.e. pick one out of N classes. Also applicable when N = 2.
The labels must be one-hot encoded or can contain soft class probabilities:
a particular example can belong to class A with 50% probability and class B
with 50% probability. Note that strictly speaking it doesn't mean that
it belongs to both classes, but one can interpret the probabilities this way.
Just like in sigmoid family, tf.losses.softmax_cross_entropy allows
to set the in-batch weights, i.e. make some examples more important than others.
As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights.
[UPD] In tensorflow 1.5, v2 version was introduced and the original softmax_cross_entropy_with_logits loss got deprecated. The only difference between them is that in a newer version, backpropagation happens into both logits and labels (here's a discussion why this may be useful).
Sparse functions family
tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy (DEPRECATED)
Like ordinary softmax above, these loss functions should be used for
multinomial mutually exclusive classification, i.e. pick one out of N classes.
The difference is in labels encoding: the classes are specified as integers (class index),
not one-hot vectors. Obviously, this doesn't allow soft classes, but it
can save some memory when there are thousands or millions of classes.
However, note that logits argument must still contain logits per each class,
thus it consumes at least [batch_size, classes] memory.
Like above, tf.losses version has a weights argument which allows
to set the in-batch weights.
Sampled softmax functions family
tf.nn.sampled_softmax_loss
tf.contrib.nn.rank_sampled_softmax_loss
tf.nn.nce_loss
These functions provide another alternative for dealing with huge number of classes.
Instead of computing and comparing an exact probability distribution, they compute
a loss estimate from a random sample.
The arguments weights and biases specify a separate fully-connected layer that
is used to compute the logits for a chosen sample.
Like above, labels are not one-hot encoded, but have the shape [batch_size, num_true].
Sampled functions are only suitable for training. In test time, it's recommended to
use a standard softmax loss (either sparse or one-hot) to get an actual distribution.
Another alternative loss is tf.nn.nce_loss, which performs noise-contrastive estimation (if you're interested, see this very detailed discussion). I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.

However, for version 1.5, softmax_cross_entropy_with_logits_v2 must be used instead, while using its argument with the argument key=..., for example
softmax_cross_entropy_with_logits_v2(_sentinel=None, labels=y,
logits=my_prediction, dim=-1, name=None)

While it is great that the accepted answer contains lot more info than what is asked, I felt that sharing a few generic thumb rules will make the answer more compact and intuitive:
There is just one real loss function. This is cross-entropy (CE). For a special case of a binary classification, this loss is called binary CE (note that the formula does not change) and for non-binary or multi-class situations the same is called categorical CE (CCE). Sparse functions are a special case of categorical CE where the expected values are not one-hot encoded but is an integer
We have the softmax formula which is an activation for multi-class scenario. For binary scenario, same formula is given a special name - sigmoid activation
Because there are sometimes numerical instabilities (for extreme values) when dealing with logarithmic functions, TF recommends combining the activation layer and the loss layer into one single function. This combined function is numerically more stable. TF provides these combined functions and they are suffixed with _with_logits
With this, let us now approach some situations. Say there is a simple binary classification problem - Is a cat present or not in the image? What is the choice of activation and loss function? It will be a sigmoid activation and a (binary)CE. So one could use sigmoid_cross_entropy or more preferably sigmoid_cross_entropy_with_logits. The latter combines the activation and the loss function and is supposed to be numerically stable.
How about a multi-class classification. Say we want to know if a cat or a dog or a donkey is present in the image. What is the choice of activation and loss function? It will be a softmax activation and a (categorical)CE. So one could use softmax_cross_entropy or more preferably softmax_cross_entropy_with_logits. We assume that the expected value is one-hot encoded (100 or 010 or 001). If (for some weird reason), this is not the case and the expected value is an integer (either 1 or 2 or 3) you could use the 'sparse' counterparts of the above functions.
There could be a third case. We could have a multi-label classification. So there could be a dog and a cat in the same image. How do we handle this? The trick here is to treat this situation as a multiple binary classification problems - basically cat or no cat / dog or no dog and donkey or no donkey. Find out the loss for each of the 3 (binary classifications) and then add them up. So essentially this boils down to using the sigmoid_cross_entropy_with_logits loss.
This answers the 3 specific questions you have asked. The functions shared above are all that are needed. You can ignore the tf.contrib family which is deprecated and should not be used.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.