Using sample_weight in Keras for sequence labelling

Using sample_weight in Keras for sequence labelling - python

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.
The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.
With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.
The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).
At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:
count = [list(array).index(1) for arrays in y for array in arrays]
count = dict(Counter(count))
count[0] = 0
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
category_weights = np.zeros(7)
for f in count:
category_weights[f] = count[f]
But I get the following error ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.
Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:
weights = []
for sample in y:
current_weight = []
for line in sample:
current_weight.append(frequency[list(line).index(1)])
weights.append(current_weight)
weights = np.array(weights)
and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().
I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.
Is this a proper way to define sample_weights or am I doing it all wrong ? I can't say I've noticed any improvements from adding the weights, so I must have missed something.

I think you are confusing sample_weights and class_weights. Checking the docs a bit we can see the differences between them:
sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). In case you are using temporal data you may instead pass a 2D array, enabling you to give weight to each timestep of each sample.
class_weights is used to provide a weight or bias for each output class. This means you should pass a weight for each class that you are trying to classify. Furthermore, this parameter expects a dictionary to be passed to it (not an array, that is why you got that error). For example consider this situation:
class_weight = {0 : 1. , 1: 50.}
In this case (a binary classification problem) you are giving 50 times as much weight (or "relevance") to your samples of class 1 compared to class 0. This way you can compensate for imbalanced datasets. Here is another useful post explaining more about this and other options to consider when dealing with imbalanced datasets.
If I train for more epochs, val_loss keeps dropping, but I get worse results.
Probably you are over-fitting, and something that may be contributing to that is the imbalanced classes your dataset has, as you correctly suspected. Compensating the class weights should help mitigate this, however there may still be other factors that can cause over-fitting that escape the scope of this question/answer (so make sure to watch out for those after solving this question).
Judging by your post, seems to me that what you need is to use class_weight to balance your dataset for training, for which you will need to pass a dictionary indicating the weight ratios between your 7 classes. Consider using sample_weight only if you want to give each sample a custom weight for consideration.
If you want a more detailed comparison between those two consider checking this answer I posted on a related question. Spoiler: sample_weight overrides class_weight, so you have to use one or the other, but not both, so be careful with not mixing them.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.

I searched online for the same question and I did have good accuracy improvement after using sample_weight correctly in my case.
I think your understanding is correct and the procedure is also correct. One possible reason that you don't have improvements in your case is that, when you pass in the sample_weight, higher value means higher weight. This means that you cannot use word count directly. You might consider to use the inverted count frequency:
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
for f in count:
category_weights = np.zeros(7)
category_weights[f] = 1 - count[f]

Related

Batch inference of softmax does not sum to 1

I am working with REINFORCE algorithm with PyTorch. I noticed that the batch inference/predictions of my simple network with Softmax doesn’t sum to 1 (not even close to 1). I am attaching a minimum working code so that you can reproduce it. What am I missing here?
import numpy as np
import torch
obs_size = 9
HIDDEN_SIZE = 9
n_actions = 2
np.random.seed(0)
model = torch.nn.Sequential(
torch.nn.Linear(obs_size, HIDDEN_SIZE),
torch.nn.ReLU(),
torch.nn.Linear(HIDDEN_SIZE, n_actions),
torch.nn.Softmax(dim=0)
)
state_transitions = np.random.rand(3, obs_size)
state_batch = torch.Tensor(state_transitions)
pred_batch = model(state_batch) # WRONG PREDICTIONS!
print('wrong predictions:\n', *pred_batch.detach().numpy())
# [0.34072137 0.34721774] [0.30972624 0.30191955] [0.3495524 0.3508627]
# DOES NOT SUM TO 1 !!!
pred_batch = [model(s).detach().numpy() for s in state_batch] # CORRECT PREDICTIONS
print('correct predictions:\n', *pred_batch)
# [0.5955179 0.40448207] [0.6574412 0.34255883] [0.624833 0.37516695]
# DOES SUM TO 1 AS EXPECTED

Although PyTorch lets us get away with it, we don’t actually provide an input with the right dimensionality. We have a model that takes one input and produces one output, but PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time. To accommodate multiple samples, modules expect the zeroth dimension of the input to be the number of samples in the batch.
Deep Learning with PyTorch
That your model works on each individual sample is an implementation nicety. You have incorrectly specified the dimension for the softmax (across batches instead of across the variables), and hence when given a batch dimension it is computing the softmax across samples instead of within samples:
nn.Softmax requires us to specify the dimension along which the softmax function is applied:
softmax = nn.Softmax(dim=1)
In this case, we have two input vectors in two rows (just like when we work with
batches), so we initialize nn.Softmax to operate along dimension 1.
Change torch.nn.Softmax(dim=0) to torch.nn.Softmax(dim=1) to get appropriate results.

Understanding what exactly neural network is predicting in documentation example (MNIST)

I've taken a quick course in neural networks to better understand them and now I'm trying them out for myself in R. I'm following this documentation of Keras.
The way I understand what is happening:
We are inputting a series of images and transforming these images to numerical matrices based on the arrangement of the pixels and colors in those pixels. We then build a neural network model to learn the pattern of these arrangements, depending on the classification (0 to 9). We then use the model to predict which class an image belongs to. I'll be honest and admit I'm not entirely sure what y_train and x_train is. I simply see it as one training and one validation set so I'm not sure what the difference between x and y is.
My question:
I've followed the steps to the T and the model runs fine and the predictions look like they do in the documentation. Ultimately, the prediction looks like this:
I take this to mean that observation 1 in x_test is predicted to be a category 7.
However, looking at x_test it looks like this:
There is a 0 in every column and row, also if I scroll further down. This is where I get confused. I'm also not sure how I view the original images to view for myself how well they are predicting them. I would eventually like to draw a number myself in paint or so and then see if the model can predict it, but for that I need to first understand what is going on. I feel I am close but I just need a little nudge!

I think if you read more about the input and output layer's dimensions, that would help.
In your example:
Input layer:
A single training example of image has two dimensions 28*28, which is then converted to a single vector of dimension 784. This acts as the input layer for the neural network.
So for m training examples, your input layer will have dimensions (m, 784). Analogically speaking (to traditional ML systems), you can imagine that each pixel of an image is converted into a feature (or x1, x2, ... x784), and your training set is a dataframe with m rows and 784 columns, which is then fed into neural network to compute y_hat = f(x1,x2,x3,...x784).
Output layer:
As an output for our neural network, we want it to predict which number it is from 0 to 9. So for a single training example, the output layer has dimension 10, representing each number from 0 to 9 and for n testing examples the output layer would be a matrix with dimension n*10.
Our y is a vector of length n which would be something like [1,7,8,2,.....] containing true value for each testing example. But to match the dimension of output layer, the y vector's dimension are converted using one-hot encoding. Imagine a length 10 vector, representing number 7 by putting 1 at 7th place and rest of the positions zeros something like [0,0,0,0,0,0,1,0,0,0].
So in your question, if you wish to see the original image, you should be able to see it before reshaping the training examples with something like image(mnist$test$x[1, , ]
Hope this helps!!

y_train are the labels and x_train is the training data, so images in this example. You need to use some kind of plotting library to plot x'es. In this example you probably are not expected to input your own drawings and if you want you would need to preprocess them in the same way as in MNIST and pass them to the model.

I have machine learning data with binary features. How can I force an autoencoder to return binary data?

I have a dataset of the following form: A series of M observations of N-dimensional data. In order to obtain latent factors from this data, I wish to make a single hidden-layer autoencoder trained on this data. Every dimension of a single observation is either a 0 or a 1. But the keras Model returns floats. Is there a way to add a layer to enforce a 0 or 1 as output?
I tried using a simple keras Model to solve this problem. It claims good accuracy on the data, but when looking at the raw data it predicts the 0's correctly and often completely ignores the 1's.
n_nodes = 50
input_1 = tf.keras.layers.Input(shape=(x_train.shape[1],))
x = tf.keras.layers.Dense(n_nodes, activation='relu')(input_1)
output_1 = tf.keras.layers.Dense(x_train.shape[1], activation='sigmoid')(x)
model = tf.keras.models.Model(input_1, output_1)
my_optimizer = tf.keras.optimizers.RMSprop()
my_optimizer.lr = 0.002
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10000)
predictions = model.predict(x_test)
These observations I then validate by looking at all experiments and seeing if a large (>0.1) value is returned for the elements which are 1. The performance is very poor on the 1's.
I have seen that the loss converges around 10000 epochs. However, the autoencoder fails to properly predict almost all 1's in the data set. Even when setting the width of the hidden layer to be identical to the dimensionality of the data (n_nodes = x_train.shape[1]) the autoencoder still gives bad performance, even worsening if I increase the width of the hidden layer.

[0, 1] outputs should generally be rounded such that >=0.5 rounds to 1 when outputting a final prediction and <0.5 rounds to 0. However your labels should be float values {0.0, 1.0} for the loss function (which I expect they are already). You can compute accuracy by rounding the outputs and comparing to your binary labels to count errors for {0, 1}, but they must be in continuous form [0.0, 1.0] for the loss and gradient calculations to work.
If you are doing all of that (and it does appear that things are set up correctly in your code), there might be a number of reasons for poor performance:
Your dense, "constriction" layer should be significantly smaller than your input. In making it smaller you are forcing the auto-encoder to learn a representative form of the input that can be used to produce the output. This representative form is likely to generalize well. If you increase the size of your hidden layer the network will have much more capacity to memorize the inputs.
You might have many more 0 values than 1 values, if this is the case then in the absence of actual learning the network could get stuck just predicting 0 as a "best guess" because that's "usually right". This is a harder problem to tackle. You might consider multiplying the loss by a vector of labels * eta + 1, this would effectively increase the learning rate of the ones labels. Example: Your labels are [0, 1, 0], eta is a hyper-parameter value >1, let's say eta=2.0. labels * eta = [1.0, 3.0, 1.0] which scales up the gradient signal for 1 values by increasing the loss for only 1's. This isn't a bullet proof method of increasing the importance of the 1's class, but it's something simple to try. If it makes any improvement then follow up on this line of reasoning in more detail.
You have 1 hidden layer, which means your limited to linear relationships, you might try 3 hidden layers to add a little non linearity. Your center layer should be fairly small, try something like 5 or 10 neurons, it should need to squeeze the data into a fairly tight constriction point to extract a general purpose representation.

Tensorflow negative sampling

I am trying to follow the udacity tutorial on tensorflow where I came across the following two lines for word embedding models:
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases,
embed, train_labels, num_sampled, vocabulary_size))
Now I understand that the second statement is for sampling negative labels. But the question is how does it know what the negative labels are? All I am providing the second function is the current input and its corresponding labels along with number of labels that I want to (negatively) sample from. Isn't there the risk of sampling from the input set in itself?
This is the full example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb

You can find the documentation for tf.nn.sampled_softmax_loss() here. There is even a good explanation of Candidate Sampling provided by TensorFlow here (pdf).
How does it know what the negative labels are?
TensorFlow will randomly select negative classes among all the possible classes (for you, all the possible words).
Isn't there the risk of sampling from the input set in itself?
When you want to compute the softmax probability for your true label, you compute: logits[true_label] / sum(logits[negative_sampled_labels]. As the number of classes is huge (the vocabulary size), there is very little probability to sample the true_label as a negative label.
Anyway, I think TensorFlow removes this possibility altogether when randomly sampling. (EDIT: #Alex confirms TensorFlow does this by default)

Candidate sampling explains how the sampled loss function is calculated:
Compute the loss function in a subset C of all training samples L, where C = T ⋃ S, T is the samples in target classes, and S is the randomly chosen samples in all classes.
The code you provided uses tf.nn.embedding_lookup to get the inputs [batch_size, dim] embed.
Then it uses tf.nn.sampled_softmax_loss to get the sampled loss function:
softmax_weights: A Tensor of shape [num_classes, dim].
softmax_biases: A Tensor of shape [num_classes]. The class biases.
embed: A Tensor of shape [batch_size, dim].
train_labels: A Tensor of shape [batch_size, 1]. The target classes T.
num_sampled: An int. The number of classes to randomly sample per batch. the numbed of classes in S.
vocabulary_size: The number of possible classes.
sampled_values: default to log_uniform_candidate_sampler
For one batch, the target samples are just train_labels (T). It chooses num_sampled samples from embed randomly (S) to be negative samples.
It will uniformly sample from embed respect to the softmax_wiehgt and softmax_bias. Since embed is embeddings[train_dataset] (of shape [batch_size, embedding_size]), if embeddings[train_dataset[i]] contains train_labels[i], it might be selected back, then it is not negative label.
According to Candidate sampling page 2, there are different types. For NCE and negative sampling, NEG=S, which may contain a part of T; for sampled logistic, sampled softmax, NEG = S-T explicitly delete T.
Indeed, it might be a chance of sampling from train_ set.

Scikit-learn RandomForestClassifier output of predict_proba

I have a dataset that I split in two for training and testing a random forest classifier with scikit learn.
I have 87 classes and 344 samples. The output of predict_proba is, most of the times, a 3-dimensional array (87, 344, 2) (it's actually a list of 87 numpy.ndarrays of (344, 2) elements).
Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344) (though I can't work out in which cases).
My two questions are:
what do these dimensions represent? I worked out that to get a ROC AUC score, I have to take one half of the output (that is (87, 344, 2)[:,:,1], transpose it, and then compare it with my ground truth (roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T) essentially) . But I don't understand what it really means.
why does the output change with different subsets of the data? I can't understand in which cases it returns a 3D array and in which cases a 2D one.

classifier.predict_proba() returns the class probabilities. The n dimension of the array will vary depending on how many classes there are in the subset you train on

Are you sure the arrays you're using to fit the RF has the right shape ? (n_samples,n_features) for the data and (n_samples) for the target classes.
You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:]. Note that sum( Y_pred[r,:] ) = 1.
However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. In that case, for each class and each sample, you would predict the probability of belonging to this class or not.
Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). You can get it with the attribute n_classes (and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes. See the documentation.
Hope it helps !

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.