I am trying to understand the NCE loss function in Tensorflow. NCE loss is employed for a word2vec task, for instance:
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
more details, please reference Tensorflow word2vec_basic.py
What are the input and output matrices in the NCE function?
In a word2vec model, we are interested in building representations for words. In the training process, given a slid window, every word will have two embeddings: 1) when the word is a centre word; 2) when the word is a context word. These two embeddings are called input and output vectors, respectively. (more explanations of input and output matrices)
In my opinion, the input matrix is embeddings and the output matrix is nce_weights. Is it right?
What is the final embedding?
According to a post by s0urcer also relating to nce, it says the final embedding matrix is just the input matrix. While, some others saying, the final_embedding=input_matrix+output_matrix. Which is right/more common?
Let's look at the relative code in word2vec example (examples/tutorials/word2vec).
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
These two lines create embedding representations. embeddings is a matrix where each row represents a word vector. embedding_lookup is a quick way to get vectors corresponding to train_inputs. In word2vec example, train_inputs consists of some int32 number, representing the id of target words. Basically, it can be placed by hidden layer feature.
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
These two lines create parameters. They will be updated by optimizer during training. We can use tf.matmul(embed, tf.transpose(nce_weights)) + nce_biases to get final output score. In other words, last inner-product layer in classification can be replaced by it.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights, # [vocab_size, embed_size]
biases=nce_biases, # [vocab_size]
labels=train_labels, # [bs, 1]
inputs=embed, # [bs, embed_size]
num_sampled=num_sampled,
num_classes=vocabulary_size))
These lines create nce loss, #garej has given a very good explanation. num_sampled refers to the number of negative sampling in nce algorithm.
To illustrate the usage of nce, we can apply it in mnist example (examples/tutorials/mnist/mnist_deep.py) with following 2 steps:
1. Replace embed with hidden layer output. The dimension of hidden layer is 1024 and num_output is 10. Minimum value of num_sampled is 1. Remember to remove the last inner-product layer in deepnn().
y_conv, keep_prob = deepnn(x)
num_sampled = 1
vocabulary_size = 10
embedding_size = 1024
with tf.device('/cpu:0'):
embed = y_conv
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
2. Create loss and compute output. After computing the output, we can use it to calculate accuracy. Note that the label here is not one-hot vector as used in softmax. Labels are the original label of training samples.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=y_idx,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
output = tf.matmul(y_conv, tf.transpose(nce_weights)) + nce_biases
correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y_, 1))
When we set num_sampled=1, the val accuracy will end at around 98.8%. And if we set num_sampled=9, we can get almost the same val accuracy as trained by softmax. But note that nce is different from softmax.
Full code of training mnist by nce can be found here. Hope it is helpful.
The embeddings Tensor is your final output matrix. It maps words to vectors. Use this in your word prediction graph.
The input matrix is a batch of centre-word : context-word pairs (train_input and train_label respectively) generated from the training text.
While the exact workings of the nce_loss op are not yet know to me, the basic idea is that it uses a single layer network (parameters nce_weights and nce_biases) to map an input vector (selected from embeddings using the embed op) to an output word, and then compares the output to the training label (an adjacent word in the training text) and also to a random sub-sample (num_sampled) of all other words in the vocab, and then modifies the input vector (stored in embeddings) and the network parameters to minimise the error.
What are the input and output matrices in the NCE function?
Take for example the skip gram model, for this sentence:
the quick brown fox jumped over the lazy dog
the input and output pairs are:
(quick, the), (quick, brown), (brown, quick), (brown, fox), ...
For more information please refer to the tutorial.
What is the final embedding?
The final embedding you should extract is usually the {w} between the input and hidden layer.
To illustrate more intuitively take a look at the following picture:
The one hot vector [0, 0, 0, 1, 0] is the input layer in the above graph, the output is the word embedding [10, 12, 19], and W(in the graph above) is the matrix in between.
For detailed explanation please read this tutorial.
1) In short, it is right in general, but just partly right for the function in question. See tutorial:
The noise-contrastive estimation loss is defined in terms of a
logistic regression model. For this, we need to define the weights and
biases for each word in the vocabulary (also called the output weights
as opposed to the input embeddings).
So inputs to the function nce_loss are output weights and a small part of input embeddings, among the other stuff.
2) 'Final' embedding (aka Word vectors, aka Vector representations of Words) is what you call input matrix. Embeddings are strings (vectors) of that matrix, corresponding to each word.
Warn In fact, this terminology is confusing because of input and output concepts usage in NN environment. Embeddings matrix is not an input to NN, as input to NN is technically an input layer. You obtain the final state of this matrix during training process. Nonetheless, this matrix should be initialised in programming, because an algorithm has to start from some random state of this matrix to gradually update it during training.
The same is true for weights - they are to be initialised also. It happens in the following line:
nce_weights = tf.Variable(
tf.truncated_normal([50000, 128], stddev=1.0 / math.sqrt(128)))
Each embedding vector can be multiplied by a vector from weights matrix (in a string to column manner). So we will get the scalar in the NN output layer. The norm of this scalar is interpreted as probability that the target word (from input layer) will be accompanied by label [or context] word corresponding to the place of scalar in output layer.
So, if we are saying about inputs (arguments) to the function, then both matrixes are such: weights and a batch sized extraction from embeddings:
tf.nn.nce_loss(weights=nce_weights, # Tensor of shape(50000, 128)
biases=nce_biases, # vector of zeros; len(128)
labels=train_labels, # labels == context words enums
inputs=embed, # Tensor of shape(128, 128)
num_sampled=num_sampled, # 64: randomly chosen negative (rare) words
num_classes=vocabulary_size)) # 50000: by construction
This nce_loss function outputs a vector of batch_size - in the TensorFlow example a shape(128,) tensor.
Then reduce_mean() reduces this result to a scalar taken the mean of those 128 values, which is in fact an objective for further minimization.
Hope this helps.
From the paper Learning word embeddings efficiently with noise-contrastive estimation:
NCE is based on the reduction of density estimationto probabilistic binary classification. The basic idea is to train a logistic regression classifier to
discriminate between samples from the data distribution and samples from some “noise” distribution
We could find out that in word embedding, the NCE is actually negative sampling. (For the difference between this two, see paper: Notes on Noise Contrastive Estimation and Negative Sampling)
Therefore, you do not need to input the noise distribution. And also from the quote, you will find out that it is actually a logistic regression: weight and bias would be the one need for logistic regression. If you are familiar with word2vec, it is just adding a bias.
Related
In trying implement the cross entropy loss with l2 regularization described here A Fast and Accurate Dependency Parser using Neural Networks, I get the error: ValueError: Cannot feed value of shape (48,) for Tensor u'Placeholder_2:0', which has shape '(48, 1) after getting a negative loss for a few steps in training. My loss is as follows:
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.prediction, labels=self.train_labels))
+ 1e-8 * (tf.nn.l2_loss(w1)
+ tf.nn.l2_loss(w2))
and is a negative value at step 0. I believe my problem is that some labels are -1 and should be ignored as stated in the paper: "A slight
variation is that we compute the softmax probabilities
only among the feasible transitions in
practice". How would I go about ignoring these labels when calculating the loss?
This thread comes close: What is the purpose of weights and biases in tensorflow word2vec example?
But I am still missing something from my interpretation of this: https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
From what I understand, you feed the network the indices of target and context words from your dictionary.
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
The batch inputs are then looked up to return the vectors that are randomly generated at the beginning
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
Then an optimizer adjusts the weights and biases to best predict the label as opposed to num_sampled random alternatives
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
My questions are as follows:
Where do the embeddings variable get updated?. It appears to me that I could get the final result by either running the index of a word through the neural network, or by just taking the final_embeddings vectors and using that. But I do not understand where embeddings is ever changed from its random initialization.
If I were to draw this computation graph, what would it look like (or better yet, what is the best way to actually do so)?
Is this running all of the context/target pairs in the batch at once? Or one by one?
Embeddings: Embeddings is a variable. It gets updated every time you do backprop (while running optimizer with loss)
Grpah: Did you try saving the graph and displaying it in tensorboard ? Is this what you're looking for ?
Batching: Atleast in the example you linked, he is doing batch processing using the function at line 96. https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L96
Please correct me if I misunderstood your question.
I am trying to train an autoencoder on some simulated data where an input is basically a vector with Gaussian noise applied. The code is almost exactly the same as in this example: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/autoencoder.py
The only differences are I changed the network parameters and the cost function:
n_hidden_1 = 32 # 1st layer num features
n_hidden_2 = 16 # 2nd layer num features
n_input = 149 # LunaH-Map data input (number of counts per orbit)
cost = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_pred), reduction_indices=[1]))
During training, the error steadily decreases down to 0.00015, but the predicted and true values are very different, e.g.
as shown in this image. In fact, the predicted y vector is almost all ones.
How is it possible to get decreasing error with very wrong predictions? Is it possible that my network is just trying to move the weights closer to log(1) so as to minimize the cross entropy cost? If so, how do I combat this?
Yes, the network simply learns to to predict 1 which reduces the loss. The cross-entropy loss you are using is categorical which is used when y_true is one-hot code (example: [0,0,1,0]) and final layer is softmax (ensures sum of all output is 1). So when y_true[idx] is 0, the loss don't care while when the y_true[idx] is 1 and y_pred[idx] is 0 there is infinite(high) loss but if its 1 then loss is again 0.
Now categorical cross-entropy loss is not suitable for autoencoders. For real valued inputs and hence outputs its mean-squared-error, which is what is used in example that you cited. But there the final activation layer is sigmoid, implicitly saying that each element of x is 0/1. So either you need to convert your data to support the same or have the last layer of decoder linear.
If you do want to use cross-entropy loss you can use binary cross-entropy
For inputs with 0,1 binary cross-entropy: tf.reduce_mean(y_true * tf.log(y_pred) + (1-y_true) * tf.log(1-y_pred)). If you work it out in both misprediction case 0-1, 1-0, the network gets infinite loss. Note again here the final layer should be softmax and elements of x should be between 0 and 1
I am starting to use tensorflow (coming from Caffe), and I am using the loss sparse_softmax_cross_entropy_with_logits. The function accepts labels like 0,1,...C-1 instead of onehot encodings. Now, I want to use a weighting depending on the class label; I know that this could be done maybe with a matrix multiplication if I use softmax_cross_entropy_with_logits (one hot encoding), Is there any way to do the same with sparse_softmax_cross_entropy_with_logits?
import tensorflow as tf
import numpy as np
np.random.seed(123)
sess = tf.InteractiveSession()
# let's say we have the logits and labels of a batch of size 6 with 5 classes
logits = tf.constant(np.random.randint(0, 10, 30).reshape(6, 5), dtype=tf.float32)
labels = tf.constant(np.random.randint(0, 5, 6), dtype=tf.int32)
# specify some class weightings
class_weights = tf.constant([0.3, 0.1, 0.2, 0.3, 0.1])
# specify the weights for each sample in the batch (without having to compute the onehot label matrix)
weights = tf.gather(class_weights, labels)
# compute the loss
tf.losses.sparse_softmax_cross_entropy(labels, logits, weights).eval()
Specifically for binary classification, there is weighted_cross_entropy_with_logits, that computes weighted softmax cross entropy.
sparse_softmax_cross_entropy_with_logits is tailed for a high-efficient non-weighted operation (see SparseSoftmaxXentWithLogitsOp which uses SparseXentEigenImpl under the hood), so it's not "pluggable".
In multi-class case, your option is either switch to one-hot encoding or use tf.losses.sparse_softmax_cross_entropy loss function in a hacky way, as already suggested, where you will have to pass the weights depending on the labels in a current batch.
The class weights are multiplied by the logits, so that still works for sparse_softmax_cross_entropy_with_logits. Refer to this solution for "Loss function for class imbalanced binary classifier in Tensor flow."
As a side note, you can pass weights directly into sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy(logits, labels, weight=1.0, scope=None)
This method is for cross-entropy loss using
tf.nn.sparse_softmax_cross_entropy_with_logits.
Weight acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weight is a tensor of size [batch_size], then the loss weights apply to each corresponding sample.
I am trying to follow the udacity tutorial on tensorflow where I came across the following two lines for word embedding models:
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases,
embed, train_labels, num_sampled, vocabulary_size))
Now I understand that the second statement is for sampling negative labels. But the question is how does it know what the negative labels are? All I am providing the second function is the current input and its corresponding labels along with number of labels that I want to (negatively) sample from. Isn't there the risk of sampling from the input set in itself?
This is the full example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb
You can find the documentation for tf.nn.sampled_softmax_loss() here. There is even a good explanation of Candidate Sampling provided by TensorFlow here (pdf).
How does it know what the negative labels are?
TensorFlow will randomly select negative classes among all the possible classes (for you, all the possible words).
Isn't there the risk of sampling from the input set in itself?
When you want to compute the softmax probability for your true label, you compute: logits[true_label] / sum(logits[negative_sampled_labels]. As the number of classes is huge (the vocabulary size), there is very little probability to sample the true_label as a negative label.
Anyway, I think TensorFlow removes this possibility altogether when randomly sampling. (EDIT: #Alex confirms TensorFlow does this by default)
Candidate sampling explains how the sampled loss function is calculated:
Compute the loss function in a subset C of all training samples L, where C = T ⋃ S, T is the samples in target classes, and S is the randomly chosen samples in all classes.
The code you provided uses tf.nn.embedding_lookup to get the inputs [batch_size, dim] embed.
Then it uses tf.nn.sampled_softmax_loss to get the sampled loss function:
softmax_weights: A Tensor of shape [num_classes, dim].
softmax_biases: A Tensor of shape [num_classes]. The class biases.
embed: A Tensor of shape [batch_size, dim].
train_labels: A Tensor of shape [batch_size, 1]. The target classes T.
num_sampled: An int. The number of classes to randomly sample per batch. the numbed of classes in S.
vocabulary_size: The number of possible classes.
sampled_values: default to log_uniform_candidate_sampler
For one batch, the target samples are just train_labels (T). It chooses num_sampled samples from embed randomly (S) to be negative samples.
It will uniformly sample from embed respect to the softmax_wiehgt and softmax_bias. Since embed is embeddings[train_dataset] (of shape [batch_size, embedding_size]), if embeddings[train_dataset[i]] contains train_labels[i], it might be selected back, then it is not negative label.
According to Candidate sampling page 2, there are different types. For NCE and negative sampling, NEG=S, which may contain a part of T; for sampled logistic, sampled softmax, NEG = S-T explicitly delete T.
Indeed, it might be a chance of sampling from train_ set.