Predicting next word using the language model tensorflow example - python

The tensorflow tutorial on language model allows to compute the probability of sentences :
probabilities = tf.nn.softmax(logits)
in the comments below it also specifies a way of predicting the next word instead of probabilities but does not specify how this can be done. So how to output a word instead of probability using this example?
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)

Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word's probabilities.
At "Evaluate the Model" from this page, your output list is y in the following example:
First we'll figure out where we predicted the correct label. tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example, tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1) is the true label. We can use tf.equal to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the "N words" closest to the predicted point in space at the output of the network. In that case, the argmax way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y will not represent words as features, but word embeddings over a dimensionality of, let's say, 100 to 2000 in size according to different models. You could Google something like this for more info: "man woman queen word addition word2vec" to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs' corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.

You need to find the argmax of the probabilities, and translate the index back to a word by reversing the word_to_id map. To get this to work, you must save the probabilities in the model and then fetch them from the run_epoch function (you could also save just the argmax itself). Here's a snippet:
inverseDictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
def run_epoch(...):
decodedWordId = int(np.argmax(logits))
print (" ".join([inverseDictionary[int(x1)] for x1 in np.nditer(x)])
+ " got" + inverseDictionary[decodedWordId] +
+ " expected:" + inverseDictionary[int(y)])
See full implementation here:

It is actually an advantage that the function returns a probability instead of the word itself. Since it is using a list of words, with the associated probabilities, you can do further processing, and increase the accuracy of your result.
To answer your question:
You can take the list of words, iterate though it , and make the program display the word with the highest probability.


inverse ratio between weights and frequent words when using tfidf?

I have an elementary doubt about the logistic regression model and the vectorized corpora with Tfidf.
Suppose I have a corpus where the word "need" appears in 10 documents. So the value of the "need" column in the Tfidf matrix will be less than if the word only appeared in 5 documents.
The point is that, when adjusting its parameters, the model will have to give a greater weight to the "need" feature to compensate for the low value in the input. And finally, a word with little relevance in the corpus will have a very high weight, precisely due to its low relevance.
To test this I used this code. You'll see that if you add occurrences of "need" when instantiating the model with vectorizer.fit_transform, the value of the "need" column in the tfidf array goes down, and the final weight goes up.
vectorizer = TfidfVectorizer(use_idf=True,stop_words=[])
vectorizer.fit_transform(["he need to get a car","you need to get a car","she need to get a car","i need a beer","please give me a beer"])
vector = vectorizer.transform(["he need to get a car","you need to get a car","she need to get a car","i love you"])
df = pd.DataFrame(vector.todense().tolist(),columns=vectorizer.get_feature_names_out())
from sklearn.linear_model import LogisticRegression
scikit_log_reg = LogisticRegression(verbose=1, solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000)
model =,["TRUE","TRUE","TRUE","FALSE"])
weights_df = pd.DataFrame(model.coef_,columns=vectorizer.get_feature_names_out())
Is my reasoning correct? Is it okay for it to work like this?

Using pretrained word embeddings to classify "pools" of words

I have seen many papers explaining the use of pretrained word embeddings (such as Word2Vec or Fasttext) on sentence sentiment classification using CNNs (like Yoon Kim's paper). However, these classifiers also account for order that the words appear in.
My application of word embeddings is to predict the class of "pools" of words. For example, in the following list of lists
example = [["red", "blue", "green", "orange"], ["bear", "horse", "cow"], ["brown", "pink"]]
The order of the words doesn't matter, but I want to classify the sublists into either class of color or animal.
Are there any prebuilt Keras implementations of this, or any papers you could point me to which address this type of classification problem based on pretrained word embeddings?
I am sorry if this is off-topic in this forum. If so, please let me know where would be a better place to post it.
The key point in creating that classifier would be to avoid any bias from the order of words in list. A naive LSTM solution would just look at first or last few words and try to classify, this effect could reduced by giving permutations of lists every time. Perhaps a simpler approach might be:
# unknown number of words in list each 300 size from word2vec
in = Input(shape=(None, 300))
# some feature extraction per word
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(in)
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(latent)
sum = Lambda(lambda x: K.sum(x, axis=-1))(latent) # reduce sum all words
out = Dense(num_classes, activation='softmax')(sum)
model = Model(in, out)
model.compile(loss='categorical_crossentropy', optimiser='sgd')
where the reduced sum would avoid any ordering bias, if a majority of words express similar features of a certain class then the sum would also lean towards that.

Using sample_weight in Keras for sequence labelling

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.
The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.
With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.
The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).
At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:
count = [list(array).index(1) for arrays in y for array in arrays]
count = dict(Counter(count))
count[0] = 0
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
category_weights = np.zeros(7)
for f in count:
category_weights[f] = count[f]
But I get the following error ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.
Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:
weights = []
for sample in y:
current_weight = []
for line in sample:
weights = np.array(weights)
and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().
I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.
Is this a proper way to define sample_weights or am I doing it all wrong ? I can't say I've noticed any improvements from adding the weights, so I must have missed something.
I think you are confusing sample_weights and class_weights. Checking the docs a bit we can see the differences between them:
sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). In case you are using temporal data you may instead pass a 2D array, enabling you to give weight to each timestep of each sample.
class_weights is used to provide a weight or bias for each output class. This means you should pass a weight for each class that you are trying to classify. Furthermore, this parameter expects a dictionary to be passed to it (not an array, that is why you got that error). For example consider this situation:
class_weight = {0 : 1. , 1: 50.}
In this case (a binary classification problem) you are giving 50 times as much weight (or "relevance") to your samples of class 1 compared to class 0. This way you can compensate for imbalanced datasets. Here is another useful post explaining more about this and other options to consider when dealing with imbalanced datasets.
If I train for more epochs, val_loss keeps dropping, but I get worse results.
Probably you are over-fitting, and something that may be contributing to that is the imbalanced classes your dataset has, as you correctly suspected. Compensating the class weights should help mitigate this, however there may still be other factors that can cause over-fitting that escape the scope of this question/answer (so make sure to watch out for those after solving this question).
Judging by your post, seems to me that what you need is to use class_weight to balance your dataset for training, for which you will need to pass a dictionary indicating the weight ratios between your 7 classes. Consider using sample_weight only if you want to give each sample a custom weight for consideration.
If you want a more detailed comparison between those two consider checking this answer I posted on a related question. Spoiler: sample_weight overrides class_weight, so you have to use one or the other, but not both, so be careful with not mixing them.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.
I searched online for the same question and I did have good accuracy improvement after using sample_weight correctly in my case.
I think your understanding is correct and the procedure is also correct. One possible reason that you don't have improvements in your case is that, when you pass in the sample_weight, higher value means higher weight. This means that you cannot use word count directly. You might consider to use the inverted count frequency:
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
for f in count:
category_weights = np.zeros(7)
category_weights[f] = 1 - count[f]

Use trained discriminator in GAN to calculate probabilities

I followed this tutorial on GAN -
I want to use the trained discriminator for calculating probabilities of test images(I trained on images which represent a certain set, and want to check the probability the test image resembles that set.) I used the following code - (after reloading the model)
newP=, feed_dict={x_placeholder: dataset2})
print("prob: " + str(newP)
But It is not giving probabilities, some random floats >1. How to use the trained discrimanator for finding probabilities?
Use, prob = tf.nn.sigmoid(Dx) for your probabilities. Since Dx outputs a single value between 0-1, softmax for a single output will always be 1.(exp(Dx)/exp(Dx) = 1)

Tensorflow negative sampling

I am trying to follow the udacity tutorial on tensorflow where I came across the following two lines for word embedding models:
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases,
embed, train_labels, num_sampled, vocabulary_size))
Now I understand that the second statement is for sampling negative labels. But the question is how does it know what the negative labels are? All I am providing the second function is the current input and its corresponding labels along with number of labels that I want to (negatively) sample from. Isn't there the risk of sampling from the input set in itself?
This is the full example:
You can find the documentation for tf.nn.sampled_softmax_loss() here. There is even a good explanation of Candidate Sampling provided by TensorFlow here (pdf).
How does it know what the negative labels are?
TensorFlow will randomly select negative classes among all the possible classes (for you, all the possible words).
Isn't there the risk of sampling from the input set in itself?
When you want to compute the softmax probability for your true label, you compute: logits[true_label] / sum(logits[negative_sampled_labels]. As the number of classes is huge (the vocabulary size), there is very little probability to sample the true_label as a negative label.
Anyway, I think TensorFlow removes this possibility altogether when randomly sampling. (EDIT: #Alex confirms TensorFlow does this by default)
Candidate sampling explains how the sampled loss function is calculated:
Compute the loss function in a subset C of all training samples L, where C = T ⋃ S, T is the samples in target classes, and S is the randomly chosen samples in all classes.
The code you provided uses tf.nn.embedding_lookup to get the inputs [batch_size, dim] embed.
Then it uses tf.nn.sampled_softmax_loss to get the sampled loss function:
softmax_weights: A Tensor of shape [num_classes, dim].
softmax_biases: A Tensor of shape [num_classes]. The class biases.
embed: A Tensor of shape [batch_size, dim].
train_labels: A Tensor of shape [batch_size, 1]. The target classes T.
num_sampled: An int. The number of classes to randomly sample per batch. the numbed of classes in S.
vocabulary_size: The number of possible classes.
sampled_values: default to log_uniform_candidate_sampler
For one batch, the target samples are just train_labels (T). It chooses num_sampled samples from embed randomly (S) to be negative samples.
It will uniformly sample from embed respect to the softmax_wiehgt and softmax_bias. Since embed is embeddings[train_dataset] (of shape [batch_size, embedding_size]), if embeddings[train_dataset[i]] contains train_labels[i], it might be selected back, then it is not negative label.
According to Candidate sampling page 2, there are different types. For NCE and negative sampling, NEG=S, which may contain a part of T; for sampled logistic, sampled softmax, NEG = S-T explicitly delete T.
Indeed, it might be a chance of sampling from train_ set.

