Keras one hot embedding before LSTM

Keras one hot embedding before LSTM - python

Suppose I have a training dataset as several sequences with padded length = 40 and a dictionary length of 80, e.g., example = [0, 0, 0, 3, 4, 9, 22, ...] and I want to feed that into a LSTM layer. What I want to do is to apply one hot encoder to the sequences, e.g., example_after_one_hot.shape = (40, 80). Is there a keras layer that is able to do this? I have tried Embedding, however, it seems that is not an one-hot encoding.
Edit: another way is to use Embedding layer. Given the dictionary only contains 80 different keys, how should I set the output of Embedding layer?

I think you're looking for a pre-processing task, not something that is strictly part of your network.
Keras has a one-hot text pre-processing function that may be able to help you. Take a look at Keras text preprocessing. If this doesn't fit your needs, it's fairly easy to pre-process it yourself with numpy. You can do something like...
X = numpy.zeros(shape=(len(sentences), 40, 80), dtype='float32')
for i, sent in enumerate(sentences):
for j, word in enumerate(sent):
X[i, j, word] = 1.0
This will give you a one-hot encoding for a 2D-array of "sentences", where each word in the array is an integer less than 80. Of course the data doesn't have to be sentences, it can be any type of data.
Note that Embedding layers are for for learning a distributed representation of the data not for putting data in a one-hot format.

Related

Add sequential features to 1D CNN classification model

I am building a 1D CNN model using Keras for text classification where the input is a sequence of words generated by tokenizer.texts_to_sequences. Is there a way to also feed in a sequence of numerical features (e.g. a score) for each word in the sequence? For example, for sentence 1 the input would be ['the', 'dog', 'barked'] and each word in this particular sequence has the scores [0.9, 0.75, 0.6]. The scores are not word specific, but sentence specific scores of the words (if that makes a difference for how to format the input). Would an LSTM be more appropriate in this case?
Many thanks in advance!

Yes, just use 2 channels in the input tensor.
In better terms, if you input before had shape: (batch_size, seq_len)
Now you could have: (batch_size, seq_len, 2)
If you look at the Keras documentation, you see that with the parameter data_format you pass a string, one of channels_last (default) or channels_first. In this case the default would be fine, because the 2 (number of channels is last).
You can just stack the 2 input arrays into a tensor with this shape.
Now if you use a word embedding probably the number of channels will not be 2, but it would be embedding_dim + 1, so the final input shape would be: (batch_size, seq_len, embedding_dim + 1)
In general you can also refer to this other Stack Overflow question.
In any case, both CNN 1D and LSTM could be good models... but this you need to discover yourself depending on your task, data and model constraints.
Now as a final remark, you could even think of a model with multiple inputs one the word sequence and the other the scores. See this documentation page or this random tutorial I found on the internet. You can again refer also to the same SO question.

Pytorch - Token embeddings using Character level LSTM

I'm trying to train a neural network that classifies a sequence of words. Based on a paper I'm trying to replicate, I'd need to have both token-level embeddings and character-level embeddings of tokens.
For example, take this sentence:
The shop is open
I need 2 embeddings - one is the normal nn.Embedding layer for the token-level embedding (very simplified!):
[The, shop, is, open] -> nn.Embedding -> [4,3,7,2]
the other is a BiLSTM embedding on the character-level:
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]] -> nn.LSTM -> [9,10,23,5]
Both of them produce word-level embeddings but on a different scale. I tried working out how to do this in PyTorch but I can't seem to do it. The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n]), but that will only produce one embedding.
If anyone could help that would be appreciated.

The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n])
Essentially, what you have to do is:
Split sentences into words (each sentence has (or will have) it's respective nn.Embedding)
Split each word into single letters (essentially adding another dimension)
About second point
Compare word-level embeddings:
[The, shop, is, open]
This is single example, let's assume each word is encoded with 300 dimensional vector. So you get shape of (1, 4, 300) (batch goes first, also padding as per usual with RNNs is needed). This data can go directly to some RNN or similar "text" models.
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]]
In this case, we would have data of shape (assuming 50 dimensional vector for single letter) (1, 4, 4, 50). Please notice I have padded to the longest word based on length!
Such input cannot go into RNNs for obvious reasons (it's 4D instead of 3D as required). But one can notice, that each word can be treated independently (as different sample), hence we can go for shape (4, 4, 50) (transpose is needed), where zeroth dimension corresponds to single words, first to letters contained in that word and last is vector dimensionality.
For batches of data
In general, for word-level encoding it is pretty simple, as you always have (batch, timesteps, embedding).
For character level, you should form your data into vector of shape (batch, word_timesteps, character_timesteps, embedding), which has to be transformed into (batch * word_timesteps, character_timesteps, embedding).
This requires some fun with padding and the size of batch grows really fast so data splitting might be needed.
Output from character level LSTM
You should get (batch * word_timesteps, network_embedding) as output (remember to take last timestep from each word!). In our case it would be (4, 50).
Given that, you can reshape this matrix into (batch, timesteps, dimension) ((1, 4, 50) in our case).
Finally, you can concatenate this embedding with word-level embedding across last dimension to get (1, 4, 350) output matrix in total. You can pass this into another RNN layer or however you wish to proceed
Additional points
If you wish to keep information between words for character-level embedding, you would have to pass hidden_state to N elements in batch (where N is the number of words in sentence). That might it a little harder, but should be doable, just remember LSTM has effective capacity of 100-1000 AFAIK and with long sentences you can easily surpass this number of letters.

Variable sentence length for LSTM using word2vec as inputs on tensorflow

I am building an LSTM Model using word2vec as an input. I am using the tensorflow framework. I have finished word embedding part, but I am stuck with LSTM part.
The issue here is that I have different sentence lengths, which means that I have to either do padding or use dynamic_rnn with specified sequence length. I am struggling with both of them.
Padding.
The confusing part of padding is when I do padding. My model goes like
word_matrix=model.wv.syn0
X = tf.placeholder(tf.int32, shape)
data = tf.placeholder(tf.float32, shape)
data = tf.nn.embedding_lookup(word_matrix, X)
Then, I am feeding sequences of word indices for word_matrix into X. I am worried that if I pad zero's to the sequences fed into X, then I would incorrectly keep feeding unnecessary input (word_matrix[0] in this case).
So, I am wondering what is the correct way of 0 padding. It would be great if you let me know how to implement it with tensorflow.
dynamic_rnn
For this, I have declared a list containing all the lengths of sentences and feed those along with X and y at the end. In this case, I cannot feed the inputs as batch though. Then, I have encountered this error (ValueError: as_list() is not defined on an unknown TensorShape.), which seems to me that sequence_length argument only accepts list? (My thoughts might be entirely incorrect though).
The following is my code for this.
X = tf.placeholder(tf.int32)
labels = tf.placeholder(tf.int32, [None, numClasses])
length = tf.placeholder(tf.int32)
data = tf.placeholder(tf.float32, [None, None, numDimensions])
data = tf.nn.embedding_lookup(word_matrix, X)
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, state_is_tuple=True)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.25)
initial_state=lstmCell.zero_state(batchSize, tf.float32)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, sequence_length=length,
initial_state=initial_state, dtype=tf.float32)
I am so struggling with this part so that any help would be very much appreciated.
Thank you in advance.

Tensorflow does not support variable length Tensor. So when you declare a Tensor, the list/numpy array should have a uniform shape.
From your 1st part, what I understand is that you were already able to pad the zeros in the last time steps of the sequence length. Which is what the ideal situation should be. Here is how it should look for a batch size of 4, max sequence length 10 and 50 hidden units ->
[4,10,50] would be the size of your whole batch, but internally, it may be shaped like this when you try to visualize the paddings ->
`[[5+5pad,50],[10,50],[8+2pad,50],[9+1pad,50]`
Each pad would represent a sequence length of 1 with hidden state size 50 Tensor. All filled with nothing but zeroes. Look at this question and this one to know more about how to pad manually.
You will use dynamic rnn for the exact reason that you do not want to compute it on the padding sequences. The tf.nn.dynamic_rnn api will ensure that by passing the sequence_length argument.
For the above example, that argument will be: [5,10,8,9] for the example above. You can compute it by summing the non-zero entities for each batch component. A simple way to compute that would be:
data_mask = tf.cast(data, tf.bool)
data_len = tf.reduce_sum(tf.cast(data_mask, tf.int32), axis=1)
and pass it in the tf.nn.dynamic_rnn api:
tf.nn.dynamic_rnn(lstmCell, data, sequence_length=data_len, initial_state=initial_state)

Predicting the next word with Keras: how to retrieve prediction for each input word

I am having some problems understanding how to retrieve the predictions from a Keras model.
I want to build a simple system that predicts the next word, but I don't know how to output the complete list of probabilities for each word.
This is my code right now:
model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=55, weights=[pretrained_weights]))
model.add(Bidirectional(LSTM(units=embedding_size)))
model.add(Dense(23690, activation='softmax')) # 23690 is the total number of classes
model.compile(loss='categorical_crossentropy',
optimizer = RMSprop(lr=0.0005),
metrics=['accuracy'])
# fit network
model.fit(np.array(X_train), np.array(y_train), epochs=10)
score = model.evaluate(x=np.array(X_test), y=np.array(y_test), batch_size=32)
prediction = model.predict(np.array(X_test), batch_size=32)
First question:
Training set: list of sentences (vectorized and transformed to indices).
I saw some examples online where people divide X_train and y_train like this:
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
Should I instead transform the X_train and the y_train in order to have sliding sequences, where for example I have
X = [[10, 9, 4, 5]]
X_train = [[10, 9], [9, 4], [4, 5]]
y_train = [[9], [4], [5]]
Second question:
Right now the model returns only one element for each input. How can I return the predictions for each word? I want to be able to have an array of output words for each word, not a single output.
I read that I could use a TimeDistributed layer, but I have problems with the input, because the Embedding layer takes a 2D input, while the TimeDistributed takes a 3D input.
Thank you for the help!

For what you're asking, I don't think a Bidirectional network would be good. (The reverse direction would be trying to predict something that does not appear at the end, but before the beginning, and I believe you're going to want to take the output and make it an input and keep predicting further, right?)
So, first, remove the Bidirectional from your model, keep only the LSTM.
Keras recurrent layers may output only the last step, or, if you set return_sequences=True, output all steps.
So, the trick is adjusting both the data and the model like this:
In the LSTM layers, add return_sequences=True. (Your output will be entire sentences)
Make Y be entire sentences one step ahead of X: X,y = sequences[:,:-1], sequences[:,1:]
Just be aware that this will make your output 3D. If you're interested only in the last word, you can manually take it from the output: lastWord = outputs[:,-1]
About sliding windows: don't use them. They totally defeat the purpose of LSTMs which is learning long sequences. (Ok, this statement may be exaggerated, you might want to use sliding windows if your sequences are too long for faster training, but for sentences, you probably need to have all words of the sentence otherwise the context is lost)
About TimeDistributed layers: only use them when you want to add an extra time dimension. Since LSTMs already use a time dimension, you're ok without a TimeDistributed. If you wanted, for instance to process an entire text, and you decided to go sentence by sentence, and inside each sentence word by word, you could try something with two time dimensions.
About predicting indefinitely into the future: for that, you'd have to use stateful=True LSTM layers, and create manual loops that get the last output step and feed it as an input for taking one more step.

Using sample_weight in Keras for sequence labelling

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.
The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.
With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.
The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).
At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:
count = [list(array).index(1) for arrays in y for array in arrays]
count = dict(Counter(count))
count[0] = 0
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
category_weights = np.zeros(7)
for f in count:
category_weights[f] = count[f]
But I get the following error ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.
Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:
weights = []
for sample in y:
current_weight = []
for line in sample:
current_weight.append(frequency[list(line).index(1)])
weights.append(current_weight)
weights = np.array(weights)
and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().
I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.
Is this a proper way to define sample_weights or am I doing it all wrong ? I can't say I've noticed any improvements from adding the weights, so I must have missed something.

I think you are confusing sample_weights and class_weights. Checking the docs a bit we can see the differences between them:
sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). In case you are using temporal data you may instead pass a 2D array, enabling you to give weight to each timestep of each sample.
class_weights is used to provide a weight or bias for each output class. This means you should pass a weight for each class that you are trying to classify. Furthermore, this parameter expects a dictionary to be passed to it (not an array, that is why you got that error). For example consider this situation:
class_weight = {0 : 1. , 1: 50.}
In this case (a binary classification problem) you are giving 50 times as much weight (or "relevance") to your samples of class 1 compared to class 0. This way you can compensate for imbalanced datasets. Here is another useful post explaining more about this and other options to consider when dealing with imbalanced datasets.
If I train for more epochs, val_loss keeps dropping, but I get worse results.
Probably you are over-fitting, and something that may be contributing to that is the imbalanced classes your dataset has, as you correctly suspected. Compensating the class weights should help mitigate this, however there may still be other factors that can cause over-fitting that escape the scope of this question/answer (so make sure to watch out for those after solving this question).
Judging by your post, seems to me that what you need is to use class_weight to balance your dataset for training, for which you will need to pass a dictionary indicating the weight ratios between your 7 classes. Consider using sample_weight only if you want to give each sample a custom weight for consideration.
If you want a more detailed comparison between those two consider checking this answer I posted on a related question. Spoiler: sample_weight overrides class_weight, so you have to use one or the other, but not both, so be careful with not mixing them.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.

I searched online for the same question and I did have good accuracy improvement after using sample_weight correctly in my case.
I think your understanding is correct and the procedure is also correct. One possible reason that you don't have improvements in your case is that, when you pass in the sample_weight, higher value means higher weight. This means that you cannot use word count directly. You might consider to use the inverted count frequency:
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
for f in count:
category_weights = np.zeros(7)
category_weights[f] = 1 - count[f]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.