I'm trying to train a neural network that classifies a sequence of words. Based on a paper I'm trying to replicate, I'd need to have both token-level embeddings and character-level embeddings of tokens.
For example, take this sentence:
The shop is open
I need 2 embeddings - one is the normal nn.Embedding layer for the token-level embedding (very simplified!):
[The, shop, is, open] -> nn.Embedding -> [4,3,7,2]
the other is a BiLSTM embedding on the character-level:
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]] -> nn.LSTM -> [9,10,23,5]
Both of them produce word-level embeddings but on a different scale. I tried working out how to do this in PyTorch but I can't seem to do it. The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n]), but that will only produce one embedding.
If anyone could help that would be appreciated.
The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n])
Essentially, what you have to do is:
Split sentences into words (each sentence has (or will have) it's respective nn.Embedding)
Split each word into single letters (essentially adding another dimension)
About second point
Compare word-level embeddings:
[The, shop, is, open]
This is single example, let's assume each word is encoded with 300 dimensional vector. So you get shape of (1, 4, 300) (batch goes first, also padding as per usual with RNNs is needed). This data can go directly to some RNN or similar "text" models.
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]]
In this case, we would have data of shape (assuming 50 dimensional vector for single letter) (1, 4, 4, 50). Please notice I have padded to the longest word based on length!
Such input cannot go into RNNs for obvious reasons (it's 4D instead of 3D as required). But one can notice, that each word can be treated independently (as different sample), hence we can go for shape (4, 4, 50) (transpose is needed), where zeroth dimension corresponds to single words, first to letters contained in that word and last is vector dimensionality.
For batches of data
In general, for word-level encoding it is pretty simple, as you always have (batch, timesteps, embedding).
For character level, you should form your data into vector of shape (batch, word_timesteps, character_timesteps, embedding), which has to be transformed into (batch * word_timesteps, character_timesteps, embedding).
This requires some fun with padding and the size of batch grows really fast so data splitting might be needed.
Output from character level LSTM
You should get (batch * word_timesteps, network_embedding) as output (remember to take last timestep from each word!). In our case it would be (4, 50).
Given that, you can reshape this matrix into (batch, timesteps, dimension) ((1, 4, 50) in our case).
Finally, you can concatenate this embedding with word-level embedding across last dimension to get (1, 4, 350) output matrix in total. You can pass this into another RNN layer or however you wish to proceed
Additional points
If you wish to keep information between words for character-level embedding, you would have to pass hidden_state to N elements in batch (where N is the number of words in sentence). That might it a little harder, but should be doable, just remember LSTM has effective capacity of 100-1000 AFAIK and with long sentences you can easily surpass this number of letters.
Related
I am building a 1D CNN model using Keras for text classification where the input is a sequence of words generated by tokenizer.texts_to_sequences. Is there a way to also feed in a sequence of numerical features (e.g. a score) for each word in the sequence? For example, for sentence 1 the input would be ['the', 'dog', 'barked'] and each word in this particular sequence has the scores [0.9, 0.75, 0.6]. The scores are not word specific, but sentence specific scores of the words (if that makes a difference for how to format the input). Would an LSTM be more appropriate in this case?
Many thanks in advance!
Yes, just use 2 channels in the input tensor.
In better terms, if you input before had shape: (batch_size, seq_len)
Now you could have: (batch_size, seq_len, 2)
If you look at the Keras documentation, you see that with the parameter data_format you pass a string, one of channels_last (default) or channels_first. In this case the default would be fine, because the 2 (number of channels is last).
You can just stack the 2 input arrays into a tensor with this shape.
Now if you use a word embedding probably the number of channels will not be 2, but it would be embedding_dim + 1, so the final input shape would be: (batch_size, seq_len, embedding_dim + 1)
In general you can also refer to this other Stack Overflow question.
In any case, both CNN 1D and LSTM could be good models... but this you need to discover yourself depending on your task, data and model constraints.
Now as a final remark, you could even think of a model with multiple inputs one the word sequence and the other the scores. See this documentation page or this random tutorial I found on the internet. You can again refer also to the same SO question.
I am reading this Kaggle notebook.
In the class DisasterDetector, in build_model(), clf_output = sequence_output[:, 0, :]
. A sigmoid activation is then applied in order to generate the model output.
The location the BertLayer was obtained from on tfhub describes the shape of sequence_output as [batch_size, max_seq_length, 768]. Why are we choosing only the first index over the max_seq_length dimension (indexing a 0)? If this corresponds to only the first token in the output sequence, and not the other tokens, why is this used in the binary classification task?
the first token of output sequence is from the first of input ,i e. [CLS].
the [CLS] is regarded as the represition of the whole input sequence.
u can read the original paper to understand it better.
I am trying to train an LSTM, followed by a Dense layer in keras with numerical input sequences of different lengths. the numbers range is [1,13]. each one of those sequences ends with the same number, 13 in my case.
I train the controller on a few sequences, use the trained model to generate a few more sequences with the same properties, add them to the training set and train the LSTM again. As this loop goes on, the LSTM predictions start converging towards the final value of each sequence.
The sequences are padded to a certain maximum length. as a result the x_train data is of the size (None, max_len-1) and y_train data is categorical data of the last element of each input sequence. in this case, every element in the y_train data is the same (one hot encoded vector for the number 13).
Is the way input and output data is structured the reason for this skewing of predictions?
Is there a way to work around it?
I am confused about how to format my own pre-trained weights for Keras Embedding layer if I'm also setting mask_zero=True. Here's a concrete toy example.
Suppose I have a vocabulary of 4 words [1,2,3,4] and am using vector weights defined by:
weight[1]=[0.1,0.2]
weight[2]=[0.3,0.4]
weight[3]=[0.5,0.6]
weight[4]=[0.7,0.8]
I want to embed sentences of length up to 5 words, so I have to zero pad them before feeding them into the Embedding layer. I want to mask out the zeros so further layers don't use them.
Reading the Keras docs for Embedding, it says the 0 value can't be in my vocabulary.
mask_zero: Whether or not the input value 0 is a special "padding"
value that should be masked out. This is useful when using recurrent
layers which may take variable length input. If this is True then all
subsequent layers in the model need to support masking or an exception
will be raised. If mask_zero is set to True, as a consequence, index 0
cannot be used in the vocabulary (input_dim should equal size of
vocabulary + 1).
So what I'm confused about is how to construct the weight array for the Embedding layer, since "index 0 cannot be used in the vocabulary." If I build the weight array as
[[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
then normally, word 1 would point to index 1, which in this case holds the weights for word 2. Or is it that when you specify mask_zero=True, Keras internally makes it so that word 1 points to index 0? Alternatively, do you just prepend a vector of zeros in index zero, as follows?
[[0.0,0.0],
[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
This second option seems to me to put the zero into the vocabulary. In other words, I'm very confused. Can anyone shed light on this?
You're second approach is correct. You will want to construct your embedding layer in the following way
embedding = Embedding(
output_dim=embedding_size,
input_dim=vocabulary_size + 1,
input_length=input_length,
mask_zero=True,
weights=[np.vstack((np.zeros((1, embedding_size)),
embedding_matrix))],
name='embedding'
)(input_layer)
where embedding_matrix is the second matrix you provided.
You can see this by looking at the implementation of keras' embedding layer. Notably, how mask_zero is only used to literally mask the inputs
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
output_mask = K.not_equal(inputs, 0)
return output_mask
thus the entire kernel is still multiplied by the input, meaning all indexes are shifted up by one.
The first layer of my neural network is like this:
model.add(Conv1D(filters=40,
kernel_size=25,
input_shape=x_train.shape[1:],
activation='relu',
kernel_regularizer=regularizers.l2(5e-6),
strides=1))
if my input shape is (600,10)
i get (None, 576, 40) as output shape
if my input shape is (6000,1)
i get (None, 5976, 40) as output shape
so my question is what exactly is happening here? is the first example simply ignoring 90% of the input?
It is not "ignoring" a 90% of the input, the problem is simply that if you perform a 1-dimensional convolution with a kernel of size K over an input of size X the result of the convolution will have size X - K + 1. If you want the output to have the same size as the input, then you need to extend or "pad" your data. There are several strategies for that, such as add zeros, replicate the value at the ends or wrap around. Keras' Convolution1D has a padding parameter that you can set to "valid" (the default, no padding), "same" (add zeros at both sides of the input to obtain the same output size as the input) and "causal" (padding with zeros at one end only, idea taken from WaveNet).
Update
About the questions in your comments. So you say your input is (600, 10). That, I assume, is the size of one example, and you have a batch of examples with size (N, 600, 10). From the point of view of the convolution operation, this means you have N examples, each of with a length of at most 600 (this "length" may be time or whatever else, it's just the dimension across which the convolution works) and, at each of these 600 points, you have vectors of size 10. Each of these vectors is considered an atomic sample with 10 features (e.g. price, heigh, size, whatever), or, as is sometimes called in the context of convolution, "channels" (from the RGB channels used in 2D image convolution).
The point is, the convolution has a kernel size and a number of output channels, which is the filters parameter in Keras. In your example, what the convolution does is take every possible slice of 25 contiguous 10-vectors and produce a single 40-vector for each (that, for every example in the batch, of course). So you pass from having 10 features or channels in your input to having 40 after the convolution. It's not that it's using only one of the 10 elements in the last dimension, it's using all of them to produce the output.
If the meaning of the dimensions in your input is not what the convolution is interpreting, or if the operation it is performing is not what you were expecting, you may need to either reshape your input or use a different kind of layer.