How is BERT Layer sequence output used? - python

I am reading this Kaggle notebook.
In the class DisasterDetector, in build_model(), clf_output = sequence_output[:, 0, :]
. A sigmoid activation is then applied in order to generate the model output.
The location the BertLayer was obtained from on tfhub describes the shape of sequence_output as [batch_size, max_seq_length, 768]. Why are we choosing only the first index over the max_seq_length dimension (indexing a 0)? If this corresponds to only the first token in the output sequence, and not the other tokens, why is this used in the binary classification task?

the first token of output sequence is from the first of input ,i e. [CLS].
the [CLS] is regarded as the represition of the whole input sequence.
u can read the original paper to understand it better.

Related

Add sequential features to 1D CNN classification model

I am building a 1D CNN model using Keras for text classification where the input is a sequence of words generated by tokenizer.texts_to_sequences. Is there a way to also feed in a sequence of numerical features (e.g. a score) for each word in the sequence? For example, for sentence 1 the input would be ['the', 'dog', 'barked'] and each word in this particular sequence has the scores [0.9, 0.75, 0.6]. The scores are not word specific, but sentence specific scores of the words (if that makes a difference for how to format the input). Would an LSTM be more appropriate in this case?
Many thanks in advance!
Yes, just use 2 channels in the input tensor.
In better terms, if you input before had shape: (batch_size, seq_len)
Now you could have: (batch_size, seq_len, 2)
If you look at the Keras documentation, you see that with the parameter data_format you pass a string, one of channels_last (default) or channels_first. In this case the default would be fine, because the 2 (number of channels is last).
You can just stack the 2 input arrays into a tensor with this shape.
Now if you use a word embedding probably the number of channels will not be 2, but it would be embedding_dim + 1, so the final input shape would be: (batch_size, seq_len, embedding_dim + 1)
In general you can also refer to this other Stack Overflow question.
In any case, both CNN 1D and LSTM could be good models... but this you need to discover yourself depending on your task, data and model constraints.
Now as a final remark, you could even think of a model with multiple inputs one the word sequence and the other the scores. See this documentation page or this random tutorial I found on the internet. You can again refer also to the same SO question.

Pytorch - Token embeddings using Character level LSTM

I'm trying to train a neural network that classifies a sequence of words. Based on a paper I'm trying to replicate, I'd need to have both token-level embeddings and character-level embeddings of tokens.
For example, take this sentence:
The shop is open
I need 2 embeddings - one is the normal nn.Embedding layer for the token-level embedding (very simplified!):
[The, shop, is, open] -> nn.Embedding -> [4,3,7,2]
the other is a BiLSTM embedding on the character-level:
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]] -> nn.LSTM -> [9,10,23,5]
Both of them produce word-level embeddings but on a different scale. I tried working out how to do this in PyTorch but I can't seem to do it. The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n]), but that will only produce one embedding.
If anyone could help that would be appreciated.
The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n])
Essentially, what you have to do is:
Split sentences into words (each sentence has (or will have) it's respective nn.Embedding)
Split each word into single letters (essentially adding another dimension)
About second point
Compare word-level embeddings:
[The, shop, is, open]
This is single example, let's assume each word is encoded with 300 dimensional vector. So you get shape of (1, 4, 300) (batch goes first, also padding as per usual with RNNs is needed). This data can go directly to some RNN or similar "text" models.
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]]
In this case, we would have data of shape (assuming 50 dimensional vector for single letter) (1, 4, 4, 50). Please notice I have padded to the longest word based on length!
Such input cannot go into RNNs for obvious reasons (it's 4D instead of 3D as required). But one can notice, that each word can be treated independently (as different sample), hence we can go for shape (4, 4, 50) (transpose is needed), where zeroth dimension corresponds to single words, first to letters contained in that word and last is vector dimensionality.
For batches of data
In general, for word-level encoding it is pretty simple, as you always have (batch, timesteps, embedding).
For character level, you should form your data into vector of shape (batch, word_timesteps, character_timesteps, embedding), which has to be transformed into (batch * word_timesteps, character_timesteps, embedding).
This requires some fun with padding and the size of batch grows really fast so data splitting might be needed.
Output from character level LSTM
You should get (batch * word_timesteps, network_embedding) as output (remember to take last timestep from each word!). In our case it would be (4, 50).
Given that, you can reshape this matrix into (batch, timesteps, dimension) ((1, 4, 50) in our case).
Finally, you can concatenate this embedding with word-level embedding across last dimension to get (1, 4, 350) output matrix in total. You can pass this into another RNN layer or however you wish to proceed
Additional points
If you wish to keep information between words for character-level embedding, you would have to pass hidden_state to N elements in batch (where N is the number of words in sentence). That might it a little harder, but should be doable, just remember LSTM has effective capacity of 100-1000 AFAIK and with long sentences you can easily surpass this number of letters.

How to correctly use mask_zero=True for Keras Embedding with pre-trained weights?

I am confused about how to format my own pre-trained weights for Keras Embedding layer if I'm also setting mask_zero=True. Here's a concrete toy example.
Suppose I have a vocabulary of 4 words [1,2,3,4] and am using vector weights defined by:
weight[1]=[0.1,0.2]
weight[2]=[0.3,0.4]
weight[3]=[0.5,0.6]
weight[4]=[0.7,0.8]
I want to embed sentences of length up to 5 words, so I have to zero pad them before feeding them into the Embedding layer. I want to mask out the zeros so further layers don't use them.
Reading the Keras docs for Embedding, it says the 0 value can't be in my vocabulary.
mask_zero: Whether or not the input value 0 is a special "padding"
value that should be masked out. This is useful when using recurrent
layers which may take variable length input. If this is True then all
subsequent layers in the model need to support masking or an exception
will be raised. If mask_zero is set to True, as a consequence, index 0
cannot be used in the vocabulary (input_dim should equal size of
vocabulary + 1).
So what I'm confused about is how to construct the weight array for the Embedding layer, since "index 0 cannot be used in the vocabulary." If I build the weight array as
[[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
then normally, word 1 would point to index 1, which in this case holds the weights for word 2. Or is it that when you specify mask_zero=True, Keras internally makes it so that word 1 points to index 0? Alternatively, do you just prepend a vector of zeros in index zero, as follows?
[[0.0,0.0],
[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
This second option seems to me to put the zero into the vocabulary. In other words, I'm very confused. Can anyone shed light on this?
You're second approach is correct. You will want to construct your embedding layer in the following way
embedding = Embedding(
output_dim=embedding_size,
input_dim=vocabulary_size + 1,
input_length=input_length,
mask_zero=True,
weights=[np.vstack((np.zeros((1, embedding_size)),
embedding_matrix))],
name='embedding'
)(input_layer)
where embedding_matrix is the second matrix you provided.
You can see this by looking at the implementation of keras' embedding layer. Notably, how mask_zero is only used to literally mask the inputs
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
output_mask = K.not_equal(inputs, 0)
return output_mask
thus the entire kernel is still multiplied by the input, meaning all indexes are shifted up by one.

Understanding tensorflow RNN encoder input for translation task

I'm following this tutorial (https://theneuralperspective.com/2016/11/20/recurrent-neural-networks-rnn-part-3-encoder-decoder/) to implement a RNN using tensorflow.
I have prepared the input data for both encoder and decoder as described but I'm having trouble understanding how the dynamic_rnn function on tensorflow is expecting the input for this particular task so I'm going to describe the data I have for now in hopes someone has an idea.
I have 20 sentences in english and in french. As is described on the above link, each sentence is padded according to the longest sentence in each language and tokens for GO and EOS are added where expected.
Since not all the code is available on the link I tried doing some of the missing code for myself. This is what my encoder code looks like
with tf.variable_scope('encoder') as scope:
# Encoder RNN cell
self.encoder_lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_units, forget_bias=0.0, state_is_tuple=True)
self.encoder_cell = tf.nn.rnn_cell.MultiRNNCell([self.encoder_lstm_cell] * num_layers, state_is_tuple=True)
# Embed encoder RNN inputs
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [self.vocab_size_en, hidden_units], dtype=data_type)
self.embedded_encoder_inputs = tf.nn.embedding_lookup(embedding, self.encoder_inputs)
# Outputs from encoder RNN
self.encoder_outputs, self.encoder_state = tf.nn.dynamic_rnn(
cell=self.encoder_cell,
inputs=self.embedded_encoder_inputs,
sequence_length=self.seq_lens_en, time_major=False, dtype=tf.float32)
My embeddings are of shape [20, 60, 60] but I think this is incorrect. If I look into the documentation for dynamic_rnn (https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) then the inputs are expected to be of size [batch, max_time, ...]. If I look into this example here (https://medium.com/#erikhallstrm/using-the-dynamicrnn-api-in-tensorflow-7237aba7f7ea#.4hrf7z1pd) then the inputs are supposed to be [batch, truncated_backprop_length, state_size].
The max length of a sentence for the encoder is 60. This is where I am confused about what exactly should the 2 last numbers on the inputs should be. Should max_time be also the max length for my sentence sequence or should that be another quantity. Also should the last number be also the max length of the sentences' sequences or the length of the current input without the padding?

How to pass 3d Tensor to tensorflow RNN embedding_rnn_seq2seq

I'm trying to feed sentences in which each world has word2vec representation.
How can I do it in tensorflow seq2seq models?
Suppose the variable
enc_inp = [tf.placeholder(tf.int32, shape=(None,10), name="inp%i" % t)
for t in range(seq_length)]
Which has dimensions [num_of_observations or batch_size x word_vec_representation x sentense_lenght].
when I pass it to embedding_rnn_seq2seq
decode_outputs, decode_state = seq2seq.embedding_rnn_seq2seq(
enc_inp, dec_inp, stacked_lstm,
seq_length, seq_length, embedding_dim)
error occurs
ValueError: Linear is expecting 2D arguments: [[None, 10, 50], [None, 50]]
Also there is a more complex problem
How can i pas as input a vector, not a scalar to first cell of my RNN?
By now it looks like (when we are about any sequence)
get first value of sequence (scalar)
compute First layer RNN First layer embedding cell output
compute First layer RNN Second layer embedding cell output
etc
But this is needed:
Get first value of sequence (vector)
compute First layer RNN First layer cell output (as ordinary computing simple perceptron when Input is a vector)
compute First layer RNN Second layer embedding cell output (as ordinary computing simple perceptron when Input is a vector)
The main point is that:
seq2seq make inside themself word embedding.
Here is reddit question and answer
Also, if smbd wants to use pretrained Word2Vec there are ways to do it,
see:
stackoverflow 1
stackoverflow 2
So this can be used no only for word embedding

Categories

Resources