How to get word embeddings from the pretrained transformers

How to get word embeddings from the pretrained transformers - python

I am working on a word-level classification task on multilingual data, I am using XLM-R, I know that XLM-R uses sentencepiece as tokenizers which sometimes tokenizes words into subword.
For example the sentence "deception master" is tokenized as de ception master, the word deception has been tokenized into two sub-words.
How can I get the embedding of deception. I can take the mean of the subwords to get the embedding of the word as done here. But I have to implement my code in TensorFlow and TensorFlow computational graph doesn't support NumPy.
I could store the final hidden embeddings after taking the mean of the subwords into a NumPy array and give this array as input to the model, but I want to fine-tune the transformer.
How to get the word embeddings from the sub-word embeddings given by the transformer

Joining subword embeddings into words for word labeling is not how this problem is usually approached. The usual approach is the opposite: keep the subwords as they are, but adjust the labels to respect the tokenization of the pre-trained model.
One of the reasons is that the data is typically in batches. When merging subwords into words, every sentence in the batch would end up having a different length which would require processing each sentence independently and pad the batch again – this would be slow. Also, if you do not average the neighboring embeddings, you get more fine-grained information from the loss function, which tells explicitly what subword is responsible for an error.
When tokenizing using SentencePiece, you can get the indices in the original string:
from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer("deception master", return_offsets_mapping=True)
This returns the following dictionary:
{'input_ids': [0, 8, 63928, 31347, 2],
'attention_mask': [1, 1, 1, 1, 1],
'offset_mapping': [(0, 0), (0, 2), (2, 9), (10, 16), (0, 0)]}
With the offsets, you can find out if the subword corresponds to a word that you want to label. There are various strategies that could be used for encoding the labels. The easiest one is just to copy the label to every subword. A more fancy way would be using schemes used in named entity recognition, such as IOB tagging that explicitly says what is the begging of the labeled segment.

Related

Pytorch - Token embeddings using Character level LSTM

I'm trying to train a neural network that classifies a sequence of words. Based on a paper I'm trying to replicate, I'd need to have both token-level embeddings and character-level embeddings of tokens.
For example, take this sentence:
The shop is open
I need 2 embeddings - one is the normal nn.Embedding layer for the token-level embedding (very simplified!):
[The, shop, is, open] -> nn.Embedding -> [4,3,7,2]
the other is a BiLSTM embedding on the character-level:
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]] -> nn.LSTM -> [9,10,23,5]
Both of them produce word-level embeddings but on a different scale. I tried working out how to do this in PyTorch but I can't seem to do it. The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n]), but that will only produce one embedding.
If anyone could help that would be appreciated.

The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n])
Essentially, what you have to do is:
Split sentences into words (each sentence has (or will have) it's respective nn.Embedding)
Split each word into single letters (essentially adding another dimension)
About second point
Compare word-level embeddings:
[The, shop, is, open]
This is single example, let's assume each word is encoded with 300 dimensional vector. So you get shape of (1, 4, 300) (batch goes first, also padding as per usual with RNNs is needed). This data can go directly to some RNN or similar "text" models.
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]]
In this case, we would have data of shape (assuming 50 dimensional vector for single letter) (1, 4, 4, 50). Please notice I have padded to the longest word based on length!
Such input cannot go into RNNs for obvious reasons (it's 4D instead of 3D as required). But one can notice, that each word can be treated independently (as different sample), hence we can go for shape (4, 4, 50) (transpose is needed), where zeroth dimension corresponds to single words, first to letters contained in that word and last is vector dimensionality.
For batches of data
In general, for word-level encoding it is pretty simple, as you always have (batch, timesteps, embedding).
For character level, you should form your data into vector of shape (batch, word_timesteps, character_timesteps, embedding), which has to be transformed into (batch * word_timesteps, character_timesteps, embedding).
This requires some fun with padding and the size of batch grows really fast so data splitting might be needed.
Output from character level LSTM
You should get (batch * word_timesteps, network_embedding) as output (remember to take last timestep from each word!). In our case it would be (4, 50).
Given that, you can reshape this matrix into (batch, timesteps, dimension) ((1, 4, 50) in our case).
Finally, you can concatenate this embedding with word-level embedding across last dimension to get (1, 4, 350) output matrix in total. You can pass this into another RNN layer or however you wish to proceed
Additional points
If you wish to keep information between words for character-level embedding, you would have to pass hidden_state to N elements in batch (where N is the number of words in sentence). That might it a little harder, but should be doable, just remember LSTM has effective capacity of 100-1000 AFAIK and with long sentences you can easily surpass this number of letters.

Using pretrained word embeddings to classify "pools" of words

I have seen many papers explaining the use of pretrained word embeddings (such as Word2Vec or Fasttext) on sentence sentiment classification using CNNs (like Yoon Kim's paper). However, these classifiers also account for order that the words appear in.
My application of word embeddings is to predict the class of "pools" of words. For example, in the following list of lists
example = [["red", "blue", "green", "orange"], ["bear", "horse", "cow"], ["brown", "pink"]]
The order of the words doesn't matter, but I want to classify the sublists into either class of color or animal.
Are there any prebuilt Keras implementations of this, or any papers you could point me to which address this type of classification problem based on pretrained word embeddings?
I am sorry if this is off-topic in this forum. If so, please let me know where would be a better place to post it.

The key point in creating that classifier would be to avoid any bias from the order of words in list. A naive LSTM solution would just look at first or last few words and try to classify, this effect could reduced by giving permutations of lists every time. Perhaps a simpler approach might be:
# unknown number of words in list each 300 size from word2vec
in = Input(shape=(None, 300))
# some feature extraction per word
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(in)
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(latent)
sum = Lambda(lambda x: K.sum(x, axis=-1))(latent) # reduce sum all words
out = Dense(num_classes, activation='softmax')(sum)
model = Model(in, out)
model.compile(loss='categorical_crossentropy', optimiser='sgd')
where the reduced sum would avoid any ordering bias, if a majority of words express similar features of a certain class then the sum would also lean towards that.

Keras one hot embedding before LSTM

Suppose I have a training dataset as several sequences with padded length = 40 and a dictionary length of 80, e.g., example = [0, 0, 0, 3, 4, 9, 22, ...] and I want to feed that into a LSTM layer. What I want to do is to apply one hot encoder to the sequences, e.g., example_after_one_hot.shape = (40, 80). Is there a keras layer that is able to do this? I have tried Embedding, however, it seems that is not an one-hot encoding.
Edit: another way is to use Embedding layer. Given the dictionary only contains 80 different keys, how should I set the output of Embedding layer?

I think you're looking for a pre-processing task, not something that is strictly part of your network.
Keras has a one-hot text pre-processing function that may be able to help you. Take a look at Keras text preprocessing. If this doesn't fit your needs, it's fairly easy to pre-process it yourself with numpy. You can do something like...
X = numpy.zeros(shape=(len(sentences), 40, 80), dtype='float32')
for i, sent in enumerate(sentences):
for j, word in enumerate(sent):
X[i, j, word] = 1.0
This will give you a one-hot encoding for a 2D-array of "sentences", where each word in the array is an integer less than 80. Of course the data doesn't have to be sentences, it can be any type of data.
Note that Embedding layers are for for learning a distributed representation of the data not for putting data in a one-hot format.

How to get both the word embeddings vector and context vector of a given word by using word2vec?

from gensim.models import word2vec
sentences = word2vec.Text8Corpus('TextFile')
model = word2vec.Word2Vec(sentences, size=200, min_count = 2, workers = 4)
print model['king']
Is the output vector the context vector of 'king' or the word embedding vector of 'king'? How can I get both context vector of 'king' and the word embedding vector of 'king'? Thanks!

It is the embedding vector for 'king'.
If you use hierarchical softmax, the context vectors are:
model.syn1
and if you use negative sampling they are:
model.syn1neg
The vectors can be accessed by:
model.syn1[model.vocab[word].index]

'Context vector' is also a 'word embedding' vector. Word embedding means how vocabulary are mapped to vectors of real numbers.
I assume you meant center word's vector when you said 'word embedding' vector.
In word2vec algorithm, when you train the model, it creates two different vectors for one word (when 'king' is used for center word and when it's used for context words.)
I don't know about how gensim is treating these two vectors, but normally, people average both context and center words, or concatinate two vectors. It might not be the most beautiful way to treat the vectors, but it works very well that way.
So when you call model['king'] on some pre-trained vector, the vector you see is probably the averaged version of two vectors.

Predicting next word using the language model tensorflow example

The tensorflow tutorial on language model allows to compute the probability of sentences :
probabilities = tf.nn.softmax(logits)
in the comments below it also specifies a way of predicting the next word instead of probabilities but does not specify how this can be done. So how to output a word instead of probability using this example?
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)

Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word's probabilities.
At "Evaluate the Model" from this page, your output list is y in the following example:
First we'll figure out where we predicted the correct label. tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example, tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1) is the true label. We can use tf.equal to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the "N words" closest to the predicted point in space at the output of the network. In that case, the argmax way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y will not represent words as features, but word embeddings over a dimensionality of, let's say, 100 to 2000 in size according to different models. You could Google something like this for more info: "man woman queen word addition word2vec" to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs' corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.

You need to find the argmax of the probabilities, and translate the index back to a word by reversing the word_to_id map. To get this to work, you must save the probabilities in the model and then fetch them from the run_epoch function (you could also save just the argmax itself). Here's a snippet:
inverseDictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
def run_epoch(...):
decodedWordId = int(np.argmax(logits))
print (" ".join([inverseDictionary[int(x1)] for x1 in np.nditer(x)])
+ " got" + inverseDictionary[decodedWordId] +
+ " expected:" + inverseDictionary[int(y)])
See full implementation here: https://github.com/nelken/tf

It is actually an advantage that the function returns a probability instead of the word itself. Since it is using a list of words, with the associated probabilities, you can do further processing, and increase the accuracy of your result.
To answer your question:
You can take the list of words, iterate though it , and make the program display the word with the highest probability.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.