Using pretrained word embeddings to classify "pools" of words

Using pretrained word embeddings to classify "pools" of words - python

I have seen many papers explaining the use of pretrained word embeddings (such as Word2Vec or Fasttext) on sentence sentiment classification using CNNs (like Yoon Kim's paper). However, these classifiers also account for order that the words appear in.
My application of word embeddings is to predict the class of "pools" of words. For example, in the following list of lists
example = [["red", "blue", "green", "orange"], ["bear", "horse", "cow"], ["brown", "pink"]]
The order of the words doesn't matter, but I want to classify the sublists into either class of color or animal.
Are there any prebuilt Keras implementations of this, or any papers you could point me to which address this type of classification problem based on pretrained word embeddings?
I am sorry if this is off-topic in this forum. If so, please let me know where would be a better place to post it.

The key point in creating that classifier would be to avoid any bias from the order of words in list. A naive LSTM solution would just look at first or last few words and try to classify, this effect could reduced by giving permutations of lists every time. Perhaps a simpler approach might be:
# unknown number of words in list each 300 size from word2vec
in = Input(shape=(None, 300))
# some feature extraction per word
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(in)
latent = TimeDistributed(Dense(latent_dim, activation='relu'))(latent)
sum = Lambda(lambda x: K.sum(x, axis=-1))(latent) # reduce sum all words
out = Dense(num_classes, activation='softmax')(sum)
model = Model(in, out)
model.compile(loss='categorical_crossentropy', optimiser='sgd')
where the reduced sum would avoid any ordering bias, if a majority of words express similar features of a certain class then the sum would also lean towards that.

Related

How to embed an extra layer of categorical feature in BERT model

OK guys, Let me explain you my issue very clearly. I have two arrays lists. One is list input sentence and other is a list of categories belong to respective sentence.
comment = ["I hate the food","Service is bad","Service is bad but food is good","[![Food in delicious][1]][1]"]
customer_name = ["umar","adil","cameron","Daniel"]
Now its mapping is like this
This is just a sample dataset. I have thousands of records in my dataset.
Actually I am doing sentiment analysis using BERT Model. But now I want to add customer_name in as an additional feature in my BERT Model
Right now this is how I am tokenising my input sentence and generating input_ids and attention_masks
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = []
attention_masks = []
for sentence in df['Comment'].tolist():
dictionary = tokenizer.encode_plus(
sentence,
add_special_tokens = True,
max_length = 512,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors = 'pt',
)
# encode_plus returns a dictionary
input_ids.append(dictionary['input_ids'])
attention_masks.append(dictionary['attention_mask'])
You can clearly see that I am passing the comment as an input and this is function is return input_ids and attention_masks of each comment and I will create the input_ids layer and attention masks layer and feed it into my BERT Model. But now I want to customer name as an additional feature in my BERT Model.
I asked few experts before and they told me that traditional way is to hold an embedding for each category and concatenate (or any other method to combine features) it with the contextualized output of BERT before you feed it to the classification layer.
I searched about it on internet but this is what I found
# batch with two sentences (i.e. the citation text you have already used)
i = t(["paper title 1", "paper title 2"], padding=True, return_tensors="pt")
# We assume that the first sentence (i.e. paper title 1) belongs to category 23 and the second sentence to category 42
# You probably want to use a dictionary in your own code
i["categorical_feature_ids"] = torch.tensor([23,42])
Because I am not an expert in ML, so I have big confusion here
Where we will give our additional feature(Customer Name) as an input. For example, we can see that we are giving our input(Comment) in tokenize method of Bert tokenizer. But what about other features(Customer Name). It's ok that we can created extra embedding layer for each feature but how will we inject those features (as an input) in our model.
Above piece of code is not enough for me to understand.

How to get word embeddings from the pretrained transformers

I am working on a word-level classification task on multilingual data, I am using XLM-R, I know that XLM-R uses sentencepiece as tokenizers which sometimes tokenizes words into subword.
For example the sentence "deception master" is tokenized as de ception master, the word deception has been tokenized into two sub-words.
How can I get the embedding of deception. I can take the mean of the subwords to get the embedding of the word as done here. But I have to implement my code in TensorFlow and TensorFlow computational graph doesn't support NumPy.
I could store the final hidden embeddings after taking the mean of the subwords into a NumPy array and give this array as input to the model, but I want to fine-tune the transformer.
How to get the word embeddings from the sub-word embeddings given by the transformer

Joining subword embeddings into words for word labeling is not how this problem is usually approached. The usual approach is the opposite: keep the subwords as they are, but adjust the labels to respect the tokenization of the pre-trained model.
One of the reasons is that the data is typically in batches. When merging subwords into words, every sentence in the batch would end up having a different length which would require processing each sentence independently and pad the batch again – this would be slow. Also, if you do not average the neighboring embeddings, you get more fine-grained information from the loss function, which tells explicitly what subword is responsible for an error.
When tokenizing using SentencePiece, you can get the indices in the original string:
from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer("deception master", return_offsets_mapping=True)
This returns the following dictionary:
{'input_ids': [0, 8, 63928, 31347, 2],
'attention_mask': [1, 1, 1, 1, 1],
'offset_mapping': [(0, 0), (0, 2), (2, 9), (10, 16), (0, 0)]}
With the offsets, you can find out if the subword corresponds to a word that you want to label. There are various strategies that could be used for encoding the labels. The easiest one is just to copy the label to every subword. A more fancy way would be using schemes used in named entity recognition, such as IOB tagging that explicitly says what is the begging of the labeled segment.

How to get the nearest documents for a word in gensim in python

I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.

Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.

How to get both the word embeddings vector and context vector of a given word by using word2vec?

from gensim.models import word2vec
sentences = word2vec.Text8Corpus('TextFile')
model = word2vec.Word2Vec(sentences, size=200, min_count = 2, workers = 4)
print model['king']
Is the output vector the context vector of 'king' or the word embedding vector of 'king'? How can I get both context vector of 'king' and the word embedding vector of 'king'? Thanks!

It is the embedding vector for 'king'.
If you use hierarchical softmax, the context vectors are:
model.syn1
and if you use negative sampling they are:
model.syn1neg
The vectors can be accessed by:
model.syn1[model.vocab[word].index]

'Context vector' is also a 'word embedding' vector. Word embedding means how vocabulary are mapped to vectors of real numbers.
I assume you meant center word's vector when you said 'word embedding' vector.
In word2vec algorithm, when you train the model, it creates two different vectors for one word (when 'king' is used for center word and when it's used for context words.)
I don't know about how gensim is treating these two vectors, but normally, people average both context and center words, or concatinate two vectors. It might not be the most beautiful way to treat the vectors, but it works very well that way.
So when you call model['king'] on some pre-trained vector, the vector you see is probably the averaged version of two vectors.

Predicting next word using the language model tensorflow example

The tensorflow tutorial on language model allows to compute the probability of sentences :
probabilities = tf.nn.softmax(logits)
in the comments below it also specifies a way of predicting the next word instead of probabilities but does not specify how this can be done. So how to output a word instead of probability using this example?
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)

Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word's probabilities.
At "Evaluate the Model" from this page, your output list is y in the following example:
First we'll figure out where we predicted the correct label. tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example, tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1) is the true label. We can use tf.equal to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the "N words" closest to the predicted point in space at the output of the network. In that case, the argmax way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y will not represent words as features, but word embeddings over a dimensionality of, let's say, 100 to 2000 in size according to different models. You could Google something like this for more info: "man woman queen word addition word2vec" to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs' corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.

You need to find the argmax of the probabilities, and translate the index back to a word by reversing the word_to_id map. To get this to work, you must save the probabilities in the model and then fetch them from the run_epoch function (you could also save just the argmax itself). Here's a snippet:
inverseDictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
def run_epoch(...):
decodedWordId = int(np.argmax(logits))
print (" ".join([inverseDictionary[int(x1)] for x1 in np.nditer(x)])
+ " got" + inverseDictionary[decodedWordId] +
+ " expected:" + inverseDictionary[int(y)])
See full implementation here: https://github.com/nelken/tf

It is actually an advantage that the function returns a probability instead of the word itself. Since it is using a list of words, with the associated probabilities, you can do further processing, and increase the accuracy of your result.
To answer your question:
You can take the list of words, iterate though it , and make the program display the word with the highest probability.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.