NLP - Open Vocabulary Word Embedding

NLP - Open Vocabulary Word Embedding - python

How can I convert words to vectors (Word embedding) if I don't have a predefined dictionary of words? Most word embedding implementations like Word2vec and GloVe have a fixed dictionary of words. The input to the neural networks are one-hot encoded and the hidden layer sizes also depend on the vocab size, which makes adding a new word later on without re-training all the vectors again impossible. I need a network that outputs a fixed dimensional vector for any arbitrary word input. But how to input the 'word' into the network? one hot encoding is not possible, as i don't have a fixed dictionary of words.
Will converting the word to a trigram vector or a bigram vector work? Trigram vectors have been used for Sentence embedding (Deep Sentence Embedding Using Long Short-Term Memory Networks), but I doubt if it will work equally as well for word embedding, as there are changes in both the network architecture (Word embedding uses shallow network whereas sentence embedding uses RNNs) and the auxiliary task. Please help.
Note:
By "converting to trigram vector" I mean the following :
Let the input word be "CAT" Add #s at the beginning and at the end :
"#CAT#"
List all the possible tri-grams: #CA, CAT, AT#
Each trigram is converted to a one hot encoded vector of dimension NxNxN
where N is my character set size. eg., E("#CA") = {0,0,0,0,0,...,0,1,0,0,0}
The one hot encoded vector of every trigram of the word is added to
get the "tri-gram vector" of the word.
e.g., trigram_vec("CAT") = {0,0,0,0,...0,1,0,0,...0,0,1,0,...0,0,1,0,0,0,0}
Thanks for any help in advance!

Related

How to generate specified # of sentences from a given list of words

I have some 100 given words, is there any available package/library in python that I can use to generate some 7-10 sentences using only these 100 words?
says words are ['Hello','is', 'time', 'you?', 'What','how', 'it?','are']
sentences will be ['Hello how are you?', ' What time is it?']

You can train a language model (if you have training data -- some news or wikipedia articles, a book in txt format, ...), and generate sentences from that (here is a detailed tutorial on how to dot that).
Or you can use a pretrained language model to give a score to sentences you generate randomly (a sequence of random words) and keep the ones with the highest score. The higher the score the more grammatical (according to the training corpus) the sentence will be.

About Bert embedding (input_ids, input_mask)

As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.
1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?
Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.
After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.
2-) Is input_mask happening in this process?
3 -) In addition, where do we use input_segment?

The input_mask obtained by encoding the sentences does not show the presence of [MASK] tokens. Instead, when the batch of sentences are tokenized, prepended with [CLS], and appended with [SEP] tokens, it obtains an arbitrary length.
To make all the sentences in the batch has fixed number of tokens, zero padding is performed. The input_ids variable shows whether a given token position contians actual token or if it a zero padded position.
Using [MASK] token is using only if you want to train on Masked Language Model(MLM) objective.
BERT is trained on two objectives, MLM and Next Sentence Prediction(NSP). In NSP, you pass two sentences and try to predict if the second sentence is the following sentence of first sentence or not. segment_id holds the information if of which sentence a particular token belongs to.

Masking only specific words with Huggingface

What would be the best strategy to mask only specific words during the LM training?
My aim is to dynamically mask at batch-time only words of interest which I have previously collected in a list.
I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.
I tried one naive approach consisting of matching all the IDs of each batch with a list of word's IDs to mask. However, a for-loop approach has a negative on the performance.
.
Side issue about word prefixed space - Already fixed
Thanks to #amdex1 and #cronoik for helping with a side issue.
This problem arose since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:
The word "Valkyria":
at the beginning of a sentences gets split as ['V', 'alky', 'ria'] with corresponding IDs: [846, 44068, 6374].
while in the middle of a sentence as ['ĠV', 'alky', 'ria'] with corresponding IDs: [468, 44068, 6374],
It is solved by setting add_prefix_space=True in the tokenizer.

How to specify word vector for OOV terms in Spacy?

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.
Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.
Example:
import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '\n',nlp('gene').vector)
This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):
My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.

If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:
words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ... # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec
for vec in spacy_vecs]
I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.
And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

Bi-LSTM: How to handle unigram and bigrams for a NLP classification?

I have a chinese text and I use a Bi-LSTM to predict if each character of the text belongs to one of these classes:
B (if the charatcer is at the beginning of the word),
I (if it is inside the word),
E (if it is at the end of the word)
S (if it is a single character).
For doing this, I have taken each character of the text and I have built a dictionary, thanks to this I was able to transform the sequence of characters into a sequences of numbers that I give to my network (after the phase of padding), for example:
dictionary={t:1, h:2, e:3, p:4, n:5}
The pen -> 123 435 -> network -> BIE BIE
Everything is ok if I'm working with unigram. However, my network should read also bigrams. How should I handle bigrams? I don't have specific labels for bigrams. (Maybe, my network for each bigram should gives two labels? It doesn't make sense to me)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.