As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.
1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?
Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.
After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.
2-) Is input_mask happening in this process?
3 -) In addition, where do we use input_segment?
The input_mask obtained by encoding the sentences does not show the presence of [MASK] tokens. Instead, when the batch of sentences are tokenized, prepended with [CLS], and appended with [SEP] tokens, it obtains an arbitrary length.
To make all the sentences in the batch has fixed number of tokens, zero padding is performed. The input_ids variable shows whether a given token position contians actual token or if it a zero padded position.
Using [MASK] token is using only if you want to train on Masked Language Model(MLM) objective.
BERT is trained on two objectives, MLM and Next Sentence Prediction(NSP). In NSP, you pass two sentences and try to predict if the second sentence is the following sentence of first sentence or not. segment_id holds the information if of which sentence a particular token belongs to.
Related
What would be the best strategy to mask only specific words during the LM training?
My aim is to dynamically mask at batch-time only words of interest which I have previously collected in a list.
I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.
I tried one naive approach consisting of matching all the IDs of each batch with a list of word's IDs to mask. However, a for-loop approach has a negative on the performance.
.
Side issue about word prefixed space - Already fixed
Thanks to #amdex1 and #cronoik for helping with a side issue.
This problem arose since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:
The word "Valkyria":
at the beginning of a sentences gets split as ['V', 'alky', 'ria'] with corresponding IDs: [846, 44068, 6374].
while in the middle of a sentence as ['ĠV', 'alky', 'ria'] with corresponding IDs: [468, 44068, 6374],
It is solved by setting add_prefix_space=True in the tokenizer.
I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.
Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.
Example:
import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '\n',nlp('gene').vector)
This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):
My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.
If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:
words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ... # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec
for vec in spacy_vecs]
I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.
And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.
I am using spaCys NLP model to work out the POS of input data so that the my Markov chains can be a bit more gramatically correct as with the example in the python markovify library found here. However the way that spaCy splits tokens makes it difficult when reconstructing them because certain grammatical elements are also split up for example "don't" becomes ["do", "n't"]. This means that you can't rejoin generated Markov chains simply by space anymore but need to know if the tokens make up one word.
I assumed that the is_left_punct and is_right_punct properties of tokens might relate to this but it doesn't seem to be related. My current code simply accounts for PUNCT tokens but the do n't problem persists.
Is there a property of the tokens that I can use to tell the method that joins sentences together when to omit spaces or some other way to know this?
Spacy tokens have a whitespace_ attribute which is always set.
You can always use that as it will represent actual spaces when they were present, or be an empty string when it was not.
This occurs in cases like you mentioned, when the tokenisation splits a continuous string.
So Token("do").whitespace_ will be the empty string.
For example
[bool(token.whitespace_) for token in nlp("don't")]
Should produce
[False, False]
I hope I don't have to provide an example set.
I have a 2D array where each array contains a set of words from sentences.
I am using a CountVectorizer to effectively call fit_transform on the whole 2D array, such that I can build a vocabulary of words.
However, I have sentences like:
u'Besides EU nations , Switzerland also made a high contribution at Rs 171 million LOCATION_SLOT~-nn+nations~-prep_besides nations~-prep_besides+made~prep_at made~prep_at+rs~num rs~num+NUMBER_SLOT'
And my current vectorizer is too strict at removing things like ~ and + as tokens. Whereas I want it to see each word in terms of split() a token in the vocab, i.e. rs~num+NUMBER_SLOT should be a word in itself in the vocab, as should made. At the same time, stopwords like the the a (the normal stopwords set) should be removed.
Current vectorizer:
vectorizer = CountVectorizer(analyzer="word",stop_words=None,tokenizer=None,preprocessor=None,max_features=5000)
You can specify a token_pattern but I am not sure which one I could use to achieve my aims. Trying:
token_pattern="[^\s]*"
Leads to a vocabulary of:
{u'': 0, u'p~prep_to': 3764, u'de~dobj': 1107, u'wednesday': 4880, ...}
Which messes things up as u'' is not something I want in my vocabulary.
What is the right token pattern for this type of vocabulary_ I want to build?
I have figured this out. The vectorizer was allowing 0 or more non-whitespace items - it should allow 1 or more. The correct CountVectorizer is:
CountVectorizer(analyzer="word",token_pattern="[\S]+",tokenizer=None,preprocessor=None,stop_words=None,max_features=5000)
How can I convert words to vectors (Word embedding) if I don't have a predefined dictionary of words? Most word embedding implementations like Word2vec and GloVe have a fixed dictionary of words. The input to the neural networks are one-hot encoded and the hidden layer sizes also depend on the vocab size, which makes adding a new word later on without re-training all the vectors again impossible. I need a network that outputs a fixed dimensional vector for any arbitrary word input. But how to input the 'word' into the network? one hot encoding is not possible, as i don't have a fixed dictionary of words.
Will converting the word to a trigram vector or a bigram vector work? Trigram vectors have been used for Sentence embedding (Deep Sentence Embedding Using Long Short-Term Memory Networks), but I doubt if it will work equally as well for word embedding, as there are changes in both the network architecture (Word embedding uses shallow network whereas sentence embedding uses RNNs) and the auxiliary task. Please help.
Note:
By "converting to trigram vector" I mean the following :
Let the input word be "CAT" Add #s at the beginning and at the end :
"#CAT#"
List all the possible tri-grams: #CA, CAT, AT#
Each trigram is converted to a one hot encoded vector of dimension NxNxN
where N is my character set size. eg., E("#CA") = {0,0,0,0,0,...,0,1,0,0,0}
The one hot encoded vector of every trigram of the word is added to
get the "tri-gram vector" of the word.
e.g., trigram_vec("CAT") = {0,0,0,0,...0,1,0,0,...0,0,1,0,...0,0,1,0,0,0,0}
Thanks for any help in advance!