I'm new to NLP (pardon the very noob question!), and am looking for a way to perform vector operations on sentence embeddings (e.g., randomization in embedding-space in a uniform ball around a given sentence) and then decode them. I'm currently attempting to use the following strategy with T5 and Huggingface Transformers:
Encode the text with T5Tokenizer.
Run a forward pass through the encoder with model.encoder. Use the last hidden state as the embedding. (I've tried .generate as well, but it doesn't allow me to use the decoder separately from the encoder.)
Perform any desired operations on the embedding.
The problematic step: Pass it through model.decoder and decode with the tokenizer.
I'm having trouble with (4). My sanity check: I set (3) to do nothing (no change to the embedding), and I check whether the resulting text is the same as the input. So far, that check always fails.
I get the sense that I'm missing something rather important (something to do with the lack of beam search or some other similar generation method?). I'm unsure of whether what I think is an embedding (as in (2)) is even correct.
How would I go about encoding a sentence embedding with T5, modifying it in that vector space, and then decoding it into generated text? Also, might another model be a better fit?
As a sample, below is my incredibly broken code, based on this:
t5_model = transformers.T5ForConditionalGeneration.from_pretrained("t5-large")
t5_tok = transformers.T5Tokenizer.from_pretrained("t5-large")
text = "Foo bar is typing some words."
input_ids = t5_tok(text, return_tensors="pt").input_ids
encoder_output_vectors = t5_model.encoder(input_ids, return_dict=True).last_hidden_state
# The rest is what I think is problematic:
decoder_input_ids = t5_tok("<pad>", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output = t5_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors)
t5_tok.decode(decoder_output.last_hidden_state[0].softmax(0).argmax(1))
Related
Following this pytorch tutorial, I'm able to create and train a transformer model on a custom dataset. The problem is, I've scoured the web and have found no clear answers... How do I use this model to generate text? I took a stab at it, by encoding my SOS and seed text and passing it through the model's forward method... But this only produces repeating garbage. The src_mask doesn't appear to be the right size or functioning at all.
def generate(model: nn.Module, src_text:str):
src=BeatleSet.encode(src_text.lower()) # encodes seed text
SOS=BeatleSet.textTokDict['<sos>'] ; EOS=BeatleSet.textTokDict['<eos>'] # obtain eos and sos tokens
model.eval(); entry=[SOS]+src
y_input=torch.tensor([entry], dtype=torch.long, device=device) # convert entry to tensor
num_tokens=len(BeatleSet)
for i in range(50):
src_mask=generate_square_subsequent_mask(y_input.size(0)).to(device) #create a mask of size 1,1 (???)
pred=model(y_input, src_mask) # passed through forward method
next_item = pred.topk(1)[1].view(-1)[-1].item() # selecting the most probable next token (I think)
next_item = torch.tensor([[next_item]], device=device)
y_input=torch.cat((y_input, next_item), dim=1) # added to inputs to be run again
if next_item.view(-1).item() == EOS:
break
return " ".join(BeatleSet.decode(y_input.view(-1).tolist()))
print(generate(model, "Oh yeah I"))
For the record, I'm following the architecture to the letter. This should be reproducible with the wikidata set that is used in the tutorial. Please advise, I've been banging my head on this one.
Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer.
Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would fill in the mask are not the same for the fast and slow tokenizer. Moreover, while the predictions of the fast tokenizer remain constant regardless of the number and length of sentences input, the same is not true for the slow tokenizer.
Here's a minimal example:
from transformers import pipeline
slow = pipeline('fill-mask', model='bert-base-cased', \
tokenizer=('bert-base-cased', {"use_fast": False}))
fast = pipeline('fill-mask', model='bert-base-cased', \
tokenizer=('bert-base-cased', {"use_fast": True}))
s1 = "This is a short and sweet [MASK]." # "example"
s2 = "This is [MASK]." # "shorter"
slow([s1, s2])
fast([s1, s2])
slow([s2])
fast([s2])
Each pipeline call yields the top-5 tokens that could fill in for [MASK], along with their probabilities. I've ommitted the actual outputs for brevity, but the probabilities assigned to each word that fill in [MASK] for s2 are not the same across all of the examples. The final 3 examples give the same probabilities, but the first yields different probabilities. The differences are so great that the top-5 are not consistent across the two groups.
The cause behind this, as I can tell, is that the fast and slow tokenizers return different outputs. The fast tokenizer standardizes sequence length to 512 by padding with 0s, and then creates an attention mask that blocks out the padding. In contrast, the slow tokenizer only pads to the length of the longest sequence, and does not create such an attention mask. Instead, it sets the token type id of the padding to 1 (rather than 0, which is the type of the non-padding tokens). By my understanding of HuggingFace's implementation (found here), these are not equivalent.
Does anyone know if this is intentional?
It appears that this bug was fixed sometime between transformers-2.5.0 and transformers-2.8.0 or tokenizers-0.5.0 and tokenizers-0.5.2. Upgrading my install of transformers (and tokenizers) solved this.
(Let me know if this question / answer is trivial, and should be deleted)
I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https://github.com/google-research/bert).
The main reason for this question is because the segmentation/tokenization for the Khmer language is different than that of English.
Original:
វាមានមកជាមួយនូវ
Segmented/Tokenized:
វា មាន មក ជាមួយ នូវ
I tried something on my own and managed to get some results after running the create_pretraining_data.py and run_pretraining.py script. However, I'm not sure if what I'm doing can be considered correct.
I also would like to know the method that I should use to verify my model.
Any help is highly appreciated!
Script Modifications
The modifications that I did were:
1. Make input file in a list format
Instead of a normal plain text, my input file is from my custom Khmer tokenization output where I then make it into a list format, mimicking the output that I get when running the sample English text.
[[['ដំណាំ', 'សាវម៉ាវ', 'ជា', 'ប្រភេទ', 'ឈើ', 'ហូប', 'ផ្លែ'],
['វា', 'ផ្តល់', 'ផប្រយោជន៍', 'យ៉ាង', 'ច្រើន', 'ដល់', 'សុខភាព']],
[['cmt', '$', '270', 'នាំ', 'លាភ', 'នាំ', 'សំណាង', 'ហេង', 'ហេង']]]
* The outer bracket indicates a source file, the first nested bracket indicates a document and the second nested bracket indicates a sentence. Exactly the same structure as the variable all_documents inside the create_training_instances() function
2. Vocab file from unique segmented words
This is the part that I'm really really having some serious doubt with. To create my vocab file, all I did was find the unique tokens from the whole documents. I then add the core token requirement [CLS], [SEP], [UNK] and [MASK]. I'm not sure if this the correct way to do it.
Feedback on this part is highly appreciated!
3. Skip tokenization step inside the create_training_instances() function
Since my input file already matches what the variable all_documents is, I skip line 183 to line 207. I replaced it with reading my input as-is:
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
lines = reader.read()
all_documents = ast.literal_eval(lines)
Results/Output
The raw input file (before custom tokenization) is from random web-scraping.
Some information on the raw and vocab file:
Number of documents/articles: 5
Number of sentences: 78
Number of vocabs: 649 (including [CLS], [SEP] etc.)
Below is the output (tail end of it) after running the create_pretraining_data.py
And this is what I get after running the run_pretraining.py
As shown in the diagram above I'm getting a very low accuracy from this and hence my concern if I'm doing it correctly.
First of all, you seem to have very little training data (you mention a vocabulary size of 649). BERT is a huge model which needs a lot of training data. The english models published by google are trained on at least the whole wikipedia. Think about that!
BERT uses something called WordPiece which guarantees a fixed vocabulary size. Rare words are split up like that: Jet makers feud over seat width with big orders at stake translates to wordPiece as: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.
WordPieceTokenizer.tokenize(text) takes a text pretokenized by whitespace, so you should change the BasicTokenizer, which is run before the WordPieceTokenizer by you specific tokenizer which should separate your tokens by whitespace.
To train your own WorPiece-Tookenizer, have a look at sentenePiece, which is in bpe mode essentially the same as WordPiece.
You can then export a vocabulary list from your WordPiece model.
I did not pretrain a BERT model myself, so I cannot help you on where to change something in the code exactly.
I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.
Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)
Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)
I am working on text classification task where my dataset contains a lot of abbreviations and proper nouns. For instance: Milka choc. bar.
My idea is to use bidirectional LSTM model with word2vec embedding.
And here is my problem how to code words, that not appears in the dictionary?
I partially solved this problem by merging pre-trained vectors with randomly initialized. Here is my implementation:
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('ru.vec', binary=False, unicode_errors='ignore')
EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,num_words)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))
for word, i in word_index.items():
if i>=num_words:
continue
try:
embedding_vector = word_vectors[word]
embedding_matrix[i] = embedding_vector
except KeyError:
embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),EMBEDDING_DIM)
def LSTMModel(X,words_nb, embed_dim, num_classes):
_input = Input(shape=(X.shape[1],))
X = embedding_layer = Embedding(words_nb,
embed_dim,
weights=[embedding_matrix],
trainable=True)(_input)
X = The_rest_of__the_LSTM_model()(X)
Do you think, that allowing the model to adjust the embedding weights is a good idea?
Could you please tell me, how can I encode words like choc? Obviously, this abbreviation stands for chocolate.
It is often not a good idea to adjust word2vec embeddings if you do not have sufficiently large corpus in your training. To clarify that, take an example where your corpus has television but not TV. Even though they might have word2vec embeddings, after training only television will be adjust and not TV. So you disrupt the information from word2vec.
To solve this problem you have 3 options:
You let the LSTM in the upper layer figure out what the word might mean based on its context. For example, I like choc. the LSTM can figure out it is an object. This was demonstrated by Memory Networks.
Easy option, pre-process, canonicalise as much as you can before passing to the model. Spell checkers often capture these very well and are really fast.
You can use character encoding along side word2vec. This is employed in many of the question answering models such as BiDAF where the character representation is merged with word2vec so you have some information relating characters to words. In this case, choc might be similar to chocolate.
One way to do this would be to add a function that maps your abbreviations to existing vectors that are most likely to be related ie: initialize the choc vector to the chocolate vector in w2v.
word_in_your_embedding_matrix[:len(abbreviated_word)]
There are two possible cases:
There is only one candidate that starts with the same n letters as your abbreviation, then, you can initialize your abbreviation embedding with that vector.
There are multiple items that start with the same n letters as your abbreviation, you can use the average as the yout initialization vector.