Fast and slow tokenizers yield different results

Fast and slow tokenizers yield different results - python

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer.
Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would fill in the mask are not the same for the fast and slow tokenizer. Moreover, while the predictions of the fast tokenizer remain constant regardless of the number and length of sentences input, the same is not true for the slow tokenizer.
Here's a minimal example:
from transformers import pipeline
slow = pipeline('fill-mask', model='bert-base-cased', \
tokenizer=('bert-base-cased', {"use_fast": False}))
fast = pipeline('fill-mask', model='bert-base-cased', \
tokenizer=('bert-base-cased', {"use_fast": True}))
s1 = "This is a short and sweet [MASK]." # "example"
s2 = "This is [MASK]." # "shorter"
slow([s1, s2])
fast([s1, s2])
slow([s2])
fast([s2])
Each pipeline call yields the top-5 tokens that could fill in for [MASK], along with their probabilities. I've ommitted the actual outputs for brevity, but the probabilities assigned to each word that fill in [MASK] for s2 are not the same across all of the examples. The final 3 examples give the same probabilities, but the first yields different probabilities. The differences are so great that the top-5 are not consistent across the two groups.
The cause behind this, as I can tell, is that the fast and slow tokenizers return different outputs. The fast tokenizer standardizes sequence length to 512 by padding with 0s, and then creates an attention mask that blocks out the padding. In contrast, the slow tokenizer only pads to the length of the longest sequence, and does not create such an attention mask. Instead, it sets the token type id of the padding to 1 (rather than 0, which is the type of the non-padding tokens). By my understanding of HuggingFace's implementation (found here), these are not equivalent.
Does anyone know if this is intentional?

It appears that this bug was fixed sometime between transformers-2.5.0 and transformers-2.8.0 or tokenizers-0.5.0 and tokenizers-0.5.2. Upgrading my install of transformers (and tokenizers) solved this.
(Let me know if this question / answer is trivial, and should be deleted)

Related

Decoding hidden layer embeddings in T5

I'm new to NLP (pardon the very noob question!), and am looking for a way to perform vector operations on sentence embeddings (e.g., randomization in embedding-space in a uniform ball around a given sentence) and then decode them. I'm currently attempting to use the following strategy with T5 and Huggingface Transformers:
Encode the text with T5Tokenizer.
Run a forward pass through the encoder with model.encoder. Use the last hidden state as the embedding. (I've tried .generate as well, but it doesn't allow me to use the decoder separately from the encoder.)
Perform any desired operations on the embedding.
The problematic step: Pass it through model.decoder and decode with the tokenizer.
I'm having trouble with (4). My sanity check: I set (3) to do nothing (no change to the embedding), and I check whether the resulting text is the same as the input. So far, that check always fails.
I get the sense that I'm missing something rather important (something to do with the lack of beam search or some other similar generation method?). I'm unsure of whether what I think is an embedding (as in (2)) is even correct.
How would I go about encoding a sentence embedding with T5, modifying it in that vector space, and then decoding it into generated text? Also, might another model be a better fit?
As a sample, below is my incredibly broken code, based on this:
t5_model = transformers.T5ForConditionalGeneration.from_pretrained("t5-large")
t5_tok = transformers.T5Tokenizer.from_pretrained("t5-large")
text = "Foo bar is typing some words."
input_ids = t5_tok(text, return_tensors="pt").input_ids
encoder_output_vectors = t5_model.encoder(input_ids, return_dict=True).last_hidden_state
# The rest is what I think is problematic:
decoder_input_ids = t5_tok("<pad>", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output = t5_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors)
t5_tok.decode(decoder_output.last_hidden_state[0].softmax(0).argmax(1))

How does gensim manage to find the most similar words so fast?

Let's say we train a model with more than 1 million words. In order to find the most similar words we need to calculate the distance between the embedding of the test word and embeddings of all the 1 million words words, and then find the nearest words. It seems that Gensim calculate the results very fast. Although when I want to calculate the most similar, my function is extremely slow:
def euclidean_most_similars (model, word, topn = 10):
distances = {}
vec1 = model[word]
for item in model.wv.vocab:
if item!= node:
vec2 = model[item]
dist = np.linalg.norm(vec1 - vec2)
distances[(node, item)] = dist
sorted_distances = sorted(distances.items(), key=operator.itemgetter(1))
I would like to know how Gensim manages to calculate the most nearest words so fast and what is an efficient way to calculate the most similares.

As #g-anderson commented, the gensim source can be reviewed to see exactly what it does. However, gensim is not actually using any of its own optimized Cython or compiled-C code as part of its most_similar() method – which can be reviewed at:
https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/keyedvectors.py#L689
Instead, by using numpy/scipy bulk array operations, those libraries' highly optimized code will take advantage of both CPU primitives and multithreading to calculate all the relevant similarities far faster than an interpreted Python loop.
(The key workhorse is the numpy dot operation: one call which creates an ordered array of all the similarities – skipping the loop & your interim-results dict entirely. But the argsort, passing through to numpy implementations as well, likely also outperforms the idiomatic sorted().)

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

How to Compare Sentences with an idea of the positions of keywords?

I want to compare the two sentences. As a example,
sentence1="football is good,cricket is bad"
sentence2="cricket is good,football is bad"
Generally these senteces have no relationship that means they are different meaning. But when I compare with python nltk tools it will give 100% similarity. How can I fix this Issue? I need Help.

Yes wup_similarity internally uses synsets for single tokens to calculate similarity
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
Since ancestor nodes for cricket and football would be same. wup_similarity will return 1.
If you want to fix this issue using wup_similarity is not a good choice.
Simplest token based way would be fitting a vectorizer and then calculating similarity.
Eg.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = ["football is good,cricket is bad", "cricket is good,football is bad"]
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(corpus)
x1 = vectorizer.transform(["football is good,cricket is bad"])
x2 = vectorizer.transform(["cricket is good,football is bad"])
cosine_similarity(x1, x2)
There are more intelligent methods to meaure semantic similarity though. One of them which can be tried easily is Google's USE Encoder.
See this link

Semantic Similarity is a bit tricky this way, since even if you use context counts (which would be n-grams > 5) you cannot cope with antonyms (e.g. black and white) well enough. Before using different methods, you could try using a shallow parser or dependency parser for extracting subject-verb or subject-verb-object relations (e.g. ), which you can use as dimensions. If this does not give you the expected similarity (or values adequate for your application), use word embeddings trained on really large data.

How to count frequency in gensim.Doc2Vec?

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast and slow tokenizers yield different results - python

It appears that this bug was fixed sometime between transformers-2.5.0 and transformers-2.8.0 or tokenizers-0.5.0 and tokenizers-0.5.2. Upgrading my install of transformers (and tokenizers) solved this. (Let me know if this question / answer is trivial, and should be deleted)

Related

Decoding hidden layer embeddings in T5

How does gensim manage to find the most similar words so fast?

Word2Vec Vocab Similarities

How to Compare Sentences with an idea of the positions of keywords?

How to count frequency in gensim.Doc2Vec?

Categories

Resources