Inconsistent results when training gensim model with gensim.downloader vs manual loading

Inconsistent results when training gensim model with gensim.downloader vs manual loading - python

I am trying to understand what is going wrong in the following example.
To train on the 'text8' dataset as described in the docs, one only has to do the following:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load('text8')
model = Word2Vec(dataset)
doing this gives very good embedding vectors, as verified by evaluating on a word-similarity task.
However, when loading the same textfile which is used above manually, as in
text_path = '~/gensim-data/text8/text'
text = []
with open(text_path) as file:
for line in file:
text.extend(line.split())
text = [text]
model = Word2Vec(test)
The model still says it's training for the same number of epochs as above (5), but training is much faster, and the resulting vectors have a very, very bad performance on the similarity task.
What is happening here? I suppose it could have to do with the number of 'sentences', but the text8 file seems to have only a single line, so does gensim.downloader split the text8 file into sentences? If yes, of which length?

In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.
Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.
So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.
Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:
https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209
However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.
It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.

Related

FastText: Can't see the representation of words that starts with # or #

I am working in an NLP project using FastText. I have some texts which contains words like #.poisonjamak, #aminagabread, #iamquak123 and I want to see their FastText representation. I want to mention that the model has the following form:
# FastText
ft_model = FastText(word_tokenized_corpus,
max_n=0,
vector_size=64,
window=5,
min_count=1,
sg=1,
workers=20,
epochs=50,
seed=42)
Using this I can see their representation, however I have an error
print(ft_model.wv['#.poisonjamak'])
KeyError: 'cannot calculate vector for OOV word without ngrams'
Of course, these words are in my texts. I have the above error in all these 3 words, however if I do the following this is working.
print(ft_model.wv['#.poisonjamak']) -----> print(ft_model.wv['poisonjamak'])
print(ft_model.wv['#aminagabread']) -----> print(ft_model.wv['aminagabread'])
print(ft_model.wv['#_iamquak123_']) -----> print(ft_model.wv['_iamquak123_'])
Question: So do you know why I have this problem?
Update:
My dataset called 'df' and the column with texts called 'text'. I using the following code to prepare the texts for the fast text. The FastText is trained on word_tokenized_corpus
extra_list = df.text.tolist()
final_corpus = [sentence for sentence in extra_list if sentence.strip() !='']
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]

As comments note, the main issue is likely with your tokenizer, which won't put '#' characters inside your tokens. As a result, your FastText model isn't seeing the tokens you expect – but probably does have a word-vector for the 'word' '#'.
Separately reviewing your actual word_tokenized_corpus, to see what it truly includes before the mdoel gets to do its training, is a good way to confirm this (or catch this class of error in the future).
There is however another contributing issue: your use of the max_n=0 parameter. This essentially turns off subword learning, by qualifying no positive-length word-substrings (aka 'character n-grams') for vector-learning. This setting essentially turns FastText into plain Word2Vec.
If instead you were using FastText in a more usual way, it would've learned subword-vectors for some of the subwords in 'aminagabread' etc, and thus would'vbe provided synthetic "guess" word-vectors for the full '#aminagabread' unseen OOV token.
So in a way, you're only seeing the error letting you know about a problem in your tokenization because of this other deviation from usual FastText OOV behavior. If you really want FastText for its unique benefit of synthetic vectors for OOV words, you should return to a more typical max_n setting.
Separate usage tips:
min_count=1 is usually a bad idea with such word2vec-family algorithms, as such rare words don't have enough varied usage examples to get good vectors themselves, but the failed attempt to try degrades training for surrounding words. Often, discarding such words (as with the default min_count=5 as if they weren't there at all improves downstream evaluations.
Because of some inherent threading inefficiencies of the Python Global Interpreter Lock ("GIL"), and the Gensim approach to iterating over your corpus in one thread, parcelling work out to worker threads, it is likely you'll get higher training throughput with fewer workers than your workers=20 setting, even if you have 20 (or far more) CPU cores. The exact best setting in any situation will vary by a lot of things, including some of the model parameters, and only trial-and-error can narrow the best values. But it's more likely to be in the 6-12 range, even when more cores are available, than 16+.

Different results infer_vector() of Doc2Vec after saving to disk and load

I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....

First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

Extract word embeddings from word2vec

Good evening, I have a relatively simple question that primarily comes from my inexperience with python. I would like to extract word embeddings for a list of words. Here I have created a simple list:
list_word = [['Word'],
['ant'],
['bear'],
['beaver'],
['bee'],
['bird']]
Then load gensim and other required libraries:
#import tweepy # Obtain Tweets via API
import re # Obtain expressions
from gensim.models import Word2Vec #Import gensim Word2Fec
Now when I use the Word2Vec function I run the following:
#extract embedding length 12
model = Word2Vec(list_word, min_count = 3, size = 12)
print(model)
When the model is run I then see that the vocab size is 1, when it should not be. The output is the following:
Word2Vec(vocab=1, size=12, alpha=0.025)
I imagine that the imported data is not in the correct format and could use some advise or even example code on how to transform it into the correct format. Thank you for your help.

Your list_data, 6 sentences each with a single word, is insufficient to train Word2Vec, which requires a lot of varied realistic text data. Among other problems:
words that only appear once will be ignored due to the min_count=3 setting (& it's not a good idea to lower that parameter)
single-word sentences have none of the nearby-words contexts the algorithm uses
getting good 'dense' vectors requires a vocabulary far larger than the vector-dimensionality, and many varied examples of each word's use with other words
Try using a larger dataset, and you'll see more realistic results. Also, enabling Python logging at the INFO level will show a lot of progress as the code runs - and perhaps hint at issues, as you notice steps happening with or without reasonable counts & delays.

Different results of Gensim Word2Vec Model in two editors for the same source code in same environment and platform?

I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.
I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.
But, I am getting different results for the same source code executed in two editors.
Which result should I need to choose and Why?
I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#
I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.
Any suggestion is appreciated.
Results Obtained in Spyder:
Results Obtained using Sublime Text:

The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.
If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.
That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)
How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)
Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)
Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)
Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)

How to train NLTK PunktSentenceTokenizer batchwise?

I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.
I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.
Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()
Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)
Now my questions are:
How can I train the algorithm batchwise and would that lead to a lower memory consumption?
Can I use the standard English pickle file and do further training with that already trained object?
I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.

I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second #colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.
There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
Suppose we have a generator that yields a stream of training texts
texts = text_stream()
In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.
We can instantiate a PunktTrainer and then begin training
trainer = PunktTrainer()
for text in texts:
trainer.train(text)
trainer.freq_threshold()
Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.
Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.
trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())
#colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way
params = trainer.get_params()
abbreviations = params.abbrev_types

As described in the source code:
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. It must be
trained on a large collection of plaintext in the target language
before it can be used.
It is not very clear what a large collection really means. In the paper, there are no information given about learning curves (when it is sufficiant to stop learning process, because enough data was seen). Wall Street Journal corpus is mentioned there (it has approximately 30 million words). So it is very unclear if you can simply trim your training corpus and have less memory footprints.
There is also an open issue on your topic saying something about 200 GB RAM and more. As you can see there, NLTK has probably not a good implementation of the algorithm presented by Kiss & Strunk (2006).
I cannot see how to batch it, as you can see in the function signature of train()-method (NLTK version 3.3):
def train(self, train_text, verbose=False):
"""
Derives parameters from a given training text, or uses the parameters
given. Repeated calls to this method destroy previous parameters. For
incremental training, instantiate a separate PunktTrainer instance.
"""
But there are probably more issues, e.g. if you compare the signature of given version 3.3 with the git tagged version 3.3, there is a new parameter finalize which might be helpful and indicates a possible batch-process or a possible merge with an already trained model:
def train(self, text, verbose=False, finalize=True):
"""
Collects training data from a given text. If finalize is True, it
will determine all the parameters for sentence boundary detection. If
not, this will be delayed until get_params() or finalize_training() is
called. If verbose is True, abbreviations found will be listed.
"""
Anyway, I would strongly recommend not using NLTK's Punkt Sentence Tokenizer if you want to do sentence tokenization beyond playground level. Nevertheless, if you want to stick to that tokenizer, I would simply recommend using also the given models and not train new models unless you have a server with huge RAM memory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.