I'm trying to extract morphs/similar words in Sinhala language using Fasttext.
But FastText takes a 1 second for 2.64 words. How can I increase the speed without changing the model size?
My code looks like this:
import fasttext
fasttext.util.download_model('si', if_exists='ignore') # Sinhala
ft = fasttext.load_model('cc.si.300.bin')
words_file = open(r'/Datasets/si_words_filtered.txt')
words = words_file.readlines()
words = words[0:300]
synon_dict = dict()
from tqdm import tqdm_notebook
for i in tqdm_notebook(range(len(words))):
word = words[i].strip()
synon = ft.get_nearest_neighbors(word)[0][1] ### takes a lot of time
if is_strictly_sinhala_word(synon):
synon_dict[word] = synon
import json
with open("out.json", "w", encoding='utf8') as f:
json.dump(synon_dict, f, ensure_ascii=False)
To do a fully accurate get_nearest_neighbors()-type of calculation is inherently fairly expensive, requiring a lookup & calculation against every word in the set, for each new word.
As it looks like that set of vectors is near or beyond 2GB in size, when just the word-vectors are loaded, that means a scan of 2GB of addressable memory may be the dominant factor in the runtime.
Some things to try that might help:
Ensure that you have plenty of RAM - if there's any use of 'swap'/virtual-memory, that will make things far slower.
Avoid all unnecessary comparisons - for example, perform your is_strictly_sinhala_word() check before the expensive step, so you can skip the costly step if not interested in the results. Also, you could consider shrinking the full set of word-vectors to eliminate those that you are unlikely to want as responses. This might involve throwing out words you know are not of the language-of-interest, or all lower-frequency words. (If you can throw out half the words as possible nearest-neighbors before even trying the get_nearest_neighbors(), it will go roughly twice as fast.) More on these options below.
Try other word-vector libraries, to see if they offer any improvement. For example, the Python Gensim project can load either plain sets of full-word vectors (eg, the cc.si.300.vec words-only file) or FastText models (the .bin file), and offers a .most_similar() function that has some extra options & might, in some cases, offer different performance. (Though, the official Facebook Fasttext .get_nearest_neighbors() is probably pretty good.)
Use an "approximate nearest neighbors" library to pre-build an index of the word-vector space that can then offer extra-fast nearest-neighbor lookups - although at some risk of not finding the exact right top-N neighbors. There are many such libraries – see this benchmarking project that compares over 20 of them. But, adding this step complicates things & the tradeoff of that complexity & the imperfect result may not be worth the effort & time-savings. So, just remember that it's a possibility if your need s large enough & nothing else helps.
With regard to slimming the set of vectors searched:
The Gensim KeyedVectors.load_word2vec_format() function, which can load the .vec words-only file, has an option limit that will only read the specified number of words from the file. It looks like the .vec file for your dataset has over 800k words - but if you chose to load only 400k, your .most_similar() calculations would go about twice as fast. (And, since such files typically front-load the files with the most-common words, the loss of the far-rarer words may not be a concern.)
Siilarly, even if you load all the vectors, the Gensim .most_similar() function has a restrict_vocab option that can limit searches to just the 1st words of that count, which could also speed things or helpfully drop obscure words that may be of less interest.
The .vec file may be easier to work with if you wanted to pre-filter the words to, for example, eliminate non-Sinhala words. (Note: the usual .load_word2vec_format() text format needs a 1st line that declares the count of words & word-dimensionality, but you may leave that off, then load using the no_header=True option, which instead uses 2 full passes over the file to get the count.)
Related
I am trying to use the gensim word2vec most_similar function in the following way:
wv_from_bin.most_similar(positive=["word_a", "word_b"])
So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.
e.g.
Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z
Finite set:
word_d, word_e, word_f
most_similar on whole vocab
wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]
desired output
finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set
output = ['word_d', 'word_f']
Depending on your specific patterns of use, you have a few options.
If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.
Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)
Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.
Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.
Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.
If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.
Two options you could consider include:
Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.
Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.
If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.
Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)
On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.
I’m in the process of trying to get document similarity values for a corpus of approximately 5,000 legal briefs with Doc2Vec (I recognize that the corpus may be a little bit small, but this is a proof-of-concept project for a larger corpus of approximately 15,000 briefs I’ll have to compile later).
Basically, every other component in the creation of the model is going relatively well so far – each brief I have is in a text file within a larger folder, so I compiled them in my script using glob.glob – but I’m running into a tokenization problem. The difficulty is, as these documents are legal briefs, they contain numbers that I’d like to keep, and many of the guides I’ve been using to help me write the code use Gensim’s simple preprocessing, which I believe eliminates digits from the corpus, in tandem with the TaggedDocument feature. However, I want to do as little preprocessing on the texts as possible.
Below is the code I’ve used, and I’ve tried swapping simple_preprocess for genism.utils.tokenize, but when I do that, I get generator objects that don’t appear workable in my final Doc2Vec model, and I can’t actually see how the corpus looks. When I’ve tried to use other tokenizers, like nltk, I don’t know how to fit that into the TaggedDocument component.
brief_corpus = []
for brief_filename in brief_filenames:
with codecs.open(brief_filename, "r", "utf-8") as brief_file:
brief_corpus.append(
gensim.models.doc2vec.TaggedDocument(
gensim.utils.simple_preprocess(
brief_file.read()),
["{}".format(brief_filename)])) #tagging each brief with its filename
I’d appreciate any advice that anyone can give that would help me combine a tokenizer that just separated on whitespace and didn’t eliminate any numbers with the TaggedDocument feature. Thank you!
Update: I was able to create a rudimentary code for some basic tokenization (I do plan on refining it further) without having to resort to Gensim's simple_preprocessing function. However, I'm having difficulty (again!) when using the TaggedDocument feature - but this time, the tags (which I want to be the file names of each brief) don't match the tokenized document. Basically, each document has a tag, but it's not the right one.
Can anyone possibly advise where I might have gone wrong with the new code below? Thanks!
briefs = []
BriefList = [p for p in os.listdir(FILEPATH) if p.endswith('.txt')]
for brief in BriefList:
str = open(FILEPATH + brief,'r').read()
tokens = re.findall(r"[\w']+|[.,!?;]", str)
tagged_data = [TaggedDocument(tokens, [brief]) for brief in BriefList]
briefs.append(tagged_data)
You're likely going to want to write your own preprocessing/tokenization functions. But don't worry, it's not hard to outdo Gensim's simple_preprocess, even with very crude code.
The only thing Doc2Vec needs as the words of a TaggedDocument is a list of string tokens (typically words).
So first, you might be surprised how well it works to just do a default Python string .split() on your raw strings - which just breaks text on whitespace.
Sure, a bunch of the resulting tokens will then be mixes of words & adjoining punctuation, which may be nearly nonsense.
For example, the word 'lawsuit' at the end of the sentence might appear as 'lawsuit.', which then won't be recognized as the same token as 'lawsuit', and might not appear enough min_count times to even be considered, or otherwise barely rise above serving as noise.
But especially for both longer documents, and larger datasets, no one token, or even 1% of all tokens, has that much influence. This isn't exact-keyword-search, where failing to return a document with 'lawsuit.' for a query on 'lawsuit' would be a fatal failure. A bunch of words 'lost' to such cruft may have hadly any effect on the overall document, or model, performance.
As your datasets seem manageable enough to run lots of experiments, I'd suggest trying this dumbest-possible tokenization – only .split() – just as a baseline to become confident that the algorithm still mostly works as well as some more intrusive operation (like simple_preprocess()).
Then, as you notice, or suspect, or ideally measure with some repeatable evaluation, that some things you'd want to be meaningful tokens aren't treated right, gradually add extra steps of stripping/splitting/canonicalizing characters or tokens. But as much as possible: checking that the extra complexity of code, and runtime, is actually delivering benefits.
For example, further refinements could be some mix of:
For each token created by the simple split(), strip off any non-alphanumeric leading/trailing chars. (Advantages: eliminates that punctuation-fouling-words cruft. Disadvantages: might lose useful symbols, like the leading $ of monetary amounts.)
Before splitting, replace certain single-character punctuation-marks (like say ['.', '"', ',', '(', ')', '!', '?', ';', ':']) with the same character with spaces on both sides - so that they're never connected with nearby words, and instead survive a simple .split() as standalone tokens. (Advantages: also prevents words-plus-punctuation cruft. Disadvantages: breaks up numbers like 2,345.77 or some useful abbreviations.)
At some appropriate stage in tokenization, canonicalize many varied tokens into a smaller set of tokens that may be more meaningful than each of them as rare standalone tokens. For example, $0.01 through $0.99 might all be turned into $0_XX - which then has a better chance of influencting the model, & being associated with 'tiny amount' concepts, than the original standalone tokens. Or replacing all digits with #, so that numbers of similar magnitudes share influence, without diluting the model with a token for every single number.
The exact mix of heuristics, and order of operations, will depend on your goals. But with a corpus only in the thousands of docs (rather than hundreds-of-thousands or millions), even if you do these replacements in a fairly inefficient way (lots of individual string- or regex- replacements in serial), it'll likely be a manageable preprocessing cost.
But you can start simple & only add complexity that your domain-specific knowledge, and evaluations, justifies.
I need to put the texts contained in a column of a MySQL database (about 3 million rows) into a list of lists of tokens. These texts (which are tweets, therefore they are generally short) must be preprocessed before being included in the list (stop words, hashtags, tags etc. must be removed). This list should be passed later as a Word2Vec parameter. This is the part of the code involved
import mysql.connector
import re
from gensim.models import Word2Vec
import preprocessor as p
p.set_options(
p.OPT.URL,
p.OPT.MENTION,
p.OPT.HASHTAG,
p.OPT.NUMBER
)
conn = mysql.connector.connect(...)
cursor = conn.cursor()
query = "SELECT text FROM tweet"
cursor.execute(query)
table = cursor.fetchall()
stopwords = open('stopwords.txt', encoding='utf-8').read().split('\n')
sentences = []
for row in table:
sentences = sentences + [[w for w in re.sub(r'[^\w\s-]', ' ', p.clean(row[0])).lower().split() if w not in stopwords and len(w) > 2]]
cursor.close()
conn.close()
model = Word2Vec(sentences)
...
Obviously it takes a lot of time and I know that my method is probably inefficient. Can anyone recommend a better one? I know it is not a question directly related to gensim and Word2Vec but perhaps those who use them have already faced the problem of working with a large amount of texts.
You haven't mentioned how long your code takes to run, but some potential sources of slowdown in your current technique might include:
the overhead of regex-based preprocessing, especially if a large number of independent regexes are each applied, separately, to the same texts
the inefficiency of expanding a Python list by appending one new item at a time - which as the list grows larger can sometimes be a factor
virtual-memory swapping, if the size of your data exceeds physical RAM
You can check the swapping issue by monitoring memory use using a platform-specific tool (like top on Linux systems) to view memory usage during the operation. If that's a contributor, using a machine with more RAM, or making other code changes to reduce RAM usage (see below), will help.
Your full prprocessing code isn't shown, but a common approach is a lot of independent steps, each of which involves one or more regular-expressions, but then returns a plain modified string (for future steps).
As appealingly simple & pluggable as that is, it often becomes a source of avoidable slowness in preprocessing large amounts of text. For example, each regex/step itself might have to repeat detecting token-boundaries, or splitting then re-concatenating a string. Or, the regexes might use complex match patterns, or techniques (like backtracking) that can be expensive on worst-case inputs.
Often this sort of preprocessing can be greatly improved by one or more of:
coalescing multiple regexes into a single step, so a string faces one front-to-back pass, rather than N
breaking into short tokens early, then leaving the text as a list-of-tokens for later steps - thus never redundantly splitting/joining, and letting later token-oriented steps to work on smaller strings and perhaps even simpler (non-regex) string-tests
Also, even if the preprocessing is still a bit time-consuming, a big process improvement is usually to be sure to only repeat it when the data changes. That is, if you're going to try a bunch of different downstream steps, like different Word2Vec parameters, make sure you're not doing the expensive preprocessing every time. Do it once, write the results aside to a file, then reuse the results file until it needs to be regenerated (because the data or preprocessing rules have changed).
Finally, if the append-one-more pattern is contributing to your slowness, you could pre-allocate your sentences (sentences = [Null,] * desired_length), then replace each row in your loop rather than append (sentences[row_num] = preprocessed_text). But that might not be a major factor, and in fact the suggestion above, about "reuse the results file", is a better way to minimize list-ops/RAM-usage, as well as enable reuse across alternate runs.
That is, open a new working file before your loop. Append each preprocessed text – with spaces between the tokens, and a newline at the end – as one new line to this file. Then, have your Word2Vec step work directly from that file. (In Gensim, you can do this by wrapping the file with a LineSentence utility object, which reads a file of that format back as a re-iterable sequence, with each item being a list-of-tokens, or by using the corpus_file parameter to feed the filename directly to Word2Vec.)
From that list of possible tactics, I'd try:
First, time your existing code for preprocessing (creating your sentences
Then, eliminate all fancy preprocessing, doing nothing more complicated than .split(), and re-time. If there's a big change, then yes, the preprocessing is the major slowdown, and concentrate on improving that.
If even that minimal preprocessing still seems slower-than-desired, then maybe the RAM/concatenation issues are a concern, and try writing to an interim file.
Separately: it's not strictly necessary to worry about removing stop-words in word2vec training - much published work doesn't bother with that step, and the algorithm already includes a sample parameter which causes it to skip a lot of the very-overrepresented words during training as less-interesting. Similarly, 2- and even 1- character tokens may still be interesting, especially in the domain of tweets, so you might not want to always discard them. (For example, lone emoji can be significant 'words'.)
I have some unsupervised data (100.000 files) and each file has a paragraph containing one sentence. The preprocessing went wrong and deleted all stop points (.).
I used word2vec on a small sample (2000 files) and it treated each document as one sentence.
Should I continue the process on all remaining files? Or this would result to a bad model ?
Thank you
Did you try it, and get bad results?
I'm not sure what you mean by "deleted all stop points". But, Gensim's Word2Vec is oblivious to what your tokens are, and doesn't really have any idea of 'sentences'.
All that matters is the lists-of-tokens you provide. (Sometimes people include puntuation like '.' as tokens, and sometimes it's stripped - and it doesn't make a very big different either way, and to the extent it does, whether it's good or bad may depend on your data & goals.)
Any lists-of-tokens that include neighboring related tokens, for the sort of context-window training that's central to the word2vec algorithm, should work well.
For example, it can't learn anything from one-word texts, where there are no neighboring words. But running togther sentences, paragraphs, and even full documents into long texts works fine.
Even concatenating wholly-unrelated texts doesn't hurt much: the bit of random noise from unrelated words now in-each-others' windows is outweighed, with enough training, by the meaningful relationships in the much-longer runs of truly-related text.
The main limit to consider is that each training text (list of tokens) shouldn't be more than 10,000 tokens long, as internal implementation limits up through Gensim 4.0 mean tokens past the 10,000th position will be ignored. (This limit might eventually be fixed - but until then, just splitting overlong texts into 10,000-token chunks is a fine workaround with negligible effects via the lost contexts at the break points.)
I have downloaded Google's pretrained word embeddings as a binary file here (GoogleNews-vectors-negative300.bin.gz). I want to be able to filter the embedding based on some vocabulary.
I first tried loading the bin file as a KeyedVector object, and then creating a dictionary that uses its vocabulary along with another vocabulary as a filter. However, it takes a long time.
# X is the vocabulary we are interested in
embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin.gz', binary=True)
embeddings_filtered = dict((k, embeddings[k]) for k in X if k in list(embeddings.wv.vocab.keys()))
It takes a very long time to run. I am not sure if this is the most efficient solution. Should I filter it out in the load_word2vec_format step first?
Your dict won't have all the features of a KeyedVectors object, and it won't be stored as compactly. The KeyedVectors stores all vectors in a large contiguous native 2D array, with a dict indicating the row for each word's vector. Your second dict, with a separate vector for each word, will involve more overhead. (And further, as the vectors you get back from embeddings[k] will be "views" into the full vector – so your subset may actually indirectly retain the larger array, even after you try to discard the KeyedVectors.)
Since it's likely that a reason you only want a subset of the original vectors is that the original set was too large, having a dict that takes as much or more memory probably isn't ideal.
You should consider two options:
load_word2vec_format() includes an optional limit parameter that only loads the first N words from the supplied file. As such files are typically sorted from most-frequent to least-frequent words, and the less-frequent words are both far less useful and of lower vector quality, it is often practical to just use the first 1 million, or 500,000, or 100,000, etc entries for a large memory & speed savings.
You could try filtering on load. You'd need to adapt the loading code to do this. Fortunately you can review the full source code for load_word2vec_format() (it's just a few dozen lines) inside your local gensim instalation, or online at the project source code hosting at:
https://github.com/RaRe-Technologies/gensim/blob/9c5215afe3bc4edba7dde565b6f2db982bba5113/gensim/models/utils_any2vec.py#L123
You'd write your own version of this routine that skips words not of interest. (It might have to do two passes over the file, one to count the words of interest, then a second to actually allocate the right-sized in-memory arrays and do the real reading.)