I am using TfIdfVectorizer of sklearn for document clustering. I have 20 million texts, for which i want to compute clusters. But calculating TfIdf matrix is taking too much time and system is getting stuck.
Is there any technique to deal with this problem ? is there any alternative method for this in any python module ?
Well, a corpus of 20 million texts is very large, and without a meticulous and comprehensive preprocessing nor some good computing instances (i.e. a lot of memory and good CPUs), the TF-IDF calculation may take a lot of time.
What you can do :
Limit your text corpus to some hundred of thousands of samples (let's say 200.000 texts). Having too much texts might not introduce more variance than much smaller (but reasonable) datasets.
Try to preprocess your texts as much as you can. A basic approach would be : tokenize your texts, use stop words, word stemming, use carefully n_grams.
Once you've done all these steps, see how much you've reduced the size of your vocabulary. It should be much more smaller than the original one.
If not too big (talking about your dataset), these steps might help you compute the TF-IDF much faster .
start small.
First cluster only 100.00 documents. Only once it works (because it probably won't), then think about scaling up.
If you don't succeed clustering the subset (and text clusters are usually pretty bad), then you won't fare well on the large set.
Related
I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples & total_words during training in this case?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')
Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab() call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with vector_size=300 – you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file mode, you can increase workers to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable mode, max throughput is usually
somewhere in the 6-12 workers threads, as long as you ahve that many
cores.)
min_count=1 is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5 does. (It's possible FastText can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100 is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon
I have to assess pairwise similarities of documents of different sizes (from 300 words to more than 200k words). To do so, I have created a procedure making use of LSA algorithm as implemented in gensim. It includes these steps: document preprocessing, creating BoW vectors, applying TF/IDF weighting, finding topic distributions for documents using LSA, and computing pairwise similarities.
The results I have obtained so far are reasonable to the extent that I was able to verify similarities manually. Nevertheless, I have doubts about the methodological correctness of applying LSA to a corpus of documents of very different sizes. I suspect that LSA might find topic distributions for documents more accurately when documents in a corpus are of comparable lengths (e.g., between 100 and 1500 words), while having documents of very different sizes in the same corpus might reduce accuracy of topic assignment for some documents, leading to inadequate similarity assessment further down the pipeline.
I have looked up papers applying LSA to a similarly structured corpus or discussing this problem methodologically, but found no relevant insights. All papers I have found deal with corpora of similarly sized documents.
Could anybody please point me to relevant research dealing with this problem, reflect on this problem considering the inner workings of LSA, or simply share their own experience of dealing with corpora of documents of mixed sizes? Any insight would be appreciated.
If LSA indeed applies best to corpora of similarly sized documents, how can one apply it to a mixed-size corpus? As I see it, one option would be to split large documents into smaller parts, run the procedure, and then average computed similarity values. If this would be a correct way, please let me know.
I'm fairly new to NLP and trying to learn the techniques that can help me get my job done.
Here is my task: I have to classify stages of a drilling process based on text memos.
I have to classify labels for "Activity", "Activity Detail", "Operation" based on what's written in "Com" column.
I've been reading a lot of articles online and all the different kinds of techniques that I've read really confuses me.
The buzz words that I'm trying to understand are
Skip-gram (prediction based method, Word2Vec)
TF-IDF (frequency based method)
Co-Occurrence Matrix (frequency based method)
I am given about ~40,000 rows of data (pretty small, I know), and I came across an article that says neural-net based models like Skip-gram might not be a good choice if I have small number of training data. So I was also looking into frequency based methods too. Overall, I am unsure which technique is the best for me.
Here's what I understand:
Skip-gram: technique used to represent words in a vector space. But I don't understand what to do next once I vectorized my corpus
TF-IDF: tells how important each word is in each sentence. But I still don't know how it can be applied on my problem
Co-Occurence Matrix: I don'y really understand what it is.
All the three techniques are to numerically represent texts. But I am unsure what step I should take next to actually classify labels.
What approach & sequence of techniques should I use to tackle my problem? If there's any open source Jupyter notebook project, or link to an article (hopefully with codes) that did the similar job done, please share it here.
Let's get things a bit clearer. You task is to create a system that will predict labels for given texts, right? And label prediction (classification) can't be done for unstructured data (texts). So you need to make your data structured, and then train and infer your classifier. Therefore, you need to induce two separate systems:
Text vectorizer (as you said, it helps to numerically represent texts).
Classifier (to predict the labels for numerically represented texts).
Skip-Gram and co-occurrence matrix are ways to vectorize your texts (here is a nice article that explains their difference). In case of skip-gram you could download and use a 3rd party model that already has mapping of vectors to most of the words; in case of co-occurrence matrix you need to build it on your texts (if you have specific lexis, it will be a better way). In this matrix you could use different measures to represent the degree of co-occurrence of words with words or documents with documents. TF-IDF is one of such measures (that gives a score for every word-document pair); there are a lot of others (PMI, BM25, etc). This article should help to implement classification with co-occurrence matrix on your data. And this one gives an idea how to do the same with Word2Vec.
Hope it helped!
I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.
I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.
But, I am getting different results for the same source code executed in two editors.
Which result should I need to choose and Why?
I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#
I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.
Any suggestion is appreciated.
Results Obtained in Spyder:
Results Obtained using Sublime Text:
The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.
If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.
That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)
How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)
Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)
Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)
Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)
I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?
What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.
Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?