I'm trying to build a vectorizer for a text mining problem. The used vocabulary should be fitted from given files. However, the number of files that will build the dictionary vocabulary_ is relatively large (say 10^5). Is there a simple way to parallelize that?
Update: As I found out, there is a "manual" way... Unfortunately, it only works for min_df=1 Let me exemplary describe what I do for two cores:
Split your input into two chunks. Train vectorizers (say vec1 and vec2), each on one core and on one chunk of your data (I used multiprocessing.Pool). Then,
# Use sets to dedupe token
vocab = set(vec1.vocabulary_) | set(vec2.vocabulary_)
# Create final vectorizer with given vocabulary
final_vec = CountVectorizer(vocabulary=vocab)
# Create the dictionary final_vec.vocabulary_
final_vec._validate_vocabulary()
will do the job.
You can use mllib, the machine learning library included in apache-spark wchich will handle the distribution accross nodes.
Here's a tutorial on how to use it for feature extraction.
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
You can also check the sklearn documentation on How to optimize for speed here to get some inspiration.
Related
Good evening, I have a relatively simple question that primarily comes from my inexperience with python. I would like to extract word embeddings for a list of words. Here I have created a simple list:
list_word = [['Word'],
['ant'],
['bear'],
['beaver'],
['bee'],
['bird']]
Then load gensim and other required libraries:
#import tweepy # Obtain Tweets via API
import re # Obtain expressions
from gensim.models import Word2Vec #Import gensim Word2Fec
Now when I use the Word2Vec function I run the following:
#extract embedding length 12
model = Word2Vec(list_word, min_count = 3, size = 12)
print(model)
When the model is run I then see that the vocab size is 1, when it should not be. The output is the following:
Word2Vec(vocab=1, size=12, alpha=0.025)
I imagine that the imported data is not in the correct format and could use some advise or even example code on how to transform it into the correct format. Thank you for your help.
Your list_data, 6 sentences each with a single word, is insufficient to train Word2Vec, which requires a lot of varied realistic text data. Among other problems:
words that only appear once will be ignored due to the min_count=3 setting (& it's not a good idea to lower that parameter)
single-word sentences have none of the nearby-words contexts the algorithm uses
getting good 'dense' vectors requires a vocabulary far larger than the vector-dimensionality, and many varied examples of each word's use with other words
Try using a larger dataset, and you'll see more realistic results. Also, enabling Python logging at the INFO level will show a lot of progress as the code runs - and perhaps hint at issues, as you notice steps happening with or without reasonable counts & delays.
I am trying to get the best features for my data for classification. For this I want try feature selection using SVM, KNN, LDA and QDA.
Also the way to test this data is a leave one out approach and not cross-validation by splitting data into parts (basically can't split one file/matrix but have to leave one file for testing while training with other files)
I tried using sfs with SVM in Matlab but keep getting only the first feature and nothing else (there are 254 features)
Is there any way to do this in Python or Matlab ?
If you're trying to code the feature selector from scratch, I think you'd better first get deeper in the theory of your algorithm of choice.
But if you're looking for a way to get results faster, scikit-learn provides you with a variety of tools for feature selection. Have a look at this page.
I am trying to understand what is going wrong in the following example.
To train on the 'text8' dataset as described in the docs, one only has to do the following:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load('text8')
model = Word2Vec(dataset)
doing this gives very good embedding vectors, as verified by evaluating on a word-similarity task.
However, when loading the same textfile which is used above manually, as in
text_path = '~/gensim-data/text8/text'
text = []
with open(text_path) as file:
for line in file:
text.extend(line.split())
text = [text]
model = Word2Vec(test)
The model still says it's training for the same number of epochs as above (5), but training is much faster, and the resulting vectors have a very, very bad performance on the similarity task.
What is happening here? I suppose it could have to do with the number of 'sentences', but the text8 file seems to have only a single line, so does gensim.downloader split the text8 file into sentences? If yes, of which length?
In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.
Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.
So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.
Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:
https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209
However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.
It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.
I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.
I want to use my own algorithm to extract features from training data and then fit and transform using CountVectorize in scikit-learn.
Currently I am doing:
from sklearn.feature_extraction.text import CountVectorizer
cvect_obj = CountVectorizer()
vects = cvect_obj.fit_transform(traning_data)
fit_transform(traning_data)automatically extracts features and transforms it, but I want to use my own algorithm to extract features.
Actually it is quite not possible using it directly though.As a rule by scikit-learn they only add well-established algorithms. A rule of thumb is at least 3 years since publications, 200+ citations and wide use and usefullness. A technique that provides a clear-cut improvement (e.g. an enhanced data structure or efficient approximation) on a widely-used method will also be considered for inclusion.
Moreover, your implementation doesn’t need to be in scikit-learn to be used together with scikit-learn tools, though. Implement your favorite algorithm in a scikit-learn compatible way, upload it to github and it will be listed under Related Projects.
As you can't alter the sklearn core, you can always keep your own feature extractions. All you have to make sure is that most of the numeric modules in sklearn deals with sparse matrix, like scipy.sparse.csr_matrix.
All you need is a method or a module that takes a data in raw form (say, a sentence), and convert it into a sparse matrix. A basic skeleton I would write would be:
class MyFeatureExtractor:
def __init__():
dictionary = {}
vocab = []
def fit(list of sentences):
# learn the words after basic nlp pipeline
# build dictionary/map between word and feature index
def transform(new sentences):
# for each sentence, build a sparse vector of length equal to
# your vocabulary, or size of dictionary
# return the matrix
Now you can use your FeatureExtractor to transform just like regular sklearn modules.