How to reduce topic classification time in textblob naive bayes classifier

How to reduce topic classification time in textblob naive bayes classifier - python

I am using pickle to save classified model with bayes theorem, I have saved a file with 2.1 GB after classification with 5600 records. but when i loading that file it is taking nearly 2 minutes but for classifying some text it is taking 5.5 minutes. I am using following code to load it and classify.
classifierPickle = pickle.load(open( "classifier.pickle", "rb" ) )
classifierPickle.classify("want to go some beatifull work place"))
First line for loading pickle object and second one for classifying text it results which topic(Category) it is. I am using following code to save model.
file = open('C:/burberry_model/classifier.pickle','wb')
pickle.dump(object,file,-1)
Every thing i am using from textblob.Environment is Windows,28GB RAM,four core CPU's . It would very help full if any one can resolve this issue.

Since textblob is built on top of NLTK, it is a pure Python implementation which reduces it's speed by a huge magnitude. Secondly, since your Pickle file is 2.1GB, that makes it expand much more and saved directly on the RAM, increasing the time even more.
Also, since you're using Windows, Python speed is relatively slower. If speed is a main concern for you, then it would be useful to use the feature selector and vector constructor from textblob/NLTK and use scikit-learn NB Classifier which has C-Bindings, so i'm guessing it would be significantly faster.

Related

Inconsistent results when training gensim model with gensim.downloader vs manual loading

I am trying to understand what is going wrong in the following example.
To train on the 'text8' dataset as described in the docs, one only has to do the following:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load('text8')
model = Word2Vec(dataset)
doing this gives very good embedding vectors, as verified by evaluating on a word-similarity task.
However, when loading the same textfile which is used above manually, as in
text_path = '~/gensim-data/text8/text'
text = []
with open(text_path) as file:
for line in file:
text.extend(line.split())
text = [text]
model = Word2Vec(test)
The model still says it's training for the same number of epochs as above (5), but training is much faster, and the resulting vectors have a very, very bad performance on the similarity task.
What is happening here? I suppose it could have to do with the number of 'sentences', but the text8 file seems to have only a single line, so does gensim.downloader split the text8 file into sentences? If yes, of which length?

In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.
Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.
So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.
Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:
https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209
However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.
It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

gensim doc2vec give non-determined result

I am using the Doc2Vec model in gensim python library.
Every time I feeds the model with the same sentences data and set the parameter:seed of Doc2Vec to a fixed number, the model gives different vectors after the model is built.
For tests purpose, I need a determined result every time I gave a unchanged input data. I searched a lot and does not find a way to keep the gensim's result unchanged.
Is there anything wrong in the way I use it? thanks for replying in advance.
Here is my code:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(sentences, dm=1, dm_concat=1, size=100, window=5, hs=0, min_count=10, seed=64)
result = model.docvecs

The Doc2Vec algorithm makes use of randomness in both initialization and training, and efficient multi-threaded training introduces more randomness because the batches across mutliple threads won't necessarily be trained-against in the same order from run-to-run.
If the model is training well, the jitter in results from run-to-run shouldn't be large, and the quality of downstream assessments shouldn't vary much. If the quality-of-results does vary a lot, there are likely other problems with the application of the algorithm to your data or training.
Separately: you almost certainly don't want to be using the non-default dm_concat=1 mode. It results in a much larger, much slower-to-train model, and there aren't any clear public examples of it being worth that extra cost. (I'd only try it if I had a strong baseline result from more simple modes, and lots and lots of data and time.)

How to make Naive Bayes classifier for large datasets in python

I have large datasets of 2-3 GB. I am using (nltk) Naive bayes classifier using the data as train data. When I run the code for small datasets, it runs fine but when run for large datasets it runs for a very long time(more than 8 hours) and then crashes without much of an error. I believe it is because of memory issue.
Also, after classifying the data I want the classifier dumped into a file so that it can be used later for testing data. This process also takes too much time and then crashes as it loads everything into memory first.
Is there a way to resolve this?
Another question is that is there a way to parallelize this whole operation i.e. parallelize the classification of this large dataset using some framework like Hadoop/MapReduce?

I hope you must increase the memory dynamically to overcome this problem. I hope this link will helps you
Python Memory Management
Parallelism in Python

Saving scikit-learn classifier causes memory error

My machine has 16G RAM and the training program uses memory up to 2.6G.
But when I want to save the classifier (trained using sklearn.svm.SVC from a large dataset) as pickle file, it consumes too much memory that my machine cannot give. Eager to know any alternative approaches to save an classifier.
I've tried:
pickle and cPickle
Dump as w or wb
Set fast = True
Neither of them work, always raise a MemoryError. Occasionally the file was saved, but loading it causes ValueError: insecure string pickle.
Thank you in advance!
Update
Thank you all. I didn't try joblib, it works after setting protocol=2.

I would suggest to use out-of-core classifiers from sci-kit learn. These are batch learning algorithms, stores the model output as compressed sparse matrix and are very time efficient.
To start with, the following link really helped me.
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.