How to make Naive Bayes classifier for large datasets in python - python

I have large datasets of 2-3 GB. I am using (nltk) Naive bayes classifier using the data as train data. When I run the code for small datasets, it runs fine but when run for large datasets it runs for a very long time(more than 8 hours) and then crashes without much of an error. I believe it is because of memory issue.
Also, after classifying the data I want the classifier dumped into a file so that it can be used later for testing data. This process also takes too much time and then crashes as it loads everything into memory first.
Is there a way to resolve this?
Another question is that is there a way to parallelize this whole operation i.e. parallelize the classification of this large dataset using some framework like Hadoop/MapReduce?

I hope you must increase the memory dynamically to overcome this problem. I hope this link will helps you
Python Memory Management
Parallelism in Python

Related

Python unable to access multiple GB of my ram?

I'm writing a machine learning project for some fun but I have run into an interesting error that I can't seem to fix. I'm using Sklearn (LinearSVC, train_test_split), numpy, and a few other small libraries like collections.
The project is a comment classifier - You put in a comment, it spits out a classification. The problem I'm running into is that a Memory Error (Unable to allocate 673. MiB for an array with shape (7384, 11947) and data type float64) when doing a train_test_split to check the classifier accuracy, specifically when I call model.fit.
There are 11947 unique words that my program finds, and I have a large training sample (14,769), but I've never had an issue where I run out of RAM. The problem is, I'm not running out of RAM. I have 32 GB, but the program ends up using less than 1gb before it gives up.
Is there something obvious I'm missing?

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!
Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.
Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

Multi-processing on Large Image Dataset in Python

I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training.
I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. But I'm not sure how should I do it, any ideas?
For speed I would advise to used HDF5 or LMDB:
I have successfully used ml-pyxis for creating deep learning datasets using LMDBs.
It allows to create binary blobs (LMDB) and they can be read quite fast.
The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos
For multi-processing:
I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method.
fit_generator(self, generator, samples_per_epoch,
nb_epoch, verbose=1, callbacks=[],
validation_data=None, nb_val_samples=None,
class_weight={}, max_q_size=10, nb_worker=1,
pickle_safe=False)
Fits the model on data generated batch-by-batch by a Python generator. The generator is run in parallel to the model, for efficiency. For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. You can find the source code here , and the documentation here.
Don't know whether you prefer tensorflow/keras/torch/caffe whatever.
Multiprocessing is simply Using Multiple GPUs
Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. The example above is for Logistic Regression.
Of course you would be more keen on looking into Convnets -
This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition
Keras is yet to officially provide support but you can "proceed at your own risk"
For multiprocessing, tensorflow is a better way to go about this (my opinion)
In fact they have some good documentation on it too

How to reduce topic classification time in textblob naive bayes classifier

I am using pickle to save classified model with bayes theorem, I have saved a file with 2.1 GB after classification with 5600 records. but when i loading that file it is taking nearly 2 minutes but for classifying some text it is taking 5.5 minutes. I am using following code to load it and classify.
classifierPickle = pickle.load(open( "classifier.pickle", "rb" ) )
classifierPickle.classify("want to go some beatifull work place"))
First line for loading pickle object and second one for classifying text it results which topic(Category) it is. I am using following code to save model.
file = open('C:/burberry_model/classifier.pickle','wb')
pickle.dump(object,file,-1)
Every thing i am using from textblob.Environment is Windows,28GB RAM,four core CPU's . It would very help full if any one can resolve this issue.
Since textblob is built on top of NLTK, it is a pure Python implementation which reduces it's speed by a huge magnitude. Secondly, since your Pickle file is 2.1GB, that makes it expand much more and saved directly on the RAM, increasing the time even more.
Also, since you're using Windows, Python speed is relatively slower. If speed is a main concern for you, then it would be useful to use the feature selector and vector constructor from textblob/NLTK and use scikit-learn NB Classifier which has C-Bindings, so i'm guessing it would be significantly faster.

Saving scikit-learn classifier causes memory error

My machine has 16G RAM and the training program uses memory up to 2.6G.
But when I want to save the classifier (trained using sklearn.svm.SVC from a large dataset) as pickle file, it consumes too much memory that my machine cannot give. Eager to know any alternative approaches to save an classifier.
I've tried:
pickle and cPickle
Dump as w or wb
Set fast = True
Neither of them work, always raise a MemoryError. Occasionally the file was saved, but loading it causes ValueError: insecure string pickle.
Thank you in advance!
Update
Thank you all. I didn't try joblib, it works after setting protocol=2.
I would suggest to use out-of-core classifiers from sci-kit learn. These are batch learning algorithms, stores the model output as compressed sparse matrix and are very time efficient.
To start with, the following link really helped me.
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Categories

Resources