How to serialize a large randomforest classifier

How to serialize a large randomforest classifier - python

I am using sklearn's randomforestclassifier to predict a set of classes. I have over 26000 classes and therefore the size of classifier is exceeding over 30 GBs. I am running it on linux with 64 GB of RAM and 20 GB storage.
I am trying to pickle my model by using joblib but it is not working as i don't have enough secondary storage (i guess). Is there any way by which this could be done?? Maybe some kind of compression technique or something else??

You could try to gzip the pickle
compressed_pickle = StringIO.StringIO()
with gzip.GzipFile(fileobj=compressed_pickle, mode='w') as f:
f.write(pickle.dumps(classifier))
Then you can write the compressed_pickle to a file.
To read it back:
with open('rf_classifier.pickle', 'rb') as f:
compressed_pickle = f.read()
rf_classifier = pickle.loads(zlib.decompress(compressed_pickle, 16 + zlib.MAX_WBITS))
EDIT
It appears Python versions prior to 3.4 used to have a hard limit of 4GB on the serialized object size. The latest version of the pickle protocol (version 4.0) does not have this limit, just specify the protocol version:
pickle.dumps(obj, protocol=4)
For older versions of Python please refer this answer:
_pickle in python3 doesn't work for large data saving

A possible workaround is to dump the individual trees into a folder:
path = '/folder/tree_{}'
import _pickle as cPickle
i = 0
for tree in model.estimators_:
with open(path.format(i), 'wb') as f:
cPickle.dump(tree, f)
i+=1
In sklearn's implementation of Random Forest, the attribute "estimators_" is a list containing the individual trees. You could consider serializing all trees indivually into a folder.
To generate predictions you can average the tree's predictions
# load the trees
path = '/folder/tree_{}'
import _pickle as cPickle
trees = []
i = 0
for i in range(num_trees):
with open(path.format(i), 'rb') as f:
trees.append(cPickle.load(f))
i+=1
# generate predictions
predictions = []
for tree in trees:
predictions.append(tree.predict(X))
predictions = np.asarray(predictions).T
# average predictions as in a RF
y_pred = predictions.mean(axis=0)

Related

Shuffling input files with tensorflow Datasets

With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.

Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave

The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset = tf.data.TFRecordDataset(myShuffledFileList)

Python: how to save training datasets

I have got training datasets, which are xtrain, ytrain, xtest and ytest. They are all numpy arrays. I want to save them together into a file, so that I can load them into workspace as done in keras for mnist.load_data:
(xtrain, ytrain), (xtest, ytest) = mnist.load_data(filepath)
In python, is there any way to save my training datasets into such a single file? Or is there any other appreciate methods to save them?

You have a number of options:
npz
hdf5
pickle
Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.

Pickle is a good way to go:
import pickle as pkl
#to save it
with open("train.pkl", "w") as f:
pkl.dump([train_x, train_y], f)
#to load it
with open("train.pkl", "r") as f:
train_x, train_y = pkl.load(f)
If your dataset is huge, I would recommend check out hdf5 as #Lukasz Tracewski mentioned.

I find hickle is a very nice way to save them all together into a dict:
import hickle as hkl
data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest}
hkl.dump(data,'data.hkl')

You simply could use numpy.save
np.save('xtrain.npy', xtrain)
or in a human readable format
np.savetxt('xtrain.txt', xtrain)

Saving images with HDF5 and cPickle cost much more disk memory than just storing the same amount of image files directly in disk?

I am trying to saving a large amount of images. I want to save them in a format that costs as less disk memory as possible. I have tested with HDF5 and cPickle in python. Surprisingly, I found out that the data files generated by PyTables and cPickle have much larger sizes than the folder that contains the same amount of images.
My code is here:
import cv2
import copy
import cPickle as pickle
import tables
import numpy as np
image = cv2.imread("aloel.jpg")
images = []
for i in xrange(1000):
images.append(copy.deepcopy(image))
images = np.asarray(images, dtype=np.uint8)
hdf5_path = "img.hdf5"
filters = tables.Filters(complevel=5, complib='blosc')
with tables.open_file(hdf5_path, mode='w', filters=filters) as hdf5_file:
data_storage = hdf5_file.create_array(hdf5_file.root, 'data', obj=images)
with open('img.pickle', 'wb') as f:
pickle.dump(images, f, protocol=pickle.HIGHEST_PROTOCOL)
The folder that contains 1000 copies of aloel.jpg consumes 61.5 MB, but the img.hdf5 and img.pickle are both 1.3GB in size.
I wonder why this occurs? If this is the case, does it mean that it would be better to directly save image data into individual image file rather than save them into a pickle file or hdf5 file?

Update:
your problem is that compression is not applied at all, because first you need to have chunking, which can be achieved by replacing "create_array" with "create_carray". Then, apply "zlib" with complevel 5 and you should see already some improvement. For this particular case, of course, it makes sense to set chunking also along the repeated data axis, so if you add something like chunkshape=[100,100,100,3] to the create_carray command, you should see a major change.
Jpeg is highly efficient lossy compression algorithm. Blosc is optimised for speed, and pickle is not compressed at all by default. There are other options for HDF5 available, take a look at https://support.hdfgroup.org/services/filters.html and I believe you can find a method that is close enough for original jpeg.

How to save a randomforest in scikit-learn？

Actually there is a lot of question about persistence,but i have tried a lot using pickle or joblib.dumps . but when i use it to save my random forest i got this:
ValueError: ("Buffer dtype mismatch, expected 'SIZE_t' but got 'long'", <type 'sklearn.tree._tree.ClassificationCriterion'>, (1, array([10])))
Can any one tell me why?
some code for review
forest = RandomForestClassifier()
forest.fit(data[:n_samples], target[:n_samples ])
import cPickle
with open('rf.pkl', 'wb') as f:
cPickle.dump(forest, f)
with open('rf.pkl', 'rb') as f:
forest = cPickle.load(f)
or
from sklearn.externals import joblib
joblib.dump(forest,'rf.pkl')
from sklearn.externals import joblib
forest = joblib.load('rf.pkl')

It is caused by using different 32/64 bit version of python to save/load, as Scikits-Learn RandomForrest trained on 64bit python wont open on 32bit python suggests.

Try to import the joblib package directly:
import joblib
# ...
# save
joblib.dump(rf, "some_path")
# load
rf2 = joblib.load("some_path")
I've put the full working example with the code and comments here.

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance for your help. I'm using Python with NLTK Naive Bayes Classifier.
classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
# Create the P(label) distribution
label_probdist = estimator(label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
return NaiveBayesClassifier(label_probdist, feature_probdist)

To save:
import pickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
To load later:
import pickle
f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()

I went thru the same problem, and you cannot save the object since is a ELEFreqDistr NLTK class. Anyhow NLTK is hell slow. Training took 45 mins on a decent set and I decided to implement my own version of the algorithm (run it with pypy or rename it .pyx and install cython). It takes about 3 minutes with the same set and it can simply save data as json (I'll implement pickle which is faster/better).
I started a simple github project, check out the code here

To Retrain the Pickled Classifer :
f = open('originalnaivebayes5k.pickle','rb')
classifier = pickle.load(f)
classifier.train(training_set)
print('Accuracy:',nltk.classify.accuracy(classifier,testing_set)*100)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to serialize a large randomforest classifier - python

Related

Shuffling input files with tensorflow Datasets

Python: how to save training datasets

Saving images with HDF5 and cPickle cost much more disk memory than just storing the same amount of image files directly in disk?

How to save a randomforest in scikit-learn？

Save Naive Bayes Trained Classifier in NLTK

Categories

Resources