How to save a randomforest in scikit-learn? - python

Actually there is a lot of question about persistence,but i have tried a lot using pickle or joblib.dumps . but when i use it to save my random forest i got this:
ValueError: ("Buffer dtype mismatch, expected 'SIZE_t' but got 'long'", <type 'sklearn.tree._tree.ClassificationCriterion'>, (1, array([10])))
Can any one tell me why?
some code for review
forest = RandomForestClassifier()
forest.fit(data[:n_samples], target[:n_samples ])
import cPickle
with open('rf.pkl', 'wb') as f:
cPickle.dump(forest, f)
with open('rf.pkl', 'rb') as f:
forest = cPickle.load(f)
or
from sklearn.externals import joblib
joblib.dump(forest,'rf.pkl')
from sklearn.externals import joblib
forest = joblib.load('rf.pkl')

It is caused by using different 32/64 bit version of python to save/load, as Scikits-Learn RandomForrest trained on 64bit python wont open on 32bit python suggests.

Try to import the joblib package directly:
import joblib
# ...
# save
joblib.dump(rf, "some_path")
# load
rf2 = joblib.load("some_path")
I've put the full working example with the code and comments here.

Related

Error , Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram to unpickle a file

I am running into this error , i can't unpickle a file on my jupyter notebook:
import os
import pickle
import joblib
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
filename = open("loan_model3.pkl", "rb")
mdl = pickle.load(filename)
mdl.close()
and it always shows the below error message , even tho i'vce upgraded all my libraries
Error Message:
FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram://89506590-ec42-44a9-b67c-3ee4cc8e884e/variables/variables You may be trying to load on a different device from the computational device. Consider setting the experimental_io_deviceoption intf.saved_model.LoadOptions to the io_device such as '/job:localhost'.
I tried to upgrade my libraries but still didn't work.
I got the same error too when I was trying to store my Sequential model in .pkl file, since Sequential model is a TensorFlow Keras model so we have to store it in .h5 file and Keras saves models in this format as it can easily store the weights and model configuration in a single file.
Code:
from keras.models import load_model
model.save('model.h5')
model_final = load_model('model.h5')
Idk if you are still here but I found the solution. basically you should not save the tensorflow model into a pickle file but instead into h5 file
## save model
save_path = './model.h5'
model.save(save_path)
## load tensorflow model
model = keras.models.load_model(save_path)
This worked for me. Hope this helps you too.
this worked for me:
import tensorflow as tf
path = './model.h5'
model.save(path )
loaded_model= tf.keras.models.load_model(path )
I have faced the same issue, but by saving the model as .h5 file worked for me. Now i'm able to load .h5 model.

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

I am trying to train PeleeNet pytorch and got the following error
train.py line 80
pelee_voc train configuration
Reading the link provided in #Dwijay 's answer, I found an answer that does not require you to do any source code change.
Indeed, it is very dangerous I would say to change PyTorch source code.
But the idea of modifying the Generator is the good one.
Indeed by default the random number generator generates numbers on CPU, but we want them on GPU.
Therefore, one should actually modify the data loader instantiation to fit the use of the default cuda device.
This is highlighted in this GitHub comment:
data_loader = data.DataLoader(
...,
generator=torch.Generator(device='cuda'),
)
This fix worked for me in PyTorch 1.11 (and worked for this other user in PyTorch 1.10).
I had same issue but on ubuntu20.04
I have tried turning shuffle off as mentioned and that worked but its not correct way as it will make your training worse.
Keep the shuffle ON and follow below step, these would vary according to pytorch version:
In file "site-packages/torch/utils/data/sampler.py" located in anaconda or wherever.
[Modify line 116]: generator = torch.Generator()
change to generator = torch.Generator(device='cuda')
[Modify line 126]: yield from torch.randperm(n, generator=generator).tolist()
change to yield from torch.randperm(n, generator=generator, device='cuda').tolist()
Line number could be different for different version but point to note is adding device='cuda' to functions.
Hope this helps!!!
Turning the shuffle parameter off in the dataloader solved it.
Got the answer form here.
Just wrote a quick code to Automate #Dwijay Bane 's answer
import os
import inspect
import torch
# Find the location of the torch package
package_path = os.path.dirname(inspect.getfile(torch))
full_path=os.path.join(package_path,'utils/data/sampler.py')
# Read in the file
with open(full_path, 'r') as file :
filedata = file.read()
# Replace the target string
filedata = filedata.replace('generator = torch.Generator()', 'generator = torch.Generator(device=\'cuda\')')
filedata = filedata.replace('yield from torch.randperm(n, generator=generator).tolist()', 'yield from torch.randperm(n, generator=generator, device=\'cuda\').tolist()')
# Write the file out again
with open(full_path, 'w') as file:
file.write(filedata)

How to serialize a large randomforest classifier

I am using sklearn's randomforestclassifier to predict a set of classes. I have over 26000 classes and therefore the size of classifier is exceeding over 30 GBs. I am running it on linux with 64 GB of RAM and 20 GB storage.
I am trying to pickle my model by using joblib but it is not working as i don't have enough secondary storage (i guess). Is there any way by which this could be done?? Maybe some kind of compression technique or something else??
You could try to gzip the pickle
compressed_pickle = StringIO.StringIO()
with gzip.GzipFile(fileobj=compressed_pickle, mode='w') as f:
f.write(pickle.dumps(classifier))
Then you can write the compressed_pickle to a file.
To read it back:
with open('rf_classifier.pickle', 'rb') as f:
compressed_pickle = f.read()
rf_classifier = pickle.loads(zlib.decompress(compressed_pickle, 16 + zlib.MAX_WBITS))
EDIT
It appears Python versions prior to 3.4 used to have a hard limit of 4GB on the serialized object size. The latest version of the pickle protocol (version 4.0) does not have this limit, just specify the protocol version:
pickle.dumps(obj, protocol=4)
For older versions of Python please refer this answer:
_pickle in python3 doesn't work for large data saving
A possible workaround is to dump the individual trees into a folder:
path = '/folder/tree_{}'
import _pickle as cPickle
i = 0
for tree in model.estimators_:
with open(path.format(i), 'wb') as f:
cPickle.dump(tree, f)
i+=1
In sklearn's implementation of Random Forest, the attribute "estimators_" is a list containing the individual trees. You could consider serializing all trees indivually into a folder.
To generate predictions you can average the tree's predictions
# load the trees
path = '/folder/tree_{}'
import _pickle as cPickle
trees = []
i = 0
for i in range(num_trees):
with open(path.format(i), 'rb') as f:
trees.append(cPickle.load(f))
i+=1
# generate predictions
predictions = []
for tree in trees:
predictions.append(tree.predict(X))
predictions = np.asarray(predictions).T
# average predictions as in a RF
y_pred = predictions.mean(axis=0)

Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier

I have a decision tree classifier from sklearn and I use pydotplus to show it.
However I don't really like when there is a lot of informations on each nodes for my presentation (entropy, samples and value).
To explain it easier to people I would like to only keep the decision and the class on it.
Where can I modify the code to do it ?
Thank you.
Accoring to the documentation, it is not possible to abstain from setting the additional information inside boxes. The only thing that you may implicitly omit is the impurity parameter.
However, I have done it the other explicit way which is somewhat crooked. First, I save the .dot file setting the impurity to False. Then, I open it up and convert it to a string format. I use regex to subtract the redundant labels and resave it.
The code goes like this:
import pydotplus # pydot library: install it via pip install pydot
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.datasets import load_iris
data = load_iris()
clf = DecisionTreeClassifier()
clf.fit(data.data, data.target)
export_graphviz(clf, out_file='tree.dot', impurity=False, class_names=True)
PATH = '/path/to/dotfile/tree.dot'
f = pydot.graph_from_dot_file(PATH).to_string()
f = re.sub('(\\\\nsamples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])', '', f)
f = re.sub('(samples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])\\\\n', '', f)
with open('tree_modified.dot', 'w') as file:
file.write(f)
Here are the images before and after modification:
In your case, there seems to be more parameters in boxes, so you may want to tweak the code a little bit.
I hope that helps!

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance for your help. I'm using Python with NLTK Naive Bayes Classifier.
classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
# Create the P(label) distribution
label_probdist = estimator(label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
return NaiveBayesClassifier(label_probdist, feature_probdist)
To save:
import pickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
To load later:
import pickle
f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
I went thru the same problem, and you cannot save the object since is a ELEFreqDistr NLTK class. Anyhow NLTK is hell slow. Training took 45 mins on a decent set and I decided to implement my own version of the algorithm (run it with pypy or rename it .pyx and install cython). It takes about 3 minutes with the same set and it can simply save data as json (I'll implement pickle which is faster/better).
I started a simple github project, check out the code here
To Retrain the Pickled Classifer :
f = open('originalnaivebayes5k.pickle','rb')
classifier = pickle.load(f)
classifier.train(training_set)
print('Accuracy:',nltk.classify.accuracy(classifier,testing_set)*100)
f.close()

Categories

Resources