I trained a NaiveBayes classifier to do elementary sentiment analysis. The model is 208MB . I want to load it only once and then use Gearman workers to keep calling the model to get the results. It takes rather long time to load it only once. How do i load the model only once and then keep calling it ?
Some code , hope this helps :
import nltk.data
c=nltk.data.load("/path/to/classifier.pickle")
This remains as the loader script.
Now i have a gearman worker script which should call this "c" object and then classify the text.
c.classify('features')
This is what i want to do .
Thanks.
If the question is how to use pickle, than that's the answer
import pickle
class Model(object):
#some crazy array of data
def getClass(sentiment)
#return class of sentiment
def loadModel(filename):
f = open(filename, 'rb')
res = pickle.load(f)
f.close()
return res
def saveModel(model, filename):
f = open(filename, 'wb')
pickle.dump(model, f)
f.close()
m = loadModel('bayesian.pickle')
if it's a problem to load large object in such a way, than I don't know whether pickle could help
Related
I am trying to optimize a function with nevergrad and save the resulting recommendation in a pickle. But when I assert saved_obj == loaded_obj it raises an error.
To reproduce the issue I uploaded per_recommendation.pkl :
https://drive.google.com/file/d/1bqxO2JjrTP2qh23HT-qdr9Mf1Kfe4mtC/view?usp=sharing
import pickle
import nevergrad
# load obj (in the real code this is the result of some optimization)
with open('/home/franchesoni/Downloads/per_recommendation.pkl', 'rb') as f:
r2 = pickle.load(f)
# r2 = optimizer.minimize(fn)
# save the object
with open('/home/franchesoni/Downloads/per_recommendation2.pkl', 'wb') as f:
pickle.dump(r2, f)
# load the object
with open('/home/franchesoni/Downloads/per_recommendation2.pkl', 'rb') as f:
recommendation = pickle.load(f)
# they are different!
assert r2 == recommendation
Is this normal or expected?
off-question: in the python docs I read pickle is unsafe, is it dangerous to open (for example) the file I uploaded? is it dangerous to reveal paths like /home/franchesoni?
In python, I am trying to store a list to a file. I've tried pickle, json, etc, but none of them support classes being inside those lists. I can't sacrifice the lists or the classes, I must maintain both. How can I do it?
My current attempt:
try:
with open('file.json', 'r') as file:
allcards = json.load(file)
except:
allcards = []
def saveData(list):
with open('file.json', 'w') as file:
print(list)
json.dump(list, file, indent=2)
saveData is called elsewhere, and I've done all the testing I can and have determined the error comes from trying to save the list due to it's inclusion of classes. It throws me the error
Object of type Card is not JSON serializable
whenever I do the JSON method, and any other method doesn't even give errors but doesn't load the list when I reload the program.
Edit: As for the pickle method, here is what it looks like:
try:
with open('allcards.dat', 'rb') as file:
allcards = pickle.load(file)
print(allcards)
except:
allcards = []
class Card():
def __init__(self, owner, name, rarity, img, pack):
self.owner = str(owner)
self.name = str(name)
self.rarity = str(rarity)
self.img = img
self.pack = str(pack)
def saveData(list):
with open('allcards.dat', 'wb') as file:
pickle.dump(list, file)
When I do this, all that happens is the code runs as normal, but the list is not saved. And the print(allcards) does not trigger either which makes me believe it's somehow not detecting the file or causing some other error leading to it just going straight to the exception. Also, img is supposed to always a link, in case that changes anything.
I have no other way I believe I can help solve this issue, but I can post more code if need be.
Please help, and thanks in advance.
Python's built-in pickle module does not support serializing a python class, but there are libraries that extend the pickle module and provide this functionality. Drill and Cloudpickle both support serializing a python class and has the exact same interface as the pickle module.
Dill: https://github.com/uqfoundation/dill
Cloudpickle: https://github.com/cloudpipe/cloudpickle
//EDIT
The article linked below is good, but I've written a bad example.
This time I've created a new snippet from scratch -- sorry for making it earlier more complicated than it should.
import json
class Card(object):
#classmethod
def from_json(cls, data):
return cls(**data)
def __init__(self, figure, color):
self.figure = figure
self.color = color
def __repr__(self):
return f"<Card: [{self.figure} of {self.color}]>"
def save(cards):
with open('file.json', 'w') as f:
json.dump(cards, f, indent=4, default=lambda c: c.__dict__)
def load():
with open('file.json', 'r') as f:
obj_list = json.load(f)
return [Card.from_json(obj) for obj in obj_list]
cards = []
cards.append(Card("1", "clubs"))
cards.append(Card("K", "spades"))
save(cards)
cards_from_file = load()
print(cards_from_file)
Source
import nltk
import pickle
input_file=open('file.txt', 'r')
input_datafile=open('newskills1.txt', 'r')
string=input_file.read()
fp=(input_datafile.read().splitlines())
def extract_skills(string):
skills=pickle.load(fp)
skill_set=[]
for skill in skills:
skill= ''+skill+''
if skill.lower() in string:
skill_set.append(skill)
return skill_set
if __name__ == '__main__':
skills= extract_skills(string)
print(skills)
I want to print the skills from file but, here pickle is not working.
It shows the error:
_pickle.UnpicklingError: the STRING opcode argument must be quoted
The file containing the pickled data must be written and read as a binary file. See the documentation for examples.
Your extraction function should look like:
def extract_skills(path):
with open(path, 'rb') as inputFile:
skills = pickle.load(inputFile)
Of course, you will need to dump your data into a file open as binary as well:
def save_skills(path, skills):
with open(path, 'wb') as outputFile:
pickle.dump(outputFile, skills)
Additionally, the logic of your main seems a bit flawed.
While the code that follows if __name__ == '__main__' is only executed when the script is run as main module, the code that is not in the main should only be static, ie definitions.
Basically, your script should not do anything, unless run as main.
Here is a cleaner version.
import pickle
def extract_skills(path):
...
def save_skills(path, skills):
...
if __name__ == '__main__':
inputPath = "skills_input.pickle"
outputPath = "skills_output.pickle"
skills = extract_skills(inputPath)
# Modify skills
save_skills(outputPath, skills)
I am on a VM in a directory that contains my Python (2.7) class. I am trying to pickle an instance of my class to a directory in my HDFS.
I'm trying to do something along the lines of:
import pickle
my_obj = MyClass() # the class instance that I want to pickle
with open('hdfs://domain.example.com/path/to/directory/') as hdfs_loc:
pickle.dump(my_obj, hdfs_loc)
From what research I've done, I think something like snakebite might be able to help...but does anyone have more concrete suggestions?
If you use PySpark, then you can use the saveAsPickleFile method:
temp_rdd = sc.parallelize(my_obj)
temp_rdd.coalesce(1).saveAsPickleFile("/test/tmp/data/destination.pickle")
Here is a work around if you are running in a Jupyter notebook with sufficient permissions:
import pickle
my_obj = MyClass() # the class instance that I want to pickle
local_filename = "pickle.p"
hdfs_loc = "//domain.example.com/path/to/directory/"
with open(local_filename, 'wb') as f:
pickle.dump(my_obj, f)
!!hdfs dfs -copyFromLocal $local_filename $hdfs_loc
You can dump Pickle object to HDFS with PyArrow:
import pickle
import pyarrow as pa
my_obj = MyClass() # the class instance that I want to pickle
hdfs = pa.hdfs.connect()
with hdfs.open('hdfs://domain.example.com/path/to/directory/filename.pkl', 'wb') as hdfs_file:
pickle.dump(my_obj, hdfs_file)
I have a pkl file from MNIST dataset, which consists of handwritten digit images.
I'd like to take a look at each of those digit images, so I need to unpack the pkl file, except I can't find out how.
Is there a way to unpack/unzip pkl file?
Generally
Your pkl file is, in fact, a serialized pickle file, which means it has been dumped using Python's pickle module.
To un-pickle the data you can:
import pickle
with open('serialized.pkl', 'rb') as f:
data = pickle.load(f)
For the MNIST data set
Note gzip is only needed if the file is compressed:
import gzip
import pickle
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f)
Where each set can be further divided (i.e. for the training set):
train_x, train_y = train_set
Those would be the inputs (digits) and outputs (labels) of your sets.
If you want to display the digits:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()
The other alternative would be to look at the original data:
http://yann.lecun.com/exdb/mnist/
But that will be harder, as you'll need to create a program to read the binary data in those files. So I recommend you to use Python, and load the data with pickle. As you've seen, it's very easy. ;-)
Handy one-liner
pkl() (
python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl
Will print __str__ for the pickled object.
The generic problem of visualizing an object is of course undefined, so if __str__ is not enough, you will need a custom script.
In case you want to work with the original MNIST files, here is how you can deserialize them.
If you haven't downloaded the files yet, do that first by running the following in the terminal:
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Then save the following as deserialize.py and run it.
import numpy as np
import gzip
IMG_DIM = 28
def decode_image_file(fname):
result = []
n_bytes_per_img = IMG_DIM*IMG_DIM
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[16:]
if len(data) % n_bytes_per_img != 0:
raise Exception('Something wrong with the file')
result = np.frombuffer(data, dtype=np.uint8).reshape(
len(bytes_)//n_bytes_per_img, n_bytes_per_img)
return result
def decode_label_file(fname):
result = []
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[8:]
result = np.frombuffer(data, dtype=np.uint8)
return result
train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')
test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')
The script doesn't normalize the pixel values like in the pickled file. To do that, all you have to do is
train_images = train_images/255
test_images = test_images/255
The pickle (and gzip if the file is compressed) module need to be used
NOTE: These are already in the standard Python library.
No need to install anything new