can't pickle _thread.RLock objects - Pyspark model - python

I created a RandomForest model with PySpark.
I need to save this model as a file with .pkl extension, for this I used the pickle library, but when I go to use it I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-76-bf32d5617a63> in <module>()
2
3 filename = "drive/My Drive/Progetto BigData/APPOGGIO/Modelli/SVM/svm_sentiment_analysis"
----> 4 pickle.dump(model, open(filename, "wb"))
TypeError: can't pickle _thread.RLock objects
Is it possible to use PICKLE with a PySPark model like RandomForest or can it only be used with a Scikit-learn model ???
This is my code:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol = "label", featuresCol = "word2vect", weightCol = "classWeigth", seed = 0, maxDepth=10, numTrees=100, impurity="gini")
model = rf.fit(train_df)
# Save our model into a file with the help of pickle library
filename = "drive/My Drive/Progetto BigData/APPOGGIO/Modelli/SVM/svm_sentiment_analysis"
pickle.dump(model, open(filename, "wb"))
My environment is Google Colab
I need to transform the model into a PICKLE file to create a webapp, to save it I normally use the .save(path) method, in this case I don't need the .save .
Is it possible that a PySpark model cannot be transformed into a file?
Thanks in advance!!

Related

Get 'Can't get attribute" error while loading my pickel file

I am trying to use pick to save and load my ML models but I get an error. Here is the simplify version of my code to save my model:
import pickle
def test(x,y):
return x+y
filename = 'test.pkl'
pickle.dump(test, open(filename, 'wb'))
I can load the pickle file from the same notebook that I am creating it but if I close the notebook and try to load the pick in a new one with the below code:
import pickle
filename = 'test.pkl'
loaded_model = pickle.load(open(filename, 'rb'))
It gets me this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 2
1 filename = 'test.pkl'
----> 2 loaded_model = pickle.load(open(filename, 'rb'))
AttributeError: Can't get attribute 'test' on <module '__main__'>

How to parse, edit and generate object_detection/pipeline.config files using Google Protobuf

I'm training multiple models in a common Ensemble learning paradigm,
currently I'm working with a few detectors and each time I train I
have to edit the config file of each detector, this obviously causes
confusion and a few times I started training with the wrong config
files.
As a solution I'm trying to build an editor to the Google Object Detection API
config files. The config file works with Google Protocol Buffer.
Link to the files I use: pipeline.proto, object_detection/protos, example .config file
I've tried the following code:
from object_detection.protos import input_reader_pb2
with open('/models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config', 'rb') as f:
config = f.read()
read = input_reader_pb2.InputReader().ParseFromString(config)
And I get the following error:
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-19-8043e6bb108f>", line 1, in <module>
input_reader_pb2.InputReader().ParseFromString(txt)
google.protobuf.message.DecodeError: Error parsing message
What am I missing here? what is the appropriate way to Parse and Edit the config file?
Thanks,
Hod
using the following code I was able to parse a config file.
import tensorflow as tf
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def get_configs_from_pipeline_file(pipeline_config_path, config_override=None):
'''
read .config and convert it to proto_buffer_object
'''
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(pipeline_config_path, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)
if config_override:
text_format.Merge(config_override, pipeline_config)
#print(pipeline_config)
return pipeline_config
def create_configs_from_pipeline_proto(pipeline_config):
'''
Returns the configurations as dictionary
'''
configs = {}
configs["model"] = pipeline_config.model
configs["train_config"] = pipeline_config.train_config
configs["train_input_config"] = pipeline_config.train_input_reader
configs["eval_config"] = pipeline_config.eval_config
configs["eval_input_configs"] = pipeline_config.eval_input_reader
# Keeps eval_input_config only for backwards compatibility. All clients should
# read eval_input_configs instead.
if configs["eval_input_configs"]:
configs["eval_input_config"] = configs["eval_input_configs"][0]
if pipeline_config.HasField("graph_rewriter"):
configs["graph_rewriter_config"] = pipeline_config.graph_rewriter
return configs
configs = get_configs_from_pipeline_file('faster_rcnn_resnet101_pets.config')
config_as_dict = create_configs_from_pipeline_proto(configs)
referred from here
Since you have the object_detection API installed, you can just do the following:
from object_detection.utils import config_util
pipeline_config = config_util.get_configs_from_pipeline_file('/path/to/config/file')
This is what I've found to be a useful approach to overriding the object detection pipeline.config:
from object_detection.utils import config_util
from object_detection import model_lib_v2
PIPELINE_CONFIG_PATH = 'path_to_your_pipeline.config'
# Load the pipeline config as a dictionary
pipeline_config_dict = config_util.get_configs_from_pipeline_file(PIPELINE_CONFIG_PATH)
# OVERRIDE EXAMPLES
# Example 1: Override the train tfrecord path
pipeline_config_dict['train_input_config'].tf_record_input_reader.input_path[0] = 'your/override/path/to/train.record'
# Example 2: Override the eval tfrecord path
pipeline_config_dict['eval_input_config'].tf_record_input_reader.input_path[0] = 'your/override/path/to/test.record'
# Convert the pipeline dict back to a protobuf object
pipeline_config = config_util.create_pipeline_proto_from_configs(pipeline_config_dict)
# EXAMPLE USAGE:
# Example 1: Run the object detection train loop with your overrides (has to be string representation)
model_lib_v2.train_loop(config_override=str(pipeline_config)
# Example 2: Save the pipeline config to disk
config_util.save_pipeline_config(config, 'path/to/save/new/pipeline.config)

Import GoogleNews-vectors-negative300.bin

I am working on code using the gensim and having a tough time troubleshooting a ValueError within my code. I finally was able to zip GoogleNews-vectors-negative300.bin.gz file so I could implement it in my model. I also tried gzip which the results were unsuccessful. The error in the code occurs in the last line. I would like to know what can be done to fix the error. Is there any workarounds? Finally, is there a website that I could reference?
Thank you respectfully for your assistance!
import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode
pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec =
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path,
binary=True)
C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname,
fvocab, binary, encoding, unicode_errors, limit, datatype)
244 word.append(ch)
245 word = utils.to_unicode(b''.join(word),
encoding=encoding, errors=unicode_errors)
--> 246 weights = fromstring(fin.read(binary_len),
dtype=REAL)
247 add_word(word, weights)
248 else:
ValueError: string size must be a multiple of element size
Edit: The S3 url has stopped working. You can download the data from Kaggle or use this Google Drive link (be careful downloading files from Google Drive).
The below commands no longer work work.
brew install wget
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This downloads the GZIP compressed file that you can uncompress using:
gzip -d GoogleNews-vectors-negative300.bin.gz
You can then use the below command to get wordVector.
from gensim import models
w = models.KeyedVectors.load_word2vec_format(
'../GoogleNews-vectors-negative300.bin', binary=True)
you have to write the complete path.
use this path:
https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
try this -
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']
also, visit this link : https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
Here is what worked for me. I loaded a part of the model and not the entire model as it's huge.
!pip install wget
import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)
f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)
You can use this URL that points to Google Drive's download of the bin.gz file:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
Alternative mirrors (including the S3 mentioned here) seem to be broken.

Why np.load() couldn't read my ndarray data in pickled file?

I am trying to analyze a tensor data, but I could not read the data in picked file by using np.load(). My python code is as follows:
import pickle
import numpy as np
import sktensor as skt
import numpy.random as rn
data = np.ones((10, 8, 3), dtype='int32') # 3-mode count tensor of size 10 x 8 x 3
##data = skt.dtensor(data)
with open('data.dat', 'w+') as f: # can be stored as a .dat using pickle
pickle.dump(data, f)
with open('data.dat', 'r+') as f: # can be loaded back in using pickle.load
tmp = pickle.load(f)
assert np.allclose(tmp, data)
But when I attempted to use np.load() to load the data in data.bat as follows:
np.load('G:\data.dat')
Some error appears as"
Traceback (most recent call last):
File "<pyshell#34>", line 1, in <module>
np.load('D:/GDELT_Tensor/data.dat', mmap_mode = 'r')
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 416, in load
"Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file 'D:/data.dat' as a pickle.
Anyone can help me?
Don't use the pickle module to save NumPy arrays. Instead, use one of the methods here: http://docs.scipy.org/doc/numpy/reference/routines.io.html
There's even one that uses pickle under the hood, for example:
np.save('data.dat', data)
tmp = np.load('data.dat')
Another format like CSV or HDF5 might be more suitable for most applications--especially where you might want to interoperate with non-Python systems.

Using Pickle object like an API call

I trained a NaiveBayes classifier to do elementary sentiment analysis. The model is 208MB . I want to load it only once and then use Gearman workers to keep calling the model to get the results. It takes rather long time to load it only once. How do i load the model only once and then keep calling it ?
Some code , hope this helps :
import nltk.data
c=nltk.data.load("/path/to/classifier.pickle")
This remains as the loader script.
Now i have a gearman worker script which should call this "c" object and then classify the text.
c.classify('features')
This is what i want to do .
Thanks.
If the question is how to use pickle, than that's the answer
import pickle
class Model(object):
#some crazy array of data
def getClass(sentiment)
#return class of sentiment
def loadModel(filename):
f = open(filename, 'rb')
res = pickle.load(f)
f.close()
return res
def saveModel(model, filename):
f = open(filename, 'wb')
pickle.dump(model, f)
f.close()
m = loadModel('bayesian.pickle')
if it's a problem to load large object in such a way, than I don't know whether pickle could help

Categories

Resources