What is the right way to save\load models in Spark\PySpark - python

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation )
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
predictions.collect() # shows me some predictions
model.save(sc, "model0")
# Trying to load saved model and work with it
model0 = MatrixFactorizationModel.load(sc, "model0")
predictions0 = model0.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
After I try to use model0 I get a long traceback, which ends with this:
Py4JError: An error occurred while calling o70.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
So my question is - am I doing something wrong? As far as I debugged my models are stored (locally and on HDFS) and they contain many files with some data. I have a feeling that models are saved correctly but probably they aren't loaded correctly. I also googled around but found nothing related.
Looks like this save\load feature has been added recently in Spark 1.3.0 and because of this I have another question - what was the recommended way to save\load models before the release 1.3.0? I haven't found any nice ways to do this, at least for Python. I also tried Pickle, but faced with the same issues as described here Save Apache Spark mllib model in python

One way to save a model (in Scala; but probably is similar in Python):
// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("linReg.model")
Saved model can then be loaded as:
val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
See also related question
For more details see (ref)

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.
You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.
Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

I run into this also -- it looks like a bug.
I have reported to spark jira.

Use pipeline in ML to train the model, and then use MLWriter and MLReader to save models and read them back.
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
pipeTrain.write().overwrite().save(outpath)
model_in = PipelineModel.load(outpath)

Related

Talos, syntax to select best_model

I'm running a toy model for learning, on Ubuntu 21.10, in a conda environment that comprises python 3.74, keras 2.4.3 and talos 1.0, among many other packages. I've run a talos scan using this code:
jam1 = talos.Scan(data,
labels[0,],
model = DLAt,
params = ParamsJam1,
experiment_name = "DL2Outputs"
)
However I've tried everything I can find but can not find correct syntax to select the best model using talos.best_model.
bm = talos.best_model(metric='loss', asc=False)
just gets this error.
AttributeError: module 'talos' has no attribute 'best_model'
Is this not the correct function to achieve this?
The best model isn't saved in the package, it's saved in the Scan object:
bm.best_model(metric='loss', asc=False)

DeepPavlov FAQ Bot is returning a 'collections.OrderedDict' object is not callable error

I'm trying to use collab to build a bot for FAQ with DeepPavlov and I modified a tutorial notebook that DeepPavlov has on their site, the only thing I change is using my sample dataset yet I get the 'collections.OrderedDict' object is not callable error when calling on
answer=model_config(["help"])
answer
The full code for this (seperated in cells) is
!pip install -q deeppavlov
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov.core.commands.infer import build_model
from deeppavlov import configs, train_model
model_config = read_json(configs.faq.tfidf_logreg_en_faq)
model_config["dataset_reader"]["data_path"] = None
model_config["dataset_reader"]["data_url"] = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSUFqHL9u_KkSCfw03bYCIuzfCzfOwXlvsQeTb8tMVaSDYohcHbfL8jNtV24AZiSLNnJJQs58dIsO8A/pub?gid=788315638&single=true&output=csv"
model_config
answer=model_config(["help"])
answer
Anyone know the fix to help my bot run with the sample dataset url I provided in my code? I'm new to bots, deeppavlov, and collab so I'm having a steep learning curve here.
Your code is missing the model training part - you are trying to call the config object instead of actually training and using a model for prediction on your data.
However, this is not the only problem here. Firstly, you might want to change the data_path variable to a string object, otherwise you will face problems here (you may try it yourself to check). Secondly, while trying to run your code with my corrections I have faced a csv-parsing error - please check your csv file again and make sure to get rid of empty rows in it. After you do that, this code should work correctly.
model_config = read_json(configs.faq.tfidf_logreg_en_faq)
model_config["dataset_reader"]["data_path"] = ''
model_config["dataset_reader"]["data_url"] = "your-dataset-link"
faq = train_model(model_config)
answer = faq(["help"])
answer

How to load a finetuned sciBERT model in AllenNLP?

I have finetuned the SciBERT model on the SciIE dataset. The repository uses AllenNLP to finetune the model. The training is executed as follows:
python -m allennlp.run train $CONFIG_FILE --include-package scibert -s "$#"
After a successful training I have a model.tar.gz file as an output that contains weights.th, config.json, and vocabulary folder. I have tried to load it in the allenlp predictor:
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("model.tar.gz")
But I get the following error:
ConfigurationError: bert-pretrained not in acceptable choices for
dataset_reader.token_indexers.bert.type: ['single_id', 'characters',
'elmo_characters', 'spacy', 'pretrained_transformer',
'pretrained_transformer_mismatched']. You should either use the
--include-package flag to make sure the correct module is loaded, or use a fully qualified class name in your config file like {"model":
"my_module.models.MyModel"} to have it imported automatically.
I have never worked with allenNLP, so I am quite lost about what to do.
For reference, this is the part of the config that describer token indexers
"token_indexers": {
"bert": {
"type": "bert-pretrained",
"do_lowercase": "false",
"pretrained_model": "/home/tomaz/neo4j/scibert/model/vocab.txt",
"use_starting_offsets": true
}
}
I am using allenlp version
Name: allennlp
Version: 1.2.1
Edit:
I think I have made a lot of progress, I have to use the same version that was used to train the model and I can import the modules like so:
from allennlp.predictors.predictor import Predictor
from scibert.models.bert_crf_tagger import *
from scibert.models.bert_text_classifier import *
from scibert.models.dummy_seq2seq import *
from scibert.dataset_readers.classification_dataset_reader import *
predictor = Predictor.from_path("scibert_ner/model.tar.gz")
dataset_reader="classification_dataset_reader")
predictor.predict(
sentence="Did Uriah honestly think he could beat The Legend of Zelda in under three hours?"
)
Now I get an error:
No default predictor for model type bert_crf_tagger.\nPlease specify a
predictor explicitly
I know that I can use the predictor_name to specify a predictor explicitly, but I haven't got the faintest idea which name to pick that would work
I have seen a lot of people having this problem. Upon going through the repository code, I found this to be the easiest way to run the predictions:
python -m allennlp.run predict /path/to/saved_model/model.tar.gz /path/to/test.txt\
--include-package scibert --use-dataset-reader\
--output-file /path/to/where/you/want/predict.txt\
--predictor sentence-tagger --batch-size 16
What did I add? The predictor sentence-tagger. Once you go through the repository, you would find that the registered predictor is sentence-tagger. Although the DEFAUL_DICT of the taggers contain sentence_tagger. A lot of confusion, right? Tell me!
This answer also saves you from writing a predictor.

Google AI Platform: Unexpected error when loading the model: 'str' object has no attribute 'decode' [Keras 2.3.1, TF 1.15]

I am trying to use the beta Google Custom Prediction Routine in Google's AI Platform to run a live version of my model.
I include in my package predictor.py which contains a Predictor class as such:
import os
import numpy as np
import pickle
import keras
from keras.models import load_model
class Predictor(object):
"""Interface for constructing custom predictors."""
def __init__(self, model, preprocessor):
self._model = model
self._preprocessor = preprocessor
def predict(self, instances, **kwargs):
"""Performs custom prediction.
Instances are the decoded values from the request. They have already
been deserialized from JSON.
Args:
instances: A list of prediction input instances.
**kwargs: A dictionary of keyword args provided as additional
fields on the predict request body.
Returns:
A list of outputs containing the prediction results. This list must
be JSON serializable.
"""
# pre-processing
preprocessed_inputs = self._preprocessor.preprocess(instances[0])
# predict
outputs = self._model.predict(preprocessed_inputs)
# post-processing
outputs = np.array([np.fliplr(x) for x in x_test])
return outputs.tolist()
#classmethod
def from_path(cls, model_dir):
"""Creates an instance of Predictor using the given path.
Loading of the predictor should be done in this method.
Args:
model_dir: The local directory that contains the exported model
file along with any additional files uploaded when creating the
version resource.
Returns:
An instance implementing this Predictor class.
"""
model_path = os.path.join(model_dir, 'keras.model')
model = load_model(model_path, compile=False)
preprocessor_path = os.path.join(model_dir, 'preprocess.pkl')
with open(preprocessor_path, 'rb') as f:
preprocessor = pickle.load(f)
return cls(model, preprocessor)
The full error Create Version failed. Bad model detected with error: "Failed to load model: Unexpected error when loading the model: 'str' object has no attribute 'decode' (Error code: 0)" indicates that the issue is in this script, specifically when loading the model. However, I am able to successfully load the model in my notebook locally with the same code block in predict.py:
from keras.models import load_model
model = load_model('keras.model', compile=False)
I have seen similar posts which suggest to set the version of h5py<3.0.0 but this hasn't helped. I can set versions of modules for my custom prediction routine as such in a setup.py file:
from setuptools import setup
REQUIRED_PACKAGES = ['keras==2.3.1', 'h5py==2.10.0', 'opencv-python', 'pydicom', 'scikit-image']
setup(
name='my_custom_code',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
version='0.23',
scripts=['predictor.py', 'preprocess.py'])
Unfortunately, I haven't found a good way to debug model deployment in google's AI Platform and the troubleshooting guide is unhelpful. Any pointers would be much appreciated. Thanks!
Edit 1:
The h5py module's version is wrong –– at 3.1.0, despite setting it to 2.10.0 in setup.py. Anyone know why? I confirmed that Keras version and other modules are set properly however. I've tried 'h5py==2.9.0' and 'h5py<3.0.0' to no avail. More on including PyPi package dependencies here.
Edit 2:
So it turns out google currently does not support this capability.
StackOverflow, enzed01
I have encountered the same problem with using AI platform with code that was running fine two months ago, when we last trained our models. Indeed, it is due to the dependency on h5py which fails to load the h5 model out of the blue.
After a while I was able to make it work with runtime 2.2 and python version 3.7. I am also using the custom prediction routine and my model was a simple 2-layer bidirectional LSTM serving classifications.
I had a notebook VM set up with TF == 2.1 and downgraded h5py to <3.0.0 with:
!pip uninstall -y h5py
!pip install 'h5py < 3.0.0'
My setup.py looks like this:
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==2.1', 'h5py<3.0.0']
setup(
name="my_package",
version="0.1",
include_package_data=True,
scripts=["preprocess.py", "model_prediction.py"]
)
I added compile=False to my model load code. Without it, I ran into another problem with deployment which was giving following error: Create Version failed. Bad model detected with error: "Failed to load model: Unexpected error when loading the model: 'sample_weight_mode' (Error code: 0)"
The code change from OP:
model = keras.models.load_model(
os.path.join(model_dir,'model.h5'), compile = False)
And this made the model be deployed as before without a problem. I suspect the
compile=False might mean slower prediction serving, but have not noticed anything so far.
Hope this helps anyone stuck and googling these issues!

How to save to disk / export a lightgbm LGBMRegressor model trained in python?

Hi I am unable to find a way to save a lightgbm.LGBMRegressor model to a file for later re-use.
Try:
my_model.booster_.save_model('mode.txt')
#load from model:
bst = lgb.Booster(model_file='mode.txt')
Note: the API state that
bst = lgb.train(…)
bst.save_model('model.txt', num_iteration=bst.best_iteration)
Depending on the version, one of the above works. For generic, You can also use pickle or something similar to freeze your model.
import joblib
# save model
joblib.dump(my_model, 'lgb.pkl')
# load model
gbm_pickle = joblib.load('lgb.pkl')
Let me know if that helps
For Python 3.7 and lightgbm==2.3.1, I found that the previous answers were insufficient to correctly save and load a model. The following worked:
lgbr = lightgbm.LGBMRegressor(num_estimators = 200, max_depth=5)
lgbr.fit(train[num_columns], train["prep_time_seconds"])
preds = lgbr.predict(predict[num_columns])
lgbr.booster_.save_model('lgbr_base.txt')
Finally, we can validated that this worked via:
model = lightgbm.Booster(model_file='lgbr_base.txt')
model.predict(predict[num_columns])
Without the above, I was getting the error: AttributeError: 'LGBMRegressor' object has no attribute 'save_model'
With the lastest version of lightGBM using import lightgbm as lgb, here is how to do it:
model.save_model('lgb_classifier.txt', num_iteration=model.best_iteration)
and then you can read the model as follow :
model = lgb.Booster(model_file='lgb_classifier.txt')
clf.save_model('lgbm_model.mdl')
clf = lgb.Booster(model_file='lgbm_model.mdl')

Categories

Resources