PyTorch model saving error: "Can't pickle local object" - python

When I try to save the PyTorch model with this piece of code:
checkpoint = {'model': Net(), 'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I get the following error:
E:\PROGRAM FILES\Anaconda\envs\staj_projesi\lib\site-packages\torch\serialization.py:251: UserWarning: Couldn't retrieve source code for container of type Net. It won't be checked for correctness upon loading.
...
"type " + obj.__name__ + ". It won't be checked "
Can't pickle local object 'trainModel.<locals>.Net'
When I try to save the PyTorch model with this piece of code:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I don't don't get any errors, but I want to save the ANN class. How can I solve this problem? Also, I could save the model with the first structure in the other projects before

You can't! torch.save is saving the objects state_dict() only.
When you use the following:
checkpoint = {'model': Net(), 'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
You are trying to save the model itself, but this data is saved in the model.state_dict() and when loading a model with the state_dict you should first initiate a model object.
This is exactly the reason why the second method works properly:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I would suggest reading the pytorch docs of how to properly save\load a model in the following link:
https://pytorch.org/tutorials/beginner/saving_loading_models.html

Do the usual proper way to save and load models https://pytorch.org/tutorials/beginner/saving_loading_models.html and if you have args or dicts you want to save and perhaps a lambda function sometimes I use dill and the errors go away. e.g.
def save_for_meta_learning(args, ckpt_filename='ckpt.pt'):
if is_lead_worker(args.rank):
import dill
args.logger.save_current_plots_and_stats()
# - ckpt
assert uutils.xor(args.training_mode == 'epochs', args.training_mode == 'iterations')
args_pickable = uutils.make_args_pickable(args)
# args.meta_learner.args = args_pickable
f: nn.Module = get_model_from_ddp(args.base_model)
# pickle vs torch_uu.save https://discuss.pytorch.org/t/advantages-disadvantages-of-using-pickle-module-to-save-models-vs-torch-save/79016
torch.save({'training_mode': args.training_mode, # its or epochs
'it': args.it,
'epoch_num': args.epoch_num,
# 'args': args_pickable,
'args_pickable': args_pickable,
# 'meta_learner': args.meta_learner,
'meta_learner_str': str(args.meta_learner),
# 'f': f,
'f_state_dict': f.state_dict(),
'f_str': str(f),
# 'f_modules': f._modules,
# 'f_modules_str': str(f._modules),
'outer_opt_state_dict': args.outer_opt.state_dict()
},
pickle_module=dill,
f=args.log_root / ckpt_filename)

Related

TensorFlow estimator is getting incorrect download input path

I am trying to run a TensorFlow estimator in Sagemaker Studio. This has worked in the past but after saving an html file of my notebook, that path is now being appended to the directory I am providing to download the input data for the training session.
My code:
# Set base model name that is used to save and load the model. Append a timestamp to it for uniqueness if batch tuning.
#Model Definition File
model_file = "Model.py"
#Input data saved as .npy
input_name = "inputdata.npy"
#Label data saved as .npy
label_name = "inputlabels.npy"
#Test data saved as .npy
test_name = "testdata.npy"
#Test labels saved as .npy
test_labels ="testlabels.npy"
#Path for trained model
bucket_dir = "s3://my-bucket"
#Bucket for trained model
bucket = "my-bucket"
model_name = "my_model"
model_name = model_name + ct.strftime("-%m%d%y-%H%M%S")
load_model_dir = os.path.join(bucket_dir,model_name)
data_dest = "/" + model_name + "/"+ input_name
print(logfile_name)
print(model_name)
print(load_model_dir)
print(data_dest)
#Define hyperparameters for hyperparameter tuning and set default values
shared_hyperparameters = {
'bucket': bucket,
'model_name': model_name,
'model_dir':model_name,
'sm_model_dir':model_name,
'logfile_name': logfile_name,
'train_data': input_name,
'learning_rate': .001,
'epochs': 100,
'train': bucket_dir,
'test': bucket_dir,
'train_labels':label_name,
'test_data': test_name,
'test_labels': test_labels
}
...
aws_estimator = TensorFlow(
entry_point= model_file, #Model definition .py file
bucket = bucket,
role= role,
instance_count=1,
instance_type="ml.m5.2xlarge",
framework_version="2.1.0",
py_version="py3",
distribution={"parameter_server": {"enabled": True}},
hyperparameters = shared_hyperparameters,
metric_definitions = metric_definitions,
log="All",
my_name = model_name,
log_name = logfile_name,
train_data = input_name,
train_labels=label_name,
)
history = aws_estimator.fit(bucket_dir)
This results in the following error:
UnexpectedStatusException: Error for Training job tensorflow-training-2021-11-19-22-51-12-360: Failed. Reason: ClientError: Data download failed:S3 key: s3://my-bucket/s3://my-bucket/model_dir/my-notebook.html has an illegal char sub-sequence '//' in it
I am not sure why the path to the HTML file is being appended onto the bucket_dir when it wasn't before. I saw a similar problem on the AWS forums, but no helpful response was provided. I have tried printing out what the SM_CHANNEL_TRAINING environment variable is before and after training, and it is None.
I'm not sure what your issues is, but what's weird is that your passing lots of non existing parameters to the TensorFlow estimator object (e.g., 'bucket').
To get started I recommend to clean up things: see list of known parameters here (also check the base classes: Framework and EstimatorBase), remove the non existing parameters, and try again.
For some reason, this problem was resolved when I switched to running an instance with TensorFlow 2.3 instead of 2.7.

How can I retrive the model.pkl in the experiment in Databricks

I want to retrieve the pickle off my trained model, which I know is in the run file inside my experiments in Databricks.
It seems that the mlflow.pyfunc.load_model can only do the predict method.
There is an option to directly access the pickle?
I also tried to use the path in the run using the pickle.load(path) (example of path: dbfs:/databricks/mlflow-tracking/20526156406/92f3ec23bf614c9d934dd0195/artifacts/model/model.pkl).
Use the frmwk's native load_model() method (e.g. sklearn.load_model()) or download_artifacts()
I recently found the solution which can be done by the following two approaches:
Use the customized predict function at the moment of saving the model (check databricks documentation for more details).
example give by Databricks
class AddN(mlflow.pyfunc.PythonModel):
def __init__(self, n):
self.n = n
def predict(self, context, model_input):
return model_input.apply(lambda column: column + self.n)
# Construct and save the model
model_path = "add_n_model"
add5_model = AddN(n=5)
mlflow.pyfunc.save_model(path=model_path, python_model=add5_model)
# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(model_path)
Load the model artefacts as we are downloading the artefact:
from mlflow.tracking import MlflowClient
client = MlflowClient()
tmp_path = client.download_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path='model/model.pkl')
f = open(tmp_path,'rb')
model = pickle.load(f)
f.close()
client.list_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path="")
client.list_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path="model")

Tensorflow: How to save a 'DNNRegressorV2' model? python

I run into problem when trying to save a trained model, I've tried:
model.save('~/Desktop/models/')
but it gave me an error AttributeError: 'DNNRegressorV2' object has no attribute 'save'
I have also tried:
tf.saved_model.save(model, mobilenet_save_path)
but it gave me an error ValueError: Expected a Trackable object for export, got <tensorflow_estimator.python.estimator.canned.dnn.DNNRegressorV2 object at 0x111cc4b70>.
Any idea?
>type(model)
<class 'tensorflow_estimator.python.estimator.canned.dnn.DNNRegressorV2'>
To save an Estimator you need to create a serving_input_receiver. This function builds a part of a tf.Graph that parses the raw data received by the SavedModel.
The tf.estimator.export module contains functions to help build these receivers.
The following code builds a receiver, based on the feature_columns, that accepts serialized tf.Example protocol buffers, which are often used with tf-serving.
tmpdir = tempfile.mkdtemp()
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
tf.feature_column.make_parse_example_spec([input_column]))
estimator_base_path = os.path.join(tmpdir, 'from_estimator')
estimator_path = estimator.export_saved_model(estimator_base_path, serving_input_fn)
You can also load and run that model, from python:
imported = tf.saved_model.load(estimator_path)
def predict(x):
example = tf.train.Example()
example.features.feature["x"].float_list.value.extend([x])
return imported.signatures["predict"](
examples=tf.constant([example.SerializeToString()]))
print(predict(1.5))
print(predict(3.5))
Click here!

How to send a tf.example into a TensorFlow Serving gRPC predict request

I have data in tf.example form and am attempting to make requests in predict form (using gRPC) to a saved model. I am unable to identify the method call to effect this.
I am starting with the well known Automobile pricing DNN regression model (https://github.com/tensorflow/models/blob/master/samples/cookbook/regression/dnn_regression.py) which I have already exported and mounted via the TF Serving docker container
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
stub = prediction_service_pb2_grpc.PredictionServiceStub(grpc.insecure_channel("localhost:8500"))
tf_ex = tf.train.Example(
features=tf.train.Features(
feature={
'curb-weight': tf.train.Feature(float_list=tf.train.FloatList(value=[5.1])),
'highway-mpg': tf.train.Feature(float_list=tf.train.FloatList(value=[3.3])),
'body-style': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b"wagon"])),
'make': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b"Honda"])),
}
)
)
request = predict_pb2.PredictRequest()
request.model_spec.name = "regressor_test"
# Tried this:
request.inputs['inputs'].CopyFrom(tf_ex)
# Also tried this:
request.inputs['inputs'].CopyFrom(tf.contrib.util.make_tensor_proto(tf_ex))
# This doesn't work either:
request.input.example_list.examples.extend(tf_ex)
# If it did work, I would like to inference on it like this:
result = self.stub.Predict(request, 10.0)
Thanks for any advice
I assume your savedModel has an serving_input_receiver_fn taking string as input and parse to tf.Example. Using SavedModel with Estimators
def serving_example_input_receiver_fn():
serialized_tf_example = tf.placeholder(dtype=tf.string)
receiver_tensors = {'inputs': serialized_tf_example}
features = tf.parse_example(serialized_tf_example, YOUR_EXAMPLE_SCHEMA)
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
so, serving_input_receiver_fn accepts a string, so you have to SerializeToString your tf.Example(). Besides, serving_input_receiver_fn works like input_fn to training, data dump into model in a batch fashion.
The code may change to :
request = predict_pb2.PredictRequest()
request.model_spec.name = "regressor_test"
request.model_spec.signature_name = 'your method signature, check use saved_model_cli'
request.inputs['inputs'].CopyFrom(tf.make_tensor_proto([tf_ex.SerializeToString()], dtype=types_pb2.DT_STRING))
#hakunami's answer didn't work for me. But when I modify the last line to
request.inputs['inputs'].CopyFrom(tf.make_tensor_proto([tf_ex.SerializeToString()], dtype=types_pb2.DT_STRING),shape=[1])
it works. If "shape" is None, the resulting tensor proto represents the numpy array precisely.enter link description here

Sagemaker "Could not find model data" when trying to deploy my model

I have a training script in Sagemaker like,
def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir, hyperparameters, **kwargs):
... Train a network ...
return net
def save(net, model_dir):
# save the model
logging.info('Saving model')
y = net(mx.sym.var('data'))
y.save('%s/model.json' % model_dir)
net.collect_params().save('%s/model.params' % model_dir)
def model_fn(model_dir):
symbol = mx.sym.load('%s/model.json' % model_dir)
outputs = mx.symbol.softmax(data=symbol, name='softmax_label')
inputs = mx.sym.var('data')
param_dict = gluon.ParameterDict('model_')
net = gluon.SymbolBlock(outputs, inputs, param_dict)
net.load_params('%s/model.params' % model_dir, ctx=mx.cpu())
return net
Most of which I stole from the MNIST Example.
When I train, everything goes fine, but when trying to deploy like,
m = MXNet("lstm_trainer.py",
role=role,
train_instance_count=1,
train_instance_type="ml.c4.xlarge",
hyperparameters={'batch_size': 100,
'epochs': 20,
'learning_rate': 0.1,
'momentum': 0.9,
'log_interval': 100})
m.fit(inputs) # No errors
predictor = m.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
I get, (full output)
INFO:sagemaker:Creating model with name: sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599
---------------------------------------------------------------------------
... Stack dump ...
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-west-2-01234567890/sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599/output/model.tar.gz.
Looking in my S3 bucket s3://sagemaker-us-west-2-01234567890/sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599/output/model.tar.gz, I in fact don't see the model.
What am I missing?
When you are calling the training job you should specify the output directory:
#Bucket location where results of model training are saved.
model_artifacts_location = 's3://<bucket-name>/artifacts'
m = MXNet(entry_point='lstm_trainer.py',
role=role,
output_path=model_artifacts_location,
...)
If you don't specify the output directory the function will use a default location, that it might not have the permissions to create or write to.
I have had the same issue using a different Estimator in a very similar way on Sagemaker.
My issue was after the first deployment on a re-deploy I had to delete the old "Endpoint Configuration" - which was confusingly pointing the endpoint to an old model location. I imagine this could be done from python using the AWS API, although very easy to test on the portal if this is the same issue.

Categories

Resources