I have a training script in Sagemaker like,
def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir, hyperparameters, **kwargs):
... Train a network ...
return net
def save(net, model_dir):
# save the model
logging.info('Saving model')
y = net(mx.sym.var('data'))
y.save('%s/model.json' % model_dir)
net.collect_params().save('%s/model.params' % model_dir)
def model_fn(model_dir):
symbol = mx.sym.load('%s/model.json' % model_dir)
outputs = mx.symbol.softmax(data=symbol, name='softmax_label')
inputs = mx.sym.var('data')
param_dict = gluon.ParameterDict('model_')
net = gluon.SymbolBlock(outputs, inputs, param_dict)
net.load_params('%s/model.params' % model_dir, ctx=mx.cpu())
return net
Most of which I stole from the MNIST Example.
When I train, everything goes fine, but when trying to deploy like,
m = MXNet("lstm_trainer.py",
role=role,
train_instance_count=1,
train_instance_type="ml.c4.xlarge",
hyperparameters={'batch_size': 100,
'epochs': 20,
'learning_rate': 0.1,
'momentum': 0.9,
'log_interval': 100})
m.fit(inputs) # No errors
predictor = m.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
I get, (full output)
INFO:sagemaker:Creating model with name: sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599
---------------------------------------------------------------------------
... Stack dump ...
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-west-2-01234567890/sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599/output/model.tar.gz.
Looking in my S3 bucket s3://sagemaker-us-west-2-01234567890/sagemaker-mxnet-py2-cpu-2018-01-17-20-52-52-599/output/model.tar.gz, I in fact don't see the model.
What am I missing?
When you are calling the training job you should specify the output directory:
#Bucket location where results of model training are saved.
model_artifacts_location = 's3://<bucket-name>/artifacts'
m = MXNet(entry_point='lstm_trainer.py',
role=role,
output_path=model_artifacts_location,
...)
If you don't specify the output directory the function will use a default location, that it might not have the permissions to create or write to.
I have had the same issue using a different Estimator in a very similar way on Sagemaker.
My issue was after the first deployment on a re-deploy I had to delete the old "Endpoint Configuration" - which was confusingly pointing the endpoint to an old model location. I imagine this could be done from python using the AWS API, although very easy to test on the portal if this is the same issue.
Related
We are currently moving our models from single model endpoints to multi model endpoints within AWS SageMaker. After deploying the Multi Model Endpoint using prebuilt TensorFlow containers I receive the following error when calling the predict() method:
{"error": "JSON Parse error: The document root must not be followed by other value at offset: 17"}
I invoke the endpoint like this:
data = np.random.rand(n_samples, n_features)
predictor = Predictor(endpoint_name=endpoint_name)
prediction = predictor.predict(data=serializer.serialize(data), target_model=model_name)
My function for processing the input is the following:
def _process_input(data, context):
data = data.read().decode('utf-8')
data = [float(x) for x in data.split(',')]
return json.dumps({'instances': [data]})
For the training I configured my container as follows:
tensorflow_container = TensorFlow(
entry_point=path_script,
framework_version='2.4',
py_version='py37',
instance_type='ml.m4.2xlarge',
instance_count=1,
role=EXECUTION_ROLE,
sagemaker_session=sagemaker_session,
hyperparameters=hyperparameters)
tensorflow_container.fit()
For deploying the endpoint I first initializing a Model from a given Estimator and then a MultiDataModel:
model = estimator.create_model(
role=EXECUTION_ROLE,
image_uri=estimator.training_image_uri(),
entry_point=path_serving)
mdm = MultiDataModel(
name=endpoint_name,
model_data_prefix=dir_model_data,
model=model,
sagemaker_session=sagemaker.Session())
mdm.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name)
Afterwards the single models are added using:
mdm.add_model(
model_data_source=source_path,
model_data_path=model_name)
Thank you for any hints and help.
This issue usually occurs in case you either have damaged or malformed JSON data. Recommend you running it past JSON validator https://jsonlint.com/
I work at AWS and my opinions are my own - Thanks,Raghu
I am trying to run a TensorFlow estimator in Sagemaker Studio. This has worked in the past but after saving an html file of my notebook, that path is now being appended to the directory I am providing to download the input data for the training session.
My code:
# Set base model name that is used to save and load the model. Append a timestamp to it for uniqueness if batch tuning.
#Model Definition File
model_file = "Model.py"
#Input data saved as .npy
input_name = "inputdata.npy"
#Label data saved as .npy
label_name = "inputlabels.npy"
#Test data saved as .npy
test_name = "testdata.npy"
#Test labels saved as .npy
test_labels ="testlabels.npy"
#Path for trained model
bucket_dir = "s3://my-bucket"
#Bucket for trained model
bucket = "my-bucket"
model_name = "my_model"
model_name = model_name + ct.strftime("-%m%d%y-%H%M%S")
load_model_dir = os.path.join(bucket_dir,model_name)
data_dest = "/" + model_name + "/"+ input_name
print(logfile_name)
print(model_name)
print(load_model_dir)
print(data_dest)
#Define hyperparameters for hyperparameter tuning and set default values
shared_hyperparameters = {
'bucket': bucket,
'model_name': model_name,
'model_dir':model_name,
'sm_model_dir':model_name,
'logfile_name': logfile_name,
'train_data': input_name,
'learning_rate': .001,
'epochs': 100,
'train': bucket_dir,
'test': bucket_dir,
'train_labels':label_name,
'test_data': test_name,
'test_labels': test_labels
}
...
aws_estimator = TensorFlow(
entry_point= model_file, #Model definition .py file
bucket = bucket,
role= role,
instance_count=1,
instance_type="ml.m5.2xlarge",
framework_version="2.1.0",
py_version="py3",
distribution={"parameter_server": {"enabled": True}},
hyperparameters = shared_hyperparameters,
metric_definitions = metric_definitions,
log="All",
my_name = model_name,
log_name = logfile_name,
train_data = input_name,
train_labels=label_name,
)
history = aws_estimator.fit(bucket_dir)
This results in the following error:
UnexpectedStatusException: Error for Training job tensorflow-training-2021-11-19-22-51-12-360: Failed. Reason: ClientError: Data download failed:S3 key: s3://my-bucket/s3://my-bucket/model_dir/my-notebook.html has an illegal char sub-sequence '//' in it
I am not sure why the path to the HTML file is being appended onto the bucket_dir when it wasn't before. I saw a similar problem on the AWS forums, but no helpful response was provided. I have tried printing out what the SM_CHANNEL_TRAINING environment variable is before and after training, and it is None.
I'm not sure what your issues is, but what's weird is that your passing lots of non existing parameters to the TensorFlow estimator object (e.g., 'bucket').
To get started I recommend to clean up things: see list of known parameters here (also check the base classes: Framework and EstimatorBase), remove the non existing parameters, and try again.
For some reason, this problem was resolved when I switched to running an instance with TensorFlow 2.3 instead of 2.7.
I want to retrieve the pickle off my trained model, which I know is in the run file inside my experiments in Databricks.
It seems that the mlflow.pyfunc.load_model can only do the predict method.
There is an option to directly access the pickle?
I also tried to use the path in the run using the pickle.load(path) (example of path: dbfs:/databricks/mlflow-tracking/20526156406/92f3ec23bf614c9d934dd0195/artifacts/model/model.pkl).
Use the frmwk's native load_model() method (e.g. sklearn.load_model()) or download_artifacts()
I recently found the solution which can be done by the following two approaches:
Use the customized predict function at the moment of saving the model (check databricks documentation for more details).
example give by Databricks
class AddN(mlflow.pyfunc.PythonModel):
def __init__(self, n):
self.n = n
def predict(self, context, model_input):
return model_input.apply(lambda column: column + self.n)
# Construct and save the model
model_path = "add_n_model"
add5_model = AddN(n=5)
mlflow.pyfunc.save_model(path=model_path, python_model=add5_model)
# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(model_path)
Load the model artefacts as we are downloading the artefact:
from mlflow.tracking import MlflowClient
client = MlflowClient()
tmp_path = client.download_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path='model/model.pkl')
f = open(tmp_path,'rb')
model = pickle.load(f)
f.close()
client.list_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path="")
client.list_artifacts(run_id="0c7946c81fb64952bc8ccb3c7c66bca3", path="model")
When I try to save the PyTorch model with this piece of code:
checkpoint = {'model': Net(), 'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I get the following error:
E:\PROGRAM FILES\Anaconda\envs\staj_projesi\lib\site-packages\torch\serialization.py:251: UserWarning: Couldn't retrieve source code for container of type Net. It won't be checked for correctness upon loading.
...
"type " + obj.__name__ + ". It won't be checked "
Can't pickle local object 'trainModel.<locals>.Net'
When I try to save the PyTorch model with this piece of code:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I don't don't get any errors, but I want to save the ANN class. How can I solve this problem? Also, I could save the model with the first structure in the other projects before
You can't! torch.save is saving the objects state_dict() only.
When you use the following:
checkpoint = {'model': Net(), 'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
You are trying to save the model itself, but this data is saved in the model.state_dict() and when loading a model with the state_dict you should first initiate a model object.
This is exactly the reason why the second method works properly:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, 'Checkpoint.pth')
I would suggest reading the pytorch docs of how to properly save\load a model in the following link:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
Do the usual proper way to save and load models https://pytorch.org/tutorials/beginner/saving_loading_models.html and if you have args or dicts you want to save and perhaps a lambda function sometimes I use dill and the errors go away. e.g.
def save_for_meta_learning(args, ckpt_filename='ckpt.pt'):
if is_lead_worker(args.rank):
import dill
args.logger.save_current_plots_and_stats()
# - ckpt
assert uutils.xor(args.training_mode == 'epochs', args.training_mode == 'iterations')
args_pickable = uutils.make_args_pickable(args)
# args.meta_learner.args = args_pickable
f: nn.Module = get_model_from_ddp(args.base_model)
# pickle vs torch_uu.save https://discuss.pytorch.org/t/advantages-disadvantages-of-using-pickle-module-to-save-models-vs-torch-save/79016
torch.save({'training_mode': args.training_mode, # its or epochs
'it': args.it,
'epoch_num': args.epoch_num,
# 'args': args_pickable,
'args_pickable': args_pickable,
# 'meta_learner': args.meta_learner,
'meta_learner_str': str(args.meta_learner),
# 'f': f,
'f_state_dict': f.state_dict(),
'f_str': str(f),
# 'f_modules': f._modules,
# 'f_modules_str': str(f._modules),
'outer_opt_state_dict': args.outer_opt.state_dict()
},
pickle_module=dill,
f=args.log_root / ckpt_filename)
I am building a data transformation and training pipeline on Azure Machine Leaning Service. I'd like to save my fitted transformer (e.g. tf-idf) to the blob, so my prediction pipeline can later access it.
transformed_data = PipelineData("transformed_data",
datastore = default_datastore,
output_path_on_compute="my_project/tfidf")
step_tfidf = PythonScriptStep(name = "tfidf_step",
script_name = "transform.py",
arguments = ['--input_data', blob_train_data,
'--output_folder', transformed_data],
inputs = [blob_train_data],
outputs = [transformed_data],
compute_target = aml_compute,
source_directory = project_folder,
runconfig = run_config,
allow_reuse = False)
The above code saves the transformer to a current run's folder, which is dynamically generated during each run.
I want to save the transformer to a fixed location on blob, so I can access it later, when calling a prediction pipeline.
I tried to use an instance of DataReference class as PythonScriptStep output, but it results in an error:
ValueError: Unexpected output type: <class 'azureml.data.data_reference.DataReference'>
It's because PythonScriptStep only accepts PipelineData or OutputPortBinding objects as outputs.
How could I save my fitted transformer so it's later accessible by any aribitraly process (e.g. my prediction pipeline)?
This is likely not flexible enough for your needs (also, I haven't tested this yet), but if you are using scikit-learn one possibility is to include the tf-idf/transformation step into a scikit-learn Pipeline object and register that into your workspace.
Your training script would thus contain:
pipeline = Pipeline([
('vectorizer', TfidfVectorizer(stop_words = list(text.ENGLISH_STOP_WORDS))),
('classifier', SGDClassifier()
])
pipeline.fit(train[label].values, train[pred_label].values)
# Serialize the pipeline
joblib.dump(value=pipeline, filename='outputs/model.pkl')
and your experiment submission script would contain
run = exp.submit(src)
run.wait_for_completion(show_output = True)
model = run.register_model(model_name='my_pipeline', model_path='outputs/model.pkl')
Then, you could use the registered "model" and deploy it as a service as explained in the documentation, by loading it into a scoring script via
model_path = Model.get_model_path('my_pipeline')
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
However this would bake the transformation in your pipeline, and thus would not be as modular as you ask...
Another option will be to use DataTransferStep and use it to copy the output to a "known location." This notebook has examples of using DataTransferStep to copy data from and to various supported datastores.
from azureml.data.data_reference import DataReference
from azureml.exceptions import ComputeTargetException
from azureml.core.compute import ComputeTarget, DataFactoryCompute
from azureml.pipeline.steps import DataTransferStep
blob_datastore = Datastore.get(ws, "workspaceblobstore")
blob_data_ref = DataReference(
datastore=blob_datastore,
data_reference_name="knownloaction",
path_on_datastore="knownloaction")
data_factory_name = 'adftest'
def get_or_create_data_factory(workspace, factory_name):
try:
return DataFactoryCompute(workspace, factory_name)
except ComputeTargetException as e:
if 'ComputeTargetNotFound' in e.message:
print('Data factory not found, creating...')
provisioning_config = DataFactoryCompute.provisioning_configuration()
data_factory = ComputeTarget.create(workspace, factory_name, provisioning_config)
data_factory.wait_for_completion()
return data_factory
else:
raise e
data_factory_compute = get_or_create_data_factory(ws, data_factory_name)
# Assuming output data is your output from the step that you want to copy
transfer_to_known_location = DataTransferStep(
name="transfer_to_known_location",
source_data_reference=[output_data],
destination_data_reference=blob_data_ref,
compute_target=data_factory_compute
)
from azureml.pipeline.core import Pipeline
from azureml.core import Workspace, Experiment
pipeline_01 = Pipeline(
description="transfer_to_known_location",
workspace=ws,
steps=[transfer_to_known_location])
pipeline_run_01 = Experiment(ws, "transfer_to_known_location").submit(pipeline_01)
pipeline_run_01.wait_for_completion()
Another solution is to pass DataReference as an input to your PythonScriptStep.
Then inside transform.py you're able to read this DataReference as a command line argument.
You can parse it and use it just as any regular path to save your vectorizer to.
E.g. you can:
step_tfidf = PythonScriptStep(name = "tfidf_step",
script_name = "transform.py",
arguments = ['--input_data', blob_train_data,
'--output_folder', transformed_data,
'--transformer_path', trained_transformer_path],
inputs = [blob_train_data, trained_transformer_path],
outputs = [transformed_data],
compute_target = aml_compute,
source_directory = project_folder,
runconfig = run_config,
allow_reuse = False)
Then inside your script (transform.py in the example above) you can e.g.:
import argparse
import joblib as jbl
import os
from sklearn.feature_extraction.text import TfidfVectorizer
parser = argparse.ArgumentParser()
parser.add_argument('--transformer_path', dest="transformer_path", required=True)
args = parser.parse_args()
tfidf = ### HERE CREATE AND TRAIN YOUR VECTORIZER ###
vect_filename = os.path.join(args.transformer_path, 'my_vectorizer.jbl')
EXTRA: The third way would be to just register the vectorizer as another model in your workspace. You can then use it exactly as any other registered model. (Though this option does not involve explicit writing to blob - as specified in the question above)