SageMaker Endpoint: ServiceUnavailable 503 when calling the InvokeEndpoint operation

SageMaker Endpoint: ServiceUnavailable 503 when calling the InvokeEndpoint operation - python

I've deployed a model as SageMaker Endpoint, it worked fine for some time but now when invoking the model through boto3
import boto3
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName="my-sagemaker-endpoint",
ContentType="text/csv",
Body=payload,
)
I got the following error
ServiceUnavailable: An error occurred (ServiceUnavailable) when calling the InvokeEndpoint operation (reached max retries: 4): A transient exception occurred while retrieving variant instances. Please try again later.
Researching about this error in SageMaker Documentation it states the following
The request has failed due to a temporary failure of the server.
I've also checked the Instance Metrics in CW and there's nothing unusual.
I'm not sure why this error is happening, any suggestions will be helpful.

TL; DR The error originates because the Instance is unable to retrieve the SageMaker Model artifact from s3.
Explanation
SageMaker Endpoints implement a /ping route which check if model artifact is able to load within the Instance. The model artifacts is first retrieved from s3 and then loaded into the instance. If model is not available on s3 it shows the following error (image below)
As the model artifact can't be retrieved from s3 because it was accidentally deleted, it can't be loaded which raises the error No such file or directory when calling the /ping route to check if the endpoint is healthy (see image below)
This in turn makes the Load Balancer to assume the instance has some problem, blocking you access to it, so when you try to invoke the endpoint you get a 503: Service Unavailable Error
Solution
I worked this out only by redeploying to a new endpoint but this time considering the following:
At least num_instances=2 to guarantee each instance is at a different AZ, and the LB communicates with at least a healthy instance.
Ensure only specific roles have s3:PutObject permission on the s3 model artifacts route models/model-name/version

Related

Amazon SageMaker could not get a response from the endpoint

I have built an anomaly detection model using AWS SageMaker inbuilt model: random cut forest.
rcf = RandomCutForest(
role=execution_role,
instance_count=1,
instance_type="ml.m5.xlarge",
num_samples_per_tree=1000,
num_trees=100,
encrypt_inter_container_traffic=True,
enable_network_isolation=True,
enable_sagemaker_metrics=True)
and created the endpoint:-
rcf_inference = rcf.deploy(
initial_instance_count=4, instance_type="ml.m5.xlarge",
endpoint_name='RCF-container2',
enable_network_isolation=True)
But when I tried to get the prediction using the endpoint I am running into the following error:-
results = rcf_inference.predict(df.values)
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Amazon SageMaker could not get a response from the RCF-container2 endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory."
I have tried with larger cpu instance but still I am getting the same issue. I guess the issue is functional.
Please help.

I would suggest checking the CloudWatch Logs to see if there is any other error that could point to the issue.
I work for AWS and my opinions are my own.

When calling a SageMaker deploy_endpoint function with an a1.small instance, I'm given an error that I can't open a m5.xlarge instance

So while executing through a notebook generated by Autopilot, I went to execute the final code cell:
pipeline_model.deploy(initial_instance_count=1,
instance_type='a1.small',
endpoint_name=pipeline_model.name,
wait=True)
I get this error
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.m5.2xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.
The most important part of that is the last line where it mentions resource limits. I'm not trying to open the type of instance it's giving me an error about opening.
Does the endpoint NEED to be on an ml.m5.2xlarge instance? Or is the code acting up?
Thanks in advance guys and gals.

You should use one of on-demand ML hosting instances supported as detailed at this link. I think non-valid instance_type='a1.small' is replaced by a valid one (ml.m5.2xlarge), and that is not in your AWS service quota. The weird part is that seeing instance_type='a1.small' was generated by SageMaker Autopilot.

Updating Sagemaker Endpoint with new Endpoint Configuration

A bit confused with automatisation of Sagemaker retraining the model.
Currently I have a notebook instance with Sagemaker LinearLerner model making the classification task. So using Estimator I'm making training, then deploying the model creating Endpoint. Afterwards using Lambda function for invoke this endpoint, I add it to the API Gateway receiving the api endpoint which can be used for POST requests and sending back response with class.
Now I'm facing with the problem of retraining. For that I use serverless approach and lambda function getting environment variables for training_jobs. But the problem that Sagemaker not allow to rewrite training job and you can only create new one. My goal is to automatise the part when the new training job and the new endpoint config will apply to the existing endpoint that I don't need to change anything in API gateway. Is that somehow possible to automatically attach new endpoint config with existing endpoint?
Thanks

Yes, use the UpdateEndpoint endpoint. However, if you are using the Python Sagemaker SDK, be aware, there might be some documentation floating around asking you to call
model.deploy(..., update_endpoint=True)
This is apparently now deprecated in v2 of the Sagemaker SDK:
You should instead use the Predictor class to perform this update:
from sagemaker.predictor import Predictor
predictor = Predictor(endpoint_name="YOUR-ENDPOINT-NAME", sagemaker_session=sagemaker_session_object)
predictor.update_endpoint(instance_type="ml.t2.large", initial_instance_count=1)

If I am understanding the question correctly, you should be able to use CreateEndpointConfig near the end of the training job, then use UpdateEndpoint:
Deploys the new EndpointConfig specified in the request, switches to using newly created endpoint, and then deletes resources provisioned for the endpoint using the previous EndpointConfig (there is no availability loss).
If the API Gateway / Lambda is routed via the endpoint ARN, that should not change after using UpdateEndpoint.

Tensorflow - S3 object does not exist

How do I set up direct private bucket access for Tensorflow?
After running
from tensorflow.python.lib.io import file_io
and running print file_io.stat('s3://my/private/bucket/file.json') I end up with an error -
NotFoundError: Object s3://my/private/bucket/file.json does not exist
However, the same line on a public object works without an error:
print file_io.stat('s3://ryft-public-sample-data/wikipedia-20150518.bin')
There appears to be an article on support here: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md
However, I end up with the same error after exporting the variables shown.
I have awscli set up with all credentials, and boto3 can view and download the file in question. I am wondering how I can get Tensorflow to have S3 access directly when the bucket is private.

I had the same problem when trying to access files in private S3 bucket from Sagemaker notebook. The mistake I made was to try using credentials I obtained from boto3, which seem not to be valid outside.
The solution was not to specify credentials (in such case it uses the role attached to the machine), but instead just specify the region name (for some reason it didn't read it from ~/.aws/config file) as follows:
import boto3
import os
session = boto3.Session()
os.environ['AWS_REGION']=session.region_name
NOTE: when debugging this error useful was to look at CloudWatch logs, as the logs of S3 client were printed only there and not in the Jupyter notebook.
In there I have first have seen, that:
when I did specify credentials from boto3 the error was: The AWS Access Key Id you provided does not exist in our records.
When accessing without AWS_REGION env variable set I had The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. which apparently is common when you don't specify bucket (see 301 Moved Permanently after S3 uploading)

SageMaker example access denied

I am running the k-means example in SageMaker:
from sagemaker import KMeans
data_location = 's3://{}/kmeans_highlevel_example/data'.format(bucket)
output_location = 's3://{}/kmeans_example/output'.format(bucket)
kmeans = KMeans(role=role,
train_instance_count=2,
train_instance_type='ml.c4.8xlarge',
output_path=output_location,
k=10,
data_location=data_location)
When I run this line, it appears access denied error.
%%time
kmeans.fit(kmeans.record_set(train_set[0]))
The error returns:
ClientError: An error occurred (AccessDenied) when calling the
PutObject operation: Access Denied
I also read other questions, but their answers do not solve my problem.
Would you please look at my case?

To be able to training a job in SageMaker, you need to pass in an AWS IAM role allowing SageMaker to access your S3 bucket.
The error means that SageMaker does not have permissions to write files in the bucket that you specified.
You can find the permissions that you need to add to your role hereL https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-createtrainingjob-perms
sagemaker aws-sagemaker role

Another thing to consider, if you are using an encrypted bucket, that requires kms decryption, make sure to also include kms related permissions
I've noticed sometimes the error shown is PutObject operation: Access Denied while failure is actually KMS related.

I faced the same problem. My Sagemaker Notebook Instance wasn't able to read or write files to my S3 bucket. First step of troubleshooting is locating the role for your **Sagemaker Instance **. You can do that by checking this section
Then go to this specific role from IAM and attach another policy to the role
I attached S3 Full access but you can create a custom policy.
I was getting confused because I was logged in using the admin user. However, when you go with a Sagemaker Instance your user policies/roles will not be used to perform actions.

In my case I had just forgotten to rename the s3 bucket name from the default given to something that is unique

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SageMaker Endpoint: ServiceUnavailable 503 when calling the InvokeEndpoint operation - python

Related

Amazon SageMaker could not get a response from the endpoint

When calling a SageMaker deploy_endpoint function with an a1.small instance, I'm given an error that I can't open a m5.xlarge instance

Updating Sagemaker Endpoint with new Endpoint Configuration

Tensorflow - S3 object does not exist

SageMaker example access denied

Categories

Resources