I uploaded my model to ML-engine and when trying to make a prediction I receive the following error:
ERROR: (gcloud.ml-engine.predict) HTTP request failed. Response: { "error": {
"code": 429,
"message": "Prediction server is out of memory, possibly because model size is too big.",
"status": "RESOURCE_EXHAUSTED" } }
My model size is 151.1 MB. I already did all the suggested actions from google cloud website such as quantise. Is there a possible solution or any other thing I could do to make it work?
Thanks
Typically a model of this size should not result in OOM. Since TF does a lot of lazy initialization, some OOMs won't be detected until the first request to initialize the data structure. In rare case certain graph can explode 10x in memory causing OOM.
1) Did you see the prediction error consistently? Due to the way Tensorflow schedules nodes the memory usage for the same graph might be different across runs. Make sure to run prediction multiple times and see if it's 429 every time.
2) Please make sure 151.1MB is the size of your SavedModel Directory.
3) You can also debug the peak memory locally, for instance using top when running gcloud ml-engine local predict or by loading the model into memory in a docker container and use docker stats or some other way to monitor memory usage. You can try tensorflow serving for debugging (https://www.tensorflow.org/serving/serving_basic) and post the results.
4) If you find the memory problem is persistent, please contact cloudml-feedback#google.com for further assistance, make sure you include your project number and associated account for further debugging.
Related
I have built an anomaly detection model using AWS SageMaker inbuilt model: random cut forest.
rcf = RandomCutForest(
role=execution_role,
instance_count=1,
instance_type="ml.m5.xlarge",
num_samples_per_tree=1000,
num_trees=100,
encrypt_inter_container_traffic=True,
enable_network_isolation=True,
enable_sagemaker_metrics=True)
and created the endpoint:-
rcf_inference = rcf.deploy(
initial_instance_count=4, instance_type="ml.m5.xlarge",
endpoint_name='RCF-container2',
enable_network_isolation=True)
But when I tried to get the prediction using the endpoint I am running into the following error:-
results = rcf_inference.predict(df.values)
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Amazon SageMaker could not get a response from the RCF-container2 endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory."
I have tried with larger cpu instance but still I am getting the same issue. I guess the issue is functional.
Please help.
I would suggest checking the CloudWatch Logs to see if there is any other error that could point to the issue.
I work for AWS and my opinions are my own.
I have created a flask api in python and deployed as a container image in gcp cloud run and running through the cloud scheduler, in my code i am reading large data (15 million rows and 20 columns) from big query, i have set my system config to 8gm ram 4 cpu.
problem1: It is taking too much time to read for about (2200 secs to read data)
import numpy as np
import pandas as pd
from pandas.io import gbq
query = """ SELECT * FROM TABLE_SALES"""
df = gbq.read_gbq(query), project_id="project_name")
Is there any efficient way to read the data from BQ?
Problem2 : my code has stopped working after reading the data. when i checked the logs, i got this:
error - 503
textPayload: "The request failed because either the HTTP response was malformed or connection to the instance had an error.
While handling this request, the container instance was found to be using too much memory and was terminated. This is likely to cause a new container instance to be used for the next request to this revision. If you see this message frequently, you may have a memory leak in your code or may need more memory. Consider creating a new revision with more memory."
one of the work around is to enhance the system config if that's the solution please let me know the cost around it.
You can try GCP Dataflow batch job to read through a large data from BQ.
Depending on the complexity of your Bigquery query you may want to consider the high performant Google Bigquery Storage API https://cloud.google.com/bigquery/docs/reference/storage/libraries
I just started using aws sagemaker for running and maintaining models, experiments. just wanted to know is there any persistent layer for the sagemaker from where i can get data of my experiments/models instead of looking into the sagemaker studio. Does sagemaker saves the experiments or its data like s3 location in any table something like modelsdb?
SageMaker Studio is using the SageMaker API to pull all of the data its displaying. Essentially there's no secret API here getting invoked.
Quite a bit of what's being displayed with respect to experiments is from the search results, the rest coming from either List* or Describe* calls. Studio is taking the results from the search request and displaying it in the table format that you're seeing. Search results when searching over resource ExperimentTrialComponent that have a source (such as a training job) will be enhanced with the original sources data ([result]::SourceDetail::TrainingJob) if supported (work is ongoing to add additional source detail resource types).
All of the metadata that is related to resources in SageMaker is available via the APIs; there is no other location (in the cloud) like s3 for that data.
As of this time there is no effort to determine if it's possible to add support into modeldb for SageMaker that I'm aware of. Given that modeldb appears to make some assumptions about it's talking to a relational database it would appear unlikely to be something that would be doable. (I only read the overview very quickly so this might be inaccurate.)
I'm training a large-ish model, trying to use for the purpose Azure Machine Learning service in Azure notebooks.
I thus create an Estimator to train locally:
from azureml.train.estimator import Estimator
estimator = Estimator(source_directory='./source_dir',
compute_target='local',
entry_script='train.py')
(my train.py should load and train starting from a large word vector file).
When running with
run = experiment.submit(config=estimator)
I get
TrainingException:
====================================================================
While attempting to take snapshot of
/data/home/username/notebooks/source_dir Your total
snapshot size exceeds the limit of 300.0 MB. Please see
http://aka.ms/aml-largefiles on how to work with large files.
====================================================================
The link provided in the error is likely broken.
Contents in my ./source_dir indeed exceed 300 MB.
How can I solve this?
You can place the training files outside source_dir so that they don't get uploaded as part of submitting the experiment, and then upload them separately to the data store (which is basically using the Azure storage associated with your workspace). All you need to do then is reference the training files from train.py.
See the Train model tutorial for an example of how to upload data to the data store and then access it from the training file.
After I read the GitHub issue Encounter |total Snapshot size 300MB while start logging and the offical document Manage and request quotas for Azure resources for Azure ML service, I think it's an unknown issue which need some time to wait Azure to fix.
Meanwhile, I recommended that you can try to migrate the current work to the other service Azure Databricks, to upload your dataset and codes and then run it in the notebook of Azure Databricks which is host on HDInsight Spark Cluster without any worry about memory or storage limits. You can refer to these samples for Azure ML on Azure Databricks.
I developed a machine learning model locally and wanted to deploy it as web service using Azure Functions.
At first model was save to binary using pickle module and then uploaded into blob storage associated with Azure Function (python language, http-trigger service) instance.
Then, after installing all required modules, following code was used to make a predicition on sample data:
import json
import pickle
import time
postreqdata = json.loads(open(os.environ['req']).read())
with open('D:/home/site/wwwroot/functionName/model_test.pkl', 'rb') as txt:
mod = txt.read()
txt.close()
model = pickle.loads(mod)
scs= [[postreqdata['var_1'],postreqdata['var_2']]]
prediciton = model.predict_proba(scs)[0][1]
response = open(os.environ['res'], 'w')
response.write(str(prediciton))
response.close()
where: predict_proba is a method of a trained model used for training and scs variable is definied for extracting particular values (values of variables for the model) from POST request.
Whole code works fine, predictions are send as a response, values are correct but the execution after sending a request lasts for 150 seconds! (locally it's less than 1s). Moreover, after trying to measure which part of a code takes so long: it's 10th line (pickle.loads(mod)).
Do you have any ideas why it takes such a huge amount of time? Model size is very small (few kB).
Thanks
When a call is made to the AML the first call must warm up the
container. By default a web service has 20 containers. Each container
is cold, and a cold container can cause a large(30 sec) delay.
I suggest you referring to this thread Azure Machine Learning Request Response latency and this article about Azure ML performance.
Hope it helps you.