Serve Tensorflow models in parallel with Ray

Serve Tensorflow models in parallel with Ray - python

I was looking at this StackOverflow thread on using ray.serve to have a saved TF model predict in parallel:
https://stackoverflow.com/a/62459372
I tried something similar with the following:
import ray
from ray import serve; serve.init()
import tensorflow as tf
class A:
def __init__(self):
self.model = tf.constant(1.0) # dummy example
#serve.accept_batch
def __call__(self, *, input_data=None):
print(input_data) # test if method is entered
# do stuff, serve model
if __name__ == '__main__':
serve.create_backend("tf", A,
# configure resources
ray_actor_options={"num_cpus": 2},
# configure replicas
config={
"num_replicas": 2,
"max_batch_size": 24,
"batch_wait_timeout": 0.1
}
)
serve.create_endpoint("tf", backend="tf")
handle = serve.get_handle("tf")
args = [1,2,3]
futures = [handle.remote(input_data=i) for i in args]
result = ray.get(futures)
However, I get the following error:
TypeError: __call__() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given. There's something wrong with the arguments passed into __call__.
This seems like a simple mistake, how should I change the args array so that the __call__ method is actually entered?

The API for Ray 1.0 is updated. Please see the migration guide https://gist.github.com/simon-mo/6d23dfed729457313137aef6cfbc7b54
For the specific code sample you posted, you can updated it to:
import ray
from ray import serve
import tensorflow as tf
class A:
def __init__(self):
self.model = tf.Constant(1.0) # dummy example
#serve.accept_batch
def __call__(self, requests):
for req in requests:
print(req.data) # test if method is entered
# do stuff, serve model
if __name__ == '__main__':
client = serve.start()
client.create_backend("tf", A,
# configure resources
ray_actor_options={"num_cpus": 2},
# configure replicas
config={
"num_replicas": 2,
"max_batch_size": 24,
"batch_wait_timeout": 0.1
}
)
client.create_endpoint("tf", backend="tf")
handle = client.get_handle("tf")
args = [1,2,3]
futures = [handle.remote(i) for i in args]
result = ray.get(futures)

Related

Python 3. How to access the dictionary of a running one from another module?

I have a ready Flask server where nlp models are run. While 5 devices were connected to it, there were no problems, then we increased this number to 15 and then the performance problem began. We decided to increase the number of cores on this server from 12 to 24, but got a performance increase of around 20%. We decided to try another method, launched 3 servers via Gunicorn and taskset --cpu-list (giving 8 cores to each server) on different ports with a limit of 5 devices per server, here the performance almost doubled, became approximately the same as on a server with 12 cores when there were 5 devices. But then another problem arose, each server ran its models in isolation from each other and RAM consumption increased by 3 times.
Models are loaded into the dictionary by this code
import torch
import torch.nn as nn
import torchvision
import os
import learning
app = Flask(__name__)
models_list_name = os.listdir('models')
global_dict_models = {}
for token in models_list_name:
try:
tok, n_class, model_type, typ, h_layer = token.split('.')
t, model = learning.get_model(model_type, typ)
n_class = int(n_class)
h_layer = int(h_layer)
model = learning.NewModel(model, n_class, typ, h_layer)
model.load_state_dict(torch.load(os.path.join('models', token), map_location=torch.device('cpu')))
print(model, tok)
with open(tok, 'r') as f:
label = f.read().split('\n')
print(label)
global_dict_models[tok] = (t, model, label)
except ValueError:
pass
and when a predict is needed, the query simply accesses the dictionary
t, model, labels = global_dict_models[token]
x = t.encode_plus(text, add_special_tokens=True, max_length=512, truncation=True, padding="max_length", return_tensors='pt')
output = torch.sigmoid(model(x['input_ids'].squeeze(1), x['attention_mask'])).detach().cpu().numpy()
Models on all servers are the same. I want to make a single module with model loading, where a dictionary is formed and so that each server would simply access it through a function.
I tried to create a separate module according to this principle
from flask import Flask, request, make_response
import torch
import torch.nn as nn
import torchvision
import os
import learning
app = Flask(__name__)
models_list_name = os.listdir('models')
global_dict_models = {}
def load_models():
for token in models_list_name:
try:
tok, n_class, model_type, typ, h_layer = token.split('.')
t, model = learning.get_model(model_type, typ)
n_class = int(n_class)
h_layer = int(h_layer)
model = learning.NewModel(model, n_class, typ, h_layer)
model.load_state_dict(torch.load(os.path.join('models', token), map_location=torch.device('cpu')))
print(model, tok)
with open(tok, 'r') as f:
label = f.read().split('\n')
print(label)
global_dict_models[tok] = (t, model, label)
except ValueError:
pass
def dict_models(token):
t, model, labels = global_dict_models[token]
return t, model, labels
if __name__ == '__main__':
load_models()
print('Start')
app.run(host="0.0.0.0", port="5000", threaded=True, processes=1)
And then a problem arose that it was impossible to read a dictionary that was launched in another module.
Tell me how to load the dictionary once and have access to it from other modules? Or how to solve this problem in another way?

predict() got an unexpected keyword argument 'stats'

I am trying to get predictions from a Tensor Flow custom routine served on AI platform.
I have managed to serve it with the following settings: --runtime-version 2.3 --python-version 3.7 --machine-type mls1-c4-m2
But I keep getting this error when i try to make any predictions.
ERROR:root:Prediction failed: predict() got an unexpected keyword argument 'stats'
ERROR:root:Prediction failed: unknown error.
The routine has two steps:
Takes the input (a string) and transforms it into a embedding using a bow model in .pkl format
Uses the embeddings for getting the predictions using a keras model saved as an .h5 file
this is my setup.py
from setuptools import setup
REQUIRED_PACKAGES = ['Keras==2.3.1', 'sklearn==0.0', 'h5py<3.0.0', 'numpy==1.16.0', 'scipy==1.4.1', 'pyyaml==5.2']
setup(
name='my_custom_code',
version='0.1',
scripts=['predictor.py'],
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=False,
description=''
)
And this is my predictor.py
import os
import pickle
import tensorflow as tf
import numpy as np
class MyPredictor(object):
def __init__(self, model, bow_model):
self._model = model
self._bow_model = bow_model
def predict(self, instances):
outputs = []
for x in instances:
vector = self.embedding(x)
output = self._model.predict(vector)
outputs.append(output)
return outputs
def embedding(self, statement):
vector = self._bow_model.transform(statement).toarray()
vector = vector.to_list()
return vector
#classmethod
def from_path(cls, model_dir):
model_path = os.path.join(model_dir, 'model.h5')
model = tf.keras.models.load_model(model_path, compile = False)
preprocessor_path = os.path.join(model_dir, 'bow.pkl')
with open(preprocessor_path, 'rb') as f:
bow_model = pickle.load(f)
return cls(model, bow_model)
The script im using for testing is:
import googleapiclient.discovery
instances = ['test','test']]
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
if 'error' in response:
raise RuntimeError(response['error'])
else:
print(response['predictions'])

According to the Custom prediction routine documentation, once creating the predictor class, predict() method should be supplied with self, instances, **kwargs arguments to properly handle the prediction request.
instances: A list of prediction input instances.
**kwargs: A dictionary of keyword args provided as additional fields on the predict request body.

Why do I get PicklingError while parallelizing tasks using Ray in Python?

I am trying to run a machine learning prediction job in parallel on a huge pandas dataframe. It seems like ray is a nice package for multiprocessing in Python. This is the code:
model_path = './models/lr.pkl'
df = pd.read_csv('./data/input.csv')
dfs = np.array_split(df, 4)
features = ['item_text', 'description', 'amount']
ray.init()
#ray.remote
def predict(model_path, df, features):
model = joblib.load(model_path)
pred_df = model.predict(df[features])
return pred_df
result_ids = []
for i in range(4):
result_ids.append(predict.remote(model_path, dfs[i], features))
results = ray.get(result_ids)
When I ran it, I got the following error:
PicklingError: args[0] from __newobj__ args has the wrong class
I take it args[0] refers to model_path. It is just a string, why wrong class? What am I missing?

It turns out the remote function can't take more than two arguments. After I tupled the two static arguments, it worked.
#ray.remote
def predict(df, args):
model_path, features = args
model = joblib.load(model_path)
pred_df = model.predict(df[features])
return pred_df
args = (model_path, features)
result_ids = []
for i in range(4):
result_ids.append(predict.remote(dfs[i], args))

How to deploy a simple neural network from A-Z in MXNet

I am trying to build and deploy a simple neural network in MXNet and deploy it on a server using mxnet-model-server.
The biggest issue is to deploy the model - model server crashes after uploading the .mar file but I have no idea what the problem could be.
I used the following code to create a custom (but very simple) neural network for testing:
from __future__ import print_function
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
data_ctx = mx.cpu()
model_ctx = mx.cpu()
# fix the seed
np.random.seed(42)
mx.random.seed(42)
num_examples = 1000
X = mx.random.uniform(shape=(num_examples, 49))
y = mx.random.uniform(shape=(num_examples, 1))
dataset_train = mx.gluon.data.dataset.ArrayDataset(X, y)
dataset_test = dataset_train
data_loader_train = mx.gluon.data.DataLoader(dataset_train, batch_size=25)
data_loader_test = mx.gluon.data.DataLoader(dataset_test, batch_size=25)
num_outputs = 2
net = gluon.nn.HybridSequential()
net.hybridize()
with net.name_scope():
net.add(gluon.nn.Dense(49, activation="relu"))
net.add(gluon.nn.Dense(64, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))
net.collect_params().initialize(mx.init.Normal(sigma=.1), ctx=model_ctx)
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .01})
epochs = 1
smoothing_constant = .01
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(data_loader_train):
data = data.as_in_context(model_ctx).reshape((-1, 49))
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
cumulative_loss += nd.sum(loss).asscalar()
Following, exported the model using:
net.export("model_files/my_project")
The result are a .json and .params file.
I created a signature.json
{
"inputs": [
{
"data_name": "data",
"data_shape": [
1,
49
]
}
]
}
The model handler is the same from the mxnet tutorial:
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
# http://www.apache.org/licenses/LICENSE-2.0
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""
ModelHandler defines a base model handler.
"""
import logging
import time
class ModelHandler(object):
"""
A base Model handler implementation.
"""
def __init__(self):
self.error = None
self._context = None
self._batch_size = 0
self.initialized = False
def initialize(self, context):
"""
Initialize model. This will be called during model loading time
:param context: Initial context contains model server system properties.
:return:
"""
self._context = context
self._batch_size = context.system_properties["batch_size"]
self.initialized = True
def preprocess(self, batch):
"""
Transform raw input into model input data.
:param batch: list of raw requests, should match batch size
:return: list of preprocessed model input data
"""
assert self._batch_size == len(batch), "Invalid input batch size: {}".format(len(batch))
return None
def inference(self, model_input):
"""
Internal inference methods
:param model_input: transformed model input data
:return: list of inference output in NDArray
"""
return None
def postprocess(self, inference_output):
"""
Return predict result in batch.
:param inference_output: list of inference output
:return: list of predict results
"""
return ["OK"] * self._batch_size
def handle(self, data, context):
"""
Custom service entry point function.
:param data: list of objects, raw input from request
:param context: model server context
:return: list of outputs to be send back to client
"""
self.error = None # reset earlier errors
try:
preprocess_start = time.time()
data = self.preprocess(data)
inference_start = time.time()
data = self.inference(data)
postprocess_start = time.time()
data = self.postprocess(data)
end_time = time.time()
metrics = context.metrics
metrics.add_time("PreprocessTime", round((inference_start - preprocess_start) * 1000, 2))
metrics.add_time("InferenceTime", round((postprocess_start - inference_start) * 1000, 2))
metrics.add_time("PostprocessTime", round((end_time - postprocess_start) * 1000, 2))
return data
except Exception as e:
logging.error(e, exc_info=True)
request_processor = context.request_processor
request_processor.report_status(500, "Unknown inference error")
return [str(e)] * self._batch_size
Following, I created the .mar file using:
model-archiver --model-name my_project --model-path my_project --handler ssd_service:handle
Starting the model on the server:
mxnet-model-server --start --model_store my_project --models ssd=my_project.mar
I literally followed every tutorial on:
https://github.com/awslabs/mxnet-model-server
However, the server is crashing. The worker die, backend worker die, workers are disconnected, Load model failed: ssd, error: worker died
I have absolutely no clue what to do so I would be very glad if you helped me out!
Best

I tried out your code and it works fine on my laptop. If I run: curl -X POST http://127.0.0.1:8080/predictions/ssd -F "data=[0 1 2 3 4]", I get: OK%
I can only guess why it doesn't work on your machine:
Notice that model-store argument should be written with - not with _ as it is in your example. My command to run mxnet-model-server looks like this: mxnet-model-server --start --model-store ./ --models ssd=my_project.mar
Which version of mxnet-model-server you use? The latest is 1.0.2, but I have 1.0.1 installed, so maybe you want to downgrade and try it out: pip install mxnet-model-server==1.0.1.
Same question to MXNet version. In my case I use nightly build which I get via pip install mxnet --pre. I see that your model is very basic, so it shouldn't depend much... Nevertheless, install the 1.4.0 (current one) just in case.
Not sure, but hope it will help you.

IPython cluster and PicklingError

my problem seem to be similar to This Thread however, while I think I am following the advised method, I still get a PicklingError. When I run my process locally without sending to an IPython Cluster Engine the function works fine.
I am using zipline with IPyhon's notebook, so I first create a class based on zipline.TradingAlgorithm
Cell [ 1 ]
from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
Cell [ 2 ]
%%px --local # This insures that the Class and modules exist on each engine
import zipline as zpl
import numpy as np
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def initialize(self):
self.valueHistory = None
pass
def handle_data(self, data):
for security in data.keys():
## Just randomly buy/sell/hold for each security
coinflip = np.random.random()
if coinflip < .25:
self.order(security,100)
elif coinflip > .75:
self.order(security,-100)
pass
Cell [ 3 ]
from zipline.utils.factory import load_from_yahoo
start = '2013-04-01'
end = '2013-06-01'
sidList = ['SPY','GOOG']
data = load_from_yahoo(stocks=sidList,start=start,end=end)
agentList = []
for i in range(3):
agentList.append(Agent())
def testSystem(agent,data):
results = agent.run(data) #-- This is how the zipline based class is executed
#-- next I'm just storing the final value of the test so I can plot later
agent.valueHistory.append(results['portfolio_value'][len(results['portfolio_value'])-1])
return agent
for i in range(10):
tasks = []
for agent in agentList:
#agent = testSystem(agent,data) ## On its own, this works!
#-- To Test, uncomment the above line and comment out the next two
tasks.append(lview.apply_async(testSystem,agent,data))
agentList = [ar.get() for ar in tasks]
for agent in agentList:
plot(agent.valueHistory)
Here is the Error produced:
PicklingError Traceback (most recent call last)/Library/Python/2.7/site-packages/IPython/kernel/zmq/serialize.pyc in serialize_object(obj, buffer_threshold, item_threshold)
100 buffers.extend(_extract_buffers(cobj, buffer_threshold))
101
--> 102 buffers.insert(0, pickle.dumps(cobj,-1))
103 return buffers
104
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
If I override the run() method from zipline.TradingAlgorithm with something like:
def run(self, data):
return 1
Trying something like this...
def run(self, data):
return zpl.TradingAlgorithm.run(self,data)
results in the same PicklingError.
then the passing off to the engines works, but obviously the guts of the test are not performed. As run is a method internal to zipline.TradingAlgorithm and I don't know everything that it does, how would I make sure it is passed through?

It looks like the zipline TradingAlgorithm object is not pickleable after it has been run:
import zipline as zpl
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def handle_data(self, data):
pass
agent = Agent()
pickle.dumps(agent)[:32] # ok
agent.run(data)
pickle.dumps(agent)[:32] # fails
But this suggests to me that you should be creating the Agents on the engines, and only passing data / results back and forth (ideally, not passing data across at all, or at most once).
Minimizing data transfers might look something like this:
define the class:
%%px
import zipline as zpl
import numpy as np
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def initialize(self):
self.valueHistory = []
def handle_data(self, data):
for security in data.keys():
## Just randomly buy/sell/hold for each security
coinflip = np.random.random()
if coinflip < .25:
self.order(security,100)
elif coinflip > .75:
self.order(security,-100)
load the data
%%px
from zipline.utils.factory import load_from_yahoo
start = '2013-04-01'
end = '2013-06-01'
sidList = ['SPY','GOOG']
data = load_from_yahoo(stocks=sidList,start=start,end=end)
agent = Agent()
and run the code:
def testSystem(agent, data):
results = agent.run(data) #-- This is how the zipline based class is executed
#-- next I'm just storing the final value of the test so I can plot later
agent.valueHistory.append(results['portfolio_value'][len(results['portfolio_value'])-1])
# create references to the remote agent / data objects
agent_ref = parallel.Reference('agent')
data_ref = parallel.Reference('data')
tasks = []
for i in range(10):
for j in range(len(rc)):
tasks.append(lview.apply_async(testSystem, agent_ref, data_ref))
# wait for the tasks to complete
[ t.get() for t in tasks ]
And plot the results, never fetching the agents themselves
%matplotlib inline
import matplotlib.pyplot as plt
for history in rc[:].apply_async(lambda : agent.valueHistory):
plt.plot(history)
This is not quite the same code you shared - three agents bouncing back and forth on all your engines, whereas this has on agent per engine. I don't know enough about zipline to say whether that's useful to you or not.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Serve Tensorflow models in parallel with Ray - python

Related

Python 3. How to access the dictionary of a running one from another module?

predict() got an unexpected keyword argument 'stats'

Why do I get PicklingError while parallelizing tasks using Ray in Python?

How to deploy a simple neural network from A-Z in MXNet

IPython cluster and PicklingError

Categories

Resources