How to use onnxruntime parallel with flask?

How to use onnxruntime parallel with flask? - python

Created a server that want to run a session of onnxruntime parallel.
First question, will be used multi-threads or multi-processings?
Try to use multi-threads, app.run(host='127.0.0.1', port='12345', threaded=True).
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.
I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.
Try to 'gpu_mem_limit', don't work either.
import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)
sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])
#app.route('/algorithm', methods=['POST'])
def parser():
prediction = sess.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
My understanding is that the Flask HTTP server maybe use different sess for each call.
How can make each call use the same session of onnxruntime?
System information
OS Platform and Distribution: Windows10
ONNX Runtime version: 1.8
Python version: python 3.7
GPU model and memory: RTX3070 - 8G

Related

How to free TF/Keras memory in Python after a model has been deleted, while other models are still in memory and in use?

I have a Python server application, which provides TensorFlow / Keras model inference services. Multiple different such models can be loaded and used at the same time, for multiple different clients. A client can request to load another model, but this has no effect on the other clients (i.e. their models stay in memory and use as they are, so each client can ask to load another model regardless of the state of any other client).
The logic and implementation works, however, I am not sure how to correctly free memory in this setup. When a client asks for a new model to load, then the previously loaded model will simply be deleted from memory (via the Python del command), then the new model is being loaded via tensorflow.keras.models.load_model().
From what I read in the Keras documentation one might want to clear a Keras session in order to free memory via calling tf.keras.backend.clear_session(). However, that seems to release all TF memory, which is a problem in my case, since other Keras models for other clients are still in use at the same time, as described above.
Moreover, it seems I cannot put every model into their own process, since I cannot access the single GPU from different running processes in parallel (or at all).
So in other words: When loading a new TensorFlow / Keras model while other models are also in memory and in use, how can I free the TF memory from the previsouly loaded model, without interferring with the other currently loaded models?

When a Tensorflow session starts, it will try to allocate all of the GPU memory available. This is what prevents multiple processes from running sessions. The ideal way to stop this is to ensure that the tf session only allocates a part of the memory. From the docs, there are two ways to do this(Depending on your tf version)
The simple way is (tf 2.2+)
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
for tf 2.0/2.1
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
for tf 1.* (Allocate 30% percentage of memory per process)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The other method is more controlled IMHO and scales better. It requires that you create logical devices and manually control placement for each of them.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Now you have to manually control placement using the with
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
# Replicate your computation on multiple GPUs
c = []
for gpu in gpus:
with tf.device(gpu.name):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c.append(tf.matmul(a, b))
with tf.device('/CPU:0'):
matmul_sum = tf.add_n(c)
print(matmul_sum)
Using this you won't run of out memory and can run multiple processes at once.

You can fork new kernels by customers. Each process will execute operations and separated environments with each other. It is safer and isolated way.
I created a basic scenario which has two parts. The main part's responsibility to start, execute and kill processes. The client part's responsibility is executing operations from the server's orders. Each client waits for orders with HTTP requests.
main.py
import subprocess
import sys
import requests
class ClientOperator:
def __init__(self, name, port, model):
self.name = name
self.port = port
self.proc = subprocess.Popen([sys.executable, 'client.py',
f'--port={port}', f'--model={model}'])
def process(self, a, b):
response = requests.get(f'http://localhost:{self.port}/process',
params={'a': a, 'b': b}).json()
print(f'{self.name} process {a} + {b} = {response}')
def close(self):
print(f'{self.name} is closing')
self.proc.terminate()
customer1 = ClientOperator('John', 20001, 'model1.hdf5')
customer2 = ClientOperator('Oscar', 20002, 'model2.hdf5')
customer1.process(5, 10)
customer2.process(4, 6)
# stop customer1
customer1.close()
client.py
import argparse
from flask import Flask, request, jsonify
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--port', '-p', type=int)
parser.add_argument('--model', '-m', type=str)
args = parser.parse_args()
model = args.model
app = Flask(__name__)
#app.route('/process', methods=['GET'])
def process():
result = int(request.args['a']) + int(request.args['b'])
return jsonify({'result': result, 'model': model})
if __name__ == '__main__':
app.run(host="localhost", port=args.port)
Output:
$ python main.py
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20002/ (Press CTRL+C to quit)
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20001/ (Press CTRL+C to quit)
127.0.0.1 - - [22/Jan/2021 16:31:26] "?[37mGET /process?a=5&b=10 HTTP/1.1?[0m" 200 -
John process 5 + 10 = {'model': 'model1.hdf5', 'result': 15}
127.0.0.1 - - [22/Jan/2021 16:31:27] "?[37mGET /process?a=4&b=6 HTTP/1.1?[0m" 200 -
Oscar process 4 + 6 = {'model': 'model2.hdf5', 'result': 10}
John is closing

How to run multiprocess on a windows server with pl/python 3?

I am running a trigger function on INSERT/UPDATE that would create a new process that sends a post request to an api.
on a Ubuntu + PostgresQL 12 docker container running I was able to get the new process to form without an issue with the below code
pid=os.fork()
... do some logic
req = urllib2.Request(apiURI)
f = urllib2.urlopen(req)
Now Attempting the same on my windows machine, its clear fork is not an option.
What is best practice when running multiprocessing on a windows system?

fork() is not supported by windows.
You can achieve the same using the multiprocessing module:
from multiprocessing import Process
def foo():
print 'hello'
if __name__ == '__main__':
p = Process(target=foo)
p.start()

Very high latency on simple GET requests

I wrote a very simple flask web service that simply returns text hey23 upon being hit and hosted on AWS EC2 t2.micro machine (1GB RAM 1CPU)
To execute this python application I used uwsgi as my app server. Finally I've put my complete setup behind Nginx.
So my stack is Flask+uwsgi+Nginx
Everything is working fine and good. I only have complaint with the execution time. The average latency measured using wrk is ~370ms, which is too much considering the amount of work this service is doing.
Running 30s test # http://XX.XXX.XX.XX/printtest
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 370.98ms 178.90ms 1.96s 91.78%
Req/Sec 93.72 36.72 270.00 69.55%
33124 requests in 30.11s, 5.41MB read
Socket errors: connect 0, read 0, write 0, timeout 15
Non-2xx or 3xx responses: 1173
Requests/sec: 1100.26
Transfer/sec: 184.14KB
hello-test.py
from flask import Flask, jsonify, request
app = Flask(__name__)
#app.route("/print23")
def helloprint():
return "hey23"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=8080, threaded=True)
uwsgi.ini
[uwsgi]
#application's base folder
base = /var/www/demoapp
master = true
#python module to import
app = hello-test
module = %(app)
home = %(base)/venv
pythonpath = %(base)
#socket file's location
socket = /var/www/demoapp/%n.sock
#permissions for the socket file
chmod-socket = 644
#the variable that holds a flask application inside the module imported at line #6
callable = app
disable-logging = True
#location of log files
logto = /var/log/uwsgi/%n.log
max-worker-lifetime = 30
processes = 10
threads = 2
enable-threads = True
cheaper = 2
cheaper-initial = 5
cheaper-step = 1
cheaper-algo = spare
cheaper-overload = 5
Even if i forget about wrk benchmark, even with posting GET requests from my POSTMAN client, I am getting similar latency.
What is wrong here? No matter what, some takeaways
Code cannot be optimised. It just has to return hey23 string.
Nothing can be wrong with Nginx.
I certainly assume ~370ms is not a
good response time for an API doing such a simple task.
Changing the region in which my EC2 machine is hosted may bring some change, but common, this should not be sole reason.
Then what am i missing?

Deploying Watson Visual recognition app fails

I created some custom classifiers locally and then i try to deploy on bluemix an app that classifies an image based on the classifiers i made.
When I try to deploy it, it failes to start.
import os
import json
from os.path import join, dirname
from os import environ
from watson_developer_cloud import VisualRecognitionV3
import time
start_time = time.time()
visual_recognition = VisualRecognitionV3(VisualRecognitionV3.latest_version, api_key='*************')
with open(join(dirname(__file__), './test170.jpg'), 'rb') as image_file:
print(json.dumps(visual_recognition.classify(images_file=image_file,threshold=0, classifier_ids=['Angle_971786581']), indent=2))
print("--- %s seconds ---" % (time.time() - start_time))
Even if I try to deploy a simple print , it failes to deploy, but the starter app i get from bluemix, or a Flask tutorial (https://www.ibm.com/blogs/bluemix/2015/03/simple-hello-world-python-app-using-flask/) i found online deploy just fine.
I'm very new to web programming and using cloud services so i'm totally lost.
Thank you.

Bluemix is expecting your python application to serve on a port. If your application isn't serving some kind of response on the port, it assumes the application failed to start.
# On Bluemix, get the port number from the environment variable PORT
# When running this app on the local machine, default the port to 8080
port = int(os.getenv('PORT', 8080))
#app.route('/')
def hello_world():
return 'Hello World! I am running on port ' + str(port)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=port)
It looks like you're writing your code to just execute once and stop. Instead, make it do the work when someone hits your URL, like shown in the hello_world() function above.
Think about what you want to happen when someone goes to YOUR_APP_NAME.mybluemix.net
If you do not want your application to be a WEB application, but instead just execute once (a background worker application), then use the --no-route option at the end of your cf push command. Then, look at the logs using cf logs appname --recent to see the output of your application
https://console.ng.bluemix.net/docs/manageapps/depapps.html#deployingapps

The main problem was watson-developer-cloud module, giving me an error that it could not be found.
I downgraded to python version 2.7.12, installing it for all users.
Modified runtime.exe and requirments.txt (requirments.txt possible not needed)
Staged with Diego, using no-route and set-health-check APP_NAME none command.
Those fixed the problem, but i still get an exit status 0.

when you deploy an app in bluemix,you should have a requirements.txt which include services you used in your app.
so ,you should checkout your requirements.txt,maybe you lost
watson_developer_cloud
and then, the requirements.txt likes this:
Flask==0.10.1
watson_developer_cloud

Replacing flask internal web server with Apache

I have written a single user application that currently works with Flask internal web server. It does not seem to be very robust and it crashes with all sorts of socket errors as soon as a page takes a long time to load and the user navigates elsewhere while waiting. So I thought to replace it with Apache.
The problem is, my current code is a single program that first launches about ten threads to do stuff, for example set up ssh tunnels to remote servers and zmq connections to communicate with a database located there. Finally it enters run() loop to start the internal server.
I followed all sorts of instructions and managed to get Apache service the initial page. However, everything goes wrong as I now don't have any worker threads available, nor any globally initialised classes, and none of my global variables holding interfaces to communicate with these threads do not exist.
Obviously I am not a web developer.
How badly "wrong" my current code is? Is there any way to make that work with Apache with a reasonable amount of work? Can I have Apache just replace the run() part and have a running application, with which Apache communicates? My current app in a very simplified form (without data processing threads) is something like this:
comm=None
app = Flask(__name__)
class CommsHandler(object):
__init__(self):
*Init communication links to external servers and databases*
def request_data(self, request):
*Use initialised links to request something*
return result
#app.route("/", methods=["GET"]):
def mainpage():
return render_template("main.html")
#app.route("/foo", methods=["GET"]):
def foo():
a=comm.request_data("xyzzy")
return render_template("foo.html", data=a)
comm = CommsHandler()
app.run()
Or have I done this completely wrong? Now when I remove app.run and just import app class to wsgi script, I do get a response from the main page as it does not need reference to global variable comm.
/foo does not work, as "comm" is an uninitialised variable. And I can see why, of course. I just never thought this would need to be exported to Apache or any other web server.
So the question is, can I launch this application somehow in a rc script at boot, set up its communication links and everyhing, and have Apache/wsgi just call function of the running application instead of launching a new one?
Hannu

This is the simple app with flask run on internal server:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()
To run it on apache server Check out fastCGI doc :
from flup.server.fcgi import WSGIServer
from yourapplication import app
if __name__ == '__main__':
WSGIServer(app).run()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use onnxruntime parallel with flask? - python

Related

How to free TF/Keras memory in Python after a model has been deleted, while other models are still in memory and in use?

How to run multiprocess on a windows server with pl/python 3?

Very high latency on simple GET requests

Deploying Watson Visual recognition app fails

Replacing flask internal web server with Apache

Categories

Resources