Worker timeout when preloading Pytorch model in Flask app on Render.com - python

In my app.py I have a function that uses a pretrained Pytorch model to generate keywords
#app.route('/get_keywords')
def get_keywords():
generated_keywords = ml_controller.generate_keywords()
return jsonify(keywords=generated_keywords)
and in ml_controller.py I have
def generate_keywords():
model = load_keywords_model()
output = model.generate()
return output
This is working fine. Calls to /get_keywords correctly return the generated keywords. However this solution is quite slow since the model gets loaded on each call. Hence I tried to load the model just once by moving it outside my function:
model = load_keywords_model()
def generate_keywords():
output = model.generate()
return output
But now all calls to /get_keywords time out when I deploy my app to Render.com. (Locally it's working.) Strangely the problem is not that the model does not get loaded. When I write
model = load_keywords_model()
testOutput = model.generate()
print(testOutput)
def generate_keywords():
output = model.generate()
return output
a bunch of keywords are generated when I boot gunicorn. Also, all other endpoints that don't call ml_controller.generate_keywords() work without problems.
For testing purposes I also added a dummy function to ml_controller.py that I can call without problems
def dummy_string():
return "dummy string"
Based on answers to similar problems I found, I'm starting Gunicorn with
gunicorn app:app --timeout 740 --preload --log-level debug
and in app.py I'm using
if __name__ == '__main__':
app.run(debug=False, threaded=False)
However, the problem still persists.

The problem is that there's some bug that occurs for Pytorch models when Gunicorn is started with the --preload flag.
Render.com secretly adds this flag and doesn't show it in the settings which is why it took me days to figure this out. You can see all settings Render.com adds by calling printenv in the console.
To resolve the issue add a new environment variable
GUNICORN_CMD_ARGS: '--access-logfile - --bind=0.0.0.0:10000'
which overwrites Render.com's standard settings
GUNICORN_CMD_ARGS: '--preload --access-logfile - --bind=0.0.0.0:10000'

Related

Can't run a Flask server [duplicate]

This question already has answers here:
How to run a flask application?
(6 answers)
Closed yesterday.
This post was edited and submitted for review yesterday.
I'm trying to follow this online tutorial for a Flask/Tensorflow/React application, but I'm having some trouble at the end now trying to run the Flask server.
Flask version: 2.2.3
Python version: 3.10.0
I've search online for solutions, but nothing I've tried has worked. Here's the ways I've tried to run the application:
Not sure if this could be helpful in coming to a solution, but incase it is, here's my app.py file:
import os
import numpy as np
from flask import Flask, request
from flask_cors import CORS
from keras.models import load_model
from PIL import Image, ImageOps
app = Flask(__name__) # new
CORS(app) # new
#app.route('/upload', methods=['POST'])
def upload():
# Disable scientific notation for clarity
np.set_printoptions(suppress=True)
# Load the model
model = load_model("keras_Model.h5", compile=False)
# Load the labels
class_names = open("labels.txt", "r").readlines()
# Create the array of the right shape to feed into the keras model
# The 'length' or number of images you can put into the array is
# determined by the first position in the shape tuple, in this case 1
data = np.ndarray(shape=(1, 224, 224, 3), dtype=np.float32)
# Replace this with the path to your image
image = Image.open("<IMAGE_PATH>").convert("RGB")
# resizing the image to be at least 224x224 and then cropping from the center
size = (224, 224)
image = ImageOps.fit(image, size, Image.Resampling.LANCZOS)
# turn the image into a numpy array
image_array = np.asarray(image)
# Normalize the image
normalized_image_array = (image_array.astype(np.float32) / 127.5) - 1
# Load the image into the array
data[0] = normalized_image_array
# Predicts the model
prediction = model.predict(data)
index = np.argmax(prediction)
class_name = class_names[index]
confidence_score = prediction[0][index]
# Print prediction and confidence score
print("Class:", class_name[2:], end="")
print("Confidence Score:", confidence_score)
Does anyone know what I'm doing wrong here, is there maybe something obvious I'm missing that's causing the problem? If there's any other info I can add that may be helpful, please let me know, thanks.
Edit:
I've added the execution section at the end of my code:
if __name__ == '__main__':
app.run(host="127.0.0.1", port=5000)
And now 'python -m flask run' does attempt to run the app, so the original post question is answered. There does seem to be a subsequent problem now though, it constantly returns this error. I installed tensorflow using 'pip3 install tensorflow' and it does install successfully, but then it always returns with the module not found error. Tensorflow doesn't appear in pip freeze package list, am now looking to see how/why this is.
Edit2: My question was flagged as already answered here, though I'm struggling to see how, as that post has absolutely no mention at all on why 'flask run' or any way of trying to run a flask app might not work, or what to do when it doesn't, which is what this question is. That post is simply discussing the 'correct' way to run a flask app, not how to run one at all when it's not running.
If you don't want to use the flask command, you need to add code to run the development server instead.
if __name__ == "__main__":
app.run()
You're not using the flask command, but you forgot to add code to run the development server instead.
if __name__ == '__main__':
app.run()

keep my data test when running TestCase in django

i have a initial_data command in my project and when i try to create my test database and run this command
"py manage.py initial_data" it's not working because this command makes data in main database and not test_database, actually can't understand which one is test_database.
so i give up to use this command,and i have other method to resolve my problem .
and i decided to create manually my initial_data.
initial_data command actually makes my required data in my other tables.
and why i try to create data?
because i have greater than 500 tests and this tests required to other test datas.
so i have this tests:
class BankServiceTest(TestCase):
def setUp(self):
self.bank_model = models.Bank
self.country_model = models.Country
def test_create_bank_object(self):
self.bank_model.objects.create(name='bank')
re1 = self.bank_model.objects.get(name='bank')
self.assertEqual(re1.name, 'bank')
def test_create_country_object(self):
self.country_model.objects.create(name='country', code=100)
bank = self.bank_model.objects.get(name='bank')
re1 = self.country_model.objects.get(code=100)
self.assertEqual(re1.name, 'country')
i want after running "test create bank object " to have first function datas in second function
actually this method Performs the command "initial_data".
so what is the method for this problem.
i used unit_test.TestCase but it not working for me.
this method actually makes test data in main data base and i do not want this method.
or maybe i used bad of this method.
actually i need to keep my data in running test.

Locust: How to use distributed mode programmatically

I'm using locust to load test a bunch of apps, the problem is that it's using more than 90% of the CPU and so I want to switch to distributed mode with master and workers, I know how it's done in command line but I'm using locust as a library and it seems the docs don't cover this case, here's some snippets of my code:
User:
# host, port and reqs are external parameters
class User(HttpUser):
host = f"{host}/{port}"
#task
def task1(self):
for req in reqs:
self.client.request(req['method'],req['path'],name=f"{port}",headers=req['headers'],data=req['body'])
def on_start(self):
self.client.verify = False
main:
env = Environment(user_classes=[User])
env.create_local_runner()
# env.create_web_ui("127.0.0.1", 8089)
gevent.spawn(stats_printer(env.stats))
gevent.spawn(stats_history, env.runner)
csvWriter = StatsCSVFileWriter(
environment=env,
base_filepath=f"{CWD}/outputs/{port}",
full_history=True,
percentiles_to_report=[0.5,0.6,0.7,0.8,0.9, 0.95, 0.99]
)
gevent.spawn(csvWriter)
env.runner.start(100, spawn_rate=5)
gevent.spawn_later(30,lambda: saveReportAndQuit(env,port))
env.runner.greenlet.join()
# env.web_ui.stop()
python version: 3.9.7
Locust version: 2.4.0
The Locust docs do in fact mention this. They don't have a full example, granted, but you should have what you need to do it from the docs.
The Environment instance’s create_local_runner, create_master_runner or create_worker_runner can then be used to start a Runner instance, which can be used to start a load test
create_worker_runner(master_host, master_port)
Create a WorkerRunner instance for this Environment
Parameters
master_host – Host/IP of a running master node
master_port – Port on master node to connect to
You'll need to know the master's IP address in code in order to use this, which Locust can't really help you with. But then you should be able to do something like this:
env = Environment(user_classes=[User])
env.create_worker_runner('0.0.0.0', 5557)
env.runner.greenlet.join()
I don't think you need to start the worker runner since the master should tell it to when your test starts, but I'm not 100% sure. Try it out and see what happens.
Also, if you're going to want to use the same file for master and workers, you'll need some logic to decide which to do.

Change Logdir of Ray RLlib Training instead of ~/ray_results

I'm using Ray & RLlib to train RL agents on an Ubuntu system. Tensorboard is used to monitor the training progress by pointing it to ~/ray_results where all the log files for all runs are stored. Ray Tune is not being used.
For example, on starting a new Ray/RLlib training run, a new directory will be created at
~/ray_results/DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1
To visualize the training progress, we need to start Tensorboard using
tensorboard --logdir=~/ray_results
Question: Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
Failed Attempt: Tried setting
os.environ['TUNE_RESULT_DIR'] = '~/another_dir`
before running ray.init(), but the result log files were still being written to ~/ray_results.
Without using Tune, you can change the logdir using rllib's "Trainer". The "Trainer" class takes in an optional "logger_creator" if you want to specify where to save the log (see here).
A concrete example:
Define your customized logger creator (you can simply modify from the default one):
def custom_log_creator(custom_path, custom_str):
timestr = datetime.today().strftime("%Y-%m-%d_%H-%M-%S")
logdir_prefix = "{}_{}".format(custom_str, timestr)
def logger_creator(config):
if not os.path.exists(custom_path):
os.makedirs(custom_path)
logdir = tempfile.mkdtemp(prefix=logdir_prefix, dir=custom_path)
return UnifiedLogger(config, logdir, loggers=None)
return logger_creator
Pass this logger_creator to the trainer, and start training:
trainer = PPOTrainer(config=config, env='CartPole-v0',
logger_creator=custom_log_creator(os.path.expanduser("~/another_ray_results/subdir"), 'custom_dir'))
for i in range(ITER_NUM):
result = trainer.train()
You will find the training results (i.e., TensorBoard events file, params, model, ...) saved under "~/another_ray_results/subdir" with your specified naming convention.
Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
There is currently no way to configure this using RLib CLI tool (rllib).
If you're okay with Python API, then, as described in documentation, local_dir parameter of tune.run is responsible for specifying output directory, default is ~/ray_results.
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
This is governed by trial_name_creator parameter of tune.run. It must be a function that accepts trial object and formats it into a string like so:
def trial_name_id(trial):
return f"{trial.trainable_name}_{trial.trial_id}"
tune.run(...trial_name_creator=trial_name_id)
Just for anyone who bumps into this problem with Ray Tune.
You can specify local_dir for run_config within tune.Tuner:
# This logs to 2 different trial folders:
# ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
# Only trial_name is autogenerated.
tuner = tune.Tuner(trainable,
tune_config=tune.TuneConfig(num_samples=2),
run_config=air.RunConfig(local_dir="./results", name="test_experiment"))
results = tuner.fit()
Please see this link for more info.

TensorFlow: minimalist program fails on distributed mode

I wrote a very simple program that runs just fine without distribution but hangs on CheckpointSaverHook in distributed mode (everything on my localhost though!). I've seen there's been a few questions about hanging in distributed mode, but none seem to match my question.
Here's the script (made to toy with the new layers API):
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib import layers
DATA_SIZE=10
DIMENSION=5
FEATURES='features'
def generate_input_fn():
def _input_fn():
mid = int(DATA_SIZE/2)
data = np.array([np.ones(DIMENSION) if x < mid else -np.ones(DIMENSION) for x in range(DATA_SIZE)])
labels = ['0' if x < mid else '1' for x in range(DATA_SIZE)]
table = tf.contrib.lookup.string_to_index_table_from_tensor(tf.constant(['0', '1']))
label_tensor = table.lookup(tf.convert_to_tensor(labels, dtype=tf.string))
return dict(zip([FEATURES], [tf.convert_to_tensor(data, dtype=tf.float32)])), label_tensor
return _input_fn
def build_estimator(model_dir):
features = layers.real_valued_column(FEATURES, dimension=DIMENSION)
return tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
dnn_feature_columns=[features],
dnn_hidden_units=[20,20])
def generate_exp_fun():
def _exp_fun(output_dir):
return tf.contrib.learn.Experiment(
build_estimator(output_dir),
train_input_fn=generate_input_fn(),
eval_input_fn=generate_input_fn(),
train_steps=100
)
return _exp_fun
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.DEBUG)
learn_runner.run(generate_exp_fun(), 'job_dir')
To test distributed mode, I simply launch it with the environment variable TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"} (this is for the worker, the same with ps type is used to launch the parameter server.
I use tensorflow-1.0.1 (but had the same behavior with 1.0.0) on windows-64, only CPU. I actually never get any error, it just hang on after INFO:tensorflow:Create CheckpointSaverHook. forever... I've tried to attach VisualStudio C++ debugger to the process but with little success so far, so I can't print a stack for what's happening in the native part.
P.S.: it's not a problem with DNNLinearCombinedClassifier because it fails as well with a simple tf.contrib.learn.LinearClassifier. And as noted in the comments, it's not due to both process running on localhost, since it fails also when running on separate VMs.
EDIT: I think there's actually an issue with server launching. It looks like the server is not launched when you're in local mode (no matter if distributed or not), cf. tensorflow/contrib/learn/python/learn/experiment.py l.250-258:
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
This will prevent the server from being started in local mode for the workers... Anyone has an idea if it's a bug or there's something I'm missing?
So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

Categories

Resources