I am running a flask application (rest api) with gunicorn and I am seeing almost every 30 seconds a batch of [CRITICAL] WORKER TIMEOUT (pid:14727).
My settings are the following:
gunicorn --worker-class gevent \
--timeout 30 --graceful-timeout 20
--max-requests-jitter 2000 --max-requests 1500
-w 50
--log-level DEBUG --capture-output
--bind 0.0.0.0:5000 run:app
I saw previous post that had said to throw more RAM at this but from the looks of it:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 513926
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 131071
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1550298
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The Heap is unlimited and stack size is slightly over 8Mb.
Log sample
+0000] [26657] [DEBUG] GET /timer
[2017-01-21 14:07:30 +0000] [26657] [DEBUG] GET /timer
[2017-01-21 14:07:33 +0000] [26657] [DEBUG] GET /timer
[2017-01-21 14:07:33 +0000] [26652] [DEBUG] GET /timer
10.193.80.149 - - [21/Jan/2017:14:07:34 +0000] "GET /timer?id=699ec59eccd3fb929b3dd7707e542ed15acd4181:6f136b54-2cb5-42ef-9def-f69caaba57ef HTTP/1.1" 200 - "-" "-"
10.193.80.147 - - [21/Jan/2017:14:07:35 +0000] "GET /timer?id=e7963c53603ed9249b0aa557d8a64cea89fb0bf4:6f136b54-2cb5-42ef-9def-f69caaba57ef HTTP/1.1" 200 - "-" "-"
10.193.80.150 - - [21/Jan/2017:14:07:35 +0000] "GET /timer?id=4b750805193fb4d00c3ce1465c266ed932a24e55:6f136b54-2cb5-42ef-9def-f69caaba57ef HTTP/1.1" 200 - "-" "-"
[2017-01-21 14:07:37 +0000] [26657] [DEBUG] GET /timer
[2017-01-21 14:07:37 +0000] [26657] [DEBUG] GET /timer
[2017-01-21 14:07:37 +0000] [26635] [CRITICAL] WORKER TIMEOUT (pid:27202)
[2017-01-21 14:07:37 +0000] [26635] [CRITICAL] WORKER TIMEOUT (pid:27205)
What I noticed was only a handful of workers are always doing the work 26657, 26652 26651 everything else just seems to be giving me the Worker timeout
You have some requests which take longer than 30 seconds to finish, that's why they are killed. Either:
tune your code so each request is done within less than 30 seconds (that might also be due to slow database or other dependencies)
check if your host is short on resources, this could be due to CPU or RAM. Tuning your machine by putting more RAM into the machine only helps if each of your unicorn processes eats a lot of RAM and the machine starts swapping. Try e.g. top to check if CPU or RAM are saturated.
increase the timeout by changing the --timeout 30 to a higher number. That's the worst idea really, as you don't solve your underlying problem that your flask app responds slowly to incoming requests. Plus killing long running requests often helps the other flask threads to not run into resource problems as well.
Related
A FastAPI application restarts after gunicorn worker timeout. Is it possible to handle such a signal from the FastAPI application (shutdown signal doesn't help) before the application restart?
The problem is that some function exceeds the default time limit (30 seconds), which is ok, and we want to handle the situation by catching such a signal to notify a user about an error. Otherwise, the user will see upstream connect error or disconnect/reset before headers. reset reason: connection termination.
INFO [83] uvicorn.error Application startup complete. ()
CRITICAL [70] gunicorn.error WORKER TIMEOUT (pid:83) (83,)
CRITICAL [70] gunicorn.error WORKER TIMEOUT (pid:83) (83,)
WARNING [70] gunicorn.error Worker with pid 83 was terminated due to signal 6 (83, 6)
WARNING [70] gunicorn.error Worker with pid 83 was terminated due to signal 6 (83, 6)
INFO [83] gunicorn.error Booting worker with pid: 83 (83,)
INFO [83] gunicorn.error Booting worker with pid: 83 (83,)
INFO [83] uvicorn.error Started server process [83] (83,)
INFO [83] uvicorn.error Waiting for application startup. ()
INFO [83] uvicorn.error Application startup complete. ()
Unfortunately, a timeout increase isn't feasible.
I did try a #app.on_event("shutdown") and some FastAPI general exception handling methods, but nothing helped.
Gunicorn sends a SIGABRT, signal 6, to a worker process when timed out.
Thus a process, FastAPI in this case, needs to catch the signal, but on_event cannot because FastAPI(Starlette) event doesn't mean signals.
But there is a simple solution, Gunicorn server hooks.
def worker_abort(worker):
...
Called when a worker received the SIGABRT signal.
This call generally happens on timeout.
The callable needs to accept one instance variable for the initialized Worker.
Of course, you will lose the request at that time; you have to find another way to send response to users. I recommend using FastAPI BackgroundTasks or Celery.
I have an Elastic Beanstalk Python worker environment. The average job running time is about 20 seconds. Sometimes the following scenario happens,
sqsd picks a message from the sqs queue and sends it to the worker.
The worker starts processing the message.
in few seconds (ranges from 1 to 30 seconds) sqsd gets the following error and parks the message in the Dead letter queue as I configured the retries to 1.
127.0.0.1 (-) - - [23/Nov/2017:19:48:17 +0000] "POST / HTTP/1.1" 500 527 "-" "aws-sqsd/2.3"
The worker continues to process the message and finishes successfully. I have logs to trace that.
That makes the environment in general not healthy.
I have the connection timeout = 60 seconds, Inactivity timeout = 600, Visibility timeout = 600, HTTP connections = 2.
I have the following in the configs as well
option_settings:
aws:elasticbeanstalk:container:python:
NumProcesses: 3
NumThreads: 10
files:
"/etc/httpd/conf.d/wsgi_custom.conf":
mode: "000644"
owner: root
group: root
content: |
WSGIApplicationGroup %{GLOBAL}
Is this because of some memory limit that wsgi puts to every request? That is the only thing that I can think of.
I have implemented a simple microservice using Flask, where the method that handles the request calculates a response based on the request data and a rather large datastructure loaded into memory.
Now, when I deploy this application using gunicorn and a large number of worker threads, I would simply like to share the datastructure between the request handlers of all workers. Since the data is only read, there is no need for locking or similar. What is the best way to do this?
Essentially what would be needed is this:
load/create the large data structure when the server is initialized
somehow get a handle inside the request handling method to access the data structure
As far as I understand gunicorn allows me to implement various hook functions, e.g. for the time the server gets initialized, but a flask request handler method does not know anything about the gunicorn server data structure.
I do not want to use something like redis or a database system for this, since all data is in a datastructure that needs to be loaded in memory and no deserialization must be involved.
The calculation carried out for each request which uses the large data structure can be lengthy so it must happen concurrently in a truly independent thread or process for each request (this should scale up by running on a multi-core computer).
You can use preloading.
This will allow you to create the data structure ahead of time, then fork each request handling process. This works because of copy-on-write and the knowledge that you are only reading from the large data structure.
Note: Although this will work, it should probably only be used for very small apps or in a development environment. I think the more production-friendly way of doing this would be to queue up these calculations as tasks on the backend since they will be long-running. You can then notify users of the completed state.
Here is a little snippet to see the difference of preloading.
# app.py
import flask
app = flask.Flask(__name__)
def load_data():
print('calculating some stuff')
return {'big': 'data'}
#app.route('/')
def index():
return repr(data)
data = load_data()
Running with gunicorn app:app --workers 2:
[2017-02-24 09:01:01 -0500] [38392] [INFO] Starting gunicorn 19.6.0
[2017-02-24 09:01:01 -0500] [38392] [INFO] Listening at: http://127.0.0.1:8000 (38392)
[2017-02-24 09:01:01 -0500] [38392] [INFO] Using worker: sync
[2017-02-24 09:01:01 -0500] [38395] [INFO] Booting worker with pid: 38395
[2017-02-24 09:01:01 -0500] [38396] [INFO] Booting worker with pid: 38396
calculating some stuff
calculating some stuff
And running with gunicorn app:app --workers 2 --preload:
calculating some stuff
[2017-02-24 09:01:06 -0500] [38403] [INFO] Starting gunicorn 19.6.0
[2017-02-24 09:01:06 -0500] [38403] [INFO] Listening at: http://127.0.0.1:8000 (38403)
[2017-02-24 09:01:06 -0500] [38403] [INFO] Using worker: sync
[2017-02-24 09:01:06 -0500] [38406] [INFO] Booting worker with pid: 38406
[2017-02-24 09:01:06 -0500] [38407] [INFO] Booting worker with pid: 38407
i have app on Pyramid. I run it in uWSGI with these config:
[uwsgi]
socket = mysite:8055
master = true
processes = 4
vacuum = true
lazy-apps = true
gevent = 100
And nginx config:
server {
listen 8050;
include uwsgi_params;
location / {
uwsgi_pass mysite:8055;
}
}
Usually all fine, but sometimes uWSGI kills workers. And i have no idea why.
I see in uWSGI logs:
DAMN ! worker 2 (pid: 4247) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 4457)
but in the logs there is no Python exceptions.
sometimes i see in uWSGI logs:
invalid request block size: 11484 (max 4096)...skip
[uwsgi-http key: my site:8050 client_addr: 127.0.0.1 client_port: 63367] hr_instance_read(): Connection reset by peer [plugins/http/http.c line 614]
And nginx errors.log:
*13388 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1,
*13955 recv() failed (104: Connection reset by peer) while reading response header from upstream, client:
I think this can be solved by adding buffer-size=32768, but it is unlikely due to this uWSGI kill workers.
Why uwsgi can kill workers? And how can I know the reason?
The line "DAMN ! worker 2 (pid: 4247) died, ..." nothing to tells.
signal 9 means it received a SIGKILL. so something sent a kill to your worker. it's relatively likely that the out-of-memory killer decided to kill your app because it was using too much memory. try watching the workers with a process monitor and see if it uses a lot of memory.
Try to add harakiri-verbose = true option in the uWSGI config.
I had the same problem, for me changing the uwsgi.ini file, changing the value of the reload-on-rss setting from 2048 to 4048, and harakiri to 600 solved the problem.
For me it was that I hadn't filled out app.config["SERVER_NAME"] = "x"
I have a Flask app with the following route:
#app.route('/')
def index():
console = logging.StreamHandler()
log = logging.getLogger("asdasd")
log.addHandler(console)
log.setLevel(logging.DEBUG)
log.error("Something")
print >> sys.stderr, "Another thing"
return 'ok'
I run this using
python gunicorn --access-logfile /mnt/log/test.log --error-logfile /mnt/log/test.log --bind 0.0.0.0:8080 --workers 2 --worker-class gevent --log-level debug server:app
The Logs are as below:
2014-06-26 00:13:55 [21621] [INFO] Using worker: gevent
2014-06-26 00:13:55 [21626] [INFO] Booting worker with pid: 21626
2014-06-26 00:13:55 [21627] [INFO] Booting worker with pid: 21627
2014-06-26 00:14:05 [21626] [DEBUG] GET /
10.224.67.41 - - [26/Jun/2014:00:14:14 +0000] "GET / HTTP/1.1" 200 525 "-" "python-requests/2.2.1 CPython/2.7.5 Darwin/13.2.0"
2014-06-26 00:14:14 [21626] [DEBUG] Closing connection.
What's happening to my logs in index method?
As of Gunicorn 19.0, gunicorn has stopped redirecting stderr to its logs.
Refer to https://github.com/benoitc/gunicorn/commit/41523188bc05fcbba840ba2e18ff67cd9df638e9