I am running a Flask application and hosting it on Kubernetes from a Docker container. Gunicorn is managing workers that reply to API requests.
The following warning message is a regular occurrence, and it seems like requests are being canceled for some reason. On Kubernetes, the pod is showing no odd behavior or restarts and stays within 80% of its memory and CPU limits.
[2021-03-31 16:30:31 +0200] [1] [WARNING] Worker with pid 26 was terminated due to signal 9
How can we find out why these workers are killed?
I encountered the same warning message.
[WARNING] Worker with pid 71 was terminated due to signal 9
I came across this faq, which says that "A common cause of SIGKILL is when OOM killer terminates a process due to low memory condition."
I used dmesg realized that indeed it was killed because it was running out of memory.
Out of memory: Killed process 776660 (gunicorn)
In our case application was taking around 5-7 minutes to load ML models and dictionaries into memory.
So adding timeout period of 600 seconds solved the problem for us.
gunicorn main:app \
--workers 1 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8443 \
--timeout 600
I encountered the same warning message when I limit the docker's memory, use like -m 3000m.
see docker-memory
and
gunicorn-Why are Workers Silently Killed?
The simple way to avoid this is set a high memory for docker or not set.
I was using AWS Beanstalk to deploy my flask application and I had a similar error.
In the log I saw:
web: MemoryError
[CRITICAL] WORKER TIMEOUT
[WARNING] Worker with pid XXXXX was terminated due to signal 9
I was using the t2.micro instance and when I changed it to t2.medium my app worked fine. In addition to this I changed to the timeout in my nginx config file.
In my case the problem was in long application startup caused by ml model warm-up (over 3s)
It may be that your liveness check in kubernetes is killing your workers.
If your liveness check is configured as an http request to an endpoint in your service, your main request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
That was my case. I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a call that takes about 25 seconds, and not every time.
It turned out that my liveness check was configuredto hit the /health route every 10 seconds, time out in 1 second, and retry 3 times. So this call would time out some times but not always.
If this is your case, a possible solution is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
Also, see this reply to a similar question: https://stackoverflow.com/a/73993486/2363627
Check memory usage
In my case, I can not use dmesg command. so I check memory usage as docker command:
sudo docker stats <container-id>
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
289e1ad7bd1d funny_sutherland 0.01% 169MiB / 1.908GiB 8.65% 151kB / 96kB 8.23MB / 21.5kB 5
In my case, terminating workers are not caused by memory.
I encountered the same problem too. and it was because docker memory usage was limited to 2GB. If you are using docker desktop you just need to go to resources and increase the memory docker dedicated portion (if not you need to find the docker command line to do that).
If that doesn't solve the problem, then it might be the timeout that kill the worker, you will need to add timeout arg to the gunicorn command:
CMD ["gunicorn","--workers", "3", "--timeout", "1000", "--bind", "0.0.0.0:8000", "wsgi:app"]
Related
I am currently working on a service that is supposed to provide an HTTP endpoint in Cloud Run and I don't have much experience. I am currently using flask + gunicorn and can also call the service. My main problem now is optimising for multiple simultaneous requests. Currently, the service in Cloud Run has 4GB of memory and 1 CPU allocated to it. When it is called once, the instance that is started directly consumes 3.7GB of memory and about 40-50% of the CPU (I use a neural network to embed my data). Currently, my settings are very basic:
memory: 4096M
CPU: 1
min-instances: 0
max-instances: 1
concurrency: 80
Workers: 1 (Gunicorn)
Threads: 1 (Gunicorn)
Timeout: 0 (Gunicorn, as recommended by Google)
If I up the number of workers to two, I would need to up the Memory to 8GB. If I do that my service should be able to work on two requests simultaneously with one instance, if this 1 CPU allocated, has more than one core. But what happens, if there is a thrid request? I would like to think, that Cloud Run will start a second instance. Does the new instance gets also 1 CPU and 8GB of memory and if not, what is the best practise for me?
One of the best practice is to let Cloud Run scale automatically instead of trying to optimize each instance. Using 1 worker is a good idea to limit the memory footprint and reduce the cold start.
I recommend to play with the threads, typically to put it to 8 or 16 to leverage the concurrency parameter.
If you put those value too low, Cloud Run internal load balancer will route the request to the instance, thinking it will be able to serve it, but if Gunicorn can't access new request, you will have issues.
Tune your service with the correct parameter of CPU and memory, but also the thread and the concurrency to find the correct ones. Hey is a useful tool to stress your service and observe what's happens when you scale.
The best practice so far is For environments with multiple CPU cores, increase the number of workers to be equal to the cores available. Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. Adjust the number of workers and threads on a per-application basis. For example, try to use a number of workers equal to the cores available and make sure there is a performance improvement, then adjust the number of threads.i.e.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
I'm running a flask application with gunicorn and gevent worker class. In my own test environment, I follow the official guide multiprocessing.cpu_count() * 2 + 1 to set worker number.
If I want to put the application on Kubernetes' pod and assume that resources will be like
resources:
limits:
cpu: "10"
memory: "5Gi"
requests:
CPU: "3"
memory: "3Gi"
how to calculate the worker number? should I use limits CPU or requests CPU?
PS. I'm launching application via binary file packaged by pyinstaller, in essence flask run(python script.py), and launch gunicorn in the main thread:
def run():
...
if config.RUN_MODEL == 'GUNICORN':
sys.argv += [
"--worker-class", "event",
"-w", config.GUNICORN_WORKER_NUMBER,
"--worker-connections", config.GUNICORN_WORKER_CONNECTIONS,
"--access-logfile", "-",
"--error-logfile", "-",
"-b", "0.0.0.0:8001",
"--max-requests", config.GUNICORN_MAX_REQUESTS,
"--max-requests-jitter", config.GUNICORN_MAX_REQUESTS_JITTER,
"--timeout", config.GUNICORN_TIMEOUT,
"--access-logformat", '%(t)s %(l)s %(u)s "%(r)s" %(s)s %(M)sms',
"app.app_runner:app"
]
sys.exit(gunicorn.run())
if __name__ == "__main__":
run()
PS. Whether I set worker number by limits CPU (10*2+1=21) or requests CPU (3*2+1=7) the performance still can't catch up with my expectations. Any trial suggestions to improve performance will be welcome under this questions
how to calculate the worker number? should I use limits CPU or requests CPU?
It depends on your situation. First, look at the documentation about request and limits (this example is for memory, but the same is for CPU).
f the node where a Pod is running has enough of a resource available, it's possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
For example, if you set a memory request of 256 MiB for a container, and that container is in a Pod scheduled to a Node with 8GiB of memory and no other Pods, then the container can try to use more RAM.
If you set a memory limit of 4GiB for that container, the kubelet (and container runtime) enforce the limit. The runtime prevents the container from using more than the configured resource limit. For example: when a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Answering your question: first of all, you need to know how many resources (eg. CPU) your application needs. Request will be the minimum amount of CPU that the application must receive (you have to calculate this value yourself. In other words - you must know how much the application needs minimum CPU to run properly and then you need to set the value.) For example, if your application will perform better, when it receives more CPU, consider adding a limit ( this is the maximum amount of CPU an application can receive). If you want to calculate the worker number based on the highest performance, use limit to calculate the value. If, on the other hand, you want your application to run smoothly (perhaps not as fast as possible, but it will consume less resources) use request type.
I have an issue with gunicorn behind nginx controller.
I have a microservice written in python with aiohttp and I am using gunicorn. That microservice deployed in a kubernetes cluster. I decided to test my app by doing some stresstest, for this purpose I used locust. The problem is: when I am running my app in a docker container locally, it shows pretty good results, but when I am doing stress test in a kubernetes cluster I see high memory usage by pod where my app is running. I thought that it is a memory leak and checked docker stats while stresstesting my app locally and it was using 80-90 MiB of ram. But when I am doing stresstest within a cluster I see growing memory usage on the grafana dashboard. Memory usage reaches up to 1.2 Gb and when I stop the locust it is not stabilizing and just jumps from 600 Mb to 1.2 and I see the spikes on the graph.
The pod is given 1 cpu and unlimited memory for now.
This is my gunicorn config:
workers = 1
bind = f"{SERVICE_HOST}:{SERVICE_PORT}"
worker_class = "aiohttp.GunicornUVLoopWebWorker"
#worker_connections = 4096
#max_requests = 4096
#max_requests_jitter = 100
I have tried different configuration of gunicorn with 3 workers (2*nCPU + 1) and max_request with jitter to restart workers. But haven't got good results.
One thing I discovered - when I am doing high load (500 users simultaneously) locust shows client timeouts with 'Remote disconnected'. I have read in gunicorn docs that it is a good practice to put gunicorn behind nginx because nginx can buffer the responses. And when I am testing locally or within a cluster I do not have errors like that.
The main question I have not figured out yet is why the memory usage differs locally and within a cluster?
With 1 worker when testing locally docker stats shows 80-90 MiB, but grafana graph shows what I have already described...
First of all thanks to #moonkotte for trying to help!
Today I found out what the cause of this problem is.
So, the problem is related to gunicorn workers and prometheus_mutiproc_dir env variable where the path is set to save counters data. I don't actually know for now why this is happening, but I just deleted this env variable and everything worked fine, but prometheus :). I think this relates to this issue and this limitations. Will dig deeper to solve this.
I have a nginx + gunicorn django application. Those process never die. For some reason until no-response:
Once it runs on local Windows 10 env - It works really good, no memory leak hangs.
I think local only "fork" (win dont fork IO know) one main, but why does gunicorn process never die?
The issue was:
if qs:
print('sql:', qs.query)
print('explain:', qs.explain())
This evaluate all QS which is 2 millions raws ....
changed to:
if qs.exists():
print('sql:', qs.query)
print('explain:', qs.explain())
solved the issue.
I have several apps running on uWSGI. Most of them grow in memory usage over time. I've always attributed this to a memory leak that I hadn't tracked down. But lately I've noticed that the growth is quite chunky. I'm wondering if each chunk correlates with a process being started.
Does uWSGI start all processes at boot time, or does it only start up a new one when there are enough requests coming in to make it necessary?
Here's an example config:
[uwsgi]
strict = true
wsgi-file = foo.py
callable = app
die-on-term = true
http-socket = :2345
master = true
enable-threads = true
thunder-lock = true
processes = 6
threads = 1
memory-report = true
update: this looks relevant: http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html
Does "worker" mean the same thing as "process" (answer seems to be yes)? If so then it seems like if I want the number to remain constant always, I should do:
cheaper = 6
cheaper-initial = 6
processes = 6
Yes, uWSGI will start all processes (or workers - worker is an alias for process in uWSGI config) at boot time but it will depend on your application what will go from then. If application imports all modules at boot time, it should be fully loaded before first request, but if some modules are loaded on request time, each worker will be fully loaded only after first requests (assuming that any request will load all modules. If not, it will be fully loaded only after doing combination of requests that will load all of it).
But even after loading all modules, application memory usage won't be constant. There may be some logging, global variables, debug information etc accumulating on every request. If you're using any framework it is possible that it will save some data for debugging, statistics etc.
By default, cheaper is not enabled - that means uWSGI will spawn all workers at startup. If you want to use cheaper mode, you need to define at least cheaper parameter. More about usage of cheaper system you can find in documentation
There are many other systems built in uWSGI to control load based on requests amount. For example
uWSGI Legion subsystem
On demand vassals
Socket activation (with inetd/xinetd, with upstart, with systemd, with circus or mentioned above on demand vassals when running emperor mode)
Bloodlord mode
Zerg mode
If you're worried that uWSGI will take up too much resources, there are solutions for that too:
Running uWSGI in a Linux CGroup
Linux Namespaces
Or even running docker instances