I have an issue with gunicorn behind nginx controller.
I have a microservice written in python with aiohttp and I am using gunicorn. That microservice deployed in a kubernetes cluster. I decided to test my app by doing some stresstest, for this purpose I used locust. The problem is: when I am running my app in a docker container locally, it shows pretty good results, but when I am doing stress test in a kubernetes cluster I see high memory usage by pod where my app is running. I thought that it is a memory leak and checked docker stats while stresstesting my app locally and it was using 80-90 MiB of ram. But when I am doing stresstest within a cluster I see growing memory usage on the grafana dashboard. Memory usage reaches up to 1.2 Gb and when I stop the locust it is not stabilizing and just jumps from 600 Mb to 1.2 and I see the spikes on the graph.
The pod is given 1 cpu and unlimited memory for now.
This is my gunicorn config:
workers = 1
bind = f"{SERVICE_HOST}:{SERVICE_PORT}"
worker_class = "aiohttp.GunicornUVLoopWebWorker"
#worker_connections = 4096
#max_requests = 4096
#max_requests_jitter = 100
I have tried different configuration of gunicorn with 3 workers (2*nCPU + 1) and max_request with jitter to restart workers. But haven't got good results.
One thing I discovered - when I am doing high load (500 users simultaneously) locust shows client timeouts with 'Remote disconnected'. I have read in gunicorn docs that it is a good practice to put gunicorn behind nginx because nginx can buffer the responses. And when I am testing locally or within a cluster I do not have errors like that.
The main question I have not figured out yet is why the memory usage differs locally and within a cluster?
With 1 worker when testing locally docker stats shows 80-90 MiB, but grafana graph shows what I have already described...
First of all thanks to #moonkotte for trying to help!
Today I found out what the cause of this problem is.
So, the problem is related to gunicorn workers and prometheus_mutiproc_dir env variable where the path is set to save counters data. I don't actually know for now why this is happening, but I just deleted this env variable and everything worked fine, but prometheus :). I think this relates to this issue and this limitations. Will dig deeper to solve this.
Related
I am currently working on a service that is supposed to provide an HTTP endpoint in Cloud Run and I don't have much experience. I am currently using flask + gunicorn and can also call the service. My main problem now is optimising for multiple simultaneous requests. Currently, the service in Cloud Run has 4GB of memory and 1 CPU allocated to it. When it is called once, the instance that is started directly consumes 3.7GB of memory and about 40-50% of the CPU (I use a neural network to embed my data). Currently, my settings are very basic:
memory: 4096M
CPU: 1
min-instances: 0
max-instances: 1
concurrency: 80
Workers: 1 (Gunicorn)
Threads: 1 (Gunicorn)
Timeout: 0 (Gunicorn, as recommended by Google)
If I up the number of workers to two, I would need to up the Memory to 8GB. If I do that my service should be able to work on two requests simultaneously with one instance, if this 1 CPU allocated, has more than one core. But what happens, if there is a thrid request? I would like to think, that Cloud Run will start a second instance. Does the new instance gets also 1 CPU and 8GB of memory and if not, what is the best practise for me?
One of the best practice is to let Cloud Run scale automatically instead of trying to optimize each instance. Using 1 worker is a good idea to limit the memory footprint and reduce the cold start.
I recommend to play with the threads, typically to put it to 8 or 16 to leverage the concurrency parameter.
If you put those value too low, Cloud Run internal load balancer will route the request to the instance, thinking it will be able to serve it, but if Gunicorn can't access new request, you will have issues.
Tune your service with the correct parameter of CPU and memory, but also the thread and the concurrency to find the correct ones. Hey is a useful tool to stress your service and observe what's happens when you scale.
The best practice so far is For environments with multiple CPU cores, increase the number of workers to be equal to the cores available. Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. Adjust the number of workers and threads on a per-application basis. For example, try to use a number of workers equal to the cores available and make sure there is a performance improvement, then adjust the number of threads.i.e.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
I am using python flask in GKE contianer and moemory is increasing inside pod. I have set limit to pod but it's getting killed.
I am thinking it's memory leak can anybody suggest something after watching this. As disk increase memory also increase and there are some page faults also.
Is there anything container side linux os (using python-slim base). Memory is not coming back to os or python flask memory management issue ?
To check memory leak i have added stackimpact to application.
Please help...!
Thanks in advance
If you added a resource memory limit to each GKE Deployment when the memory limit was hit, the pod was killed, rescheduled, and should restarted and the other pods on the node should be fine.
You can find more information by running this command:
kubectl describe pod <YOUR_POD_NAME>
kubectl top pods
Please note if you put in a memory request that is larger than the amount of memory on your nodes, the pod will never be scheduled.
And if the Pod cannot be scheduled because of insufficient resources or some configuration error You might encounter an error indicating a lack memory or another resource. If a Pod is stuck in Pending it means that it can not be scheduled onto a node. In this case you need to delete Pods, adjust resource requests, or add new nodes to your cluster. You can find more information here.
Additionally, as per this document, Horizontal Pod Autoscaling (HPA) scales the replicas of your deployments based on metrics like memory or CPU usage.
I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter
I have a Flask-Restful App with a pretty standard server stack
WSGI Server : Gunicorn
Async worker class : Gevent / sync
When I start/restart my flask app via supervisorctl, the CPU load goes very high till the app is loaded. And it takes around 10-20 secs for the app to load.
I've tried running the app on the following instance configs
8 core, 15 GB RAM
2 core, 7.5 GB RAM
But on both of the instances, the behaviour is quite similar. The CPU usage rises drastically to ~35-40 %.
I'm not able to find the root cause of this load. Can there be any issue with Gunicorn combined with Flask-Restful?
Any help would be appreciated.
I'm running a Flask app on Heroku (on the free tier) and running into some trouble when scheduling tasks using apply_async. If I schedule more than two tasks, I get a long stacktrace with the exception:
AccessRefused(403, u"ACCESS_REFUSED - access to exchange 'celeryresults' in vhost 'rthtwchf' refused for user 'rthtwchf'", (40, 10), 'Exchange.declare')
The odd thing is the first two tasks (before restarting all of my processes) always seem to complete with no issue.
A little bit of search engine sleuthing leads me to https://stackoverflow.com/questions/21071906/celery-cannot-connect-remote-worker-with-new-username which makes it looks like a permissions issue, but I'd assume that the Heroku CloudAMPQ service would have taken care of that already.
Any advice is appreciated!
I think your connections are exceeding 3 (free plan limit). Set the BROKER_POOL_LIMIT to 1 and it will work.