Cloud Run with Gunicorn Best-Practise

Cloud Run with Gunicorn Best-Practise - python

I am currently working on a service that is supposed to provide an HTTP endpoint in Cloud Run and I don't have much experience. I am currently using flask + gunicorn and can also call the service. My main problem now is optimising for multiple simultaneous requests. Currently, the service in Cloud Run has 4GB of memory and 1 CPU allocated to it. When it is called once, the instance that is started directly consumes 3.7GB of memory and about 40-50% of the CPU (I use a neural network to embed my data). Currently, my settings are very basic:
memory: 4096M
CPU: 1
min-instances: 0
max-instances: 1
concurrency: 80
Workers: 1 (Gunicorn)
Threads: 1 (Gunicorn)
Timeout: 0 (Gunicorn, as recommended by Google)
If I up the number of workers to two, I would need to up the Memory to 8GB. If I do that my service should be able to work on two requests simultaneously with one instance, if this 1 CPU allocated, has more than one core. But what happens, if there is a thrid request? I would like to think, that Cloud Run will start a second instance. Does the new instance gets also 1 CPU and 8GB of memory and if not, what is the best practise for me?

One of the best practice is to let Cloud Run scale automatically instead of trying to optimize each instance. Using 1 worker is a good idea to limit the memory footprint and reduce the cold start.
I recommend to play with the threads, typically to put it to 8 or 16 to leverage the concurrency parameter.
If you put those value too low, Cloud Run internal load balancer will route the request to the instance, thinking it will be able to serve it, but if Gunicorn can't access new request, you will have issues.
Tune your service with the correct parameter of CPU and memory, but also the thread and the concurrency to find the correct ones. Hey is a useful tool to stress your service and observe what's happens when you scale.

The best practice so far is For environments with multiple CPU cores, increase the number of workers to be equal to the cores available. Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. Adjust the number of workers and threads on a per-application basis. For example, try to use a number of workers equal to the cores available and make sure there is a performance improvement, then adjust the number of threads.i.e.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app

Related

How to set gunicorn worker number in a kubernetes' pod

I'm running a flask application with gunicorn and gevent worker class. In my own test environment, I follow the official guide multiprocessing.cpu_count() * 2 + 1 to set worker number.
If I want to put the application on Kubernetes' pod and assume that resources will be like
resources:
limits:
cpu: "10"
memory: "5Gi"
requests:
CPU: "3"
memory: "3Gi"
how to calculate the worker number? should I use limits CPU or requests CPU?
PS. I'm launching application via binary file packaged by pyinstaller, in essence flask run(python script.py), and launch gunicorn in the main thread:
def run():
...
if config.RUN_MODEL == 'GUNICORN':
sys.argv += [
"--worker-class", "event",
"-w", config.GUNICORN_WORKER_NUMBER,
"--worker-connections", config.GUNICORN_WORKER_CONNECTIONS,
"--access-logfile", "-",
"--error-logfile", "-",
"-b", "0.0.0.0:8001",
"--max-requests", config.GUNICORN_MAX_REQUESTS,
"--max-requests-jitter", config.GUNICORN_MAX_REQUESTS_JITTER,
"--timeout", config.GUNICORN_TIMEOUT,
"--access-logformat", '%(t)s %(l)s %(u)s "%(r)s" %(s)s %(M)sms',
"app.app_runner:app"
]
sys.exit(gunicorn.run())
if __name__ == "__main__":
run()
PS. Whether I set worker number by limits CPU (10*2+1=21) or requests CPU (3*2+1=7) the performance still can't catch up with my expectations. Any trial suggestions to improve performance will be welcome under this questions

how to calculate the worker number? should I use limits CPU or requests CPU?
It depends on your situation. First, look at the documentation about request and limits (this example is for memory, but the same is for CPU).
f the node where a Pod is running has enough of a resource available, it's possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
For example, if you set a memory request of 256 MiB for a container, and that container is in a Pod scheduled to a Node with 8GiB of memory and no other Pods, then the container can try to use more RAM.
If you set a memory limit of 4GiB for that container, the kubelet (and container runtime) enforce the limit. The runtime prevents the container from using more than the configured resource limit. For example: when a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Answering your question: first of all, you need to know how many resources (eg. CPU) your application needs. Request will be the minimum amount of CPU that the application must receive (you have to calculate this value yourself. In other words - you must know how much the application needs minimum CPU to run properly and then you need to set the value.) For example, if your application will perform better, when it receives more CPU, consider adding a limit ( this is the maximum amount of CPU an application can receive). If you want to calculate the worker number based on the highest performance, use limit to calculate the value. If, on the other hand, you want your application to run smoothly (perhaps not as fast as possible, but it will consume less resources) use request type.

gunicorn behind nginx high memory usage

I have an issue with gunicorn behind nginx controller.
I have a microservice written in python with aiohttp and I am using gunicorn. That microservice deployed in a kubernetes cluster. I decided to test my app by doing some stresstest, for this purpose I used locust. The problem is: when I am running my app in a docker container locally, it shows pretty good results, but when I am doing stress test in a kubernetes cluster I see high memory usage by pod where my app is running. I thought that it is a memory leak and checked docker stats while stresstesting my app locally and it was using 80-90 MiB of ram. But when I am doing stresstest within a cluster I see growing memory usage on the grafana dashboard. Memory usage reaches up to 1.2 Gb and when I stop the locust it is not stabilizing and just jumps from 600 Mb to 1.2 and I see the spikes on the graph.
The pod is given 1 cpu and unlimited memory for now.
This is my gunicorn config:
workers = 1
bind = f"{SERVICE_HOST}:{SERVICE_PORT}"
worker_class = "aiohttp.GunicornUVLoopWebWorker"
#worker_connections = 4096
#max_requests = 4096
#max_requests_jitter = 100
I have tried different configuration of gunicorn with 3 workers (2*nCPU + 1) and max_request with jitter to restart workers. But haven't got good results.
One thing I discovered - when I am doing high load (500 users simultaneously) locust shows client timeouts with 'Remote disconnected'. I have read in gunicorn docs that it is a good practice to put gunicorn behind nginx because nginx can buffer the responses. And when I am testing locally or within a cluster I do not have errors like that.
The main question I have not figured out yet is why the memory usage differs locally and within a cluster?
With 1 worker when testing locally docker stats shows 80-90 MiB, but grafana graph shows what I have already described...

First of all thanks to #moonkotte for trying to help!
Today I found out what the cause of this problem is.
So, the problem is related to gunicorn workers and prometheus_mutiproc_dir env variable where the path is set to save counters data. I don't actually know for now why this is happening, but I just deleted this env variable and everything worked fine, but prometheus :). I think this relates to this issue and this limitations. Will dig deeper to solve this.

uwsgi worker not distributing evenly

I have a Django project configured with nginx and uwsgi. There isn't much cpu processing involved in the website. There is mostly simple read, but we expect lot of hits. I have used apache bench mark for load testing. Giving a simple ab -n 200 -c 200 <url> is making the website slower (while the benchmark test is on, not able to open the website in any browser even from a different ip address). I have given number of processes as 16 and threads as 8. my uwsgi.ini file is given below.
[uwsgi]
master = true
socket = /tmp/uwsgi.sock
chmod-socket = 666
chdir = <directory>
wsgi-file = <wsgi file path>
processes = 16
threads = 8
virtualenv = <virtualenv path>
vacuum = true
enable-threads = true
daemonize= <log path>
stats= /tmp/stats.sock
when i check the uwsgitop, what is seen that workers 7 and 8 are handling most of the requests, rest of them are processing less number of requests compared to them. Could this be the reason why i cannot load the website in a browser while benchmark is run ? How can i efficiently use uwsgi processes to serve maximum number of concurrent requests ?
this the result of htop. Not much memory or processor is used during the benchmark testing. Can somebody help me to set up the server efficiently ?

As far as I can see, there are only 2 cores. You cannot span a massive amount of processes and threads over just two cores. You'll get advantages if your threads have to wait for other IO processes. Then they go to sleep and others can work.
Always max two (=number of cores) at the same time.
You do not provide much information about your app except that it's "mostly simple read, but we expect lot of hits". This is not the sound of a lot of IO waits.
I guess the database is running on the same host as well (will need some CPU time as well)
Try to lower your threads/processes to 4 at first. Then play around with +/- 1 and test accordingly.
Read https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
You'll find sentences like:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent.
By default the Python plugin does not initialize the GIL. This means
your app-generated threads will not run. If you need threads, remember
to enable them with enable-threads. Running uWSGI in multithreading
mode (with the threads options) will automatically enable threading
support. This “strange” default behaviour is for performance reasons,
no shame in that.

If you have enough money change your processor according to your motherboard requirements. Better go for core i3 or above.
This is because you have only two core processor which is easily got hotted when you run a multi-thread software. You can,t make very task on it. Sometimes it runs so fast and then stopped some massive multi-thread software.

Production ready Python apps on Kubernetes

I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env

We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).

I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter

How to calculate max requests per second of a Django app?

I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.

Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!

According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine

You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.