I'm using Python with waitress to serve requests. They are being served really slowly and the queue eventually is full and blocked. However the use of CPU is almost nothing, around 2 % of the resources.
I'm serving waitress as
serve(PrefixMiddleware(app_config), port=8041, url_scheme='http')
I've tried to add more threads and increase the backlog, with a small improvement but this doesn't scale well.
What am I doing wrong? How can I achieve that a bigger portion of the CPU is used to actually process the requests?
Related
I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter
We have an application that is getting 20-30 requests a second. Waitress seems to be buckling under the load despite us tweaking performance vars. It doesn't crash nor give any errors. Instead, it seems to send (we assume) a ERRCONRESET to Nginx which is sending requests to it. This hypothesis is from the waitress documentation that notes when the backlog is past its limit it may send ERRCONRESETs to the requesting party. Further, Nginx returns 504s to us when waitress is under load. The python application itself continues to run seemingly fine.
We attempted to increase threads up (50 threads) and connection limits (1000) as well. We also lowered the channel_timeout and cleanup_interval (10 sec and 15sec respectively). This still showed no improvement on performance under load. Lastly we even attempted to increase the backlog to 2048. None of this has produced any significant impact.
On some level I even wonder if the new limits proscribed are being respected as running netcat shows long running connections that are not being terminated for well over 60 seconds. We're under the impression Waitress should be able to handle this load, yet it is not. To note we have scaled this up to 6 concurrent instances behind an LB to take requests and are still getting these errors.
Any feedback or performance tips would be appreciated. We are running these on pretty beefy AWS instances layered upon kubernetes. They are taking negligible CPU and RAM sources. When it does work its millisecond return times, so I cannot see any bottle necks in the code that may be contributing, onyl the fact that some how the connections and backlog are being overwhelmed.
See below for our config of waitress to start the app.
waitress.serve(app.app,
host=os.getenv('HOST', '0.0.0.0'),
port=int(os.getenv('PORT', '3000')),
expose_tracebacks=True,
connection_limit=os.getenv('CONNECTION_LIMIT', '1000'),
threads=os.getenv('THREADS', '50'),
channel_timeout=os.getenv('CHANNEL_TIMEOUT', '10'),
cleanup_interval=os.getenv('CLEANUP_INTERVAL', '30'),
backlog=os.getenv('BACKLOG', '2048'))
I'm using Python + Bottle as a webserver.
As I use the production server for many other websites, I don't want Python + Bottle to eat 70% of the CPU for example.
How is it possible to limit the CPU usage of a Python Bottle webserver?
I was thinking about using resource.setrlimit, but is this a good way to do it?
With which syntax should we use resource.setrlimit to set the limit to 20% of the CPU for example?
Step 1
You should ask yourself whether resource optimization is really necessary. If you're certain that specific application consumes too much resources than it could or should then go to step 2.
Step 2
When your application consumes too much resources, first thing you should do is try to identify bottlenecks in it and see if they can be optimized away. Python has various tools that can help you (code profilers, PyPy etc.) If there's nothing you can do in that regard, go to step 3.
Step 3
If you absolutely must limit process resources, keep in mind that:
OS has a very sophisticated scheduling mechanisms that do their best to ensure each running process gets a fair share of CPU time. Even on processor overload, things will still run fine (up to some point).
if you deliberately limit CPU time of one of your processes it may respond slowly or not at all due to network timeouts.
My answer to this question is - reduce static priority of your server if you think it may starve other services but then your server may suffer from starvation when processors are overload. Using nice utility would be my choice as it won't litter your code.
I have a simple Flask application that exposes one api. Calling the api runs a python algorithm that does a lot of string manipulation and file reading (no writing). The algorithm takes about 1000ms. I'm trying to see if there's anyway to optimize concurrent requests. I'm running on a single instance of 4 vCPU VM.
I wrote a client that makes a request every 1000ms. There's minimal RAM usage, and CPU usage is about 35%. When I up the request to every 750ms. RAM usage did not increase by much, but CPU usage doubles to 70%. If I increase the requests to every 500ms, the response will start taking longer time, eventually timing out. CPU usage is at 100%, and RAM is still minimal.
I followed this tutorial to set my application. I enabled threads in my uWSGI settings. However, I did not really notice much difference.
I was hoping to get some advice on what I can do software/settings-wise to respond better to concurrent requests.
The setup
Our setup is unique in the following ways:
we have a large number of distinct Django installations on a single server.
each of these has its own code base, and even runs as a separate linux user. (Currently implemented using Apache mod_wsgi, each installation configured with a small number of threads (2-5) behind a nginx proxy).
each of these installations have a significant memory footprint (20 - 200 MB)
these installations are "web apps" - they are not exposed to the general web, and will be used by a limited nr. of users (1 - 100).
traffic is expected to be in (small) bursts per-installation. I.e. if a certain installation becomes used, a number of follow up requests are to be expected for that installation (but not others).
As each of these processes has the potential to rack up anywhere between 20 and 200 MB of memory, the total memory footprint of the Django processes is "too large". I.e. it quickly exceeds the available physical memory on the server, leading to extensive swapping.
I see 2 specific problems with the current setup:
We're leaving the guessing of which installation needs to be in physical mememory to the OS. It would seem to me that we can do better. Specifically, an installation that currently gets more traffic would be better off with a larger number of ready workers. Also: installations that get no traffic for extensive amounts of time could even do with 0 ready workers as we can deal with the 1-2s for the initial request as long as follow-up requests are fast enough. A specific reason I think we can be "smarter than the OS": after a server restart on a slow day the server is much more responsive (difference is so great it can be observed w/ the naked eye). This would suggest to me that the overhead of presumably swapped processes is significant even if they have not currenlty activily serving requests for a full day.
Some requests have larger memory needs than others. A process that has once dealt with one such a request has claimed the memory from the OS, but due to framentation will likely not be able to return it. It would be worthwhile to be able to retire memory-hogs. (Currenlty we simply have a retart-after-n-requests configured on Apache, but this is not specifically triggered after the fragmentation).
The question:
My idea for a solution would be to have the main server spin up/down workers per installation depending on the needs per installation in terms of traffic. Further niceties:
* configure some general system constraints, i.e. once the server becomes busy be less generous in spinning up processes
* restart memory hogs.
There are many python (WSGI) servers available. Which of them would (easily) allow for such a setup. And what are good pointers for that?
See if uWSGI works for you. I don't think there is something more flexible.
You can have it spawn and kill workers dynamically, set max memory usage etc. Or you might come with better ideas after reading their docs.