I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter
Related
I built an real-time collaboration application with Prosemirror that uses a centralised operational transform algorithm (described here by Marijn Haverbeke) with a Python server using Django Channels with prosemirror-py as the central point.
The server creates a DocumentInstance for every document users are collaborating on, keeps it in memory and occasionally stores it in a Redis database. As long as there is only one Dyno, all requests are routed there. Then the server looks up the instance the request belongs to and updates it.
I would like to take advantage of Heroku's horizontal scaling and run more than one dyno. But as I understand, this would imply requests being routed to any of the running dynos. But since one DocumentInstance can only live on one server this would not work.
Is there a way to make sure that requests belonging to a specific DocumentInstance are only routed to the machine that keep that keeps it?
Or maybe there is an alternative architecture that I am overlooking?
When building scalable architecture the horizontal autoscaling layer is typically ephemeral, meaning no state is kept there it typically serves either for processing or IO to other services as these boxes are shut down and spun up with regularity. You should not be keeping the DocumentInstance data there as it will be lost given dynos spin up and shut down during autoscaling.
If you're using Redis as a centralized datastore you would be much better off either saving in real-time to Redis and having all dynos write/call to Redis to keep in sync, Redis is capable of this and would move all state from ephemeral dynos to a non-emphereal datastore. You may however be able to find a middle ground for your case using https://devcenter.heroku.com/articles/session-affinity.
I run 123 Dyno an autoscaling and monitoring add-on for Heroku for more options and faster, configurable autoscaling on Heroku when you get there.
I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.
Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!
According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine
You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/
I'm working on a twisted web application which uploads files and encrypts them, returning the url+key to the user.
I've been tasked with scaling this application. At the moment when there are more than 3-4 concurrent upload requests the performance will drop off significantly.
I'm no Twisted expert but I assume this is due to it running in a single python process, being a high cpu application and the GIL?
How could I go about scaling this?
If this was a different framework such as Flask I would just put uwsgi in front of it and scale the number of processes. Would something similar work for Twisted and if so what tools are generally used for this?
If you think you could throw uwsgi in front of the application, I suppose it is pretty close to shared-nothing. So you can run multiple instances of the program and gain a core's worth of performance from each.
There are a couple really obvious options for exactly how to run the multiple instances. You could have a load balancer in front. You could have the processes share a listening port. There are probably more possibilities, too.
Since your protocol seems to be HTTP, any old HTTP load balancer should be applicable. It needn't be Twisted or Python based itself (though certainly it could be).
If you'd rather share a listening port, Twisted has APIs for passing file descriptors between processes (IReactorSocket) and for launching new processes that inherit a file descriptor from the parent (IReactorProcess).
I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.
There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).
My technology stack is Redis as a channels backend, Postgresql as a database, Daphne as an ASGI server, Nginx in front of a whole application. Everything is deployed using Docker Swarm, with only Redis and Database outside. I have about 20 virtual hosts, with 20 interface servers, 40 http workers and 20 websocket workers. Load balancing is done using Ingress overlay Docker network.
The problem is, sometimes very weird things happen regarding performance. Most of requests are handled in under 400ms, but sometimes request can take up to 2-3s, even during very small load. Profiling workers with Django Debug Toolbar or middleware-based profilers shows nothing (timing 0.01s or so)
My question: is there any good method of profiling a whole request path with django-channels? I would like how much time each phase takes, i.e when request was processed by Daphne, when worker started processing, when it finished, when interface server sent response to the client. Currently, I have no idea how to solve this.
Django-silk might be helpful to you in profiling the request and database searching time with following reasons:
It is easy to set by simply adding the configs on settings.py of your Django project.
Can be customised: by using the provided decorator, you can profile the function or methods and get their running performance.
Dynamic setting: you can choose to dynamically allocate silk to methods and also set the profiling rate you want during the running time.
As the documentation states:
Silk is a live profiling and inspection tool for the Django framework. Silk intercepts and stores HTTP requests and database queries before presenting them in a user interface for further inspection
Note: silk may double your database searching time, so it may cause some trouble if you set it on your production environment. However, the increase from silk will be shown separately on the dash board.
https://github.com/jazzband/django-silk
Why not stick a monitoring tool something like Kibana or New Relic and monitor why and what's taking so long for a small payload response. It can tell you the time spent on Python, PostgreSQL and Memcache (Redis).