Routing requests to a specific Heroku Dyno

Routing requests to a specific Heroku Dyno - python

I built an real-time collaboration application with Prosemirror that uses a centralised operational transform algorithm (described here by Marijn Haverbeke) with a Python server using Django Channels with prosemirror-py as the central point.
The server creates a DocumentInstance for every document users are collaborating on, keeps it in memory and occasionally stores it in a Redis database. As long as there is only one Dyno, all requests are routed there. Then the server looks up the instance the request belongs to and updates it.
I would like to take advantage of Heroku's horizontal scaling and run more than one dyno. But as I understand, this would imply requests being routed to any of the running dynos. But since one DocumentInstance can only live on one server this would not work.
Is there a way to make sure that requests belonging to a specific DocumentInstance are only routed to the machine that keep that keeps it?
Or maybe there is an alternative architecture that I am overlooking?

When building scalable architecture the horizontal autoscaling layer is typically ephemeral, meaning no state is kept there it typically serves either for processing or IO to other services as these boxes are shut down and spun up with regularity. You should not be keeping the DocumentInstance data there as it will be lost given dynos spin up and shut down during autoscaling.
If you're using Redis as a centralized datastore you would be much better off either saving in real-time to Redis and having all dynos write/call to Redis to keep in sync, Redis is capable of this and would move all state from ephemeral dynos to a non-emphereal datastore. You may however be able to find a middle ground for your case using https://devcenter.heroku.com/articles/session-affinity.
I run 123 Dyno an autoscaling and monitoring add-on for Heroku for more options and faster, configurable autoscaling on Heroku when you get there.

Related

Production ready Python apps on Kubernetes

I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env

We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).

I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter

How to calculate max requests per second of a Django app?

I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.

Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!

According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine

You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/

gunicorn and/or celery: What is the way get the best out of both?

I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.

There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).

How to profile django channels?

My technology stack is Redis as a channels backend, Postgresql as a database, Daphne as an ASGI server, Nginx in front of a whole application. Everything is deployed using Docker Swarm, with only Redis and Database outside. I have about 20 virtual hosts, with 20 interface servers, 40 http workers and 20 websocket workers. Load balancing is done using Ingress overlay Docker network.
The problem is, sometimes very weird things happen regarding performance. Most of requests are handled in under 400ms, but sometimes request can take up to 2-3s, even during very small load. Profiling workers with Django Debug Toolbar or middleware-based profilers shows nothing (timing 0.01s or so)
My question: is there any good method of profiling a whole request path with django-channels? I would like how much time each phase takes, i.e when request was processed by Daphne, when worker started processing, when it finished, when interface server sent response to the client. Currently, I have no idea how to solve this.

Django-silk might be helpful to you in profiling the request and database searching time with following reasons:
It is easy to set by simply adding the configs on settings.py of your Django project.
Can be customised: by using the provided decorator, you can profile the function or methods and get their running performance.
Dynamic setting: you can choose to dynamically allocate silk to methods and also set the profiling rate you want during the running time.
As the documentation states:
Silk is a live profiling and inspection tool for the Django framework. Silk intercepts and stores HTTP requests and database queries before presenting them in a user interface for further inspection
Note: silk may double your database searching time, so it may cause some trouble if you set it on your production environment. However, the increase from silk will be shown separately on the dash board.
https://github.com/jazzband/django-silk

Why not stick a monitoring tool something like Kibana or New Relic and monitor why and what's taking so long for a small payload response. It can tell you the time spent on Python, PostgreSQL and Memcache (Redis).

A question about sites downtime updates

I've a server when I run a Django application but I've a little problem:
when with mercurial I commit and pushing new changes on the server, there's a micro time (like 1 microsec) where the home page is unreachable.
I have apache on the server.
How can I solve this?

You could run multiple instances of the django app (either on the same machine with different ports or on different machines) and use apache to reverse proxy requests to each instance. It can failover to instance B whilst instance A is rebooting. See mod_proxy.
If the downtime is as short as you say though, it is unlikly to be an issue worth worrying about.
Also note that there are likely to be better (and easier) proxies than Apache. Nginx is popular, as is HAProxy.

If you have any significant traffic in time that is measured in microsecond it's probably best to push new changes to your web servers one at a time, and remove the machine from load balancer rotation for the moment you're doing the upgrade there.

When using apachectl graceful, you minimize the time the website is unavailable when 'restarting' Apache. All children are 'kindly' requested to restart and get their new configuration when they're not doing anything.
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
At a heavy-traffic website, you will notice some performance loss, as some children will temporarily not accept new connections. It's my experience, however, that TCP recovers perfectly from this.
Considering that some websites take several minutes or hours to update, that is completely acceptable. If it is a really big issue, you could use a proxy, running multiple instances and updating them one at a time, or update at an off-peak moment.

If you're at the point of complaining about a 1/1,000,000th of a second outage, then I suggest the following approach:
Front end load balancers pointing to multiple backend servers.
Remove one backend server from the loadbalancer to ensure no traffic will go to it.
Wait for all traffic that the server was processing has been sent.
Shutdown the webserver on that instance.
Update the django instance on that machine.
Add that instance back to the load balancers.
Repeat for every other server.
This will ensure that the 1/1,000,000th of a second gap is removed.

i think it's normal, since django may be needing to restart its server after your update

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.