How can I handle multiple Python requests on my AWS EC2 instance?

How can I handle multiple Python requests on my AWS EC2 instance? - python

I have a Flask app deployed on Elastic Beanstalk onto an EC2 instance on AWS. If 100 people simultaneously connected to my server, then wouldn't that mean that they have to wait in a queue of 100 since the app can only handle one instance at a time?
How can I make it so that I can handle more requests using the same IP address to connect to? Thanks!

The short answer is to use uWSGI or gunicorn.
The longer answer is that your intuition is correct - what you are worrying about is "concurrency", or the number of simultaneous requests your app can handle. And yes, a single Flask app without any application server can handle one request at a time. How do you change that? For most Python apps, the unit of concurrency is a process (there are frameworks that change that, but the majority of app deployments are probably process-based). That is, you run a process for each concurrent request you think you'll need. App servers like uWSGI do the listening for your app, then dispatch the request to a process from a pool. So, how many processes do you need?
The second concept you need is "throughput" - how many requests can be served in a specific time, which is influenced by, but different from, "concurrency" and is where your intuition may mislead you. Let's say you have 8 processes. You may think "but I'll have 100 users, 8 is clearly not enough". Let's assume you know that each request completes in 1/8 (.125) seconds. That means that each process can serve 8 requests a second. Times 8 processes; your throughput will be (roughly) 64 requests per second. 8 process gets you a lot closer to your 100 users than you may have otherwise expected. Your 100 users probably won't actually issue requests in that 1 second window. Possible, but unlikely. The issue isn't really the concurrency, but whether the user gets a response in a reasonable time.
Hope this helps. Scaling is a wonderful topic - both straightforward and frustratingly nuanced at the same time. As your traffic increases, the above guidance will shift and you'll need more and more advanced techniques. But to get started - keep it simple and focus on the basics.
See How many concurrent requests does a single Flask process receive?

Related

gunicorn and/or celery: What is the way get the best out of both?

I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.

There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).

How do I run background job in Flask without threading or task-queue

I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?

I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.

Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).

with flask, is it bad to use multiprocessing.Process to handle concurrent requests?

The flask python server can handle by default only one connection at a time.
By using multiprocessing.Process() one can spawn work tasks for every request. Each request takes some time for example to query a database.
question 1.
Why is it bad and why a WSGI server would be recommended and superior?
question 2.
It works with multiprocessing.Process(). Maybe it is not structured. But what real problem can happen in the future?

By the method of using multiprocess library and attempting concurrent processes for requests you risk limiting the concurrency to maximum of number of cores the machine CPU has. This is loosely equivalent to using --workers flag with something like gunicorn and providing maximum number of cores available for the guincorn server to run. While sure one can write the required logic to provide CPU time to each connection it seems like a lot of effort when WSGI frameworks exists to do exactly that. I'd suggest you going through Settings and Design documentation of GUnicorn to obtain an even better clarity on your situation.

Setting Django WSGI workers with long external API response

I'm writing an e-commerce plug-in app in Python/Django that integrates with Shopify stores. Whenever a customer for a store reaches checkout, Shopify sends a request to my app with shopping cart and destination address data, and my app is required to respond with shipping price information. The problem is that I need to make an external API call between them sending me the request and sending them the response, and under moderate load, my WSGI workers get filled very easily.
I'm trying to avoid scaling out unnecessarily. Should I simply increase my number of workers past the recommended cores * 2 + 1? Do I simply monitor CPU load in order to adjust this number? What's the ideal CPU load % I should be looking for? Since I'm also handing short non-blocked requests from the same app, will this cause any problems?
Is Django simply not a good match for this kind of use-case? If so, what is a good match, and what would be the best way to apply it without rewriting my whole app?
EDIT: My WSGI server is Gunicorn

There are a couple of things you can do to improve the performance of gunicorn here. Given your design, it's almost certain that your workers are IO-bound. So for a start you could configure them to use multiple threads per worker; the docs suggest 2-4.
However, again because of the IO-bound nature of your site, it seems likely that you'll get even better improvements by using one of the asynchronous worker types. See the design docs for details: I don't think there is much to choose between gevent and eventlet, personally I've had good results from the former.

Are uwsgi processes stuck when using Comet(Long Polling)?

I believe nginx is event based so with 1 single worker it can take multiple requests, say 100requests/second. These requests will then be pass on to uwsgi to be process and then once it's done it will push the result back to nginx and nginx will push the result to the user that do http request.
Assuming I am only using 1 worker(no thread) for my uwsgi, uwsgi will process this 100 request one by one right? So it will need to do 100 processes to complete the entire requests.
Now what happen if I am planning to use long polling to get a quick update on my front end How does facebook, gmail send the real time notification?
I believe it will force the uwsgi to process a single request(which is the long polling process) and suspend all the other requests, hence causing the entire system to broke down.
Do I have any misconception of how uwsgi work, or is there any other solution to implement long polling?
Thank You

Your analysis is right, long-polling is not well-suited for multiprocesses or multithreads modes (in term of costs). Each process/thread can manage a single request. Lucky enough uWSGI supports dozens of
non-blocking/evented/microthreads-based technologies (like gevent, or lower-levels greenlets), if your app can be adapted to this patterns (and this is not a no-brain task, so do not hope monkey-patching will be enough) you will win.
In addition to this, if you like/tolerate callback-based programming and you do not need uWSGI specific features, i find Tornado a great solution for the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.