I am developing my final degree project and I am facing some problems with Python, Flask, socketIO and background threads.
My solution takes some files as input, process them, makes some calculations, and generates an image and a CSV file. Those files are then uploaded to some storage service. I want to make the processing of the files on a background thread and notify my clients (web, Android, and iOS) using websockets. Right now, I am using flask-socketIO with eventlet as the async_mode of my socket. When a client uploads the files, the process is started in a background thread (using socketio.start_background_task) but that heavy process (takes about 30 minutes to end) seems to take control of the main thread, as a result when I try to make an HTTP request to the server, the response is loading infinitely.
I would like to know if there is a way to make this work using eventlet or maybe using another different approach.
Thank you in advance.
Eventlet uses cooperative multitasking, which means that you cannot have a task using the CPU for long periods of time, as this prevents other tasks from running.
In general it is a bad idea to include CPU heavy tasks in an eventlet process, so one possible solution would be to offload the CPU heavy work to an external process, maybe through Celery or RQ. Another option that sometimes works (but not always) is to add calls to socketio.sleep(0) inside your CPU heavy task as frequently as possible. The sleep call interrupts the function for a moment and allows other functions waiting for the CPU to run.
Related
I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?
I am developing the rest API using python flask. (Client is a mobile app)
However, important functions are a batch program that reads data from
DB processes it, and then updates (or inserts) the data when a user requests POST method with user data
Considering a lot of Read, Write, and Computation
How do you develop it?
This is how I think.
Use procedures in DB
Create an external deployment program that is independent of API.
Create a separate batch server
Just run it on the API server
I can not judge what is right with my knowledge.
And the important thing is that execution speed should not be slow.
For the user to feel, they should look as though they are running on their own devices.
I would like to ask you for advice on back-end development.
I would recommand considering asyncio. This is pretty much the use-case you have - i/o is time-consuming, but doesn't require lots of CPU. So essentially you would want that i/o to be done asynchronously, while the rest of the server carries on.
The server receives some requests that requires i/o.
It spins off that request into your asyncio architecture, so it can
be performed.
The server is already available to receive other requests, while the
previous i/o requests is being processed.
The previous i/o requests finishes. Asyncio offers a few ways to
deal with this.
See the docs, but you could provide a callback, or build your logic to take advantage of Asyncio's event loop (which essentially manages switching back & forth between context, e.g. the "main" context of your server serving resquests and the async i/o operations that you have queued up).
I have a Flask app that is using external scripts to perform certain actions. In one of the scripts, I am using threading to run the threads.
I am using the following code for the actual threading:
for a_device in get_devices:
my_thread = threading.Thread(target=DMCA.do_connect, args=(self, a_device, cmd))
my_thread.start()
main_thread = threading.currentThread()
for some_thread in threading.enumerate():
if some_thread != main_thread:
some_thread.join()
However, when this script gets ran (from a form), the process will hang and I will get a continuous loading cycle on the webpage.
Is there another way to use multithreading within the app?
Implementing threading by myself in a Flask app has always ended in some kind of disaster for me. You might want to use a distributed task queue such as Celery. Even though it might be tempting to spin off threads by yourself to get it finished faster, you will start to face all kinds of problems along the way and just end up wasting a lot of time (IMHO).
Celery is an asynchronous task queue/job queue based on distributed
message passing. It is focused on real-time operation, but supports
scheduling as well.
The execution units, called tasks, are executed concurrently on a
single or more worker servers using multiprocessing, Eventlet, or
gevent. Tasks can execute asynchronously (in the background) or
synchronously (wait until ready).
Here are some good resources that you can use to get started
Using Celery With Flask - Miguel Grinberg
Celery Background Tasks - Flask Documentation
I am working on a django application which uses celery for the distributed async processes. Now I have been tasked with integrating a process which was originally written with concurrent.futures in the code. So my question is, can this job with the concurrent futures processing work inside the celery task queue. Would it cause any problems ? If so what would be the best way to go forward. The concurrent process which was written earlier is resource intensive as it is able to avoid the GIL. Also, its very fast due to it. Not only that the process uses concurrent.futures.ProcessPoolExecutor and inside it another few (<5) concurrent.futures.ThreadPoolExecutor jobs.
So now the real question is should we extract all the core functions of the process and re-write them by breaking them as celery app tasks or just keep the original code and run it as one big piece of code within the celery queue.
As per the design of the system, a user of the system can submit several such celery tasks which would contain the concurrent futures code.
Any help will be appreciated.
Your library should work without modification. There's no harm in having threaded code running within Celery, unless you are mixing in gevent with non-gevent compatible code for example.
Reasons to break the code up would be for resource management (reduce memory/CPU overhead). With threading, the thing you want to monitor is CPU load. Once your concurrency causes enough load (e.g. threads doing CPU intensive work), the OS will start swapping between threads, and your processing gets slower, not faster.
I'm trying to decide if I should use gevent or threading to implement concurrency for web scraping in python.
My program should be able to support a large (~1000) number of concurrent workers. Most of the time, the workers will be waiting for requests to come back.
Some guiding questions:
What exactly is the difference between a thread and a greenlet? What is the max number of threads \ greenlets I should create in a single process (with regard to the spec of the server)?
The python thread is the OS thread which is controlled by the OS which means it's a lot heavier since it needs context switch, but green threads are lightweight and since it's in userspace the OS does not create or manage them.
I think you can use gevent, Gevent = eventloop(libev) + coroutine(greenlet) + monkey patch. Gevent give you threads but without using threads with that you can write normal code but have async IO.
Make sure you don't have CPU bound stuff in your code.
I don't think you have thought this whole thing through. I have done some considerable lightweight thread apps with Greenlets created from the Gevent framework. As long as you allow control to switch between Greenlets with appropriate sleep's or switch's -- everything tends to work fine. Rather than blocking or waiting for a reply, it is recommended that the wait or block timeout, raise and except and then sleep (in except part of your code) and then loop again - otherwise you will not switch Greenlets readily.
Also, take care to join and/or kill all Greenlets, since you could end up with zombies that cause copious effects that you do not want.
However, I would not recommend this for your application. Rather, one of the following Websockets extensions that use Gevent... See this link
Websockets in Flask
and this link
https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/
I have implemented a very nice app with Flask-SocketIO
https://flask-socketio.readthedocs.io/en/latest/
It runs through Gunicorn with Nginx very nicely from a Docker container. The SocketIO interfaces very nicely with Javascript on the client side.
(Be careful on the webscraping - use something like Scrapy with the appropriate ethical scraping enabled)