asyncio performance in non-web, non-files apps

asyncio performance in non-web, non-files apps - python

I’ve got a burning question. Recently I’ve been learning Asyncio in Python and found it very useful and efficient but here is my question: is it efficient to use it for “normal” things?
It’s obvious that using asynchronous operations for making requests, handling requests (in apis), working on files will give us performance gain. But how I put other operations? For example, if I want to do a lot of complicated mathematical operations or just standard operations (without files and web), would asyncio help me anyway? Is there any reason why we should use it outside our apps where we are not making requests and doing all this web or files stuff?
I’m wondering because in college teachers never mentioned that we couldn’t get any better by using it for just math or standard (local?, non-file, non-web) operations and I thought that we benefit from it (almost always). Am I totally wrong? Is it that way just in python or in every other language ?

asyncio in the first place is a convenient way to run multiple execution flows (compared to common alternatives like callbacks and threads).
Why would someone want to run multiple execution flows? Usually to gain performance, for example:
You don't want to waste time waiting one network request finished, so you starting another concurrently gaining performance
You don't want to waste time waiting one OS thread finished, so you starting another concurrently. In Python due to GIL you won't gain performance with threads for CPU-bound operations. But they can still be useful for network stuff or specifically in asyncio as a common way to run something blocking without freezing event loop.
You don't want to waste time waiting one OS process finished, so you starting another concurrently.
Last item is a way to gain performance even for purely CPU-bound operations (if machine have multiple cores). You can see example here (third option). asyncio here, again, is just a tool for convenient managing execution flows. Nothing stops you from using pure ProcessPoolExecutor and de-facto callbacks as shown here.

Related

Concurrency: multiprocessing, threading, greenthreads and asyncio

I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?

Python Multiprocessing vs Eventlet

Based on my understanding, threads cannot be executed in parallel(executed based on availability and random) and thats the reason Eventlet are being used.
If Eventlets are more for parallelism why can't we just use multiprocessing module of Python.
I thought of executing multi process modules and use the join method() to check if all the process are complete.
Can someone explain if my understanding is correct?

Based on my understanding, threads cannot be executed in parallel (executed based on availability and random)
Correct
and thats the reason Eventlet are being used.
Not so correct. The Eventlet library is used to simplify non-blocking IO programming. It does not actually add parallelism. Thread execution is still limited to one thread at a time due to the GIL. But it is used because it greatly simplifies the process of launching, scheduling, and managing IO-bound threads, particularly ones that do not need to interact with each other.
If Eventlets are more for parallelism
As I just mentioned, this is not what they exist for.
why can't we just use multiprocessing module of Python. I thought of executing multi process modules and use the join method() to check if all the process are complete.
You certainly can! And you will get actual parallel execution with this approach. But you may not get the same speedup. The multiprocessing library is better suited for CPU-bound parallel tasks, because those are the ones that need more frequent access to the interpreter. You may actually see an increase in execution time when using multiprocessing with IO-bound tasks because of the overhead of multiple process execution and management.
As is the case with most optimization and execution time questions, trying both and profiling is the surefire way to guarantee you're using the "best" option for your application. Though you may find that if you write the code to utilize Eventlets first, then try to modify it to use regular threads or multiprocessing, you'll have to write more boilerplate code just to manage the threads or processes, and the value of Eventlets should become more obvious.

multi-threading in python: is it really performance effiicient most of the time?

In my little understanding, it is the performance factor that drives programming for multi-threading in most cases but not all. (irrespective of Java or Python).
I was reading this enlightening article on GIL in SO. The article summarizes that python adopts GIL mechanism; i.e only a single Thread can execute python byte code at any given time.
This makes single thread application really faster.
My question is as follows:
Since if only one Thread is served at a given point, does multiprocessing or thread module provides a way to overcome this limitation imposed by GIL? If not, what features does they provide for doing a real multi-task work
There was a question asked in the comments section of the above post in the accepted answer,but no answer has been made? I had this question in my mind too
^so at any time point of time, only one thread will be serving content to client...
so no point of actually using multithreading to improve performance. right?

You're right about the GIL, there is no point to use multithreading to do CPU-bound computation, as the CPU will only be used by one thread.
But that previous statement may have enlighted you: If your computation is not CPU bound, you may take advantage of multithreading.
A typical example is when your application take most of its time waiting for something.
One of many many examples of not-CPU bound program:
Say you want to build a web crawler, you have to crawl many many websites, and store them in a database, what does cost times ? Waiting for the servers to send data, actually downloading the data, and storing it in the database, nothing CPU bound here. Here you may get a faster crawler using a pool of crawlers instead of one single crawler. Typically in the case one website is almost down and very slow to respond (~30s), during this time, a single-threaded application will wait for the website, you're stuck. In a multithreaded application, other threads will continue crawling, and that's cool.
On the other hand, as there is one GIL per process, you may use multiprocessing to do CPU-bound computation.
As a side note, it exists some more or less partial implementations of Python without the GIL, I'd like to mention one that I think is in a great way to achieve something cool: pypy STM. You'll easily find, searching "get rid of the GIL" a lot of threads about the subject.

Multiprocessing side-steps the GIL issue because code runs in a separate process while the GIL is only concerned with a single process. Within a process, multithreading may be faster to the extent that threads are waiting for some relatively slow resource like the disk or network.

A quick google search yielded this informative slideshow. http://www.dabeaz.com/python/UnderstandingGIL.pdf
But what it fails to present it the fact that all threads are contained within a process. And a process by default can only run on one CPU (or core). So while the GIL on a per process basis does manage the threads in said process and doesn't always deliver the expected performance, it should at large scales perform better than single threaded operations.

GIL is always a hot topic in python but usually meaningless. It makes most programs much more safe. If you want real computational performance, try PyOpenCL. Any modern real-world high performance number crunching should be done on GPUs (also openCL runs happily on CPUs). It has no GIL issues.
If you want to do multithreading in python to improve I/O bound performance, GIL is not an issue there.
Lastly if you want to utilize multiple CPUs to increase performance of your pure number crunching, and in a pythonic fashion, use multiprocessing.
But its still not as fast as coding your multithreaded application in assembly. Good luck not making typos.

When are Python threads fast?

We're all aware of the horrors of the GIL, and I've seen a lot of discussion about the right time to use the multiprocessing module, but I still don't feel that I have a good intuition about when threading in Python (focusing mainly on CPython) is the right answer.
What are instances in which the GIL is not a significant bottleneck? What are the types of use cases where threading is the most appropriate answer?

Threading really only makes sense if you have a lot of blocking I/O going on. If that's the case, then some threads can sleep while other threads work. If threads are CPU-bound, you're not likely to see much benefit from multithreading.
Note that the multiprocessing module, while more difficult to code for, makes use of separate processes and therefore doesn't suffer the downsides of the GIL.

Since you seem to be looking for examples, here are some off the top of my head and grabbed from searching for CPU-bound and I/O-bound examples (I can't seem to find many). I am no expert, so please feel free to correct anything I've miscategorized. It's also worth noting that advancing technology could move a problem from one category to another.
CPU Bound Tasks (use multiprocessing)
Numerical methods/approximations for mathematical functions (calculating digits of pi, etc.)
Image processing
Performing convolutions
Calculating transforms for graphics programming (possibly handled by GPU)
Audio/video compression/decompression
I/O Bound Tasks (threading is probably OK)
Sending data across a network
Writing to/reading from the disk
Asking for user input
Audio/video streaming

The GIL prevents python from running multiple threads.
If your code releases the GIL before jumping into a C extension, other python threads can continue while the C code runs. Like with the blocking IO, that other people have mentioned.
Ctypes does this automatically, and so does numpy. So if your code uses them a lot, it may not be significantly restricted by the GIL.

Besides the CPU bound and I/O bound tasks, there is still more use cases. For example, thread enables concurrent tasks. A lot of GUI programming fall into this category. The main loop have to be responsive to mouse events. So anytime you have a task that take a while and you don't want to freeze the UI, you do it on a separate thread. It is less about performance and more about parallelism.

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically

For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.

You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.

Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.

If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)

IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.

You may want to look at Twisted. It is designed for asynchronous network tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.