Threads vs Asynchronous Execution in Python

Threads vs Asynchronous Execution in Python - python

I'm relatively new to threading and asynchronous programming in general, but I'm trying to understand the distinction between the two as it relates to the GIL in CPython.
From the reading that I've down, I understand that threads have their own stack and the two models are a different programming paradigm. But given that they cannot run concurrently because of the GIL, are python threads a type of asynchronous execution underneath? I'd really like to get a better understanding of how the python interpreter implements threading, specifically how it determines when one thread is blocking and another can be executed?

The GIL does only play a role when executing python code - calling Functions that are implemented in C, for example, GIL shouldn't interfere afaik. Also, downloading files or moving files from the disk can be made work concurrently with python Threads.
Quote from the Python Wiki:
Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
You might have a look on the multiprocessing module, which allows you to overcome the GIL and use multiple Cores on the machine. Also, there is work going on to make PyPy, an alternative Python Interpreter, become GIL-less in some day (just search for STM/AME).

Related

Parallel threading python GIL vs Java

I know that python has a GIL that make threads not be able to run at the same time therefore threading is just context switching.
Why is java different?
Threads on the same CPU in every language cannot run parallel.
Is creating new thread in java utilizes cores in multi core machine?
python can only spawn threads on the same CPU, in contrast to java?
If 1. Is the case, when using more threads than CPUs even in java it comes back to context switching again for several of them?
If 1. Is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
Isn't the whole point of threading is being able to use the same memory space? If java does run some of them in multiple threads for perallelism, how do they really share memory?
Thank you

Why is java different?
Because it is able to effectively use multiple cores at the same time.
Does creating a new thread in java utilizes cores in multi core machine?
Yes.
Python can only spawn threads on the same CPU, in contrast to Java?
Java can spawn multiple threads which will on different CPUs. Java is not responsible for the actual thread scheduling. That is handled by the OS. And the OS may reschedule a thread to a different CPU to the one that it started on.
I am not sure about the precise details for Python, but I think the GIL is an implementation detail rather than something that it intrinsic to the language itself1. But either way, in a Python implementation, the GIL means that you would get little performance benefit in spawning threads on multiple cores. As this page says:
"The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter."
If 1. is the case, when using more threads than CPUs does it come back to context switching in Java?
It depends. When switching a CPU between threads belonging to different processes, a full context switch is involved. But when switching between threads in the same process, only the (user) registers need to be switched. (The virtual memory registers and caches don't need to be switched / flushed because the threads share the same virtual address space.)
If 1. is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
The key difference between multi-threading and multi-processing is that processes do not share any memory. By contrast, one thread in a process can see the memory of all of the others ... modulo issues of when changes are visible.
This difference has a variety of consequences.
Isn't the whole point of threading is being able to use the same memory space?
Yes, that is the main point ... when you compare multi-threading with multi-processing.
If Java does run some of them in multiple threads for parallelism ...
Java supports threads for many reasons. Parallelism is only one of those reasons. Others include multiplexing I/O and simplifying certain kinds of programming problem. These other reasons are also relevant to Python.
... how do [Java threads] really share memory?
The hardware deals with the issues of making the physical memory visible to all of the threads, and propagation of changes via the memory caches. It is complicated.
In Java the onus is on the programmer to "do the right thing" when threads make use of shared variables / objects. You need to use volatile variables, or synchronized blocks / methods, or something else that ensures that there is a happens before chain between a write and subsequent read. (Otherwise you can get issues with changes not being visible.)
This transfer of responsibility to the programmer allows the compiler to generate code with fewer main memory operations ... and hence that is faster. The downside is that if an application doesn't obey the rules, it is liable to behave in unexpected ways.
By contrast, in Python the memory model is unspecified, but there is an expectation (by millions of Python programmers) that it will behave in an intuitive fashion; i.e. a shared variable write performed by one thread will immediately be visible to other threads. This is hard to achieve efficiently while also allowing Python threads to run in parallel.
1 - While the GIL is not formally part of the Python spec, the influence of GIL on the (unspecified!) Python memory model and Python programmers assumptions make it more than merely an implementation detail. It remains to be seen if Python can successfully evolve into a language where multi-threading can use multiple cores effectively.

Not a complete answer here, but just adding a couple of things that Stephen C didn't already say:
Python can only spawn threads on the same CPU, in contrast to java?
That would be an optimization, not an essential fact. There's no reason in principle why Python could not simply allow the OS to schedule its threads on whatever CPU happened to be available at any given time.
OTOH, given that no two Python threads can do significant work at the same time, it potentially could improve performance if the threads all had affinity for the same CPU. (See what Stephen C said about "full context switch" vs. "only the (user) registers."
Giving user-mode processes control over processor affinity is a relatively new feature in some operating systems. I have no idea of whether or not any Python version actually uses that feature.
If java does run...multiple threads for parallelism...?
Java doesn't "run multiple threads for parallelism." Your Java program creates multiple threads for whatever reason you happen to want them. Most modern OSs provide threads. Java simply makes that ability available to application programmers in a way that is tightly integrated with the language itself. You are free to use them (or not) however you see fit.

How to speed up nested loops in python with concurrency?

i have the following code:
def multiple_invoice_matches(payment_regex, invoice_regex):
multiple_invoice_payment_matches=[]
for p in payment_regex:
if p["match_count"]>1:
for k in p["matches"]:
for i in invoice_regex:
if i["rechnung_nr"] ==k:
multiple_invoice_payment_matches.append({"fuzzy_ratio":100, "type":2, "m_match":0, "invoice":i, "payment":p})
return multiple_invoice_payment_matches
The sizes of payment_regex and invoice_regex are really huge. Therefore, the code snippet give above takes too much time to return the result. How can I speed up running time of this code?

You could take a look at the numba library, if your data has the possibility of parallelization, rewrite your function using the numba library would definitely speed up your code.
Without the dimensions of size and how your data is structured it's kind of hard to give a general approach to optimize your function.

I could say partition your data into multiple ranges (either by payment_regex, or by invoice_regex, or both) and then add those partitions to a work queue that is processed by multiple threads. Wait for those threads to finish (i.e.: join them), and then construct your final list based on the partial results you got for each partition.
This will work well in other programming languages, but unfortunately, not in Python, because of GIL - the Python's Global Interpreter Lock.
If you don't know much about GIL here's a decent article, saying:
The Python Global Interpreter Lock or GIL, in simple words,
is a mutex (or a lock) that allows only one thread to hold
the control of the Python interpreter.
[...]
The impact of the GIL isn’t visible to developers who execute
single-threaded programs, but it can be a performance bottleneck
in CPU-bound and multi-threaded code.
To evade GIL you basically have two options:
(1) spawn multiple Python processes and use shared memory for backing up your data => concurrency will now rely on the OS for switching between processes (e.g.: use numpy and shared memory, see here)
(2) use a Python package that can manipulate your data and implements the multi-threading model in C, where GIL is not effective (e.g.: use numba)
You may ask yourself then why Python supports multi-threading in the first place?
Multi-threading in Python is mostly useful when the threads are blocked by IO operations (read/write of files, sockets, etc.) or by other system calls that put the thread in the sleep state. That's where Python releases the GIL lock and other threads can operate concurrently while some are at sleep.

Having no performance problems with threading and Python's Global Interpreter Lock. Scalability?

I have been having no problems with performance with Python's Global Interpreter Lock. I've had to make a few things thread-safe - despite common advice, the GIL does NOT automatically guarantee thread-safety - but I've got a program commonly running upwards of 10 threads, where all of them can be active at any time, including together. It is a somewhat complex asynchronous messaging system.
I understand multiprocessing and am even using Celery in this program, but the solution would have to be very convoluted to work through multiprocessing for this problem set.
I'm running 2.7 and using recursive locks despite their performance penalties.
My question is this: will I run into scaling problems with the GIL? I have seen no performance problems with it so far. Measuring this is...problematic. Is there a number of threads or something similar that you hit and it just starts choking? Does GIL performance differ significantly from executing multi-threaded code on a single-core CPU?
Thanks!

The GIL is a complex topic and the exact behavior in your case is hard to explain without your code. So I can not tell you if you will run into troubles in future. I can just advise to bring you project to a recent version of Python 3 if posdible. There have been many improvents made to the GIL in Python 3.
The is nothing like a magic number of threads at which Python will break. The general rule is just: The more threads, the more problem. Most complicated is going from one to two.
The GIL is released in some situations, especially when C code is executed or I/O is done. This allows code to run in parallel. With the advanced featured of modern CPUs is wouldn't be wise to limit your code to just one CPU.

When are Python threads fast?

We're all aware of the horrors of the GIL, and I've seen a lot of discussion about the right time to use the multiprocessing module, but I still don't feel that I have a good intuition about when threading in Python (focusing mainly on CPython) is the right answer.
What are instances in which the GIL is not a significant bottleneck? What are the types of use cases where threading is the most appropriate answer?

Threading really only makes sense if you have a lot of blocking I/O going on. If that's the case, then some threads can sleep while other threads work. If threads are CPU-bound, you're not likely to see much benefit from multithreading.
Note that the multiprocessing module, while more difficult to code for, makes use of separate processes and therefore doesn't suffer the downsides of the GIL.

Since you seem to be looking for examples, here are some off the top of my head and grabbed from searching for CPU-bound and I/O-bound examples (I can't seem to find many). I am no expert, so please feel free to correct anything I've miscategorized. It's also worth noting that advancing technology could move a problem from one category to another.
CPU Bound Tasks (use multiprocessing)
Numerical methods/approximations for mathematical functions (calculating digits of pi, etc.)
Image processing
Performing convolutions
Calculating transforms for graphics programming (possibly handled by GPU)
Audio/video compression/decompression
I/O Bound Tasks (threading is probably OK)
Sending data across a network
Writing to/reading from the disk
Asking for user input
Audio/video streaming

The GIL prevents python from running multiple threads.
If your code releases the GIL before jumping into a C extension, other python threads can continue while the C code runs. Like with the blocking IO, that other people have mentioned.
Ctypes does this automatically, and so does numpy. So if your code uses them a lot, it may not be significantly restricted by the GIL.

Besides the CPU bound and I/O bound tasks, there is still more use cases. For example, thread enables concurrent tasks. A lot of GUI programming fall into this category. The main loop have to be responsive to mouse events. So anytime you have a task that take a while and you don't want to freeze the UI, you do it on a separate thread. It is less about performance and more about parallelism.

Python Global Interpreter Lock (GIL) workaround on multi-core systems using taskset on Linux?

So I just finished watching this talk on the Python Global Interpreter Lock (GIL) http://blip.tv/file/2232410.
The gist of it is that the GIL is a pretty good design for single core systems (Python essentially leaves the thread handling/scheduling up to the operating system). But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.
So since the GIL limits us to basically executing a Python program on one CPU my thought is why not accept this and simply use taskset on Linux to set the affinity of the program to a certain core/cpu on the system (especially in a situation with multiple Python apps running on a multi-core system)?
So ultimately my question is this: has anyone tried using taskset on Linux with Python applications (especially when running multiple applications on a Linux system so that multiple cores can be used with one or two Python applications bound to a specific core) and if so what were the results? is it worth doing? Does it make things worse for certain workloads? I plan to do this and test it out (basically see if the program takes more or less time to run) but would love to hear from others as to your experiences.
Addition: David Beazley (the guy giving the talk in the linked video) pointed out that some C/C++ extensions manually release the GIL lock and if these extensions are optimized for multi-core (i.e. scientific or numeric data analysis/etc.) then rather than getting the benefits of multi-core for number crunching the extension would be effectively crippled in that it is limited to a single core (thus potentially slowing your program down significantly). On the other hand if you aren't using extensions such as this
The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box since a thread fires off an HTTP request and then since it's waiting on I/O gives up the GIL and another thread can do it's thing, so that part of the program can easily run 100+ threads without hurting the CPU much and let me actually use the network bandwidth that is available. As for stackless Python/etc I'm not overly interested in rewriting the program or replacing my Python stack (availability would also be a concern).
[*] Only the main thread can receive signals so if you send a ctrl-C the Python interpreter basically tries to get the main thread to run so it can handle the signal, but since it doesn't directly control which thread is run (this is left to the operating system) it basically tells the OS to keep switching threads until it eventually hits the main thread (which if you are unlucky may take a while).

Another solution is:
http://docs.python.org/library/multiprocessing.html
Note 1: This is not a limitation of the Python language, but of CPython implementation.
Note 2: With regard to affinity, your OS shouldn't have a problem doing that itself.

I have never heard of anyone using taskset for a performance gain with Python. Doesn't mean it can't happen in your case, but definitely publish your results so others can critique your benchmarking methods and provide validation.
Personally though, I would decouple your I/O threads from the CPU bound threads using a message queue. That way your front end is now completely network I/O bound (some with HTTP interface, some with message queue interface) and ideal for your threading situation. Then the CPU intense processes can either use multiprocessing or just be individual processes waiting for work to arrive on the message queue.
In the longer term you might also want to consider replacing your threaded I/O front-end with Twisted or some thing like eventlets because, even if they won't help performance they should improve scalability. Your back-end is now already scalable because you can run your message queue over any number of machines+cpus as needed.

An interesting solution is the experiment reported by Ryan Kelly on his blog: http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2/
The results seems very satisfactory.

I've found the following rule of thumb sufficient over the years: If the workers are dependent on some shared state, I use one multiprocessing process per core (CPU bound), and per core a fix pool of worker threads (I/O bound). The OS will take care of assigining the different Python processes to the cores.

The Python GIL is per Python interpreter. That means the only to avoid problems with it while doing multiprocessing is simply starting multiple interpreters (i.e. using seperate processes instead of threads for concurrency) and then using some other IPC primitive for communication between the processes (such as sockets). That being said, the GIL is not a problem when using threads with blocking I/O calls.
The main problem of the GIL as mentioned earlier is that you can't execute 2 different python code threads at the same time. A thread blocking on a blocking I/O call is blocked and hence not executin python code. This means it is not blocking the GIL. If you have two CPU intensive tasks in seperate python threads, that's where the GIL kills multi-processing in Python (only the CPython implementation, as pointed out earlier). Because the GIL stops CPU #1 from executing a python thread while CPU #0 is busy executing the other python thread.

Until such time as the GIL is removed from Python, co-routines may be used in place of threads. I have it on good authority that this strategy has been implemented by two successful start-ups, using greenlets in at least one case.

This is a pretty old question but since everytime I search about information related to python and performance on multi-core systems this post is always on the result list, I would not let this past before me an do not share my thoughts.
You can use the multiprocessing module that rather than create threads for each task, it creates another process of cpython compier interpreting your code.
It would make your application to take advantage of multicore systems.
The only problem that I see on this approach is that you will have a considerable overhead by creating an entire new process stack on memory. (http://en.wikipedia.org/wiki/Thread_(computing)#How_threads_differ_from_processes)
Python Multiprocessing module:
http://docs.python.org/dev/library/multiprocessing.html
"The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box..."
About this, I guess that you can have also a pool of process too: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers
Att,
Leo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.