How does dask achieve parallelism?

How does dask achieve parallelism? - python

I don't quite understand dask's parallelism model (https://docs.dask.org/en/latest/delayed-best-practices.html)
Given that python is single-threaded, what performance benefit can delayed actually offer? My understanding is it infers independent processes/functions as parts of a graph and then executes them in "parallel", but how is that possible?
I see how they might be "concurrent" processes, but even so - given that the function is sync, how can it perform any concurrent processes?

Simple: python is not "single-threaded", it can run many threads simultaneously. You are maybe thinking of the global interpreter lock (GIL), which makes the interpreter run exactly one operation at a time from one of the threads. Many libraries do not need to hold the GIL, however, so thread-based parallelism is real and useful in many cases. This will generally be true for numerical libraries (pandas...) and other things that do most of their work in compiled C/C++ code.
In addition, Dask supports process-based parallelism, that bypasses the GIL issue, but at the cost of communication and memory overhead. Whether this is better or worse for you will depend on your workload.
Finally, the distributed scheduler is ideal even on a single machine, because it enables you to choose the threads/processes mix that is right for whatever you are doing.

Related

Parallel threading python GIL vs Java

I know that python has a GIL that make threads not be able to run at the same time therefore threading is just context switching.
Why is java different?
Threads on the same CPU in every language cannot run parallel.
Is creating new thread in java utilizes cores in multi core machine?
python can only spawn threads on the same CPU, in contrast to java?
If 1. Is the case, when using more threads than CPUs even in java it comes back to context switching again for several of them?
If 1. Is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
Isn't the whole point of threading is being able to use the same memory space? If java does run some of them in multiple threads for perallelism, how do they really share memory?
Thank you

Why is java different?
Because it is able to effectively use multiple cores at the same time.
Does creating a new thread in java utilizes cores in multi core machine?
Yes.
Python can only spawn threads on the same CPU, in contrast to Java?
Java can spawn multiple threads which will on different CPUs. Java is not responsible for the actual thread scheduling. That is handled by the OS. And the OS may reschedule a thread to a different CPU to the one that it started on.
I am not sure about the precise details for Python, but I think the GIL is an implementation detail rather than something that it intrinsic to the language itself1. But either way, in a Python implementation, the GIL means that you would get little performance benefit in spawning threads on multiple cores. As this page says:
"The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter."
If 1. is the case, when using more threads than CPUs does it come back to context switching in Java?
It depends. When switching a CPU between threads belonging to different processes, a full context switch is involved. But when switching between threads in the same process, only the (user) registers need to be switched. (The virtual memory registers and caches don't need to be switched / flushed because the threads share the same virtual address space.)
If 1. is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
The key difference between multi-threading and multi-processing is that processes do not share any memory. By contrast, one thread in a process can see the memory of all of the others ... modulo issues of when changes are visible.
This difference has a variety of consequences.
Isn't the whole point of threading is being able to use the same memory space?
Yes, that is the main point ... when you compare multi-threading with multi-processing.
If Java does run some of them in multiple threads for parallelism ...
Java supports threads for many reasons. Parallelism is only one of those reasons. Others include multiplexing I/O and simplifying certain kinds of programming problem. These other reasons are also relevant to Python.
... how do [Java threads] really share memory?
The hardware deals with the issues of making the physical memory visible to all of the threads, and propagation of changes via the memory caches. It is complicated.
In Java the onus is on the programmer to "do the right thing" when threads make use of shared variables / objects. You need to use volatile variables, or synchronized blocks / methods, or something else that ensures that there is a happens before chain between a write and subsequent read. (Otherwise you can get issues with changes not being visible.)
This transfer of responsibility to the programmer allows the compiler to generate code with fewer main memory operations ... and hence that is faster. The downside is that if an application doesn't obey the rules, it is liable to behave in unexpected ways.
By contrast, in Python the memory model is unspecified, but there is an expectation (by millions of Python programmers) that it will behave in an intuitive fashion; i.e. a shared variable write performed by one thread will immediately be visible to other threads. This is hard to achieve efficiently while also allowing Python threads to run in parallel.
1 - While the GIL is not formally part of the Python spec, the influence of GIL on the (unspecified!) Python memory model and Python programmers assumptions make it more than merely an implementation detail. It remains to be seen if Python can successfully evolve into a language where multi-threading can use multiple cores effectively.

Not a complete answer here, but just adding a couple of things that Stephen C didn't already say:
Python can only spawn threads on the same CPU, in contrast to java?
That would be an optimization, not an essential fact. There's no reason in principle why Python could not simply allow the OS to schedule its threads on whatever CPU happened to be available at any given time.
OTOH, given that no two Python threads can do significant work at the same time, it potentially could improve performance if the threads all had affinity for the same CPU. (See what Stephen C said about "full context switch" vs. "only the (user) registers."
Giving user-mode processes control over processor affinity is a relatively new feature in some operating systems. I have no idea of whether or not any Python version actually uses that feature.
If java does run...multiple threads for parallelism...?
Java doesn't "run multiple threads for parallelism." Your Java program creates multiple threads for whatever reason you happen to want them. Most modern OSs provide threads. Java simply makes that ability available to application programmers in a way that is tightly integrated with the language itself. You are free to use them (or not) however you see fit.

How to speed up nested loops in python with concurrency?

i have the following code:
def multiple_invoice_matches(payment_regex, invoice_regex):
multiple_invoice_payment_matches=[]
for p in payment_regex:
if p["match_count"]>1:
for k in p["matches"]:
for i in invoice_regex:
if i["rechnung_nr"] ==k:
multiple_invoice_payment_matches.append({"fuzzy_ratio":100, "type":2, "m_match":0, "invoice":i, "payment":p})
return multiple_invoice_payment_matches
The sizes of payment_regex and invoice_regex are really huge. Therefore, the code snippet give above takes too much time to return the result. How can I speed up running time of this code?

You could take a look at the numba library, if your data has the possibility of parallelization, rewrite your function using the numba library would definitely speed up your code.
Without the dimensions of size and how your data is structured it's kind of hard to give a general approach to optimize your function.

I could say partition your data into multiple ranges (either by payment_regex, or by invoice_regex, or both) and then add those partitions to a work queue that is processed by multiple threads. Wait for those threads to finish (i.e.: join them), and then construct your final list based on the partial results you got for each partition.
This will work well in other programming languages, but unfortunately, not in Python, because of GIL - the Python's Global Interpreter Lock.
If you don't know much about GIL here's a decent article, saying:
The Python Global Interpreter Lock or GIL, in simple words,
is a mutex (or a lock) that allows only one thread to hold
the control of the Python interpreter.
[...]
The impact of the GIL isn’t visible to developers who execute
single-threaded programs, but it can be a performance bottleneck
in CPU-bound and multi-threaded code.
To evade GIL you basically have two options:
(1) spawn multiple Python processes and use shared memory for backing up your data => concurrency will now rely on the OS for switching between processes (e.g.: use numpy and shared memory, see here)
(2) use a Python package that can manipulate your data and implements the multi-threading model in C, where GIL is not effective (e.g.: use numba)
You may ask yourself then why Python supports multi-threading in the first place?
Multi-threading in Python is mostly useful when the threads are blocked by IO operations (read/write of files, sockets, etc.) or by other system calls that put the thread in the sleep state. That's where Python releases the GIL lock and other threads can operate concurrently while some are at sleep.

When are Python threads fast?

We're all aware of the horrors of the GIL, and I've seen a lot of discussion about the right time to use the multiprocessing module, but I still don't feel that I have a good intuition about when threading in Python (focusing mainly on CPython) is the right answer.
What are instances in which the GIL is not a significant bottleneck? What are the types of use cases where threading is the most appropriate answer?

Threading really only makes sense if you have a lot of blocking I/O going on. If that's the case, then some threads can sleep while other threads work. If threads are CPU-bound, you're not likely to see much benefit from multithreading.
Note that the multiprocessing module, while more difficult to code for, makes use of separate processes and therefore doesn't suffer the downsides of the GIL.

Since you seem to be looking for examples, here are some off the top of my head and grabbed from searching for CPU-bound and I/O-bound examples (I can't seem to find many). I am no expert, so please feel free to correct anything I've miscategorized. It's also worth noting that advancing technology could move a problem from one category to another.
CPU Bound Tasks (use multiprocessing)
Numerical methods/approximations for mathematical functions (calculating digits of pi, etc.)
Image processing
Performing convolutions
Calculating transforms for graphics programming (possibly handled by GPU)
Audio/video compression/decompression
I/O Bound Tasks (threading is probably OK)
Sending data across a network
Writing to/reading from the disk
Asking for user input
Audio/video streaming

The GIL prevents python from running multiple threads.
If your code releases the GIL before jumping into a C extension, other python threads can continue while the C code runs. Like with the blocking IO, that other people have mentioned.
Ctypes does this automatically, and so does numpy. So if your code uses them a lot, it may not be significantly restricted by the GIL.

Besides the CPU bound and I/O bound tasks, there is still more use cases. For example, thread enables concurrent tasks. A lot of GUI programming fall into this category. The main loop have to be responsive to mouse events. So anytime you have a task that take a while and you don't want to freeze the UI, you do it on a separate thread. It is less about performance and more about parallelism.

Would limiting a GILed Python program to a single CPU boost performance?

Following up on David Beazley's paper regarding Python and GIL, would it be a good practice to limit a Python program (CPython with GIL and all) to a single CPU in a Windows based multi-core system?
Would it boost performance?
UPDATE: Assume multiple threads are used (not sure if it makes a difference)

The paper does indeed imply that limiting a program to a single core would improve performance in that particular case. However, there are a number of concerns that you would need to deal with:
His tests are mainly for compute intensive threads rather than IO bound threads. If the threads you are using often block voluntarily (such as in a web server waiting for a client) then you don't run into GIL issues at all.
The GIL issues deal specifically with threads and not processes. I may be reading your question wrong, but you seem to be asking about restricting all Python programs to a single core. Programs using processes for parallelism don't suffer from GIL issues and restricting them to a single core will make them slower.
The GIL is drastically different in Python 3.2 (as David mentions in this video. The GIL was changed explicitly to deal with such issues. While it still has problems, it no longer has this problem.
In summary, the only time you want to complicate your life by forcing the OS to restrict the program to a single core is when you are running a:
Multithreaded
Compute Intensive
Lower than Python 3.2
program on a multicore machine.

Bias : For parallel computing involving heavy CPU processing, I much
prefer message passing and cooperating processes to thread programming
(of course, it depends on the problem)
You shouldn't limit your programs to one core. Beazley was just demonstrating a specific problem that performed poorly under those unque conditions (those conditions being an IO bound thread and a CPU bound thread competing against each other). Ideally you want to avoid those conditions by using a different method (import multiprocessing).
I think the best solution is to put your CPU bound tasks in other processes using the multiprocessing module so that they utilize their own cores, and IO bound tasks in threads (or microthreads/coroutines, if you read his interesting paper on that: http://www.dabeaz.com/coroutines/) since the GIL is best suited for those types of tasks.
Conclusion: Python threads are best suited for IO bound tasks, NOT CPU bound.

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically

For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.

You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.

Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.

If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)

IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.

You may want to look at Twisted. It is designed for asynchronous network tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.