Should I use Python native multithread or Multiple Tasks in Airflow?

Should I use Python native multithread or Multiple Tasks in Airflow? - python

I'm refactoring a .NET application to airflow. This .NET application uses multiple threads to extract and process data from a mongoDB (Without multiple threads the process takes ~ 10hrs, with multi threads i can reduce this) .
In each documment on mongoDB I have a key value namedprocess. This value is used to control which thread process the documment. I'm going to develop an Airflow DAG to optimize this process. My doubt is about performance and the best way to do this.
My application should have multiple tasks (I will control the process variable in the input of the python method). Or should I use only 1 task and use Python MultiThreading inside this task? The image below illustrates my doubt.
Multi Task X Single Task (Multi Threading)
I know that using MultiTask I'm going to do more DB Reads (1 per task). Although, using Python Multi Threading I know I'll have to do a lot of control processing inside de task method. What is the best, fastest and optimized way to do this?

It really depends on the nature of your processing.
Multi-threading in Python can be limiting because of GIL (Global Interpreter Lock) - there are some operations that require exclusive lock, and this limit the parallelism it can achieve. Especially if you mix CPU and I/O operations the effects might be that a lot of time is spent by threads waiting for the lock. But it really depends on what you do - you need to experiment to see if GIL affects your multithreading.
Multiprocessing (which is used by Airflow for Local Executor) is better because each process runs effectively a separate Python interpreter. So each process has it's own GIL - at the expense of resources used (each process uses it's own memory, sockets and so on). Each task in Airlfow will run in a separate process.
However Airflow offers a bit more - it also offers multi-machine. You can run separate workers With X processes on Y machines, effectively running up to X*Y processes at a time.
Unfortunately, Airflow is (currently) not well suited to run dynamic number of parallel tasks of the same type. Specifically if you would like to split load to process to N pieces and run each piece in a separate task - this would only really work if N is constant and does not change over time for the same DAG (like if you know you have 10 machines, with 4 CPUs, you's typically want to run 10*4 = 40 tasks at a time, so you'd have to split your job into 40 tasks. And it cannot change dynamically between runs really - you'd have to write your DAG to run 40 parallel tasks every time it runs.
Not sure if I helped, but there is no single "best optimised" answer - you need to experiment and check what works best for your case.

Related

Celery vs threading for short tasks (5 seconds each)

I implemented a script in which every day I process several urls and make many I/O operations, and I am subclassing threading.Thread and starting a number of threads (say 32).
The workload varies day by day but as soon as the processing starts I am sure that no more tasks will be added to the input queue.
Also, my script is not supporting any front-end (at least for now).
I feel though that this solution will not be so easily scalable in the case of multiple processes / machines and would like to give Celery (or any distributed task queue) a shot, but I always read that it’s better suited for long-running tasks running in the background to avoid blocking a UI.
On the other hand, I have also read that having many small tasks is not a problem with Celery.
What’s your thought on this? Would be easy to scale Celery workers possibly across processes / machines?

Python multiprocessing and concurency vs paralellism

I have created and application. In this application I use multiprocessing library. In that application I do spin two processes (instances of the same class) to consume data from Kafka and put into Python Queue.
This is the library I used:
Python multiprocessing
Q1. Is it concurrency or is it parallelism?
Q2. Is it multithreading or is it multiprocessing?
Q3. How does Python maps Processes to CPUs? (does this question make sense?)
I understand in order to speak about multithreading I need to use separate / multiple CPUs (so separate threads are mapped to separate CPU threads).
I understand in order to speak about multiprocessing I need to use separate memory space for both processes? Is it correct?
I assume if I spin two processes within one Application instance => we talk about concurrency.
If I spin multiple instances of above application then we would talk about parallelism? (multiple CPUs, separate memory spaces used)?
I see that Python library defines it as follows: Python multiprocessing library
The multiprocessing package offers both local and remote concurrency
...
Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.
...
A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).

First, separate threads are not mapped to separate CPU-s. That's optional, and in python due to the GIL, all threads in a process will run on the same CPU
1) It's both concurrency, in that the order of execution is not set, and parallelism, since the multiprocessing package can run on multiple processors, bypassing the GIL limitations.
2) Since the threading package is another story, then it's definitely multiprocessing
3) I may be speaking out of line, but python , IMO does NOT map processes to CPU-s, it leaves this detail to the OS

Q1: It is at least concurrency, can be parallelism as well (terms intended as defined in the answer to this question). Clearly, if you have one processor only, true parallellism cannot be achieved, becuse only one process can use the CPU at a single time. In that case, however, the muliprocessing library still allows you to define multiple tasks, that run in separate processes. It will be the OS's scheduler to decide which process runs when.
Q2: Multiprocessing (...which is kind of implied by the library name). Due to the Global Interpreter Lock present in most Python interpreter implementations, parallelism with threads is impossible. Multiprocessing offers a threading-like interface that makes use of processes under the hood.
Q3: It doesn't. Python spawns processes, the OS scheduler decided who runs where and when. There are some ways to execute processes on specific CPUs, but this is not the default behaviour of multiprocessing (and I'm not aware of any way to force the library to pin processes to CPUs).

Why python does (not) use more CPUs? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Does Python support multithreading? Can it speed up execution time?

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

If `threading` module doesn't really let multiple threads run at once due to GIL, why is ParallelPython using all my cores?

AFAIK, the threading.Thread instances can't actually run in parallell, due to the Global Interpreter Lock, which forces only one thread to be able to run at any time (except for when blocking on I/O operations).
ParalellPython uses the threading module.
If I however submit multiple local jobs to it, it DOES execute them in parallel, or at least so it would seem.
I have 8 cores, and if I start 8 jobs to simply run empty loops, they all take up 12-13% of the CPU (meaning they each get executed on one core, and I can see this in my task manager)
Does anyone know how this can happen?

As the linked page says,
PP module overcomes this limitation and provides a simple way to write parallel python applications. Internally ppsmp uses processes and IPC (Inter Process Communications) to organize parallel computations
So the actual parallelism must be due to invoking multiple processes, as one would expect.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.