parallel computations with task manager

parallel computations with task manager - python

I need to run some parallel computations in python. The only compatible approach I can think of is the multiprocess/fork model, which is less than ideal for several reasons:
from what I understand, forks in windows are expensive
fine-grained process management (signals, ie SIGSTOP/SIGCONT) is clunky (i.e. outside the language)
These are the task requirements:
tasks may spawn new tasks
tasks must be registered with the task manager
tasks do not require shared state
tasks must return a value (python object)
The task manager is responsible for scheduling and limiting the number of concurrent tasks. These are the task manager requirements:
when a new task is started, the task manager may suspend other tasks based on a predetermined limit
when a task returns, the task manager may continue other suspended tasks
when the return value of a task is requested, the task manager may reorganize the task priority (prevent deadlocks)
So you see, the task manager doesn't need to be a parallel/concurrent process. Each task may make synchronous calls to the task manager on starting or stopping. Tasks waiting on other tasks may also make synchronous calls.
I can't seem to think of any other approaches:
asyncio can start parallel process within a limited pool, but that approach is more suited for data parallelism rather than task pre-emption. Externally pre-empting a task (suspending) isn't compatible with cooperatively programmed events. Correct me if I'm wrong, but while I could use asyncio, it wouldn't make my life easier (an abstraction without benefit) as I would still be required to use processes, and signals on "task-start/stop" events?
stackless python might be suitable, but it isn't really python?
Any ideas?
P.S. My end-goal is to automatically parallelize (decorated) function calls. The task manager limits the number of tasks executing in parallel (i.e. recursive functions) to avoid thrashing (fork bombs). I need to use python, even though a though lazy (task waiting), pure (no shared state) and stackless (lightweight threads) language might be more suitable...

Wow, this question is old and I'm surprised a Stackless Python user hasn't chimed in...
Then again, Stackless Python was/is way ahead of its time and there's very few of us out there putting it into use.
Stackless Python is indeed Python. It is a little more than just Python, but it is Python none the less.
Stackless Python Wiki
I think it would suit your needs very well. It is still up-to-date and maintained with a commit as recent as this month. It's rather solid and has worked wonderfully for my needs.

Related

asyncio and coroutines vs task queues

I've been reading about asyncio module in python 3, and more broadly about coroutines in python, and I can't get what makes asyncio such a great tool.
I have the feeling that all you can do with coroutines, you can do better by using task queues based on the multiprocessing module (celery for example).
Are there use cases where coroutines are better than task queues?

Not a proper answer, but a list of hints that could not fit into a comment:
You are mentioning the multiprocessing module (and let's consider threading too). Suppose you have to handle hundreds of sockets: can you spawn hundreds of processes or threads?
Again, with threads and processes: how do you handle concurrent access to shared resources? What is the overhead of mechanisms like locking?
Frameworks like Celery also add an important overhead. Can you use it e.g. for handling every single request on a high-traffic web server? By the way, in that scenario, who is responsible for handling sockets and connections (Celery for its nature can't do that for you)?
Be sure to read the rationale behind asyncio. That rationale (among other things) mentions a system call: writev() -- isn't that much more efficient than multiple write()s?

Adding to the above answer:
If the task at hand is I/O bound and operates on a shared data, coroutines and asyncio are probably the way to go.
If on the other hand, you have CPU-bound tasks where data is not shared, a multiprocessing system like Celery should be better.
If the task at hand is a both CPU and I/O bound and sharing of data is not required, I would still use Celery.You can use async I/O from within Celery!
If you have a CPU bound task but with the need to share data, the only viable option as I see now is to save the shared data in a database. There have been recent attempts like pyparallel but they are still work in progress.

Python Multiprocessing vs Eventlet

Based on my understanding, threads cannot be executed in parallel(executed based on availability and random) and thats the reason Eventlet are being used.
If Eventlets are more for parallelism why can't we just use multiprocessing module of Python.
I thought of executing multi process modules and use the join method() to check if all the process are complete.
Can someone explain if my understanding is correct?

Based on my understanding, threads cannot be executed in parallel (executed based on availability and random)
Correct
and thats the reason Eventlet are being used.
Not so correct. The Eventlet library is used to simplify non-blocking IO programming. It does not actually add parallelism. Thread execution is still limited to one thread at a time due to the GIL. But it is used because it greatly simplifies the process of launching, scheduling, and managing IO-bound threads, particularly ones that do not need to interact with each other.
If Eventlets are more for parallelism
As I just mentioned, this is not what they exist for.
why can't we just use multiprocessing module of Python. I thought of executing multi process modules and use the join method() to check if all the process are complete.
You certainly can! And you will get actual parallel execution with this approach. But you may not get the same speedup. The multiprocessing library is better suited for CPU-bound parallel tasks, because those are the ones that need more frequent access to the interpreter. You may actually see an increase in execution time when using multiprocessing with IO-bound tasks because of the overhead of multiple process execution and management.
As is the case with most optimization and execution time questions, trying both and profiling is the surefire way to guarantee you're using the "best" option for your application. Though you may find that if you write the code to utilize Eventlets first, then try to modify it to use regular threads or multiprocessing, you'll have to write more boilerplate code just to manage the threads or processes, and the value of Eventlets should become more obvious.

Green-threads and thread in Python

As Wikipedia states:
Green threads emulate multi-threaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support.
Python's threads are implemented as pthreads (kernel threads),
and because of the global interpreter lock (GIL), a Python process only runs one thread at a time.
[QUESTION]
But in the case of Green-threads (or so-called greenlet or tasklets),
Does the GIL affect them? Can there be more than one greenlet
running at a time?
What are the pitfalls of using greenlets or tasklets?
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to
ulimit(-s, -v) set in your *ix system.)
I need a little insight, and it would help if someone could share their experience, or guide me to the right path.

You can think of greenlets more like cooperative threads. What this means is that there is no scheduler pre-emptively switching between your threads at any given moment - instead your greenlets voluntarily/explicitly give up control to one another at specified points in your code.
Does the GIL affect them? Can there be more than one greenlet running
at a time?
Only one code path is running at a time - the advantage is you have ultimate control over which one that is.
What are the pitfalls of using greenlets or tasklets?
You need to be more careful - a badly written greenlet will not yield control to other greenlets. On the other hand, since you know when a greenlet will context switch, you may be able to get away with not creating locks for shared data-structures.
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to umask limit set in your *ix system.)
With regular threads, the more you have the more scheduler overhead you have. Also regular threads still have a relatively high context-switch overhead. Greenlets do not have this overhead associated with them. From the bottle documentation:
Most servers limit the size of their worker pools to a relatively low
number of concurrent threads, due to the high overhead involved in
switching between and creating new threads. While threads are cheap
compared to processes (forks), they are still expensive to create for
each new connection.
The gevent module adds greenlets to the mix. Greenlets behave similar
to traditional threads, but are very cheap to create. A gevent-based
server can spawn thousands of greenlets (one for each connection) with
almost no overhead. Blocking individual greenlets has no impact on the
servers ability to accept new requests. The number of concurrent
connections is virtually unlimited.
There's also some further reading here if you're interested:
http://sdiehl.github.io/gevent-tutorial/

I assume you're talking about evenlet/gevent greenlets
1) There can be only one greenlet running
2) It's cooperative multithreading, which means that if a greenlet is stuck in an infinite loop, your entire program is stuck, typically greenlets are scheduled either explicitly or during I/O
3) A lot more than threads, it depends of the amount of RAM available

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically

For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.

You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.

Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.

If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)

IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.

You may want to look at Twisted. It is designed for asynchronous network tasks.

Python Global Interpreter Lock (GIL) workaround on multi-core systems using taskset on Linux?

So I just finished watching this talk on the Python Global Interpreter Lock (GIL) http://blip.tv/file/2232410.
The gist of it is that the GIL is a pretty good design for single core systems (Python essentially leaves the thread handling/scheduling up to the operating system). But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.
So since the GIL limits us to basically executing a Python program on one CPU my thought is why not accept this and simply use taskset on Linux to set the affinity of the program to a certain core/cpu on the system (especially in a situation with multiple Python apps running on a multi-core system)?
So ultimately my question is this: has anyone tried using taskset on Linux with Python applications (especially when running multiple applications on a Linux system so that multiple cores can be used with one or two Python applications bound to a specific core) and if so what were the results? is it worth doing? Does it make things worse for certain workloads? I plan to do this and test it out (basically see if the program takes more or less time to run) but would love to hear from others as to your experiences.
Addition: David Beazley (the guy giving the talk in the linked video) pointed out that some C/C++ extensions manually release the GIL lock and if these extensions are optimized for multi-core (i.e. scientific or numeric data analysis/etc.) then rather than getting the benefits of multi-core for number crunching the extension would be effectively crippled in that it is limited to a single core (thus potentially slowing your program down significantly). On the other hand if you aren't using extensions such as this
The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box since a thread fires off an HTTP request and then since it's waiting on I/O gives up the GIL and another thread can do it's thing, so that part of the program can easily run 100+ threads without hurting the CPU much and let me actually use the network bandwidth that is available. As for stackless Python/etc I'm not overly interested in rewriting the program or replacing my Python stack (availability would also be a concern).
[*] Only the main thread can receive signals so if you send a ctrl-C the Python interpreter basically tries to get the main thread to run so it can handle the signal, but since it doesn't directly control which thread is run (this is left to the operating system) it basically tells the OS to keep switching threads until it eventually hits the main thread (which if you are unlucky may take a while).

Another solution is:
http://docs.python.org/library/multiprocessing.html
Note 1: This is not a limitation of the Python language, but of CPython implementation.
Note 2: With regard to affinity, your OS shouldn't have a problem doing that itself.

I have never heard of anyone using taskset for a performance gain with Python. Doesn't mean it can't happen in your case, but definitely publish your results so others can critique your benchmarking methods and provide validation.
Personally though, I would decouple your I/O threads from the CPU bound threads using a message queue. That way your front end is now completely network I/O bound (some with HTTP interface, some with message queue interface) and ideal for your threading situation. Then the CPU intense processes can either use multiprocessing or just be individual processes waiting for work to arrive on the message queue.
In the longer term you might also want to consider replacing your threaded I/O front-end with Twisted or some thing like eventlets because, even if they won't help performance they should improve scalability. Your back-end is now already scalable because you can run your message queue over any number of machines+cpus as needed.

An interesting solution is the experiment reported by Ryan Kelly on his blog: http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2/
The results seems very satisfactory.

I've found the following rule of thumb sufficient over the years: If the workers are dependent on some shared state, I use one multiprocessing process per core (CPU bound), and per core a fix pool of worker threads (I/O bound). The OS will take care of assigining the different Python processes to the cores.

The Python GIL is per Python interpreter. That means the only to avoid problems with it while doing multiprocessing is simply starting multiple interpreters (i.e. using seperate processes instead of threads for concurrency) and then using some other IPC primitive for communication between the processes (such as sockets). That being said, the GIL is not a problem when using threads with blocking I/O calls.
The main problem of the GIL as mentioned earlier is that you can't execute 2 different python code threads at the same time. A thread blocking on a blocking I/O call is blocked and hence not executin python code. This means it is not blocking the GIL. If you have two CPU intensive tasks in seperate python threads, that's where the GIL kills multi-processing in Python (only the CPython implementation, as pointed out earlier). Because the GIL stops CPU #1 from executing a python thread while CPU #0 is busy executing the other python thread.

Until such time as the GIL is removed from Python, co-routines may be used in place of threads. I have it on good authority that this strategy has been implemented by two successful start-ups, using greenlets in at least one case.

This is a pretty old question but since everytime I search about information related to python and performance on multi-core systems this post is always on the result list, I would not let this past before me an do not share my thoughts.
You can use the multiprocessing module that rather than create threads for each task, it creates another process of cpython compier interpreting your code.
It would make your application to take advantage of multicore systems.
The only problem that I see on this approach is that you will have a considerable overhead by creating an entire new process stack on memory. (http://en.wikipedia.org/wiki/Thread_(computing)#How_threads_differ_from_processes)
Python Multiprocessing module:
http://docs.python.org/dev/library/multiprocessing.html
"The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box..."
About this, I guess that you can have also a pool of process too: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers
Att,
Leo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.