Pool with generator: Chunkify - python

I'm using pathos.Pool, but I suppose the technique required is comparable to those from `multiprocessing.Pool.
I have a generator that yields a large (huge) list of things to do.
from pathos import ProcessingPool
def doStuff(item):
return 1
pool = ProcessingPool(nodes=32)
values = pool.map(doStuff, myGenerator)
Unfortunately, the function to apply to my generated items (here: doStuff) clears very quickly. Therefore, so far I have been unable to generate a speed boost from serializing this - in fact, the multiprocessing version of the code runs slower than the original.
I assume this is because the overhead of delivering the next item from the pool to the workers is large compared to the time it takes the worker to complete the task.
I suppose the solution would be to "chunkify" the generated items: Group the items into n lists and then provide them to a pool with n many workers (since all of the jobs should take almost exactly long). Or perhaps a less extreme version.
What would be a good way of achieving this in Python?

Related

Iterating the results of a multiprocessing list is consuming large amounts of memory

I have a large list. I want to process each item. I'd like to segment the list and process each segment on a different CPU. I'm using the pathos multiprocessing library. I've created the following function:
def map_list_in_segments (l, f):
cpus = max(1, int(cpu_count() / 2) - 1)
seg_length = int(len(l) / cpus)
segments = [l[x:x+seg_length] for x in range(0,len(l),seg_length)]
pool = Pool(nodes=cpus)
mapped_segments = pool.map(lambda seg: f(seg), segments)
return (sg for seg in mapped_segments for sg in seg)
It returns the correct result and uses all (or almost all) the CPUs. However, iterating over the returned list results in very large amounts of memory being consumed unexpectedly.
At first I was returning a list comprehension. I switched that to a generator, hoping for less memory consumption, but that didn't improve anything.
Update based on comments:
I was unaware of imap and uimap and that they automatically chunk the input list. I gave uimap a try but saw very low CPU utilization and very long running times. One of the processes had very high CPU utilization though. What I think is happening is that there is a lot of pickling going on. The f that I'm passing in has a large object in a closure. When using the ProcessingPool methods (map, imap, uimap) this object needs to be pickled for each element in the list. I suspect that this is what the one process that is very busy is doing. The other processes are throttled by this pickling.
If so, this explains why my manual segmenting is causing significant gains in CPU utilization: the large object only needs to be pickled once per segment instead of for every item.
I then tried using uimap in my map_list_in_segments, hoping for a drop in memory consumption but this did not occur. Here's how the code looks that calls the method and iterates the results:
segments = multiprocessing.map_list_in_segments(l, lambda seg: process_segment(seg, large_object_needed_for_processing))
for seg in segments:
for item in seg:
# do something with item
My (limited) understanding of generators is that the first for loop that is looping through the segments should release each one from memory as it iterates. If so it would seem that the large memory usage is the pickling of the return values of the process_segment method. I'm not returning large amounts of data (about 1K bytes for each item) and the size of l I'm working with is 6000 items. Not sure why 5GB of memory gets consumed.
The problem with multiprocessing is that communication between processes is expensive. If your result is equivalent in size to your input, you're probably going to spend most of your time pickling and unpickling data rather than doing anything useful. This depends on how expensive f is, but you might be better off not using multiprocessing here.
Some further testing reveals that the pickling isn't the issue. The processing I was doing in the for item in seg was constructing additional objects that were consuming a large amount of memory.
The insights derived from this exercise and the intelligent commenters:
ProcessPool methods (map, imap, uimap) automatically chunk the list.
If you are passing in a large object to f (via a closure) you might find that manually chunking the list (as above) saves on a lot of pickling and increases CPU utilization.
Using imap and uimap can significantly reduce memory usage.

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Why is pool.map slower than normal map?

I'm trying the following code:
import multiprocessing
import time
import random
def square(x):
return x**2
pool = multiprocessing.Pool(4)
l = [random.random() for i in xrange(10**8)]
now = time.time()
pool.map(square, l)
print time.time() - now
now = time.time()
map(square, l)
print time.time() - now
and the pool.map version consistently runs several seconds more slowly than the normal map version (19 seconds vs 14 seconds).
I've looked at the questions: Why is multiprocessing.Pool.map slower than builtin map? and multiprocessing.Pool() slower than just using ordinary functions
and they seem to chalk it up to to either IPC overhead or disk saturation, but I feel like in my example those aren't obviously the issue; I'm not writing/reading anything to/from disk, and the computation is long enough that it seems like IPC overhead should be small compared to the total time saved by the multiprocessing (I'm estimating that, since I'm doing work on 4 cores instead of 1, I should cut the computation time down from 14 seconds to about 3.5 seconds). I'm not saturating my cpu I don't think; checking cat /proc/cpuinfo shows that I have 4 cores, but even when I multiprocess to only 2 processes it's still slower than just the normal map function (and even slower than 4 processes). What else could be slowing down the multiprocessed version? Am I misunderstanding how IPC overhead scales?
If it's relevant, this code is written in Python 2.7, and my OS is Linux Mint 17.2
pool.map splits a list into N jobs (where N is the size of the list) and dispatches those to the processes.
The work a single process is doing is shown in your code:
def square(x):
return x**2
This operation takes very little time on modern CPUs, no matter how big the number is.
In your example you're creating a huge list and performing an irrelevant operation on every single element. Of course the IPC overhead will be greater compared to the regular map function which is optimized for fast looping.
In order to see your example working as you expect, just add a time.sleep(0.1) call to the square function. This simulates a long running task. Of course you might want to reduce the size of the list or it will take forever to complete.

Naive and easiest way to decompose independent loop into parallel threads/processes

I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)
From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.
Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.

what is the neat way to divide huge nested loops to 8(or more) processes using Python?

this time i'm facing a "design" problem. Using Python, I have a implement a mathematical algorithm which uses 5 parameters. To find the best combination of these 5 parameters, i used 5-layer nested loop to enumerate all possible combinations in a given range. The time it takes to finish appeared to be beyond my expectation. So I think it's the time to use multithreading...
The task in the core of nested loops are calculation and saving. In current code, result from every calculation is appended to a list and the list will be written to a file at the end of program.
since I don't have too much experience of multithreading in any language, not to mention Python, I would like to ask for some hints on what should the structure be for this problem. Namely, how should the calculations be assigned to the threads dynamically and how should the threads save results and later combine all results into one file. I hope the number of threads can be adjustable.
Any illustration with code will be very helpful.
thank you very much for your time, I appreciate it.
#
update of 2nd Day:
thanks for all helpful answers, now I know that it is multiprocessing instead of multithreading. I always confuse with these two concepts because I think if it is multithreaded then the OS will automatically use multiple processor to run it when available.
I will find time to have some hands-on with multiprocessing tonight.
You can try using jug, a library I wrote for very similar problems. Your code would then look something like
from jug import TaskGenerator
evaluate = TaskGenerator(evaluate)
for p0 in [1,2,3]:
for p1 in xrange(10):
for p2 in xrange(10,20):
for p3 in [True, False]:
for p4 in xrange(100):
results.append(evaluate(p0,p1,p2,p3,p4))
Now you could run as many processes as you'd like (even across a network if you have access to a computer cluster).
Multithreading in Python won't win you anything in this kind of problem, since Python doesn't execute threads in parallel (it uses them for I/O concurrency, mostly).
You want multiprocessing instead, or a friendly wrapper for it such as joblib:
from joblib import Parallel, delayed
# -1 == use all available processors
results = Parallel(n_jobs=-1)(delayed(evaluate)(x) for x in enum_combinations())
print best_of(results)
Where enum_combinations would enumerate all combinations of your five parameters; you can likely implement it by putting a yield at the bottom of your nested loop.
joblib distributes the combinations over multiple worker processes, taking care of some load balancing.
Assuming this is a calculation-heavy problem (and thus CPU-bound), multi-threading won't help you much in Python due to the GIL.
What you can, however, do is split the calculation across multiple processes to take advantage of extra CPU cores. The easiest way to do this is with the multiprocessing library.
There are a number of examples for how to use multiprocessing on the docs page for it.

Categories

Resources