Why is pool.map slower than normal map?

Why is pool.map slower than normal map? - python

I'm trying the following code:
import multiprocessing
import time
import random
def square(x):
return x**2
pool = multiprocessing.Pool(4)
l = [random.random() for i in xrange(10**8)]
now = time.time()
pool.map(square, l)
print time.time() - now
now = time.time()
map(square, l)
print time.time() - now
and the pool.map version consistently runs several seconds more slowly than the normal map version (19 seconds vs 14 seconds).
I've looked at the questions: Why is multiprocessing.Pool.map slower than builtin map? and multiprocessing.Pool() slower than just using ordinary functions
and they seem to chalk it up to to either IPC overhead or disk saturation, but I feel like in my example those aren't obviously the issue; I'm not writing/reading anything to/from disk, and the computation is long enough that it seems like IPC overhead should be small compared to the total time saved by the multiprocessing (I'm estimating that, since I'm doing work on 4 cores instead of 1, I should cut the computation time down from 14 seconds to about 3.5 seconds). I'm not saturating my cpu I don't think; checking cat /proc/cpuinfo shows that I have 4 cores, but even when I multiprocess to only 2 processes it's still slower than just the normal map function (and even slower than 4 processes). What else could be slowing down the multiprocessed version? Am I misunderstanding how IPC overhead scales?
If it's relevant, this code is written in Python 2.7, and my OS is Linux Mint 17.2

pool.map splits a list into N jobs (where N is the size of the list) and dispatches those to the processes.
The work a single process is doing is shown in your code:
def square(x):
return x**2
This operation takes very little time on modern CPUs, no matter how big the number is.
In your example you're creating a huge list and performing an irrelevant operation on every single element. Of course the IPC overhead will be greater compared to the regular map function which is optimized for fast looping.
In order to see your example working as you expect, just add a time.sleep(0.1) call to the square function. This simulates a long running task. Of course you might want to reduce the size of the list or it will take forever to complete.

Related

Why does my multiprocessing job in Python take longer than a single process?

I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.
My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.
I was able to reproduce the issue on a simple, and shareable piece of code:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
def f(x):
'''Junk to just waste computing power!'''
arr = np.ones(shape=(10000,10000))*x
val1 = arr*arr
val2 = np.sqrt(val1)
val3 = np.roll(arr,-1) - np.roll(arr,1)
plt.subplot(121)
plt.imshow(arr)
plt.subplot(122)
plt.imshow(val3)
plt.savefig('f.png')
plt.close()
print('Done')
N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
pool.apply_async(f,[i])
pool.close()
pool.join()
print('Time taken = ', time.time()-t0)
The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:
N=1 || Time taken = 7.964005947113037
N=7 || Time taken = 40.3266499042511.
Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.
[Question 1] Is this all a case of overhead (whatever it entails!)?
[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?
For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.
Thank you in advance for your help and input!

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.
An easy way to test this is just to have something else do the parallelism and see if anything is improved.
With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.
$ for i in $(seq 8) ; do python paralleltest & done
...
Time taken = 32.07925891876221
Done
Time taken = 33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken = 34.14084982872009
Time taken = 34.21410894393921
Time taken = 34.44455814361572
Time taken = 34.49029612541199
Time taken = 34.502259969711304
Time taken = 34.516881227493286
I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.
When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.
SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.

This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.

Python multiprocessing poor performance

I have to run a prediction using prophet on a certain number of series (several thousands).
prophet works fine but does not make use of multiple CPU. Each prediction takes around 36 seconds (actually the whole functions that makes also some data preprocessing and post-elaborations). If I run sequentially (for just 15 series to make a test) it takes 540 seconds to complete. Here is the code:
for i in G:
predictions = list(make_prediction(i, c, p))
where G is an iterator that returns a series at a time (capped for the test at 15 series), c and p are two dataframes (the functions uses them only to read).
I then tried with joblib:
predictions = Parallel(n_jobs=5)( delayed(make_prediction)(i, c, p) for i in G)
Time taken 420 seconds.
Then I tried Pool:
p = Pool(5)
predictions = list(p.imap(make_prediction2, G))
p.close()
p.join()
Since with map I can pass just one parameter, I call a function that calls make_prediction(G, c,p). Time taken 327 seconds.
Eventually I tried ray:
I decorated make_prediction with ##ray.remote and called:
predictions = ray.get([make_prediction.remote(i, c, p) for i in G])
time taken: 340 seconds! I also tried to make c and p ray objects with c_ray = ray.put(c) and the same for p and passing c_ray and p_ray to the functions but I could not see any improvement in performance.
I understand the overhead required by forking and actually the time taken from the function is not huge, but I'd expect better performance (less than 40% gain, at best, using 5 time as much CPUs does not seem amazing) especially from ray. Am I missing something or doing something wrong? Any idea to get better performances?
Let me point out that RAM usage is under control. Each process, in the worst scenario, uses less than 2GB of memory so less than 10 in total with 26 GB available.
Python 3.7.7 Ray 0.8.5 joblib 0.14.1 on Mac os X 10.14.6 i7 with 4 physical cores and 8 threads (cpu_count() = 8; with 5 cpu working no throttle reported by pmset -g thermlog)
PS Increasing the test size to 50 series the performance improves, especially with ray, meaning that the overhead of forking is relevant in such a small test. I will make a longer test to have a more accurate performance metric but I guess I won't go far from 50% a value that seems consistent with other posts I have seen where to have a 90% reduction they used 50CPUs.
**************************** UPDATE ***********************************
Increasing the number of series to work to 100, ray hasn't showed good performances (maybe I miss something in its implementation). Pool, instead, using initializer function and imap_unordered went better. There is a tangible overhead caused by forking and preparing each process environment, but I got really weird results:
a single CPU took 2685 seconds to complete the job (100 series);
using pool as described above with 2 CPUs took 1808 seconds (-33% and it seems fine);
using the same configuration with 4 CPUs took 1582 seconds (-41% compared to single CPU but just -12,5% compared to the 2 CPUs job).
Doubling the number of CPU for just 12% of time decrease? Even worse using 5 CPUs takes nearly the same time (please note than 100 can be evenly divided by 1, 2, 4 and 5 so the queues are always filled)! No throttle, no swap, the machine has 8 cores and plenty of unused CPU power even when running the test, so no bottlenecks. What's wrong?

Dask's slow compute performance on single core

I have implemented a map function that parses strings into an XML tree, traverses this tree and extracts some information. A lot of if-then-else stuff, no additional IO code.
The speedup we got from Dask was not very satisfying, so we took a closer look at the raw execution performance on a single but large item (580 MB of XML string) in a single partition.
Here's the code:
def timedMap(x):
start = time.time()
# computation, no IO or access to Dask variables, no threading or multiprocessing
...
return time.time() - start
print("Direct Execution")
print(timedMap(xml_string))
print("Dask Distributed")
import dask
from dask.distributed import Client
client = Client(threads_per_worker=1, n_workers=1)
print(client.submit(timedMap, xml_string).result())
client.close()
print("Dask Multiprocessing")
import dask.bag as db
bag = db.from_sequence([xml_string], npartitions=1)
print(bag.map(timedMap).compute()[0])
The output (that's the time without before/after overhead) is:
Direct Execution
54.59478211402893
Dask Distributed
91.79525017738342
Dask Multiprocessing
86.94628381729126
I've repeated this many times. It seems that Dask has not only an overhead for communication and task management, but the individual computation steps are also significantly slower as well.
Why is the computation inside Dask so much slower? I suspected the profiler and increased the profiling interval from 10 to 1000ms, which knocked of 5 seconds. But still... Then I suspected memory pressure, but the worker doesn't reach it's limit, not even the 50%.
In terms of overhead (measuring total times submit+result and map+compute), Dask adds 18 seconds for the distributed case and 3 seconds for the multiprocessing case. I'm happy to pay this overhead, but I don't like that the actual computation takes so much longer. Do you have a clue why this would be the case? What can I tweak to speed this up?
Note: when I double the compute load, all durations roughly double as well.
All the best,
Boris

You shouldn't expect any speedup because you have only a single partition.
You should expect some slowdown because you're moving 500MB across processes in Python. I recommend that you read the following: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

Pool with generator: Chunkify

I'm using pathos.Pool, but I suppose the technique required is comparable to those from `multiprocessing.Pool.
I have a generator that yields a large (huge) list of things to do.
from pathos import ProcessingPool
def doStuff(item):
return 1
pool = ProcessingPool(nodes=32)
values = pool.map(doStuff, myGenerator)
Unfortunately, the function to apply to my generated items (here: doStuff) clears very quickly. Therefore, so far I have been unable to generate a speed boost from serializing this - in fact, the multiprocessing version of the code runs slower than the original.
I assume this is because the overhead of delivering the next item from the pool to the workers is large compared to the time it takes the worker to complete the task.
I suppose the solution would be to "chunkify" the generated items: Group the items into n lists and then provide them to a pool with n many workers (since all of the jobs should take almost exactly long). Or perhaps a less extreme version.
What would be a good way of achieving this in Python?

Python "range" resource consumption

I wrote the following script
Basically, I'm just learning Python for Machine Learning and wanted to check how really computationally intensive tasks would perform. I observe that for 10**8 iterations, Python takes up a lot of RAM (around 3.8 GB) and also a lot of CPU time (just froze my system)
I want to know if there is any way to limit the time/memory consumption either through code or some global settings
Script -
initial_start = time.clock()
for i in range(9):
start = time.clock()
for j in range(10**i):
pass
stop = time.clock()
print 'Looping exp(',i,') times takes', stop - start, 'seconds'
final_stop = time.clock()
print 'Overall program time is',final_stop - initial_start,'seconds'

In Python 2, range creates a list. Use xrange instead. For a more detailed explanation see Should you always favor xrange() over range()?
Note that a no-op for loop is a very poor benchmark that tells you pretty much nothing about Python.
Also note, as per gnibbler's comment, Python 3's range is works like Python 2's xrange.

look at this question: How to limit the heap size?
To address your script, the timeit module measures the time it takes to perform an action more accurately
>>> import timeit
>>> for i in range(9):
... print timeit.timeit(stmt='pass', number=10**i)
...
0.0
0.0
0.0
0.0
0.0
0.015625
0.0625
0.468752861023
2.98439407349
Your example is taking most of its time dealing with the gigantic lists of numbers you're putting it memory. xrange instead of range will help fix that issue but you're still using a terrible benchmark. the loop is going to execute over and over and not actually do anything, so the cpu is busy checking the condition and entering the loop.
As you can see, creating these lists is taking the majority of the time here
>>> timeit.timeit(stmt='range(10**7)', number=1)
0.71875405311584473
>>> timeit.timeit(stmt='for i in range(10**7): pass', number=1)
1.093757152557373

Python takes RAM because you're creating a very large list of 10 ** 8 length with range function. That's where iterators become useful.
Use xrange instead of range.
It will work the same way as range do but instead of creating that large list in memory, xrange will just calculate inner index (incrementing it's value by 1 each iteration).

If you're considering Python for machine learning, take a look at numpy. Its philosophy is to implement all "inner loops" (matrix operations, linear algebra) in optimized C, and to use Python to manipulate input and output and to manage high-level algorithms - sort of like Matlab that uses Python. That gives you the best of both worlds: ease and readability of Python, and speed of C.
To get back to your question, benchmarking numpy operations will give you a more realistic assessment of Python's performances for machine learning.

As regards cpu, you have a for loop running for billions of iterations without any sort of sleep or pause inbetween, so no wonder the process hogs the cpu completely ( at least on a single core computer).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is pool.map slower than normal map? - python

Related

Why does my multiprocessing job in Python take longer than a single process?

Python multiprocessing poor performance

Dask's slow compute performance on single core

Pool with generator: Chunkify

Python "range" resource consumption

Categories

Resources