Dask's slow compute performance on single core

Dask's slow compute performance on single core - python

I have implemented a map function that parses strings into an XML tree, traverses this tree and extracts some information. A lot of if-then-else stuff, no additional IO code.
The speedup we got from Dask was not very satisfying, so we took a closer look at the raw execution performance on a single but large item (580 MB of XML string) in a single partition.
Here's the code:
def timedMap(x):
start = time.time()
# computation, no IO or access to Dask variables, no threading or multiprocessing
...
return time.time() - start
print("Direct Execution")
print(timedMap(xml_string))
print("Dask Distributed")
import dask
from dask.distributed import Client
client = Client(threads_per_worker=1, n_workers=1)
print(client.submit(timedMap, xml_string).result())
client.close()
print("Dask Multiprocessing")
import dask.bag as db
bag = db.from_sequence([xml_string], npartitions=1)
print(bag.map(timedMap).compute()[0])
The output (that's the time without before/after overhead) is:
Direct Execution
54.59478211402893
Dask Distributed
91.79525017738342
Dask Multiprocessing
86.94628381729126
I've repeated this many times. It seems that Dask has not only an overhead for communication and task management, but the individual computation steps are also significantly slower as well.
Why is the computation inside Dask so much slower? I suspected the profiler and increased the profiling interval from 10 to 1000ms, which knocked of 5 seconds. But still... Then I suspected memory pressure, but the worker doesn't reach it's limit, not even the 50%.
In terms of overhead (measuring total times submit+result and map+compute), Dask adds 18 seconds for the distributed case and 3 seconds for the multiprocessing case. I'm happy to pay this overhead, but I don't like that the actual computation takes so much longer. Do you have a clue why this would be the case? What can I tweak to speed this up?
Note: when I double the compute load, all durations roughly double as well.
All the best,
Boris

You shouldn't expect any speedup because you have only a single partition.
You should expect some slowdown because you're moving 500MB across processes in Python. I recommend that you read the following: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

Related

Why does my multiprocessing job in Python take longer than a single process?

I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.
My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.
I was able to reproduce the issue on a simple, and shareable piece of code:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
def f(x):
'''Junk to just waste computing power!'''
arr = np.ones(shape=(10000,10000))*x
val1 = arr*arr
val2 = np.sqrt(val1)
val3 = np.roll(arr,-1) - np.roll(arr,1)
plt.subplot(121)
plt.imshow(arr)
plt.subplot(122)
plt.imshow(val3)
plt.savefig('f.png')
plt.close()
print('Done')
N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
pool.apply_async(f,[i])
pool.close()
pool.join()
print('Time taken = ', time.time()-t0)
The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:
N=1 || Time taken = 7.964005947113037
N=7 || Time taken = 40.3266499042511.
Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.
[Question 1] Is this all a case of overhead (whatever it entails!)?
[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?
For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.
Thank you in advance for your help and input!

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.
An easy way to test this is just to have something else do the parallelism and see if anything is improved.
With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.
$ for i in $(seq 8) ; do python paralleltest & done
...
Time taken = 32.07925891876221
Done
Time taken = 33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken = 34.14084982872009
Time taken = 34.21410894393921
Time taken = 34.44455814361572
Time taken = 34.49029612541199
Time taken = 34.502259969711304
Time taken = 34.516881227493286
I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.
When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.
SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.

This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.

Unexpected performance of multiprocessing.Pool()

I find that multiprocessing.Pool() doesn't behave as expected in my case below. Could anyone explain why it behaved in the way and how to improve the performance if possible. Following is just simplistic code:
import numpy as np
import multiprocessing
from itertools import repeat
def group_data_by_runID(args):
data, runID = args
return data[data[:,0].astype(int)==runID,:]
%%time
DATA = np.array([[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[1,8],[2,9],[2,10],[2,11],[2,12]])
runIDs = [0,1,2]*10000000
pool = multiprocessing.Pool(40)
list(pool.map(group_data_by_runID, zip(repeat(DATA), runIDs)))
As you can see in the above code that I intended to use 40 cores (56 cores and far more than enough memory available on this system) to run the code, it took 1min 31s. Then I used:
list(map(group_data_by_runID, zip(repeat(DATA), runIDs)))
It took 2min 33s. So the performance of using 40 cores only again less than twice performance, which is very weird to me. I also notice that even I 40 cores, it sometimes doesn't actually launch it in 40 cores as it can be seen in htop.
Where I did wrong? And how can I improve the speed. Please note that the actual data is much larger.

Maybe there are still many people like me who are confused by the performance of multiprocessing in python. Sometime you may achieve a performance gain and sometime you may even get worse performance. Thus I decided to answer this question by myself according to my own experience with multiprocessing.
There may be a overhead by using multiprocessing if your input data is large because these data would be copied and sent across the wire to different processes as juanpa commented above. This overhead could be very significant. However, we can still get a huge performance gain by chopping the input data into small chunks and let each process handle each chunk.
Another scenario where a significant performance gain can be achieved is there is no input data. Such as reading data from tens or hundreds of files.
Although multiprocessing can boost the speed, the majority of energy shall still be spent on the algorithm itself, which may fundamentally determine the efficiency of the code.

ipyparallel strange overhead behavior

Im trying to understand how to do distributed processing with ipyparallel and jupyter notebook, so i did some test and got odd results.
from ipyparallel import Client
%px import numpy as np
rc = Client()
dview = rc[:]
bview = rc.load_balanced_view()
print(len(dview))
print(len(bview))
data = [np.random.rand(10000)] * 4
%time np.sin(data)
%%time #45.7ms
results = dview.map(np.sin, data)
results.get()
%%time #110ms
dview.push({'data': data})
%px results = np.sin(data)
results
%%time #4.9ms
results = np.sin(data)
results
%%time #93ms
results = bview.map(np.sin, data)
results.get()
What is the matter with the overhead?
Is the task i/o bound in this case and just 1 core can do it better?
I tried larger arrays and still got better times with no parallel processing.
Thanks for the advice!

The problem seems to be the io. Push pushes the whole set of data to every node. I am not sure about the map function, but most likely it splits the data in chunks that are sent to nodes. So smaller chunks - faster processing. Load balancer most likely sends the data and the task two time to the same node, which significantly hits performance.
And how did you manage to send the data in 40 ms? I am used to http protocol where only the handshake takes about a second. For me 40 ms in the network is lightning fast.
EDIT About long times (40ms):
In local networks the ping time of 1-10ms is considered a normal situation. Taking into account that you first need to make a handshake (minimum 2 signals) and only then send the data (minimum 1 signal) and wait for the response (another signal) you already talk about 20ms just for connecting two computers. Of course you can try to minimize the ping time to 1ms and then use a faster MPI protocol. But as I understand it does not improve the situation significantly. Only one order of magnitude faster.
Therefore the general recommendations are to use larger jobs. For example, a pretty fast dask distributed framework (faster than Celery based on benchmarks) recommends tasks times to be more than 100ms. Otherwise the overhead of the framework starts overweighting the time of the execution and the parallelization benefits are disappearing. Efficiency on Dask Distributed

Why is pool.map slower than normal map?

I'm trying the following code:
import multiprocessing
import time
import random
def square(x):
return x**2
pool = multiprocessing.Pool(4)
l = [random.random() for i in xrange(10**8)]
now = time.time()
pool.map(square, l)
print time.time() - now
now = time.time()
map(square, l)
print time.time() - now
and the pool.map version consistently runs several seconds more slowly than the normal map version (19 seconds vs 14 seconds).
I've looked at the questions: Why is multiprocessing.Pool.map slower than builtin map? and multiprocessing.Pool() slower than just using ordinary functions
and they seem to chalk it up to to either IPC overhead or disk saturation, but I feel like in my example those aren't obviously the issue; I'm not writing/reading anything to/from disk, and the computation is long enough that it seems like IPC overhead should be small compared to the total time saved by the multiprocessing (I'm estimating that, since I'm doing work on 4 cores instead of 1, I should cut the computation time down from 14 seconds to about 3.5 seconds). I'm not saturating my cpu I don't think; checking cat /proc/cpuinfo shows that I have 4 cores, but even when I multiprocess to only 2 processes it's still slower than just the normal map function (and even slower than 4 processes). What else could be slowing down the multiprocessed version? Am I misunderstanding how IPC overhead scales?
If it's relevant, this code is written in Python 2.7, and my OS is Linux Mint 17.2

pool.map splits a list into N jobs (where N is the size of the list) and dispatches those to the processes.
The work a single process is doing is shown in your code:
def square(x):
return x**2
This operation takes very little time on modern CPUs, no matter how big the number is.
In your example you're creating a huge list and performing an irrelevant operation on every single element. Of course the IPC overhead will be greater compared to the regular map function which is optimized for fast looping.
In order to see your example working as you expect, just add a time.sleep(0.1) call to the square function. This simulates a long running task. Of course you might want to reduce the size of the list or it will take forever to complete.

Multiple Thread data generator

I have a small python script used to generate lots of data to a file, it takes about 6 mins to generate 6GB data, however, my target data size could up to 1TB, for linear calculation, it will take about 1000 mins to generate 1TB data which I think it's unacceptable for me.
So I am wondering will multiple threading help me here to short the time? and why could that be? If not, do I have other options?
Thanks!

Currently, typical hard drives can write on the order of 100 MB per second.
Your program is writing 6 GB in 6 minutes, which means the overall throughput is ~ 17 MB/s.
So your program is not pushing data to disk anywhere near the maximum rate (assuming you have a typical hard drive).
So your problem might actually be CPU-bound.
If this "back-of-the-envelope" calculation is correct, and if you have a machine with multiple processors, using multiple processes could help you generate more data quicker, which could then be sent to a single process which writes the data to disk.
Note that if you are using CPython, the most common implementation of Python, then the GIL (global interpreter lock) prevents multiple threads from running at the same time. So to do concurrent calculations, you need to use multiple processes rather than multiple threads. The multiprocessing or concurrent.futures module can help you here.
Note that if your hard drive can write 100 MB/s, it would still take ~ 160 minutes to write a 1TB to disk, and if your multiple processes generated data at a rate greater than 100 MB/s, then the extra processes would not lead to any speed gain.
Of course, your hardware may be much faster or much slower than this, so it pays to know your hardware specs.
You can estimate how fast you can write to disk using Python by doing a simple experiment:
with open('/tmp/test', 'wb') as f:
x = 'A'*10**8
f.write(x)
% time python script.py
real 0m0.048s
user 0m0.020s
sys 0m0.020s
% ls -l /tmp/test
-rw-rw-r-- 1 unutbu unutbu 100000000 2014-09-12 17:13 /tmp/test
This shows 100 MB were written in 0.511s. So the effective throughput was ~195 MB/s.
Note that if you instead call f.write in a loop:
with open('/tmp/test', 'wb') as f:
for i in range(10**7):
f.write('A')
then the effective throughput drops dramatically to just ~ 3MB/s. So how you structure your program -- even if using just a single process -- can make a big difference. This is an example of how collecting your data into fewer but bigger writes can improve performance.
As Max Noel and kipodi have already pointed out, you can also try writing to /dev/null:
with open(os.devnull, 'wb') as f:
and timing a shortened version of your current script. This will show you how much time is being consumed (mainly) by CPU computation. It's this portion of the overall run time that may be improved by using concurrent processes. If it is large then there is hope that multiprocessing may improve performance.

In all likelihood, multithreading won't help you.
Your data generation speed is either:
IO-bound (that is, limited by the speed of your hard drive), and the only way to speed it up is to get a faster storage device. The only type of parallelization that can help you is finding a way to spread your writes across multiple devices (can you use multiple hard drives?).
CPU-bound, in which case Python's GIL means you can't take advantage of multiple CPU cores within one process. The way to speed your program up is to make it so you can run multiple instances of it (multiple processes), each generating part of your data set.
Regardless, the first thing you need to do is profile your program. What parts are slow? Why are they slow? Is your process IO-bound or CPU-bound? Why?

6 mins to generate 6GB means you take a minute to generate 1 GB. A typical hard drive is capable of up to 80 - 100 MB/s throughput when new. This leaves you with approximately 6 GB / minute IO limit.
So it looks like the limiting factor is the CPU, which is good news (running more instances can help you).
However I wouldn't use multithreading for Python because of GIL. A better idea will be to run some scripts writing to different offsets in different processes or tu multiprocessing module of Python.
I would check it though with running it an writing to /dev/null to make sure you truly are CPU bound.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.