Python multiprocessing poor performance

Python multiprocessing poor performance - python

I have to run a prediction using prophet on a certain number of series (several thousands).
prophet works fine but does not make use of multiple CPU. Each prediction takes around 36 seconds (actually the whole functions that makes also some data preprocessing and post-elaborations). If I run sequentially (for just 15 series to make a test) it takes 540 seconds to complete. Here is the code:
for i in G:
predictions = list(make_prediction(i, c, p))
where G is an iterator that returns a series at a time (capped for the test at 15 series), c and p are two dataframes (the functions uses them only to read).
I then tried with joblib:
predictions = Parallel(n_jobs=5)( delayed(make_prediction)(i, c, p) for i in G)
Time taken 420 seconds.
Then I tried Pool:
p = Pool(5)
predictions = list(p.imap(make_prediction2, G))
p.close()
p.join()
Since with map I can pass just one parameter, I call a function that calls make_prediction(G, c,p). Time taken 327 seconds.
Eventually I tried ray:
I decorated make_prediction with ##ray.remote and called:
predictions = ray.get([make_prediction.remote(i, c, p) for i in G])
time taken: 340 seconds! I also tried to make c and p ray objects with c_ray = ray.put(c) and the same for p and passing c_ray and p_ray to the functions but I could not see any improvement in performance.
I understand the overhead required by forking and actually the time taken from the function is not huge, but I'd expect better performance (less than 40% gain, at best, using 5 time as much CPUs does not seem amazing) especially from ray. Am I missing something or doing something wrong? Any idea to get better performances?
Let me point out that RAM usage is under control. Each process, in the worst scenario, uses less than 2GB of memory so less than 10 in total with 26 GB available.
Python 3.7.7 Ray 0.8.5 joblib 0.14.1 on Mac os X 10.14.6 i7 with 4 physical cores and 8 threads (cpu_count() = 8; with 5 cpu working no throttle reported by pmset -g thermlog)
PS Increasing the test size to 50 series the performance improves, especially with ray, meaning that the overhead of forking is relevant in such a small test. I will make a longer test to have a more accurate performance metric but I guess I won't go far from 50% a value that seems consistent with other posts I have seen where to have a 90% reduction they used 50CPUs.
**************************** UPDATE ***********************************
Increasing the number of series to work to 100, ray hasn't showed good performances (maybe I miss something in its implementation). Pool, instead, using initializer function and imap_unordered went better. There is a tangible overhead caused by forking and preparing each process environment, but I got really weird results:
a single CPU took 2685 seconds to complete the job (100 series);
using pool as described above with 2 CPUs took 1808 seconds (-33% and it seems fine);
using the same configuration with 4 CPUs took 1582 seconds (-41% compared to single CPU but just -12,5% compared to the 2 CPUs job).
Doubling the number of CPU for just 12% of time decrease? Even worse using 5 CPUs takes nearly the same time (please note than 100 can be evenly divided by 1, 2, 4 and 5 so the queues are always filled)! No throttle, no swap, the machine has 8 cores and plenty of unused CPU power even when running the test, so no bottlenecks. What's wrong?

Related

Poor performance from numba.prange

I'm trying to put together a simple example to illustrate the benefits of using numba.prange for myself and some colleagues, but I'm unable to get a decent speedup. I coded up a simple 1D diffusion solver, which essentially loops over a long array, combining elements i+1, i, and i-1, and writing the results into element i of a second array. This should be a pretty perfect use case for a parallel for loop, similar to OpenMP in Fortran or C.
My complete example is included below:
import numpy as np
from numba import jit, prange
#jit(nopython=True, parallel=True)
def diffusion(Nt):
alpha = 0.49
x = np.linspace(0, 1, 10000000)
# Initial condition
C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
# Temporary work array
C_ = np.zeros_like(C)
# Loop over time (normal for-loop)
for j in range(Nt):
# Loop over array elements (space, parallel for-loop)
for i in prange(1, len(C)-1):
C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
C[:] = C_
return C
# Run once to just-in-time compile
C = diffusion(1)
# Check timing
%timeit C = diffusion(100)
When running with parallel=False, this takes about 2 second, and with parallel=True it takes about 1.5 seconds. I'm running on a MacBook Pro with 4 physical cores, and Activity Monitor reports 100% and around 700% CPU usage with and without parallelisation.
I would have expected something closer to a factor 4 speedup. Am I doing something wrong?

The poor scalability certainly comes from the saturation of the RAM which is shared by all cores on a desktop machine. Indeed, your code is memory-bound and the memory throughput is pretty limited on modern machines compared to the computing power of CPUs (or GPUs). As a result, 1 or 2 cores are generally enough to saturate the RAM on most desktop machine (a bit more cores are needed on computing servers).
On a 10 core Intel Xeon processor with a 40~43 GiB/s RAM, the code takes 1.32 seconds in parallel and 2.56 seconds in sequential. This means only 2 times faster with 10 cores. That being said, the parallel loop read the full C array once per time step and also read+write the full C_ array once per time step (x86 processors need to read the written memory by default due to the write allocate cache policy). The C[:] = C_ does the same thing. This means (2*3)*(8*10e6)*100/1024**3 = 44.7 GiB or RAM are read/written during only 1.32 second in parallel, resulting in a 33.9 GiB/s memory throughput reaching 80% of the RAM bandwidth (very good for this use-case).
To speed up this code, you need to read/write less data from/to RAM and compute data as much as possible in cache. The first thing to do is to use a double-buffering approach with two views so to avoid a very expensive copy. Another optimization is to try to do multiple time step in parallel at the same time. This is theoretically possible using a complex trapezoidal tiling strategy but very tricky to implement in practice, especially in Numba. High-performance stencil libraries should do that for you. Such optimization should not only improve the sequential execution but also the scalability of the resulting parallel code.

Why does my multiprocessing job in Python take longer than a single process?

I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.
My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.
I was able to reproduce the issue on a simple, and shareable piece of code:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
def f(x):
'''Junk to just waste computing power!'''
arr = np.ones(shape=(10000,10000))*x
val1 = arr*arr
val2 = np.sqrt(val1)
val3 = np.roll(arr,-1) - np.roll(arr,1)
plt.subplot(121)
plt.imshow(arr)
plt.subplot(122)
plt.imshow(val3)
plt.savefig('f.png')
plt.close()
print('Done')
N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
pool.apply_async(f,[i])
pool.close()
pool.join()
print('Time taken = ', time.time()-t0)
The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:
N=1 || Time taken = 7.964005947113037
N=7 || Time taken = 40.3266499042511.
Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.
[Question 1] Is this all a case of overhead (whatever it entails!)?
[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?
For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.
Thank you in advance for your help and input!

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.
An easy way to test this is just to have something else do the parallelism and see if anything is improved.
With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.
$ for i in $(seq 8) ; do python paralleltest & done
...
Time taken = 32.07925891876221
Done
Time taken = 33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken = 34.14084982872009
Time taken = 34.21410894393921
Time taken = 34.44455814361572
Time taken = 34.49029612541199
Time taken = 34.502259969711304
Time taken = 34.516881227493286
I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.
When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.
SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.

This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.

Dask's slow compute performance on single core

I have implemented a map function that parses strings into an XML tree, traverses this tree and extracts some information. A lot of if-then-else stuff, no additional IO code.
The speedup we got from Dask was not very satisfying, so we took a closer look at the raw execution performance on a single but large item (580 MB of XML string) in a single partition.
Here's the code:
def timedMap(x):
start = time.time()
# computation, no IO or access to Dask variables, no threading or multiprocessing
...
return time.time() - start
print("Direct Execution")
print(timedMap(xml_string))
print("Dask Distributed")
import dask
from dask.distributed import Client
client = Client(threads_per_worker=1, n_workers=1)
print(client.submit(timedMap, xml_string).result())
client.close()
print("Dask Multiprocessing")
import dask.bag as db
bag = db.from_sequence([xml_string], npartitions=1)
print(bag.map(timedMap).compute()[0])
The output (that's the time without before/after overhead) is:
Direct Execution
54.59478211402893
Dask Distributed
91.79525017738342
Dask Multiprocessing
86.94628381729126
I've repeated this many times. It seems that Dask has not only an overhead for communication and task management, but the individual computation steps are also significantly slower as well.
Why is the computation inside Dask so much slower? I suspected the profiler and increased the profiling interval from 10 to 1000ms, which knocked of 5 seconds. But still... Then I suspected memory pressure, but the worker doesn't reach it's limit, not even the 50%.
In terms of overhead (measuring total times submit+result and map+compute), Dask adds 18 seconds for the distributed case and 3 seconds for the multiprocessing case. I'm happy to pay this overhead, but I don't like that the actual computation takes so much longer. Do you have a clue why this would be the case? What can I tweak to speed this up?
Note: when I double the compute load, all durations roughly double as well.
All the best,
Boris

You shouldn't expect any speedup because you have only a single partition.
You should expect some slowdown because you're moving 500MB across processes in Python. I recommend that you read the following: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

Python multithreaded execution - not scalable

I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?

You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.

Best allocation of resources for multiprocessed command running within python

I've developed a tool that requires the user to provide the number of CPUs available to run it.
As part of the program, the tool calls HMMER (hmmer - http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) which itself is quite slow and needs multiple CPUs to run.
I'm confused about the most efficient way to spread the CPUs considering how many CPUs the user has specified.
For instance, assuming the user gave N cpus, I could run
N HMMER jobs with 1 CPU each
N/2 jobs with 2 CPUs each
etc..
My current solution is arbitrary opening pool size of N/5 and open a pool, then to call HMMER with 5 CPUs in each process in the pool.:
pool = multiprocessing.Pool(processes = N/5)
pool.map_async(run_scan,tuple(jobs))
pool.close()
pool.join()
where run_scan calls HMMER and jobs holds all the command line arguments for each HMMER job as dictionaries.
The program is very slow and I was wondering if there was a better way to do this.
Thanks

Almost always, parallelization comes at some cost in efficieny, but the cost depends strongly on the specifics of the computation, so I think the only way to answer this question is a series of experiments.
(I'm assuming memory or disk I/O isn't an issue here; don't know much about HMMER, but the user's guide doesn't mention memory at all in the requirements section.)
Run the same job on one core (--cpu 1), then two cores, four, six, ..., and see how long it takes. That will give you an idea of how well the jobs get parallelized. Used CPU time = runtime * number of cores should remain constant.
Once you notice a below-linear correlation between runtime and number of cores devoted to the job, that's when you start to run multiple jobs in parallel. Say you have 24 cores, and a job takes 240 seconds on a single core, 118 seconds on two cores, 81 seconds on three cores, 62 seconds on four, but a hardly faster 59 seconds for five cores (instead of the expected 48 secs), you should run 6 jobs in parallel with 4 cores each.
You might see a sharp decline at about n_cores/2: some computations don't work well with Hyperthreading, and the number of cores is effectively half of what the CPU manufacturer claims.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.