Parallel Processing Python : Why parallel processing is slower than serial processing? [duplicate]

Parallel Processing Python : Why parallel processing is slower than serial processing? [duplicate] - python

from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?

Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.

There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.

As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.

Related

Pandas DataFrame Multithreading No Performance Gain

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?

When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

multiprocessing - Influence of number on process on processing time

It may be a really stupid question, but I didn't find any doc which perfectly answers that question. I'm trying to familiarise with the multiprocessing library on python try to paraglide task using multiprocessing.Pool.
I initiate the number of processes in my Pool with:
Pool(processes=nmbr_of_processes).
The thing is I don't understand exactly how this number of process reduce the work duration time. I wrote a script to evaluate it.
def test_operation(y):
sum = 0
for x in range(1000):
sum += y*x
def main():
time1 = time.time()
p = mp.Pool(processes=2)
result = p.map(test_operation, range(100000))
p.close()
p.join()
print('Parallel tooks {} seconds'.format(time.time() - time1))
final = list()
time2 = time.time()
for y in range(100000):
final.append(test_operation(y))
print('Serial tooks {} seconds'.format(time.time() - time2))
The thing is, when I'm using 2 processes with mp.Pool(processes=2) I get typically:
Parallel took 5.162384271621704 seconds
Serial took 9.853888034820557 seconds
And if I'm using more processes, like p = mp.Pool(processes=4)
I get:
Parallel took 6.404058218002319 seconds
Serial took 9.667300701141357 seconds
I'm working on a MacMini DualCore i7 3Ghz. I know I can't reduce the work duration time to half the time it took with a serial work. But I can't understand why adding more processes increase work duration time compared to a work with 2 processes. And if there is an optimal number of processes to start depending of the cpu, what would it be ?

The thing to note here is that this applies to CPU-bound tasks; your code is heavy on CPU usage. The first thing to do is check how many theoretical cores you have:
import multiprocessing as mp
print(mp.cpu_count())
For CPU-bound tasks like this, there is no benefit to be gained by creating a pool with more workers than theoretical cores. If you don't specify the size of the Pool, it will default back to this number. However, this neglects something else; your code is not the only thing that your OS has to run.
If you launch as many processes as theoretical cores, the system has no choice but to interrupt your processes periodically simply to keep running, so you're likely to get a performance hit. You can't monopolise all cores. The general rule-of-thumb here is to have a pool size of cpu_count() - 1, which leaves the OS a core free to use on other processes.
I was surprised to find that other answers I found don't mention this general rule; it seems to be confined to comments etc. However, your own tests show that it is applicable to the performance in your case so is a reasonable heuristic to determine pool size.

python multiprocessing Pool not always using all workers

The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..
My (simplified) code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]
jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.
The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.
Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)
There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.
I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]
And it does print all 1000 numbers, even though get isn't called.
I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?
Thanks a lot!

The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.
Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.
This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.
It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.
Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.

why is multiprocess Pool slower than a for loop?

from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?

Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.

There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.

As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.

Python multithreading performance

I've to process say thousands of records in an array. I did the normal for loop like this
for record in records:
results = processFile(record)
write_output_record(o, results)
The script above took 427.270612955 seconds!
As there is no dependancy between these records. I used Python multi threading module in a hope to speedup the process. below is my implementation
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(processes=threads)
results = pool.map(processFile, records)
pool.close()
pool.join()
write_output(o, results)
My computer has 8 cpu's. And it takes 852.153398991 second.
Can somebody help me as in what am I doing wrong?
PS: processFile function has no i/o's. its mostly processing the records and sending back the update record

Try using vmstat and verify whether its a memory issue. Sometimes, using multithreading can slow your system down if each thread pushes up the RAM usage by a significant amount.
Usually people encounter three types of issues: CPU bound (Constraint on CPU computations), Memory bound (Constraint on RAM) and I/O bound (Network & hard drive I/O constraints).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel Processing Python : Why parallel processing is slower than serial processing? [duplicate] - python

Related

Pandas DataFrame Multithreading No Performance Gain

multiprocessing - Influence of number on process on processing time

python multiprocessing Pool not always using all workers

why is multiprocess Pool slower than a for loop?

Python multithreading performance

Categories

Resources