from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?
Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.
There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.
As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.
Related
I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?
When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.
It may be a really stupid question, but I didn't find any doc which perfectly answers that question. I'm trying to familiarise with the multiprocessing library on python try to paraglide task using multiprocessing.Pool.
I initiate the number of processes in my Pool with:
Pool(processes=nmbr_of_processes).
The thing is I don't understand exactly how this number of process reduce the work duration time. I wrote a script to evaluate it.
def test_operation(y):
sum = 0
for x in range(1000):
sum += y*x
def main():
time1 = time.time()
p = mp.Pool(processes=2)
result = p.map(test_operation, range(100000))
p.close()
p.join()
print('Parallel tooks {} seconds'.format(time.time() - time1))
final = list()
time2 = time.time()
for y in range(100000):
final.append(test_operation(y))
print('Serial tooks {} seconds'.format(time.time() - time2))
The thing is, when I'm using 2 processes with mp.Pool(processes=2) I get typically:
Parallel took 5.162384271621704 seconds
Serial took 9.853888034820557 seconds
And if I'm using more processes, like p = mp.Pool(processes=4)
I get:
Parallel took 6.404058218002319 seconds
Serial took 9.667300701141357 seconds
I'm working on a MacMini DualCore i7 3Ghz. I know I can't reduce the work duration time to half the time it took with a serial work. But I can't understand why adding more processes increase work duration time compared to a work with 2 processes. And if there is an optimal number of processes to start depending of the cpu, what would it be ?
The thing to note here is that this applies to CPU-bound tasks; your code is heavy on CPU usage. The first thing to do is check how many theoretical cores you have:
import multiprocessing as mp
print(mp.cpu_count())
For CPU-bound tasks like this, there is no benefit to be gained by creating a pool with more workers than theoretical cores. If you don't specify the size of the Pool, it will default back to this number. However, this neglects something else; your code is not the only thing that your OS has to run.
If you launch as many processes as theoretical cores, the system has no choice but to interrupt your processes periodically simply to keep running, so you're likely to get a performance hit. You can't monopolise all cores. The general rule-of-thumb here is to have a pool size of cpu_count() - 1, which leaves the OS a core free to use on other processes.
I was surprised to find that other answers I found don't mention this general rule; it seems to be confined to comments etc. However, your own tests show that it is applicable to the performance in your case so is a reasonable heuristic to determine pool size.
from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?
Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.
There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.
As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.
I've to process say thousands of records in an array. I did the normal for loop like this
for record in records:
results = processFile(record)
write_output_record(o, results)
The script above took 427.270612955 seconds!
As there is no dependancy between these records. I used Python multi threading module in a hope to speedup the process. below is my implementation
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(processes=threads)
results = pool.map(processFile, records)
pool.close()
pool.join()
write_output(o, results)
My computer has 8 cpu's. And it takes 852.153398991 second.
Can somebody help me as in what am I doing wrong?
PS: processFile function has no i/o's. its mostly processing the records and sending back the update record
Try using vmstat and verify whether its a memory issue. Sometimes, using multithreading can slow your system down if each thread pushes up the RAM usage by a significant amount.
Usually people encounter three types of issues: CPU bound (Constraint on CPU computations), Memory bound (Constraint on RAM) and I/O bound (Network & hard drive I/O constraints).
I have been playing around with multiprocessing problem and notice my algorithm is slower when I parallelizes it than when it is single thread.
In my code I don't share memory.
And I'm pretty sure my algorithm (see code), which is just nested loops is CPU bound.
However, no matter what I do. The parallel code runs 10-20% slower on all my computers.
I also ran this on a 20 CPUs virtual machine and single thread beats multithread every times (even slower up there than my computer, actually).
from multiprocessing.dummy import Pool as ThreadPool
from multi import chunks
from random import random
import logging
import time
from multi import chunks
## Product two set of stuff we can iterate over
S = []
for x in range(100000):
S.append({'value': x*random()})
H =[]
for x in range(255):
H.append({'value': x*random()})
# the function for each thread
# just nested iteration
def doStuff(HH):
R =[]
for k in HH['S']:
for h in HH['H']:
R.append(k['value'] * h['value'])
return R
# we will split the work
# between the worker thread and give it
# 5 item each to iterate over the big list
HChunks = chunks(H, 5)
XChunks = []
# turn them into dictionary, so i can pass in both
# S and H list
# Note: I do this because I'm not sure if I use the global
# S, will it spend too much time on cache synchronizatio or not
# the idea is that I dont want each thread to share anything.
for x in HChunks:
XChunks.append({'H': x, 'S': S})
print("Process")
t0 = time.time()
pool = ThreadPool(4)
R = pool.map(doStuff, XChunks)
pool.close()
pool.join()
t1 = time.time()
# measured time for 4 threads is slower
# than when i have this code just do
# doStuff(..) in non-parallel way
# Why!?
total = t1-t0
print("Took", total, "secs")
There are many related question opened, but many are geared toward code being structured incorrectly - each worker being IO bound and such.
You are using multithreading, not multiprocessing. While many languages allow threads to run in parallel, python does not. A thread is just a separate state of control, i.e. it holds it own stack, current function, etc. The python interpreter just switches between executing each stack every now and then.
Basically, all threads are running on a single core. They will only speed up your program when you are not CPU bound.
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Multithreading is usually slower than single threading if you are CPU bound. This is because the work and processing resources stay the same, but you add overhead for managing the threads, e.g. switching between them.
How to fix this: instead of using from multiprocessing.dummy import Pool as ThreadPool do multiprocessing.Pool as ThreadPool.
You might want to read up on the GIL, the Global Interpreter Lock. It's what prevents threads from running in parallel (that and implications on single threaded performance). Python interpreters other than CPython may not have the GIL and be able to run multithreaded on several cores.