I am trying to speed up some code with multiprocessing in Python, but I cannot understand one point. Assume I have the following dumb function:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
When I run this code without using multiprocessing (see the code below) on my laptop (Intel - 8 cores cpu) time taken is ~2.31 seconds.
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
Instead, when I run this code by using Python multiprocessing library (see the code below) time taken is ~6.0 seconds.
pool = Pool(8)
t1 = time.time()
pool.map(foo, range(8))
print(f"Sample multiprocessing {time.time() - t1}")
From the best of my knowledge, I understand that when using multiprocessing there is some time overhead mainly caused by the need to spawn the new processes and to copy the memory state. However, this operation should be performed just once when the processed are initially spawned at the very beginning and should not be that huge.
So what I am missing here? Is there something wrong in my reasoning?
Edit: I think it is better to be more explicit on my question. What I expected here was the multiprocessed code to be slightly slower than the sequential one. It is true that I don't split the whole work across the 8 cores, but I am using 8 cores in parallel to do the same job (hence in an ideal world the processing time should more or less stay the same). Considering the overhead of spawning new processes, I expected a total increase in time of some (not too big) percentage, but not of a ~2.60x increase as I got here.
Well, multiprocessing can't possibly make this faster: you're not dividing the work across 8 processes, you're asking each of 8 processes to do the entire thing. Each process will take at least as long as your code doing it just once without using multiprocessing.
So if multiprocessing weren't helping at all, you'd expect it to take about 8 times as long (it's doing 8x the work!) as your single-processor run. But you said it's not taking 2.31 * 8 ~= 18.5 seconds, but "only" about 6. So you're getting better than a factor of 3 speedup.
Why not more than that? Can't guess from here. That will depend on how many physical cores your machine has, and how much other stuff you're running at the same time. Each process will be 100% CPU-bound for this specific function, so the number of "logical" cores is pretty much irrelevant - there's scant opportunity for processor hyper-threading to help. So I'm guessing you have 4 physical cores.
On my box
Sample timing on my box, which has 8 logical cores but only 4 physical cores, and otherwise left the box pretty quiet:
Without multiprocessing 2.468580484390259
Sample multiprocessing 4.78624415397644
As above, none of that surprises me. In fact, I was a little surprised (but pleasantly) at how effectively the program used up the machine's true capacity.
#TimPeters already answered that you are actually just running the job 8 times across the 8 Pool subprocesses, so it is slower not faster.
That answers the issue but does not really answer what your real underlying question was. It is clear from your surprise at this result, that you were expecting that the single job to somehow be automatically split up and run in parts across the 8 Pool processes. That is not the way that it works. You have to build in/tell it how to split up the work.
Different kinds of jobs needs need to be subdivided in different ways, but to continue with your example you might do something like this:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
def foo2(job_desc):
start, stop = job_desc
print(f"{start}, {stop}")
for _ in range(start, stop):
a = 3
def main():
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
pool_size = 8
pool = Pool(pool_size)
t1 = time.time()
top_num = 100000000
size = top_num // pool_size
job_desc_list = [[size * j, size * (j+1)] for j in range(pool_size)]
# this is in case the the upper bound is not a multiple of pool_size
job_desc_list[-1][-1] = top_num
pool.map(foo2, job_desc_list)
print(f"Sample multiprocessing {time.time() - t1}")
if __name__ == "__main__":
main()
Which results in:
Without multiprocessing 3.080709171295166
0, 12500000
12500000, 25000000
25000000, 37500000
37500000, 50000000
50000000, 62500000
62500000, 75000000
75000000, 87500000
87500000, 100000000
Sample multiprocessing 1.5312283039093018
As this shows, splitting the job up does allow it to take less time. The speedup will depend on the number of CPUs. In a CPU bound job you should try to limit it the pool size to the number of CPUs. My laptop has plenty more CPU's but some of the benefit is lost to the overhead. If the jobs were longer this should look more useful.
Related
I am coding a phase retrieval algorithm and am currently stuck with what I think is a bad multiprocessing speedup, given the problem.
The algorithm itself is composed of an iterative sequence of float operations on a numpy matrix whose size is on the order of 10-100 MB.
No IO operations are done while the algorithm is running.
Parallelization just amounts to running several of these iterative procedures in parallel using multiprocessing.Process.
I tested the program on a node with 40 physical CPU Cores (80 threads) and 250 GB RAM.
Given that there is no communication between the processes and no IO calls I expected a multiprocessing speedup somewhere between 40 and 80 times on this node (was this naive ?).
However the best I could achieve was a speedup of about 20 using 40 parallel Processes.
To diagnose I used cProfile on a random process which is part of a run with 1,40 and 70 parallel executions.
It tuns out that the relative amount of time spent for each sub part of the iterative procedure stays roughly constant for all three tests but each individual operation takes much longer. This is to the point where a simple call to numpy.conjugate takes 4 to 5 times longer in the 70 parallel processes case as compared to the single process case. Here are some Snakeviz diagrams:
Clearly I am running into some kind of bottleneck but what is it and how to further diagnose?
RAM space is not a problem the 250GB are more than enough.
Could RAM bandwidth be an issue ? How to check that?
The CPU-core usage of a single process in all cases is close to 100%. With 80 available "cores" this results in roughly 1%,50%,90% overall CPU usage for the 1,40 and 70 parallel processes case.
EDIT
Here is a minimum working example just using numpy.conjugate()
from multiprocessing import Process, Queue
import time
import numpy as np
import timeit
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
def f(q):
a = np.random.rand(64,254,128) + 1.j*np.random.rand(64,254,128)
start = time.time()
for i in range(10):
a.conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
p1 = speed_test(number_of_processes=1)
p40 = speed_test(number_of_processes=40)
p70 = speed_test(number_of_processes=70)
print('\n 1 process took {} seconds\n 40 processses took {} seconds \n 70 processes took {} seconds'.format(p1,p40,p70))
The numpy.conjugate() part of this code runs about 5 times slower if executed in parallel on 70 cores compared to executing it on a single core. I would have expected a much lower runtime difference what is causing this?
EDIT 2
On suggestion from #Ahmed AEK, here is a plot of the timings for different array sizes.
And for bigger matrices:
Thanks for reading the post!
If I can provide further data just let me know.
I have a python program which 1) Reads from a very large file from Disk(~95% time) and then 2) Process and Provide a relatively small output (~5% time). This Program is to be run on TeraBytes of files .
Now i am looking to Optimize this Program by utilizing Multi Processing and Multi Threading . The platform I am running is a Virtual Machine with 4 Processors on a virtual Machine .
I plan to have a scheduler Process which will execute 4 Processes (same as processors) and then Each Process should have some threads as most part is I/O . Each Thread will process 1 file & will report result to the Main Thread which in turn will report it back to scheduler Process via IPC . Scheduler can queue these and eventually write them to disk in ordered manner
So wondering How does one decide number of Processes and Threads to create for such scenario ? Is there a Mathematical way to figure out whats the best mix .
Thankyou
I think I would arrange it the inverse of what you are doing. That is, I would create a thread pool of a certain size that would be responsible for producing the results. The tasks that get submitted to this pool would be passed as an argument a processor pool that could be used by the worker thread for submitting the CPU-bound portions of work. In other words, the thread pool workers would primarily be doing all the disk-related operations and handing off to the processor pool any CPU-intensive work.
The size of the processor pool should just be the number of processors you have in your environment. It's difficult to give a precise size for the thread pool; it depends on how many concurrent disk operations it can handle before the law of diminishing returns come into play. It also depends on your memory: The larger the pool, the greater the memory resources that will be taken, especially if entire files have to be read into memory for processing. So, you may have to experiment with this value. The code below outlines these ideas. What you gain from the thread pool is overlapping of I/O operations greater than you would achieve if you just used a small processor pool:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
def cpu_bound_function(arg1, arg2):
...
return some_result
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'r') as f:
# Do disk related operations:
. . . # code omitted
# Now we have to do a CPU-intensive operation:
future = process_pool_executor.submit(cpu_bound_function, arg1, arg2)
result = future.result() # get result
return result
file_list = [file_1, file_2, file_n]
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list)
Important Note:
Another far simpler approach is to just have a single, processor pool whose size is greater than the number of CPU processors you have, for example, 25. The worker processes will do both I/O and CPU operations. Even though you have more processes than CPUs, many of the processes will be in a wait state waiting for I/O to complete allowing CPU-intensive work to run.
The downside to this approach is that the overhead in creating a N processes is far greater than the overhead in creating N threads + a small number of processes. However, as the running time of the tasks submitted to the pool becomes increasingly larger, then this increased overhead becomes decreasingly a smaller percentage of the total run time. So, if your tasks are not trivial, this could be a reasonably performant simplification.
Update: Benchmarks of Both Approaches
I did some benchmarks against the two approaches processing 24 files whose sizes were approximately 10,000KB (actually, these were just 3 different files processed 8 times each, so there might have been some caching done):
Method 1 (thread pool + processor pool)
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'rb') as f:
b = f.read()
future = process_pool_executor.submit(cpu_bound_function, b)
result = future.result() # get result
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Method 2 (processor pool only)
from concurrent.futures import ProcessPoolExecutor
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(file_name):
with open(file_name, 'rb') as f:
b = f.read()
result = cpu_bound_function(b)
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_PROCESSES = 50 # depends on your configuration on how well the I/O can be overlapped
N_PROCESSES = min(N_FILES, MAX_PROCESSES) # no point in creating more threds than required
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(process_pool_executor.map(io_bound_function, file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Results:
(I have 8 cores)
Thread Pool + Processor Pool: 13.5 seconds
Processor Pool Alone: 13.3 seconds
Conclusion: I would try the simpler approach first of just using a processor pool for everything. Now the tricky bit is deciding what the maximum number of processes to create, which was part of your original question and had a simple answer when all it was doing was the CPU-intensive computations. If the number of files you are reading are not too many, then the point is moot; you can have one process per file. But if you have hundreds of files, you will not want to have hundreds of processes in your pool (there is also an upper limit to how many processes you can create and again there are those nasty memory constraints). There is just no way I can give you an exact number. If you do have a large number of files, start with a smallish pool size and keep incrementing until you get no further benefit (of course, you probably do not want to be processing more files than some maximum number for these tests or you will be running forever just deciding on a good pool size for the real run).
For parallel processing:
I saw this question, and quoting the accepted answer:
In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.
For multiprocessing:
Someone asked a similar question here, and the accepted answer said this:
If all of your threads/processes are indeed CPU-bound, you should run as many processes as the CPU reports cores. Due to HyperThreading, each physical CPU cores may be able to present multiple virtual cores. Call multiprocessing.cpu_count to get the number of virtual cores.
If only p of 1 of your threads is CPU-bound, you can adjust that number by multiplying by p. For example, if half your processes are CPU-bound (p = 0.5) and you have two CPUs with 4 cores each and 2x HyperThreading, you should start 0.5 * 2 * 4 * 2 = 8 processes.
The key here is understand what machine are you using, from that, you can choose a nearly optimal number of threads/processes to split the execution of you code. And I said nearly optimal because it will vary a little bit every time you run your script, so it'll be difficult to predict this optimal number from a mathematical point of view.
For your specific situation, if your machine has 4 cores, I would recommend you to only create 4 threads max, and then split them:
1 to the main thread.
3 for file reading and process.
using multiple processes to speed up IO performance may not be a good idea, check this and the sample code below it to see wether it is helpful
One idea can be to have a thread only reading the file (If I understood well, there is only one file) and pushing the independent parts (for ex. rows) into queue with messages.
The messages can be processed by 4 threads. In this way, you can optimize the load between the processors.
On a strongly I/O-bound process (like what you are describing), you do not necessarily need multithreading nor multiprocessing: you could also use more advanced I/O primitives from your OS.
For example on Linux you can submit read requests to the kernel along with a suitably sized mutable buffer and be notified when the buffer is filled. This can be done using the AIO API, for which I've written a pure-python binding: python-libaio (libaio on pypi)), or with the more recent io_uring API for which there seems to be a CFFI python binding (liburing on pypy) (I have neither used io_uring nor this python binding).
This removes the complexity of parallel processing at your level, may reduce the number of OS/userland context switches (reducing the cpu time even further), and lets the OS know more about what you are trying to do, giving it the opportunity of scheduling the IO more efficiently (in a virtualised environment I would not be surprised if it reduced the number of data copies, although I have not tried it myself).
Of course, the downside is that your program will be more tightly bound to the OS you are executing it on, requiring more effort to get it to run on another one.
I am trying to compare sequential computation and parallel computation in Python.
This is the bench mark function.
def benchmking_f(n=0):
import time
items = range(int(10**(6+n)))
def f2(x):return x*x
start = time.time()
sum_squared = 0
for i in items:
sum_squared += f2(i)
return time.time() - start
this sequential computation
problem_size = 2
import time
start = time.time()
tlist = []
for i in range(5):
tlist.append(benchmking_f(problem_size))
print('for loop took {}s'.format(time.time() - start))
print('each iterate took')
print(tlist)
took about 70s to finish the job; each iterate took
[14.209498167037964, 13.92169737815857, 13.949078798294067, 13.94432258605957, 14.004642486572266]
this parallel approach
problem_size = 2
import itertools
import multiprocessing
start = time.time()
pool = multiprocessing.Pool(5)
tlist = list(pool.map(benchmking_f, itertools.repeat(problem_size, 5)))
print('pool.map took {}s'.format(time.time() - start))
print('each iterate took')
print(tlist)
took about 42.45s; each iterate took
[41.17476940155029, 41.92032074928284, 41.50966739654541, 41.348535776138306, 41.06284761428833]
question
A piece of the whole computation (benchmking_f in this case) took about 14s in sequential and 42.45s in parallel
Why is that?
Note:
I am not asking the total time. I am asking the time that A piece of the whole computation, which takes on one iteration in for loop, and one process/thread in parallel.
1-iter benchmking_f takes.
How many physical (not logical) cores do you have? You're trying to run 5 copies of the function simultaneously, the function takes 100% of one core for as long as it runs, and unless you have at least 5 physical cores they're going to fight each other tooth and nail for cycles.
I have 4 physical cores, but want to use my machine for other things too, so reduced your Pool(5) with Pool(3). Then the per-iterate timings were about the same either way.
Back of the envelope
Suppose you have a task that nails 100% of a CPU for T seconds. If you want to run S copies of that task simultaneously, that requires T*S cpu-seconds in total. If you have C entirely free physical cores to throw at it, at most min(C, S) cores can be working on the aggregate simultaneously, so to a first approximation the time needed will be:
T*S / min(C, S)
As another reply said, when you have more processes running than cores, the OS cycles through the processes for the duration, acting to make them all take about the same amount of wall-clock time (during some amount of which each process is doing nothing at all except waiting for the OS to let it run again for a while).
I'm guessing you have 2 physical cores. For your example, T is about 14 seconds, and S is 5, so if you had C=2 cores that works out to
14*5 / min(2, 5) = 14*5/2 = 35
seconds. You're actually seeing something closer to 41. Overheads account for part of that, but seems likely your machine was also doing other work at the same time, so your test run didn't get 100% of the 2 cores.
Total time is reduced: 70 second vs 42 second.
Your computer is processing 5 things at the same time, probably in a round-robin fashion. Threading overhead (context load etc.) occurs and each thread took longer. However, since longer threads are run in parallel, 5 threads finished within 42 second.
For sequential, your computer is processing the same thing 5 times. Each thread can run until it finishes without interrupt (hence, no overhead). Yet, all takes 70 second to finish.
I'm beginner to python and machine learning. I'm trying to reproduce the code for countvectorizer() using multi-threading. I'm working with yelp dataset to do sentiment analysis using LogisticRegression. This is what I've written so far:
Code snippet:
from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']
y = []
def product_helper(args):
return featureExtraction(*args)
def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1
return temp
# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l = pool.map(product_helper,job_args)
pool.close()
pool.join()
return l
temp_X = calculateParallel(12)
Here this is just part of code.
Explanation:
df['text'] has all the reviews and df['stars'] has the ratings (1 through 5). I'm trying to find the word count vector temp_X using multi-threading. bag_of_words is a list of some frequent words of choice.
Question:
Without multi-threading , I was able to compute the temp_X in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).
What am I doing wrong here?
Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use
Use
import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)
from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.
Python and threads dont really work together very well. There is a known issue called the GIL (global interperter lock). Basically there is a lock in the interperter that makes all threads not run in parallel (even if you have multiple cpu cores). Python will simply give each thread a few milliseconds of cpu time one after another (and the reason it became slower is the overhead from context switching between those threads).
Here is a really good document explaining how it works: http://www.dabeaz.com/python/UnderstandingGIL.pdf
To fix your problem i suggest you try multi processing:
https://pymotw.com/2/multiprocessing/basics.html
Note: multiprocessing is not 100% equivilent to multithreading. Multiprocessing will run at parallel but the diffrent processes wont share memory so if you change a variable in one of them it will not be changed in the other process.
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.