So I made a program that calculates primes to test what the difference is between using multithreading or just using single thread. I read that multiprocessing bypasses the GIL, so I expected a decent performance boost.
So here we have my code to test it:
def prime(n):
if n == 2:
return n
if n & 1 == 0:
return None
d= 3
while d * d <= n:
if n % d == 0:
return None
d= d + 2
return n
loop = range(2,1000000)
chunks = range(1,1000000,1000)
def chunker(chunk):
ret = []
r2 = chunk + 1000
r1 = chunk
for k in range(r1,r2):
ret.append(prime(k))
return ret
from multiprocessing import cpu_count
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
start = t()
results = pool.map(prime, loop)
print(t() - start)
pool.close()
filtered = filter(lambda score: score != None, results)
new = []
start = t()
for i in loop:
new.append(prime(i))
print(t()-start)
pool = Pool(12)
start = t()
results = pool.map_async(chunker, chunks).get()
print(t() - start)
pool.close()
I executed the program and this where the times:
multi processing without chunks:
4.953783750534058
single thread:
5.067057371139526
multiprocessing with chunks:
5.041667222976685
Maybe you already notice, but multiprocessing isn't that much faster. I have a 6 core 12 thread AMD ryzen CPU, so I excpected if I can use all those threads, that I would at least double the performance. But no. If I look in task manager the cpu usage on average from using multiprocessing is 12%, while single threaded uses around 10% of the cpu.
So what is going on? Did I do something wrong? Or does meaning being able to bypass the GIL not mean being able to use more cores?
If I can't use more cores with multiprocessing how can I do it then?
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
From the documentation:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
In other words, you're still using threads, not processes.
To use processes, do from multiprocessing import Pool instead.
Related
I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures
# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)
# max_Worker = cpu_num
max_Worker = 1
# A fake input array
n=1000000
array = list(range(n))
results = []
# A fake function being applied to each element of array
def task(i):
return i**2
x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
features = {executor.submit(task, j) for j in array}
# the real function is heavy and we need to be sure of completeness of each run
for future in concurrent.futures.as_completed(features):
results.append(future.result())
results = [future.result() for future in features]
y = time.time()
print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')
And now my questions,
Although there is no error, is it correct/optimized?
While playing with the number of workers, seems there is no
improvement in the speed (e.g., 1 vs 16, no difference). Then,
what's the problem and how can be solved?
Thanks in advance,
See my comment to your question. To the overhead I mentioned in that comment you need to also add the oberhead in just creating the process pool itself.
The following is a benchmark with several results. The first is a timing from just calling the worker function task 100000 times and creating a results list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task from 1000000 to 100000.
The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit method and then processing the Future instances that are returned.
The next attempt is to instead use the map method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit calls for each element of the iterable.
The next attempt specifies a chunksize value which approximates more or less the value that the map function used by the Pool class in the multiprocessing package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.
The final attempt uses the multiprocessing faciltity provided by package multiprocessing and its multiprocessing.pool.Pool class. The difference in this benchmark is that its map function uses a more intelligent default chunksize when no chunksize argument is specified.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
return i**2
# required for Windows:
if __name__ == '__main__':
n=100000
t1 = time.time()
results = [task(i) for i in range(n)]
print('Non-multiprocessing time:', time.time() - t1, results[-1])
# num CPUs
cpu_num = os.cpu_count()
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
print('Multiprocessing time using submit:', time.time() - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
print('Multiprocessing time using map:', time.time() - t1, results[-1])
t1 = time.time()
chunksize = n // (4 * cpu_num)
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
Prints:
Non-multiprocessing time: 0.027019739151000977 9999800001
Number of CPUs available: 8
Multiprocessing time using submit: 77.34723353385925 9999800001
Multiprocessing time using map: 79.52981925010681 9999800001
Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
Update
The following bechmarks use a version of task that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map case (it would by default compute a chunksize value of 4), is slightly more performant.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
for _ in range(1_000_000):
result = i ** 2
return result
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, pool_size * 4)
if remainder:
chunksize += 1
return chunksize
# required for Windows:
if __name__ == '__main__':
n = 100
cpu_num = os.cpu_count()
chunksize = compute_chunksize(n, cpu_num)
t1 = time.time()
results = [task(i) for i in range(n)]
t2 = time.time()
print('Non-multiprocessing time:', t2 - t1, results[-1])
# num CPUs
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
t2 = time.time()
print('Multiprocessing time using submit:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
t2 = time.time()
print('Multiprocessing time using map:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
t2 = time.time()
print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
t2 = time.time()
print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n), chunksize=1)
t2 = time.time()
print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
Prints:
Non-multiprocessing time: 23.12758779525757 9801
Number of CPUs available: 8
Multiprocessing time using submit: 5.336004018783569 9801
Multiprocessing time using map: 5.364996671676636 9801
Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
Multiprocessing time using Pool.map: 5.400001287460327 9801
Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801
I've been reading threads like this one but any of them seems to work for my case. I'm trying to parallelize the following toy example to fill a Numpy array inside a for loop using Multiprocessing in Python:
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
for idx, val in enumerate(range(1, n+1)):
result = pool.apply_async(func1, [val])
my_array[idx] = result.get()
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(60000, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
The lines in the code based on Multiprocessing seem to work and give you the right results. However, it takes far longer than the non parallelized version. Here is the output of both versions:
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Normal time: 0.01605963706970215
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Time based on multiprocessing: 2.8775112628936768
My intuition tells me that it should be a better way of capturing results from pool.apply_async(). What am I doing wrong? What is the most efficient way to accomplish this? Thx.
Creating processes is expensive. On my machine it take at leas several hundred of microsecond per process created. Moreover, the multiprocessing module copy the data to be computed between process and then gather the results from the process pool. This inter-process communication is very slow too. The problem is that your computation is trivial and can be done very quickly, likely much faster than all the introduced overhead. The multiprocessing module is only useful when you are dealing with quite small datasets and perform intensive computation (compared to the amount of computed data).
Hopefully, when it comes to numericals computations using Numpy, there is a simple and fast way to parallelize your application: the Numba JIT. Numba can parallelize a code if you explicitly use parallel structures (parallel=True and prange). It uses threads and not heavy processes that are working in shared memory. Numba can overcome the GIL if your code does not deal with native types and Numpy arrays instead of pure Python dynamic object (lists, big integers, classes, etc.). Here is an example:
import numpy as np
import numba as nb
import time
#nb.njit
def func1(x, y=1):
return x**2 + y
#nb.njit('float64[:](int64)', parallel=True)
def func2(n):
my_array = np.zeros(n)
for i in nb.prange(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Numba time: {}\n".format(end-start))
if __name__ == '__main__':
main()
Because Numba compiles the code at runtime, it is able to fully optimize the loop to a no-op resulting in a time close to 0 second in this case.
Here is the solution proposed by #thisisalsomypassword that improves my initial proposal. That is, "collecting the AsyncResult objects in a list within the loop and then calling AsyncResult.get() after all processes have started on each result object":
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
time.sleep(0.1)
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
####### HERE COMES THE CHANGE #######
results = [pool.apply_async(func1, [val]) for val in range(1, n+1)]
for idx, val in enumerate(results):
my_array[idx] = val.get()
#######
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(600)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(600, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
Now it works. Time is reduced considerably with Multiprocessing:
Normal time: 60.107836008071
Time based on multiprocessing: 10.049324989318848
time.sleep(0.1) was added in func1 to cancel out the effect of being a super trivial task.
I am learning about python's multiprocessing module. I want to make my code use all my CPU resources. This is the code I wrote:
from multiprocessing import Process
import time
def work():
for i in range(1000):
x=5
y=10
z=x+y
if __name__ == '__main__':
start1 = time.time()
for i in range(100):
p=Process(target=work)
p.start()
p.join()
end1=time.time()
start = time.time()
for i in range(100):
work()
end=time.time()
print(f'With Parallel {end1-start1}')
print(f'Without Parallel {end-start}')
The output I get is this:
With Parallel 0.8802454471588135
Without Parallel 0.00039649009704589844
I tried experimenting with different range values in the for loops or using print statement only in work function but everytime without parallel runs faster. Is there something I am missing?
Thanks in advance!
Your benchmark method is problematic:
for i in range(100):
p = Process(target=work)
p.start()
p.join()
I guess you want to run 100 processes in parallel, but Process.join() blocks until process exit, you effectively run in serial. Besides, run more busy processes than CPU cores count leads to high CPU contention which is a performance penalty. And as a comment pointed out, your work() function is too simple, compare to the overhead of Process creation.
A better version:
import multiprocessing
import time
def work():
for i in range(2000000):
pow(i, 10)
n_processes = multiprocessing.cpu_count() # 8
total_runs = n_processes * 4
ps = []
n = total_runs
start1 = time.time()
while n:
# ensure processes number limit
ps = [p for p in ps if p.is_alive()]
if len(ps) < n_processes:
p = multiprocessing.Process(target=work)
p.start()
ps.append(p)
n = n-1
else:
time.sleep(0.01)
# wait for all processes to finish
while any(p.is_alive() for p in ps):
time.sleep(0.01)
end1=time.time()
start = time.time()
for i in range(total_runs):
work()
end=time.time()
print(f'With Parallel {end1-start1:.4f}s')
print(f'Without Parallel {end-start:.4f}s')
print(f'Acceleration factor {(end-start)/(end1-start1):.2f}')
result:
With Parallel 4.2835s
Without Parallel 33.0244s
Acceleration factor 7.71
I am trying to understand how multiprocessing works with Python. Here's my test code:
import numpy as np
import multiprocessing
import time
def worker(a):
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
result = pool.map(worker, [arr1], chunksize=1000000/(cores-1))
print "mp time", time.time()-tmp
I have 8 cores. It usually ends up with 7 processes using only ~3% of the CPU for about a second, and the last process uses ~1/8 of the CPU for forever...(it has been running for about 15 minutes)
I understand that the interprocess communication usually bounds the complexity of parallel programming, but does it usually take this long? What else could cause the last process to take forever?
This thread: Python multiprocessing never joins seems to address a similar issue but it doesn't solve the problem with Pool.
It looks like you want to divide the work into chunks. You can use the range function to partition the data. On Linux, forked processes get a copy-on-write view of the parent memory so you can just pass down the indexes you want to work on. On Windows, no such luck. You need to pass in each sublist. This program should do it
import numpy as np
import multiprocessing
import time
import platform
def worker(a):
if platform.system() == "Linux":
# on linux we passed in start:len
start, length = a
a = arr1[start:length]
for i in range(len(a)):
for j in arr2:
a[i] = a[i]*j
return len(a)
arr2 = np.random.rand(10000).tolist()
if __name__ == '__main__':
multiprocessing.freeze_support()
cores = multiprocessing.cpu_count()
arr1 = np.random.rand(1000000).tolist()
tmp = time.time()
pool = multiprocessing.Pool(processes=cores)
chunk = (len(arr1)+cores-1)//cores
# on Windows, pass the sublist, on linux just the indexes and let the
# worker split from the view of parent memory space
if platform.system() == "Linux":
seq = [(i, i+chunk) for i in range(0, len(arr1), chunk)]
else:
seq = [arr1[i:i+chunk] for i in range(0, len(arr1), chunk)]
result = pool.map(worker, seq, chunksize=1)
print "mp time", time.time()-tmp
You point is here:
pool.map will automatically iterate the object which is [arr1] in your program. Please notice that the object is [arr1] but not arr1, that means the length of object you pass to pool.map is only one.
I think the simplest solution is replace [arr1] with arr1.
I am a newbie to python and I am trying to use multiprocessing for one my applications.
I actually have a very simple multiplication program and I was trying to asynchronously generate parallel processes to calculate the multiplication of a range of numbers. When I try to do this without pooling, the time is atleast twice or some times even 4 times faster. I am not sure what could the reason be for this behavior.
I am using python 2.7.1
Non-Pool.py
#!/usr/bin/python
import time
def f(x):
return x*x
st = time.time()
t = 10000000
f(t)
map(f, range(t))
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
Pool.py
#!/usr/bin/python
from multiprocessing import Pool
import time
def f(x):
return x*x
st = time.time()
t = 10000000
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [t]) # evaluate "f(10)" asynchronously
result.get(timeout=1) # prints "100" unless your computer is *very* slow
pool.map(f, range(t)) # prints "[0, 1, 4,..., 81]"
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
exit(0)
Execution Times: (Format >> minutes--seconds)
Macha-MacBook-Pro:Downloads me$ ./nonpool.py
2.03456997871--1352551406.28
Macha-MacBook-Pro:Downloads me$ ./pool.py
4.69528508186--1352551417.28
You might check related answers, e.g., python prime crunching: processing pool is slower? -- the overhead of setting up a processing pool is high, but so is sending and receiving single integers in arguments and results.