Pooled processes are slower when compared to non-pool - python

I am a newbie to python and I am trying to use multiprocessing for one my applications.
I actually have a very simple multiplication program and I was trying to asynchronously generate parallel processes to calculate the multiplication of a range of numbers. When I try to do this without pooling, the time is atleast twice or some times even 4 times faster. I am not sure what could the reason be for this behavior.
I am using python 2.7.1
Non-Pool.py
#!/usr/bin/python
import time
def f(x):
return x*x
st = time.time()
t = 10000000
f(t)
map(f, range(t))
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
Pool.py
#!/usr/bin/python
from multiprocessing import Pool
import time
def f(x):
return x*x
st = time.time()
t = 10000000
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [t]) # evaluate "f(10)" asynchronously
result.get(timeout=1) # prints "100" unless your computer is *very* slow
pool.map(f, range(t)) # prints "[0, 1, 4,..., 81]"
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
exit(0)
Execution Times: (Format >> minutes--seconds)
Macha-MacBook-Pro:Downloads me$ ./nonpool.py
2.03456997871--1352551406.28
Macha-MacBook-Pro:Downloads me$ ./pool.py
4.69528508186--1352551417.28

You might check related answers, e.g., python prime crunching: processing pool is slower? -- the overhead of setting up a processing pool is high, but so is sending and receiving single integers in arguments and results.

Related

Multiprocessing in Python: Parallelize a for loop to fill a Numpy array

I've been reading threads like this one but any of them seems to work for my case. I'm trying to parallelize the following toy example to fill a Numpy array inside a for loop using Multiprocessing in Python:
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
for idx, val in enumerate(range(1, n+1)):
result = pool.apply_async(func1, [val])
my_array[idx] = result.get()
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(60000, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
The lines in the code based on Multiprocessing seem to work and give you the right results. However, it takes far longer than the non parallelized version. Here is the output of both versions:
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Normal time: 0.01605963706970215
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Time based on multiprocessing: 2.8775112628936768
My intuition tells me that it should be a better way of capturing results from pool.apply_async(). What am I doing wrong? What is the most efficient way to accomplish this? Thx.
Creating processes is expensive. On my machine it take at leas several hundred of microsecond per process created. Moreover, the multiprocessing module copy the data to be computed between process and then gather the results from the process pool. This inter-process communication is very slow too. The problem is that your computation is trivial and can be done very quickly, likely much faster than all the introduced overhead. The multiprocessing module is only useful when you are dealing with quite small datasets and perform intensive computation (compared to the amount of computed data).
Hopefully, when it comes to numericals computations using Numpy, there is a simple and fast way to parallelize your application: the Numba JIT. Numba can parallelize a code if you explicitly use parallel structures (parallel=True and prange). It uses threads and not heavy processes that are working in shared memory. Numba can overcome the GIL if your code does not deal with native types and Numpy arrays instead of pure Python dynamic object (lists, big integers, classes, etc.). Here is an example:
import numpy as np
import numba as nb
import time
#nb.njit
def func1(x, y=1):
return x**2 + y
#nb.njit('float64[:](int64)', parallel=True)
def func2(n):
my_array = np.zeros(n)
for i in nb.prange(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Numba time: {}\n".format(end-start))
if __name__ == '__main__':
main()
Because Numba compiles the code at runtime, it is able to fully optimize the loop to a no-op resulting in a time close to 0 second in this case.
Here is the solution proposed by #thisisalsomypassword that improves my initial proposal. That is, "collecting the AsyncResult objects in a list within the loop and then calling AsyncResult.get() after all processes have started on each result object":
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
time.sleep(0.1)
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
####### HERE COMES THE CHANGE #######
results = [pool.apply_async(func1, [val]) for val in range(1, n+1)]
for idx, val in enumerate(results):
my_array[idx] = val.get()
#######
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(600)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(600, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
Now it works. Time is reduced considerably with Multiprocessing:
Normal time: 60.107836008071
Time based on multiprocessing: 10.049324989318848
time.sleep(0.1) was added in func1 to cancel out the effect of being a super trivial task.

why doesn't multiprocessing use all my cores

So I made a program that calculates primes to test what the difference is between using multithreading or just using single thread. I read that multiprocessing bypasses the GIL, so I expected a decent performance boost.
So here we have my code to test it:
def prime(n):
if n == 2:
return n
if n & 1 == 0:
return None
d= 3
while d * d <= n:
if n % d == 0:
return None
d= d + 2
return n
loop = range(2,1000000)
chunks = range(1,1000000,1000)
def chunker(chunk):
ret = []
r2 = chunk + 1000
r1 = chunk
for k in range(r1,r2):
ret.append(prime(k))
return ret
from multiprocessing import cpu_count
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
start = t()
results = pool.map(prime, loop)
print(t() - start)
pool.close()
filtered = filter(lambda score: score != None, results)
new = []
start = t()
for i in loop:
new.append(prime(i))
print(t()-start)
pool = Pool(12)
start = t()
results = pool.map_async(chunker, chunks).get()
print(t() - start)
pool.close()
I executed the program and this where the times:
multi processing without chunks:
4.953783750534058
single thread:
5.067057371139526
multiprocessing with chunks:
5.041667222976685
Maybe you already notice, but multiprocessing isn't that much faster. I have a 6 core 12 thread AMD ryzen CPU, so I excpected if I can use all those threads, that I would at least double the performance. But no. If I look in task manager the cpu usage on average from using multiprocessing is 12%, while single threaded uses around 10% of the cpu.
So what is going on? Did I do something wrong? Or does meaning being able to bypass the GIL not mean being able to use more cores?
If I can't use more cores with multiprocessing how can I do it then?
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
From the documentation:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
In other words, you're still using threads, not processes.
To use processes, do from multiprocessing import Pool instead.

Python Multiprocessing Query

I am learning about python's multiprocessing module. I want to make my code use all my CPU resources. This is the code I wrote:
from multiprocessing import Process
import time
def work():
for i in range(1000):
x=5
y=10
z=x+y
if __name__ == '__main__':
start1 = time.time()
for i in range(100):
p=Process(target=work)
p.start()
p.join()
end1=time.time()
start = time.time()
for i in range(100):
work()
end=time.time()
print(f'With Parallel {end1-start1}')
print(f'Without Parallel {end-start}')
The output I get is this:
With Parallel 0.8802454471588135
Without Parallel 0.00039649009704589844
I tried experimenting with different range values in the for loops or using print statement only in work function but everytime without parallel runs faster. Is there something I am missing?
Thanks in advance!
Your benchmark method is problematic:
for i in range(100):
p = Process(target=work)
p.start()
p.join()
I guess you want to run 100 processes in parallel, but Process.join() blocks until process exit, you effectively run in serial. Besides, run more busy processes than CPU cores count leads to high CPU contention which is a performance penalty. And as a comment pointed out, your work() function is too simple, compare to the overhead of Process creation.
A better version:
import multiprocessing
import time
def work():
for i in range(2000000):
pow(i, 10)
n_processes = multiprocessing.cpu_count() # 8
total_runs = n_processes * 4
ps = []
n = total_runs
start1 = time.time()
while n:
# ensure processes number limit
ps = [p for p in ps if p.is_alive()]
if len(ps) < n_processes:
p = multiprocessing.Process(target=work)
p.start()
ps.append(p)
n = n-1
else:
time.sleep(0.01)
# wait for all processes to finish
while any(p.is_alive() for p in ps):
time.sleep(0.01)
end1=time.time()
start = time.time()
for i in range(total_runs):
work()
end=time.time()
print(f'With Parallel {end1-start1:.4f}s')
print(f'Without Parallel {end-start:.4f}s')
print(f'Acceleration factor {(end-start)/(end1-start1):.2f}')
result:
With Parallel 4.2835s
Without Parallel 33.0244s
Acceleration factor 7.71

Python: Parallel Processing in Joblib Makes the Code Run Even Slower

I want to integrate a parallel processing to make my for loops run faster.
However, I noticed that it has just made my code run slower. See below example where I am using joblib with a simple function on a list of random integers. Notice that without the parallel processing it runs faster than with.
Any insight as to what is happening?
def f(x):
return x**x
if __name__ == '__main__':
s = [random.randint(0, 100) for _ in range(0, 10000)]
# without parallel processing
t0 = time.time()
out1 = [f(x) for x in s]
t1 = time.time()
print("without parallel processing: ", t1 - t0)
# with parallel processing
t0 = time.time()
out2 = Parallel(n_jobs=8, batch_size=len(s), backend="threading")(delayed(f)(x) for x in s)
t1 = time.time()
print("with parallel processing: ", t1 - t0)
I am getting the following output:
without parallel processing: 0.0070569515228271484
with parallel processing: 0.10714387893676758
The parameter batch_size=len(s) effectively says give each process a batch of s jobs. This means you create 8 threads but then give all workload to 1 thread.
Also you might want to increase the workload to have a measurable advantage. I prefer to use time.sleep delays:
def f(x):
time.sleep(0.001)
return x**x
out2 = Parallel(n_jobs=8,
#batch_size=len(s),
backend="threading")(delayed(f)(x) for x in s)
without parallel processing: 11.562264442443848
with parallel processing: 1.412865400314331

Python array sum vs MATLAB

I'm slowly switching to Python and I wanted to make a simple test for comparing the performance of a simple array summation. I generate a random 1000x1000 array and add one to each of the values in this array.
Here my script in Python :
import time
import numpy
from numpy.random import random
def testAddOne(data):
"""
Test addOne
"""
return data + 1
i = 1000
data = random((i,i))
start = time.clock()
for x in xrange(1000):
testAddOne(data)
stop = time.clock()
print stop - start
And my function in MATLAB:
function test
%parameter declaration
c=rand(1000);
tic
for t = 1:1000
testAddOne(c);
end
fprintf('Structure: \n')
toc
end
function testAddOne(c)
c = c + 1;
end
The Python takes 2.77 - 2.79 seconds, the same as the MATLAB function (I'm actually quite impressed by Numpy!). What would I have to change to my Python script to use multithreading? I can't in MATLAB since I don,t have the toolbox.
Multi threading in Python is only useful for situations where threads get blocked, e.g. on getting input, which is not the case here (see the answers to this question for more details). However, multi processing is easy to do in Python. Multiprocessing in general is covered here.
A program taking a similar approach to your example is below
import time
import numpy
from numpy.random import random
from multiprocessing import Process
def testAddOne(data):
return data + 1
def testAddN(data,N):
# print "testAddN", N
for x in xrange(N):
testAddOne(data)
if __name__ == '__main__':
matrix_size = 1000
num_adds = 10000
num_processes = 4
data = random((matrix_size,matrix_size))
start = time.clock()
if num_processes > 1:
processes = [Process(target=testAddN, args=(data,num_adds/num_processes))
for i in range(num_processes)]
for p in processes:
p.start()
for p in processes:
p.join()
else:
testAddN(data,num_adds)
stop = time.clock()
print "Elapsed", stop - start
A more useful example using a pool of worker processes to successively add 1 to different matrices is below.
import time
import numpy
from numpy.random import random
from multiprocessing import Pool
def testAddOne(data):
return data + 1
def testAddN(dataN):
data,N=dataN
for x in xrange(N):
data = testAddOne(data)
return data
if __name__ == '__main__':
num_matrices = 4
matrix_size = 1000
num_adds_per_matrix = 2500
num_processes = 4
inputs = [(random((matrix_size,matrix_size)), num_adds_per_matrix)
for i in range(num_matrices)]
#print inputs # test using, e.g., matrix_size = 2
start = time.clock()
if num_processes > 1:
proc_pool = Pool(processes=num_processes)
outputs = proc_pool.map(testAddN, inputs)
else:
outputs = map(testAddN, inputs)
stop = time.clock()
#print outputs # test using, e.g., matrix_size = 2
print "Elapsed", stop - start
In this case the code in testAddN actually does something with the result of calling testAddOne. And you can uncomment the print statements to check that some useful work is being done.
In both cases I've changed the total number of additions to 10000; with fewer additions the cost of starting up processes becomes more significant (but you can experiment with the parameters). And you can experiment with num_processes also. On my machine I found that compared to running in the same process with num_processes=1 I got just under a 2x speedup spawning four processes with num_processes=4.

Categories

Resources