Several questions about multiprocessing.Pool - python

I started recently to learn multiprocessing in python. Regarding this I have some questions. The following code shows my example:
import multiprocessing
from time import *
def func(n):
for i in range(100):
print(i, "/ 100")
for j in range(100000):
a=j*i
b=j*i/2
if __name__ == '__main__':
#Test with multiprosessing
pool = multiprocessing.Pool(processes=4)
t1 = clock()
pool.map(func, range(10))
pool.close()
t2 = clock()
print(t2-t1)
#Test without multiprocessing
func(range(10))
t3 = clock()
print(t3-t2)
Does this code use the four cores of the cpu or did I make a mistake?
Why is the runtime without multiprocessing so much faster? Is there a mistake?
Why does the print command not work while using the multiprocessing?

It does submit four processes at a time to your process pool. Your multiprocessing example is running func ten times, whereas the plain call is running only once. In addition, starting processes has some run time overhead. These probably account for the difference in run times.
I think a simpler example is instructive. func now sleeps for five seconds, then prints out its input n, along with the time.
import multiprocessing
import time
def func(n):
time.sleep(5)
print([n, time.time()])
if __name__ == '__main__':
#Test with multiprosessing
print("With multiprocessing")
pool = multiprocessing.Pool(processes=4)
pool.map(func, range(5))
pool.close()
#Test without multiprocessing
print("Without multiprocessing")
func(1)
pool.map(func, range(5)) runs func(0), func(1), ..., func(4).
This outputs
With multiprocessing
[2, 1480778015.3355303]
[3, 1480778015.3355303]
[1, 1480778015.3355303]
[0, 1480778015.3355303]
[4, 1480778020.3495753]
Without multiprocessing
[1, 1480778025.3653867]
Note that the first four are output at the same time, and not strictly in order. The fifth (n == 4), gets output five seconds later, which makes sense since we had a pool of four processes and it could only get started once the first four were done.

Related

Multithreaded Python Program faster than Single Threaded program for CPU bound task

EDIT : Turns out this weird behavior was happening only with python in my WSL ubuntu. Otherwise, sequential does run faster than multi-threaded one.
I understand that, for CPython, in general, multiple-threads just context-switch while utilizing the same CPU-core and not utilize multiple CPU-cores like with multi-processing where several instances of python interpreter gets started.
I know this makes multithreading good for I/O bound tasks if done right. Nevertheless, CPU bound tasks will actually be slower with multi-threading. So, I experimented with 3 code snippets each doing some CPU bound calculations.
Example 1 : Runs tasks in sequence (single thread)
Example 2 : Runs each task in different thread (Multithreaded)
Example 3 : Runs each task in separate processes (Multi-processed)
To my surprise, even though task is CPU bound, Example 2 utilizing multiple threads is executing faster (on avg 1.5 secs) than Example 1 using single thread (on avg 2.2 secs). But Example 3 runs the fastest as expected (on avg 1 sec).
I don't know what I am doing wrong.
Example 1 : Run tasks Sequentially
import time
import math
nums = [ 8, 7, 8, 5, 8]
def some_computation(n):
counter = 0
for i in range(int(math.pow(n,n))):
counter += 1
if __name__ == '__main__':
start = time.time()
for i in nums:
some_computation(i)
end = time.time()
print("Total time of program execution : ", round(end-start, 4) )
Example 2 : Run tasks with Multithreading
import threading
import time
import math
nums = [ 8, 7, 8, 5, 8]
def some_computation(n):
counter = 0
for i in range(int(math.pow(n,n))):
counter += 1
if __name__ == '__main__':
start = time.time()
threads = []
for i in nums:
x = threading.Thread(target=some_computation, args=(i,))
threads.append(x)
x.start()
for t in threads:
t.join()
end = time.time()
print("Total time of program execution : ", round(end-start, 4) )
Example 3 : Run tasks in parallel with multiprocessing module
from multiprocessing import Pool
import time
import math
nums = [ 8, 7, 8, 5, 8]
def some_computation(n):
counter = 0
for i in range(int(math.pow(n,n))):
counter += 1
if __name__ == '__main__':
start = time.time()
pool = Pool(processes=3)
for i in nums:
pool.apply_async(some_computation, [i])
pool.close()
pool.join()
end = time.time()
print("Total time of program execution : ", round(end-start, 4) )
Turns out this was happening only in ubuntu that I had installed in Windows Subsystem for Linux. My original snippets runs as expected in Windows or Ubuntu python environment but not in WSL i.e Sequential Execution running faster than Multithreaded one. Thanks #Vlad to double check things on your end.
As stated in my comment, it's a question of what the function is actually doing.
If we make the nums list longer (i.e., there will be more concurrent threads/processes) and also adjust the way the loop range is calculated then we see this:
import time
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
nums = [8,7,8,5,8,8,5,4,8,7,7,8,8,7,8,8,8]
def some_computation(n):
counter = 0
for _ in range(n*1_000_000):
counter += 1
return counter
def sequential():
for n in nums:
some_computation(n)
def threaded():
with ThreadPoolExecutor() as executor:
executor.map(some_computation, nums)
def pooled():
with ProcessPoolExecutor() as executor:
executor.map(some_computation, nums)
if __name__ == '__main__':
for func in sequential, threaded, pooled:
start = time.perf_counter()
func()
end = time.perf_counter()
print(func.__name__, f'{end-start:.4f}')
Output:
sequential 4.8998
threaded 5.1257
pooled 0.7760
This indicates that the complexity of some_computation() determines how the system is going to behave. With this code and its adjusted parameters we see that threading is slower than running sequentially (as one would typically expect) and, of course, multiprocessing is significantly faster

Does the apply_async method always allocate processes as they become available?

I'm using python's multiprocessing module some analyses. Specifically, I use a pool and the apply_async method to compute some results from a large number of variables. The time required to compute the results vary drastically with the input, which got me wondering if the pool will either A) immediately assign a process to each input when I call apply_async, or if B) processes are assigned as they become ready. My worry is A) would imply that one of the pool's processes might be assigned to several of the 'heavy' inputs and so take longer than necessary to complete.
I tried running the below code and got similar returns in each run (<2 seconds), indicating that B) is the case, at least here. Is that what should be expected to happen always?
import multiprocessing as mp
import numpy as np
import time
def foo(x):
if x % 4 == 0:
time.sleep(1) # Simulating a large workload
else:
time.sleep(.1) # Simulating a small workload
return 2*x
if __name__ == '__main__':
for _ in range(20):
with mp.Pool(processes=4) as pool:
jobs = []
now = time.time()
vals = list(range(1, 9))
np.random.shuffle(vals)
for x in vals:
jobs.append(pool.apply_async(foo, args=(x,)))
results = [j.get() for j in jobs]
elapsed = time.time() - now
print(elapsed)

Multiprocessing in Python 3.8

I'm trying to use multiprocessing in python for the first time. I wrote a basic prime searching program and I want to run simultaneously it on each core. The problem is: when the program does the multiprocessing, it not only does the 'primesearch' function but also the beginning of the code. My expected output would be a list of prime numbers between 0 and a limit, but it writes 16 times (I have 16 cores and 16 processes) "Enter a limit: "
Here is my code:
import time
import os
from multiprocessing import Process
# Defining lists
primes = []
processes = []
l = [0]
limit = int(input('Enter a limit: '))
def primesearch(lower,upper):
global primes
for num in range(lower, upper):
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
primes.append(num)
# Start the clock
starter = time.perf_counter()
#Dividing data
step = limit // os.cpu_count()
for x in range(os.cpu_count()):
l.append(step * (x+1))
l[-1] = limit
#Multiprocessing
for init in range(os.cpu_count()):
processes.append(Process(target=primesearch, args=[l[init], l[init + 1],] ))
for process in processes:
process.start()
for process in processes:
process.join()
#End clock
finish = time.perf_counter()
print(primes)
print(f'Finished in {round(finish-starter, 2)} second')
What could be the problem?
You are using Windows - If you read the Python documenation for multiprocessing, it will reveal to you that you should protect your main code using if __name__==“__main__”: This is because on Windows each process re-executes the complete main .py file.
This is used in pretty much every example in the documentation., and explained in the section at the end ‘Programming guidelines’.
See https://docs.python.org/3/library/multiprocessing.html
Except for the __main__ issue, your way of using primes as a global list doesn't seem to work.
I imported Queue from multiprocessing and used primes = Queue() and
size = primes.qsize()
print([primes.get() for _ in range(size)])
primes.close()
in the main function and primes.put(num) in your function. I don't know if it's the best way, for me this works but if N > 12000 then the console freezes. Also, in this case, using multiprocessing is actually slightly slower than single process.
If you aim for speed, you can test only to the square root of num, which saves about half of the time. There are many optimizations you can do. If you are testing huge numbers, you can use the Rabin-Miller algorithm.
http://inventwithpython.com/cracking/chapter22.html

Why is multiprocessed code in given code taking more time than usual sequential execution?

So I tried my hands on multiprocessing in python and tried to execute a simple map function using both the techniques and did the benchmarking. However the strange thing that occurred is that it actually took more time in the code where I created 4 pools. Following is my general code:
from datetime import datetime
from multiprocessing.dummy import Pool as ThreadPool
def square(x):
return x*x
l = xrange(10000000)
map(square, l)
Executing this code took about 1.5 secs
Now I created 4 pools for multiprocessing using following code:
from datetime import datetime
from multiprocessing.dummy import Pool as ThreadPool
def square(x):
return x*x
l = xrange(10000000)
pool = ThreadPool(4)
results = pool.map(square, l)
pool.close()
pool.join()
Now when I benchmarked it, multiprocessed code actually took more time(around 2.5 secs). Since it is a cpu bound task, I am a bit confused as in why it took more time when it actually should have taken less. Any views on what I am doing wrong?
Edit - Instead of multiprocessing.dummy I used multiprocessing and it was still slower. Even more slower.
This is not surprising. Your test is a very poor test. You use threads for long running tasks. But what you are testing is a function that returns almost instantly. Here the primary factor is the overhead of setting up threads. That far outweighs any benefits you will possibly get from threading.
The problem is that you're using dummy. I.e. multithreading, not multiprocessing. Multithreading won't make CPU bound tasks faster, but only I/O bound tasks.
Try again with multiprocessing.Pool and you should have more success.
multiprocessing.dummy in Python is not utilising 100% cpu
Also you need to somehow chunk your input sequence into subsequences to make every process do enough calculations that it's worth it.
I put this into a solution. See that you need to call the multiprocessed pool from only on main execution, the problem is Python starts subengines that do each mapping.
import time
from multiprocessing import Pool as ThreadPool
def square(x):
return x*x
def squareChunk(chunk):
return [square(x) for x in chunk]
def chunks(l, n):
n = max(1, n)
return (l[i:i+n] for i in range(0, len(l), n))
def flatten(ll):
lst = []
for l in ll:
lst.extend(l)
return lst
if __name__ == '__main__':
start_time = time.time()
r1 = range(10000000)
nProcesses = 100
chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
pool = ThreadPool(4)
results = flatten(pool.map(squareChunk, chunked))
pool.close()
pool.join()
print("--- Parallel map %g seconds ---" % (time.time() - start_time))
start_time = time.time()
r2 = range(10000000)
squareChunk(r2)
print("--- Serial map %g seconds ---" % (time.time() - start_time))
I get the following printout:
--- Parallel map 3.71226 seconds ---
--- Serial map 2.33983 seconds ---
Now the question is shouldn't the parallel map be faster?
It could be that the whole chunking is costing us efficiency. But it could also be that the engine is more "warmed up" when the serial processing runs after. So I turned around the measurements:
import time
from multiprocessing import Pool as ThreadPool
def square(x):
return x*x
def squareChunk(chunk):
return [square(x) for x in chunk]
def chunks(l, n):
n = max(1, n)
return (l[i:i+n] for i in range(0, len(l), n))
def flatten(ll):
lst = []
for l in ll:
lst.extend(l)
return lst
if __name__ == '__main__':
start_time = time.time()
r2 = range(10000000)
squareChunk(r2)
print("--- Serial map %g seconds ---" % (time.time() - start_time))
start_time = time.time()
r1 = range(10000000)
nProcesses = 100
chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
pool = ThreadPool(4)
results = flatten(pool.map(squareChunk, chunked))
pool.close()
pool.join()
print("--- Parallel map %g seconds ---" % (time.time() - start_time))
And now I got:
--- Serial map 4.176 seconds ---
--- Parallel map 2.68242 seconds ---
So it's not so clear whether one or the other is faster. But if you want to do multiprocessing you have to think whether the overhead of creating the threads is actually much smaller than what you expect in speedup. You run into cache locality issues etc.

Why is the CPU usage of my threaded python merge sort only 50% for each core?

Questions
Why is the CPU usage of my threaded python merge sort only 50% for each core?
Why does this result in "cannot create new thread" errors for relatively small inputs (100000)
How can I make this more pythonic? (It's very ugly.)
Linux/Ubuntu 12.4 64-bit i5 mobile (quad)
from random import shuffle
from threading import *
import time
import Queue
q = Queue.LifoQueue()
def merge_sort(num, q):
end = len(num)
if end > 1:
mid = end / 2
thread = Thread(target=merge_sort, args=(num[0:mid],q,))
thread1 = Thread(target=merge_sort, args=(num[mid:end],q,))
thread.setDaemon(True)
thread1.setDaemon(True)
thread.start()
thread1.start()
thread.join()
thread1.join()
return merge(q.get(num), q.get(num))
else:
if end != 0:
q.put(num)
else:
print "?????"
return
def merge(num, num1):
a = []
while len(num) is not 0 and len(num1) is not 0:
if num[0] < num1[0]:
a.append(num.pop(0))
else:
a.append(num1.pop(0))
if len(num) is not 0:
for i in range(0,len(num)):
a.append(num.pop(0))
if len(num1) is not 0:
for i in range(0,len(num1)):
a.append(num1.pop(0))
q.put(a)
return a
def main():
val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = xrange(0, val)
shuffle(numbers)
numbers = merge_sort(numbers[0:val], q)
# print "Sorted list is: \n"
# for number in numbers:
# print number
print str(time.time() - start_time) + " seconds to run.\n"
if __name__ == "__main__":
main()
For the 100000 input your code tries to create ~200000 threads. Python threads are real OS threads so the 50% CPU load that you are seeing is probably the system busy handling the threads. On my system the error happens around ~32000 threads.
Your code as written can't possibly work:
from random import shuffle
#XXX won't work
numbers = xrange(0, val)
shuffle(numbers)
xrange() is not a mutable sequence.
Note: the sorting takes much less time than the random shuffling of the array:
import numpy as np
numbers = np.random.permutation(10000000) # here spent most of the time
numbers.sort()
If you want to sort parts of the array using different threads; you can do it:
from multiprocessing.dummy import Pool # use threads
Pool(2).map(lambda a: a.sort(), [numbers[:N//2], numbers[N//2:]])
a.sort() releases GIL so the code uses 2 CPUs.
If you include the time it takes to merge the sorted parts; it may be faster just to sort the whole array at once (numbers.sort()) in a single thread.
You may want to look into using Parallel Python, as by default CPython will be restricted to one core because of the Global Interpreter Lock (GIL). This is why CPython cannot perform true CPU bound concurrent operations. But, CPython is still great at carrying out IO bound tasks.
There is a good article that describes the threading limitations of CPyton here.

Categories

Resources