I am running the following benchmark script on Windows machine. I noticed the order when multiprocess() get executed affected it's performance. If I execute multiprocess first, the execution speed is faster than simple & multithread() method, if I executed it in the end, the processing speed is almost double compared to multithread() and simple method.
import random
from threading import Thread
from multiprocessing import Process
import time
size = 10000000 # Number of random numbers to add to list
threads = 8 # Number of threads to create
my_list = []
for i in range(0,threads):
my_list.append([])
def func(count, mylist):
for i in range(count):
mylist.append(random.random())
processes = []
for i in range(0, threads):
p = Process(target=func,args=(size,my_list[i]))
processes.append(p)
def multithreaded():
jobs = []
for i in range(0, threads):
thread = Thread(target=func,args=(size,my_list[i]))
jobs.append(thread)
# Start the threads
for j in jobs:
j.start()
# Ensure all of the threads have finished
for j in jobs:
j.join()
def simple():
for i in range(0, threads):
func(size,my_list[i])
def multiprocessed():
global processes
# Start the processes
for p in processes:
p.start()
# Ensure all processes have finished execution
for p in processes:
p.join()
if __name__ == "__main__":
start = time.time()
multiprocessed()
print("elasped time:{}".format(time.time()-start))
start = time.time()
simple()
print("elasped time:{}".format(time.time()-start))
start = time.time()
multithreaded()
print("elasped time:{}".format(time.time()-start))
Results #1 : multiprocessed (2.85s) -> simple (7.39s) -> multithread
(7.84s)
Results #2 : multithread (7.84s) -> simple (7.53s) ->
multiprocessed (13.96 s)
Why is that ? How do I properly use multiprocess function on windows in order to improve the speed by utilizing CPU cores
Your timing code doesn't isolate each test from the effects of the others. If you execute multiprocessed first, the sublists of my_list are empty. If you execute it last, the sublists are full of elements added by the other runs, dramatically increasing the communication overhead involved in sending the data to the worker processes.
Related
So I made a program that calculates primes to test what the difference is between using multithreading or just using single thread. I read that multiprocessing bypasses the GIL, so I expected a decent performance boost.
So here we have my code to test it:
def prime(n):
if n == 2:
return n
if n & 1 == 0:
return None
d= 3
while d * d <= n:
if n % d == 0:
return None
d= d + 2
return n
loop = range(2,1000000)
chunks = range(1,1000000,1000)
def chunker(chunk):
ret = []
r2 = chunk + 1000
r1 = chunk
for k in range(r1,r2):
ret.append(prime(k))
return ret
from multiprocessing import cpu_count
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
start = t()
results = pool.map(prime, loop)
print(t() - start)
pool.close()
filtered = filter(lambda score: score != None, results)
new = []
start = t()
for i in loop:
new.append(prime(i))
print(t()-start)
pool = Pool(12)
start = t()
results = pool.map_async(chunker, chunks).get()
print(t() - start)
pool.close()
I executed the program and this where the times:
multi processing without chunks:
4.953783750534058
single thread:
5.067057371139526
multiprocessing with chunks:
5.041667222976685
Maybe you already notice, but multiprocessing isn't that much faster. I have a 6 core 12 thread AMD ryzen CPU, so I excpected if I can use all those threads, that I would at least double the performance. But no. If I look in task manager the cpu usage on average from using multiprocessing is 12%, while single threaded uses around 10% of the cpu.
So what is going on? Did I do something wrong? Or does meaning being able to bypass the GIL not mean being able to use more cores?
If I can't use more cores with multiprocessing how can I do it then?
from multiprocessing.dummy import Pool
from time import time as t
pool = Pool(12)
From the documentation:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
In other words, you're still using threads, not processes.
To use processes, do from multiprocessing import Pool instead.
I want to implement a recursive parallel algorithm and I want a pool to be created only once and each time step do a job wait for all the jobs to finish and then call the processes again with inputs the previous outputs and then again the same at the next time step, etc.
My problem is that I have implemented a version where every time step I create and kill the pool, but this is extremely slow, even slower than the sequential version. When I try to implement a version where the pool is created only once at the beginning I got assertion error when I try to call join().
This is my code
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
workers.join()
X and b are the matrices that I want to update directly at the master's memory.
What is wrong here and I get the assertion error?
Can I implement with the pool what I want or not?
You cannot join a pool that is not closed first, as join() will wait worker processes to terminate, not jobs to complete (https://docs.python.org/3.6/library/multiprocessing.html section 17.2.2.9).
But as this will close the pool, which is not what you want, you cannot use this. So join is out, and you need to implement a "wait until all jobs completed" by yourself.
One way of doing this without busy loops would be using a queue. You could also work with bounded semaphores, but they do not work on all operating systems.
counter = 0
lock_queue = multiprocessing.Queue()
counter_lock = multiprocessing.Lock()
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
with counter_lock:
counter += 1
if counter == 4:
counter = 0
lock_queue.put(42)
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
counter = 0
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
lock_queue.get(block=True)
This resets a global counter before submitting jobs. As soon as a job is completed, you callback increments a global counter. When the counter hits 4 (your number of jobs), the callback knows it has processed the last result. Then a dummy message is sent in a queue. Your main program is waiting at Queue.get() for something to appear there.
This allows your main program to block until all jobs have completed, without closing down the pool.
If you replace multiprocessing.Pool with ProcessPoolExecutor from concurrent.futures, you can skip this part and use
concurrent.futures.wait(fs, timeout=None, return_when=ALL_COMPLETED)
to block until all submitted tasks have finished. From functional standpoint there is no difference between these. The concurrent.futures method is a couple of lines shorter but the result is exactly the same.
I am learning about python's multiprocessing module. I want to make my code use all my CPU resources. This is the code I wrote:
from multiprocessing import Process
import time
def work():
for i in range(1000):
x=5
y=10
z=x+y
if __name__ == '__main__':
start1 = time.time()
for i in range(100):
p=Process(target=work)
p.start()
p.join()
end1=time.time()
start = time.time()
for i in range(100):
work()
end=time.time()
print(f'With Parallel {end1-start1}')
print(f'Without Parallel {end-start}')
The output I get is this:
With Parallel 0.8802454471588135
Without Parallel 0.00039649009704589844
I tried experimenting with different range values in the for loops or using print statement only in work function but everytime without parallel runs faster. Is there something I am missing?
Thanks in advance!
Your benchmark method is problematic:
for i in range(100):
p = Process(target=work)
p.start()
p.join()
I guess you want to run 100 processes in parallel, but Process.join() blocks until process exit, you effectively run in serial. Besides, run more busy processes than CPU cores count leads to high CPU contention which is a performance penalty. And as a comment pointed out, your work() function is too simple, compare to the overhead of Process creation.
A better version:
import multiprocessing
import time
def work():
for i in range(2000000):
pow(i, 10)
n_processes = multiprocessing.cpu_count() # 8
total_runs = n_processes * 4
ps = []
n = total_runs
start1 = time.time()
while n:
# ensure processes number limit
ps = [p for p in ps if p.is_alive()]
if len(ps) < n_processes:
p = multiprocessing.Process(target=work)
p.start()
ps.append(p)
n = n-1
else:
time.sleep(0.01)
# wait for all processes to finish
while any(p.is_alive() for p in ps):
time.sleep(0.01)
end1=time.time()
start = time.time()
for i in range(total_runs):
work()
end=time.time()
print(f'With Parallel {end1-start1:.4f}s')
print(f'Without Parallel {end-start:.4f}s')
print(f'Acceleration factor {(end-start)/(end1-start1):.2f}')
result:
With Parallel 4.2835s
Without Parallel 33.0244s
Acceleration factor 7.71
I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))
You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))
You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop
As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool
I am a newbie to python and I am trying to use multiprocessing for one my applications.
I actually have a very simple multiplication program and I was trying to asynchronously generate parallel processes to calculate the multiplication of a range of numbers. When I try to do this without pooling, the time is atleast twice or some times even 4 times faster. I am not sure what could the reason be for this behavior.
I am using python 2.7.1
Non-Pool.py
#!/usr/bin/python
import time
def f(x):
return x*x
st = time.time()
t = 10000000
f(t)
map(f, range(t))
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
Pool.py
#!/usr/bin/python
from multiprocessing import Pool
import time
def f(x):
return x*x
st = time.time()
t = 10000000
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [t]) # evaluate "f(10)" asynchronously
result.get(timeout=1) # prints "100" unless your computer is *very* slow
pool.map(f, range(t)) # prints "[0, 1, 4,..., 81]"
et = time.time()
tt = (str((et-st)%60)+'--'+str((et-st/60)))
print tt
exit(0)
Execution Times: (Format >> minutes--seconds)
Macha-MacBook-Pro:Downloads me$ ./nonpool.py
2.03456997871--1352551406.28
Macha-MacBook-Pro:Downloads me$ ./pool.py
4.69528508186--1352551417.28
You might check related answers, e.g., python prime crunching: processing pool is slower? -- the overhead of setting up a processing pool is high, but so is sending and receiving single integers in arguments and results.