How to use python multiprocessing Pool.map within loop - python

I am running a simulation using Runge-Kutta. At every time step two FFT of two independent variables are necessary which can be parallelized. I implemented the code like this:
from multiprocessing import Pool
import numpy as np
pool = Pool(processes=2) # I like to calculate only 2 FFTs parallel
# in every time step, therefor 2 processes
def Splitter(args):
'''I have to pass 2 arguments'''
return makeSomething(*args):
def makeSomething(a,b):
'''dummy function instead of the one with the FFT'''
return a*b
def RungeK():
# ...
# a lot of code which create the vectors A and B and calculates
# one Kunge-Kutta step for them
# ...
n = 20 # Just something for the example
A = np.arange(50000)
B = np.ones_like(A)
for i in xrange(n): # loop over the time steps
A *= np.mean(B)*B - A
B *= np.sqrt(A)
results = pool.map(Splitter,[(A,3),(B,2)])
A = results[0]
B = results[1]
print np.mean(A) # Some output
print np.max(B)
if __name__== '__main__':
RungeK()
Unfortunately python generates a unlimited number of processes after reaching the loop. Before it seems that only two processes are running. Also my memory fills up. Adding a
pool.close()
pool.join()
behind the loop does not solve my problem, and to put it inside the loop makes no sense for me. Hope you can help.

Move the creation of the pool into the RungeK function;
def RungeK():
# ...
# a lot of code which create the vectors A and B and calculates
# one Kunge-Kutta step for them
# ...
pool = Pool(processes=2)
n = 20 # Just something for the example
A = np.arange(50000)
B = np.ones_like(A)
for i in xrange(n): # loop over the time steps
A *= np.mean(B)*B - A
B *= np.sqrt(A)
results = pool.map(Splitter, [(A, 3), (B, 2)])
A = results[0]
B = results[1]
pool.close()
print np.mean(A) # Some output
print np.max(B)
Alternatively, put it in the main block.
This is probably a side effect of how multiprocessing works. E.g. on MS windows, you need to be able to import the main module without side effects (like creating new processes).

Related

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

Using Multithreading or Multiprocessing to improve computational speed

I am iterating through very large file size [mesh]. Since the iteration is independent, I would like to split my mesh into smaller sizes and run them all at the same time in order to lower computation time. Below is a sample code. For example, if mesh is of length=50000, I would like to divide the mesh into 100 and run fun for each mesh/100 at the same time.
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
interpolate = fnc(mydata, mymesh)
I would like to know to achieve this using multiprocessing or multithreading as I'm unable to reconcile it with the execution of my loop.
This will give you the general idea. I couldn't test this since I do not have your data. The default constructor for ProcessPoolExecutor will use the number of processors on your computer. But since that determines the level of multiprocessing you can have, it will probably be more efficient to set the N_CHUNKS parameter to the number of simultaneous processes you can support. That is, if you have a processing pool size of 6, then it is better to just divide your array into 6 large chunks and have 6 processes do the work rather than breaking it up into smaller pieces where processes will have to wait to run. So you should probably specify a specific max_workers number to the ProcessPoolExecutor not greater than the number of processors you have and set N_CHUNKS to the same value.
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
def main(data, mesh):
#N_CHUNKS = 100
N_CHUNKS = 6 # assuming you have 6 processors; see max_workers parameter
n = len(mesh)
assert n != 0
if n <= N_CHUNKS:
N_CHUNKS = 1
chunk_size = n
last_chunk_size = n
else:
chunk_size = n // N_CHUNKS
last_chunk_size = n - chunk_size * (N_CHUNKS - 1)
with ProcessPoolExcutor(max_workers=N_CHUNKS) as executor: # assuming you have 6 processors
the_futures = {}
start = 0
for i in range(N_CHUNKS - 1):
future = executor.submit(fnc, data, mesh[start:start+chunk_size]) # pass slice
the_futures[future] = (start, start+chunk_size) # map future to request parameters
start += chunk_size
if last_chunk_size:
future = executor.submit(fnc, data, mesh[start:n]) # pass slice
the_futures[future] = (start, start+n)
for future in as_completed(the_futures):
(start, end) = the_futures[future] # the original range
d = future.result() # do something with the results
if __name__ == '__main__':
# the call to main must be done in a block governed by if __name__ == '__main__' or you will get into a recursive
# loop where each subprocess calls main again
main(data, mesh)

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))
You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))
You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop
As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Python array sum vs MATLAB

I'm slowly switching to Python and I wanted to make a simple test for comparing the performance of a simple array summation. I generate a random 1000x1000 array and add one to each of the values in this array.
Here my script in Python :
import time
import numpy
from numpy.random import random
def testAddOne(data):
"""
Test addOne
"""
return data + 1
i = 1000
data = random((i,i))
start = time.clock()
for x in xrange(1000):
testAddOne(data)
stop = time.clock()
print stop - start
And my function in MATLAB:
function test
%parameter declaration
c=rand(1000);
tic
for t = 1:1000
testAddOne(c);
end
fprintf('Structure: \n')
toc
end
function testAddOne(c)
c = c + 1;
end
The Python takes 2.77 - 2.79 seconds, the same as the MATLAB function (I'm actually quite impressed by Numpy!). What would I have to change to my Python script to use multithreading? I can't in MATLAB since I don,t have the toolbox.
Multi threading in Python is only useful for situations where threads get blocked, e.g. on getting input, which is not the case here (see the answers to this question for more details). However, multi processing is easy to do in Python. Multiprocessing in general is covered here.
A program taking a similar approach to your example is below
import time
import numpy
from numpy.random import random
from multiprocessing import Process
def testAddOne(data):
return data + 1
def testAddN(data,N):
# print "testAddN", N
for x in xrange(N):
testAddOne(data)
if __name__ == '__main__':
matrix_size = 1000
num_adds = 10000
num_processes = 4
data = random((matrix_size,matrix_size))
start = time.clock()
if num_processes > 1:
processes = [Process(target=testAddN, args=(data,num_adds/num_processes))
for i in range(num_processes)]
for p in processes:
p.start()
for p in processes:
p.join()
else:
testAddN(data,num_adds)
stop = time.clock()
print "Elapsed", stop - start
A more useful example using a pool of worker processes to successively add 1 to different matrices is below.
import time
import numpy
from numpy.random import random
from multiprocessing import Pool
def testAddOne(data):
return data + 1
def testAddN(dataN):
data,N=dataN
for x in xrange(N):
data = testAddOne(data)
return data
if __name__ == '__main__':
num_matrices = 4
matrix_size = 1000
num_adds_per_matrix = 2500
num_processes = 4
inputs = [(random((matrix_size,matrix_size)), num_adds_per_matrix)
for i in range(num_matrices)]
#print inputs # test using, e.g., matrix_size = 2
start = time.clock()
if num_processes > 1:
proc_pool = Pool(processes=num_processes)
outputs = proc_pool.map(testAddN, inputs)
else:
outputs = map(testAddN, inputs)
stop = time.clock()
#print outputs # test using, e.g., matrix_size = 2
print "Elapsed", stop - start
In this case the code in testAddN actually does something with the result of calling testAddOne. And you can uncomment the print statements to check that some useful work is being done.
In both cases I've changed the total number of additions to 10000; with fewer additions the cost of starting up processes becomes more significant (but you can experiment with the parameters). And you can experiment with num_processes also. On my machine I found that compared to running in the same process with num_processes=1 I got just under a 2x speedup spawning four processes with num_processes=4.

Multiprocessing in python to speed up functions

I am confused with Python multiprocessing.
I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with “normal processing”.
Here an example of what I am trying to achieve.
from time import clock, time
from multiprocessing import Pool, freeze_support
from random import choice
def foo(x):
TupWerteMany = []
for i in range(0,len(x)):
TupWerte = []
s = list(x[i][3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
TupWerteMany.append(TupWerte)
return TupWerteMany
if __name__ == '__main__':
start_time = time()
List = [(u'1', u'aa', u'Jacob', u'Emily'),
(u'2', u'bb', u'Ethan', u'Kayla')]
List1 = List*1000000
# METHOD 1 : NORMAL (takes 20 seconds)
x2 = foo(List1)
print x2[1:3]
# METHOD 2 : APPLY_ASYNC (takes 28 seconds)
# pool = Pool(4)
# Werte = pool.apply_async(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
# METHOD 3: MAP (!! DOES NOT WORK !!)
# pool = Pool(4)
# Werte = pool.map(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
print 'Time Elaspse: ', time() - start_time
My questions:
Why does apply_async takes longer than the “normal way” ?
What I am doing wrong with map?
Does it makes sense to speed up such tasks with multiprocessing at all?
Finally: after all I have read here, I am wondering if multiprocessing in python works on windows at all ?
So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.
1)
The idea of a process pool is to have many processes doing computations on separate bits of some data.
# METHOD 2 : APPLY_ASYNC
jobs = 4
size = len(List1)
pool = Pool(4)
results = []
# split the list into 4 equally sized chunks and submit those to the pool
heads = range(size/jobs, size, size/jobs) + [size]
tails = range(0,size,size/jobs)
for tail,head in zip(tails, heads):
werte = pool.apply_async(foo, args=(List1[tail:head],))
results.append(werte)
pool.close()
pool.join() # wait for the pool to be done
for result in results:
werte = result.get() # get the return value from the sub jobs
This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.
2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code
def foo(thing):
print thing
map(foo, ['a','b','c','d'])
That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.
Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk.
pool.map(foo, List1, chunksize=size/jobs)
I don't know though if there is a problem with map on Windows as I don't have one available for testing.
3) yes, given that your problem is big enough to justify forking out new python interpreters
4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.
On question (2)
With the guidance of Dougal and Matti, I figured out what's went wrong.
The original foo function processes a list of lists, while map requires a function to process single elements.
The new function should be
def foo2 (x):
TupWerte = []
s = list(x[3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
return TupWerte
and the block to call it :
jobs = 4
size = len(List1)
pool = Pool()
#Werte = pool.map(foo2, List1, chunksize=size/jobs)
Werte = pool.map(foo2, List1)
pool.close()
print Werte[1:3]
Thanks to all of you who helped me understand this.
Results of all methods:
for List * 2 Mio records: normal 13.3 seconds , parallel with async: 7.5 seconds, parallel with with map with chuncksize : 7.3, without chunksize 5.2 seconds
Here's a generic multiprocessing template if you are interested.
import multiprocessing as mp
import time
def worker(x):
time.sleep(0.2)
print "x= %s, x squared = %s" % (x, x*x)
return x*x
def apply_async():
pool = mp.Pool()
for i in range(100):
pool.apply_async(worker, args = (i, ))
pool.close()
pool.join()
if __name__ == '__main__':
apply_async()
And the output looks like this:
x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144
As you can see, the numbers are not in order, as they are being executed asynchronously.

Categories

Resources