starmap, starmap_async python - python

I want to translate a huge matlab model to python. Therefor I need to work on the key functions first. One key function handles parallel processing. Basically, a matrix with parameters is the input, in which every row represents the parameters for one run. These parameters are used within a computation-heavy function. This computation-heavy function should run in parallel, I don't need the results of a previous run for any other run. So all processes can run independent from eachother.
Why is starmap_async slower on my pc? Also: When i add more code (to test consecutive computation) my python crashes (i use spyder). Can you give me advice?
import time
import numpy as np
import multiprocessing as mp
from functools import partial
# Create simulated data matrix
data = np.random.random((100,3000))
data = np.column_stack((np.arange(1,len(data)+1,1),data))
def EAF_DGL(*z, package_num):
sum_row = 0
for i in range(1,np.shape(z)[0]):
sum_row = sum_row + z[i]
func_result = np.column_stack((package_num,z[0],sum_row))
return func_result
t0 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap_async(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])]).get()
pool.close()
pool.join()
t1 = time.time()
calculation_time_parallel_async = t1-t0
print(calculation_time_parallel_async)
t2 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])])
pool.close()
pool.join()
t3 = time.time()
calculation_time_parallel = t3-t2
print(calculation_time_parallel)

Related

Thread Pool Executor using Concurrent: no improvement for various number of workers

I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures
# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)
# max_Worker = cpu_num
max_Worker = 1
# A fake input array
n=1000000
array = list(range(n))
results = []
# A fake function being applied to each element of array
def task(i):
return i**2
x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
features = {executor.submit(task, j) for j in array}
# the real function is heavy and we need to be sure of completeness of each run
for future in concurrent.futures.as_completed(features):
results.append(future.result())
results = [future.result() for future in features]
y = time.time()
print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')
And now my questions,
Although there is no error, is it correct/optimized?
While playing with the number of workers, seems there is no
improvement in the speed (e.g., 1 vs 16, no difference). Then,
what's the problem and how can be solved?
Thanks in advance,
See my comment to your question. To the overhead I mentioned in that comment you need to also add the oberhead in just creating the process pool itself.
The following is a benchmark with several results. The first is a timing from just calling the worker function task 100000 times and creating a results list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task from 1000000 to 100000.
The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit method and then processing the Future instances that are returned.
The next attempt is to instead use the map method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit calls for each element of the iterable.
The next attempt specifies a chunksize value which approximates more or less the value that the map function used by the Pool class in the multiprocessing package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.
The final attempt uses the multiprocessing faciltity provided by package multiprocessing and its multiprocessing.pool.Pool class. The difference in this benchmark is that its map function uses a more intelligent default chunksize when no chunksize argument is specified.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
return i**2
# required for Windows:
if __name__ == '__main__':
n=100000
t1 = time.time()
results = [task(i) for i in range(n)]
print('Non-multiprocessing time:', time.time() - t1, results[-1])
# num CPUs
cpu_num = os.cpu_count()
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
print('Multiprocessing time using submit:', time.time() - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
print('Multiprocessing time using map:', time.time() - t1, results[-1])
t1 = time.time()
chunksize = n // (4 * cpu_num)
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
Prints:
Non-multiprocessing time: 0.027019739151000977 9999800001
Number of CPUs available: 8
Multiprocessing time using submit: 77.34723353385925 9999800001
Multiprocessing time using map: 79.52981925010681 9999800001
Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
Update
The following bechmarks use a version of task that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map case (it would by default compute a chunksize value of 4), is slightly more performant.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
for _ in range(1_000_000):
result = i ** 2
return result
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, pool_size * 4)
if remainder:
chunksize += 1
return chunksize
# required for Windows:
if __name__ == '__main__':
n = 100
cpu_num = os.cpu_count()
chunksize = compute_chunksize(n, cpu_num)
t1 = time.time()
results = [task(i) for i in range(n)]
t2 = time.time()
print('Non-multiprocessing time:', t2 - t1, results[-1])
# num CPUs
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
t2 = time.time()
print('Multiprocessing time using submit:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
t2 = time.time()
print('Multiprocessing time using map:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
t2 = time.time()
print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
t2 = time.time()
print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n), chunksize=1)
t2 = time.time()
print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
Prints:
Non-multiprocessing time: 23.12758779525757 9801
Number of CPUs available: 8
Multiprocessing time using submit: 5.336004018783569 9801
Multiprocessing time using map: 5.364996671676636 9801
Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
Multiprocessing time using Pool.map: 5.400001287460327 9801
Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)
You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

Python multiprocessing calculations significantly slower than sequential calculations

i was playing with multiprocessing in python. I'm trying do distribute calculations on arrays to multiple CPU cores. In order to do that I'm forking as many processes as multiprocessing.cpu_count() returns and I'm passing subsets of the array to the processes (by partitioning the array indices). The array is operated on as a shared memory object.
However, for varying array sizes I cannot experience any runtime improvements. Why is that?
This is just a toy example, I'm not trying to achieve something with this calculations.
import multiprocessing as mp
import numpy as np
import time
import sharedmem
def some_function_mult(q, arr, index, width):
q.put((sum(arr[index:index+width])/np.amax(arr[index:index+width])**2)/40)
def some_function(arr, index, width):
return sum((arr[index:index+width])/np.amax(arr[index:index+width])**2)/40
def main():
num = mp.cpu_count()
n = 200000000
width = n/num
random_array = np.random.randint(0,255,n)
shared = sharedmem.empty(n)
shared[:] = random_array
print (shared)
queue = mp.Queue()
processes = [mp.Process(target=some_function_mult, args=(queue, shared, i*width, width)) for i in xrange(num)]
start_time = time.time()
for p in processes:
p.start()
result = []
for p in processes:
result.append(queue.get())
for p in processes:
p.join()
end_time = time.time()
print ('Multiprocessing execution time = ' + str(end_time-start_time))
print (result)
result = []
start_time =time.time()
for i in range(num):
result.append(some_function(random_array, i*width, width))
end_time = time.time()
print ('Sequential processing time = ' + str(end_time-start_time))
print (result)
if __name__ == '__main__':
main()

Is this the right way to use multiprocessing queue with python?

I am trying to speed up my code by using multiprocessing with Python. The only problem I ran into when trying to implement multiprocessing was that my function has a return statement and I needed to save that data to a list. The best way I found using google was to use queue as "q.put()" and retrieve it using "q.get()". The only issue is that I think i'm not utilizing this the right way because when I use command prompt after compiling, it shows i'm hardly using my cpu and I only see one Python process running. If I remove "q.get()" the process is super fast and utilizes my cpu. Am I doing this the right way?
import time
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
def test(x,y,q):
q.put(x * y)
if __name__ == '__main__':
q = Queue()
one = []
two = []
three = []
start_time = time.time()
for x in np.arange(30, 60, 1):
for y in np.arange(0.01, 2, 0.5):
p = multiprocessing.Process(target=test, args=(x, y, q))
p.start()
one.append(q.get())
two.append(int(x))
three.append(float(y))
print(x, ' | ', y, ' | ', one[-1])
p.join()
print("--- %s seconds ---" % (time.time() - start_time))
d = {'x' : one, 'y': two, 'q' : three}
data = pd.DataFrame(d)
print(data.tail())
No, this is not correct. You start a process and wait for the result through q.get at once. Therefore only one process running at the same time. If you want to operate on many tasks, use multiprocessing.Pool:
import time
import numpy as np
from multiprocessing import Pool
from itertools import product
def test((x,y)):
return x, y, x * y
def main():
start_time = time.time()
pool = Pool()
result = pool.map(test, product(np.arange(30, 60, 1), np.arange(0.01, 2, 0.5)))
pool.close()
print("--- %s seconds ---" % (time.time() - start_time))
print(result)
if __name__ == '__main__':
main()

Python array sum vs MATLAB

I'm slowly switching to Python and I wanted to make a simple test for comparing the performance of a simple array summation. I generate a random 1000x1000 array and add one to each of the values in this array.
Here my script in Python :
import time
import numpy
from numpy.random import random
def testAddOne(data):
"""
Test addOne
"""
return data + 1
i = 1000
data = random((i,i))
start = time.clock()
for x in xrange(1000):
testAddOne(data)
stop = time.clock()
print stop - start
And my function in MATLAB:
function test
%parameter declaration
c=rand(1000);
tic
for t = 1:1000
testAddOne(c);
end
fprintf('Structure: \n')
toc
end
function testAddOne(c)
c = c + 1;
end
The Python takes 2.77 - 2.79 seconds, the same as the MATLAB function (I'm actually quite impressed by Numpy!). What would I have to change to my Python script to use multithreading? I can't in MATLAB since I don,t have the toolbox.
Multi threading in Python is only useful for situations where threads get blocked, e.g. on getting input, which is not the case here (see the answers to this question for more details). However, multi processing is easy to do in Python. Multiprocessing in general is covered here.
A program taking a similar approach to your example is below
import time
import numpy
from numpy.random import random
from multiprocessing import Process
def testAddOne(data):
return data + 1
def testAddN(data,N):
# print "testAddN", N
for x in xrange(N):
testAddOne(data)
if __name__ == '__main__':
matrix_size = 1000
num_adds = 10000
num_processes = 4
data = random((matrix_size,matrix_size))
start = time.clock()
if num_processes > 1:
processes = [Process(target=testAddN, args=(data,num_adds/num_processes))
for i in range(num_processes)]
for p in processes:
p.start()
for p in processes:
p.join()
else:
testAddN(data,num_adds)
stop = time.clock()
print "Elapsed", stop - start
A more useful example using a pool of worker processes to successively add 1 to different matrices is below.
import time
import numpy
from numpy.random import random
from multiprocessing import Pool
def testAddOne(data):
return data + 1
def testAddN(dataN):
data,N=dataN
for x in xrange(N):
data = testAddOne(data)
return data
if __name__ == '__main__':
num_matrices = 4
matrix_size = 1000
num_adds_per_matrix = 2500
num_processes = 4
inputs = [(random((matrix_size,matrix_size)), num_adds_per_matrix)
for i in range(num_matrices)]
#print inputs # test using, e.g., matrix_size = 2
start = time.clock()
if num_processes > 1:
proc_pool = Pool(processes=num_processes)
outputs = proc_pool.map(testAddN, inputs)
else:
outputs = map(testAddN, inputs)
stop = time.clock()
#print outputs # test using, e.g., matrix_size = 2
print "Elapsed", stop - start
In this case the code in testAddN actually does something with the result of calling testAddOne. And you can uncomment the print statements to check that some useful work is being done.
In both cases I've changed the total number of additions to 10000; with fewer additions the cost of starting up processes becomes more significant (but you can experiment with the parameters). And you can experiment with num_processes also. On my machine I found that compared to running in the same process with num_processes=1 I got just under a 2x speedup spawning four processes with num_processes=4.

Categories

Resources