Python multiprocessing calculations significantly slower than sequential calculations

Python multiprocessing calculations significantly slower than sequential calculations - python

i was playing with multiprocessing in python. I'm trying do distribute calculations on arrays to multiple CPU cores. In order to do that I'm forking as many processes as multiprocessing.cpu_count() returns and I'm passing subsets of the array to the processes (by partitioning the array indices). The array is operated on as a shared memory object.
However, for varying array sizes I cannot experience any runtime improvements. Why is that?
This is just a toy example, I'm not trying to achieve something with this calculations.
import multiprocessing as mp
import numpy as np
import time
import sharedmem
def some_function_mult(q, arr, index, width):
q.put((sum(arr[index:index+width])/np.amax(arr[index:index+width])**2)/40)
def some_function(arr, index, width):
return sum((arr[index:index+width])/np.amax(arr[index:index+width])**2)/40
def main():
num = mp.cpu_count()
n = 200000000
width = n/num
random_array = np.random.randint(0,255,n)
shared = sharedmem.empty(n)
shared[:] = random_array
print (shared)
queue = mp.Queue()
processes = [mp.Process(target=some_function_mult, args=(queue, shared, i*width, width)) for i in xrange(num)]
start_time = time.time()
for p in processes:
p.start()
result = []
for p in processes:
result.append(queue.get())
for p in processes:
p.join()
end_time = time.time()
print ('Multiprocessing execution time = ' + str(end_time-start_time))
print (result)
result = []
start_time =time.time()
for i in range(num):
result.append(some_function(random_array, i*width, width))
end_time = time.time()
print ('Sequential processing time = ' + str(end_time-start_time))
print (result)
if __name__ == '__main__':
main()

Related

Thread Pool Executor using Concurrent: no improvement for various number of workers

I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures
# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)
# max_Worker = cpu_num
max_Worker = 1
# A fake input array
n=1000000
array = list(range(n))
results = []
# A fake function being applied to each element of array
def task(i):
return i**2
x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
features = {executor.submit(task, j) for j in array}
# the real function is heavy and we need to be sure of completeness of each run
for future in concurrent.futures.as_completed(features):
results.append(future.result())
results = [future.result() for future in features]
y = time.time()
print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')
And now my questions,
Although there is no error, is it correct/optimized?
While playing with the number of workers, seems there is no
improvement in the speed (e.g., 1 vs 16, no difference). Then,
what's the problem and how can be solved?
Thanks in advance,

See my comment to your question. To the overhead I mentioned in that comment you need to also add the oberhead in just creating the process pool itself.
The following is a benchmark with several results. The first is a timing from just calling the worker function task 100000 times and creating a results list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task from 1000000 to 100000.
The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit method and then processing the Future instances that are returned.
The next attempt is to instead use the map method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit calls for each element of the iterable.
The next attempt specifies a chunksize value which approximates more or less the value that the map function used by the Pool class in the multiprocessing package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.
The final attempt uses the multiprocessing faciltity provided by package multiprocessing and its multiprocessing.pool.Pool class. The difference in this benchmark is that its map function uses a more intelligent default chunksize when no chunksize argument is specified.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
return i**2
# required for Windows:
if __name__ == '__main__':
n=100000
t1 = time.time()
results = [task(i) for i in range(n)]
print('Non-multiprocessing time:', time.time() - t1, results[-1])
# num CPUs
cpu_num = os.cpu_count()
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
print('Multiprocessing time using submit:', time.time() - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
print('Multiprocessing time using map:', time.time() - t1, results[-1])
t1 = time.time()
chunksize = n // (4 * cpu_num)
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
Prints:
Non-multiprocessing time: 0.027019739151000977 9999800001
Number of CPUs available: 8
Multiprocessing time using submit: 77.34723353385925 9999800001
Multiprocessing time using map: 79.52981925010681 9999800001
Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
Update
The following bechmarks use a version of task that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map case (it would by default compute a chunksize value of 4), is slightly more performant.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
for _ in range(1_000_000):
result = i ** 2
return result
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, pool_size * 4)
if remainder:
chunksize += 1
return chunksize
# required for Windows:
if __name__ == '__main__':
n = 100
cpu_num = os.cpu_count()
chunksize = compute_chunksize(n, cpu_num)
t1 = time.time()
results = [task(i) for i in range(n)]
t2 = time.time()
print('Non-multiprocessing time:', t2 - t1, results[-1])
# num CPUs
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
t2 = time.time()
print('Multiprocessing time using submit:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
t2 = time.time()
print('Multiprocessing time using map:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
t2 = time.time()
print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
t2 = time.time()
print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n), chunksize=1)
t2 = time.time()
print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
Prints:
Non-multiprocessing time: 23.12758779525757 9801
Number of CPUs available: 8
Multiprocessing time using submit: 5.336004018783569 9801
Multiprocessing time using map: 5.364996671676636 9801
Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
Multiprocessing time using Pool.map: 5.400001287460327 9801
Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

starmap, starmap_async python

I want to translate a huge matlab model to python. Therefor I need to work on the key functions first. One key function handles parallel processing. Basically, a matrix with parameters is the input, in which every row represents the parameters for one run. These parameters are used within a computation-heavy function. This computation-heavy function should run in parallel, I don't need the results of a previous run for any other run. So all processes can run independent from eachother.
Why is starmap_async slower on my pc? Also: When i add more code (to test consecutive computation) my python crashes (i use spyder). Can you give me advice?
import time
import numpy as np
import multiprocessing as mp
from functools import partial
# Create simulated data matrix
data = np.random.random((100,3000))
data = np.column_stack((np.arange(1,len(data)+1,1),data))
def EAF_DGL(*z, package_num):
sum_row = 0
for i in range(1,np.shape(z)[0]):
sum_row = sum_row + z[i]
func_result = np.column_stack((package_num,z[0],sum_row))
return func_result
t0 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap_async(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])]).get()
pool.close()
pool.join()
t1 = time.time()
calculation_time_parallel_async = t1-t0
print(calculation_time_parallel_async)
t2 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])])
pool.close()
pool.join()
t3 = time.time()
calculation_time_parallel = t3-t2
print(calculation_time_parallel)

Python Multiprocessing Query

I am learning about python's multiprocessing module. I want to make my code use all my CPU resources. This is the code I wrote:
from multiprocessing import Process
import time
def work():
for i in range(1000):
x=5
y=10
z=x+y
if __name__ == '__main__':
start1 = time.time()
for i in range(100):
p=Process(target=work)
p.start()
p.join()
end1=time.time()
start = time.time()
for i in range(100):
work()
end=time.time()
print(f'With Parallel {end1-start1}')
print(f'Without Parallel {end-start}')
The output I get is this:
With Parallel 0.8802454471588135
Without Parallel 0.00039649009704589844
I tried experimenting with different range values in the for loops or using print statement only in work function but everytime without parallel runs faster. Is there something I am missing?
Thanks in advance!

Your benchmark method is problematic:
for i in range(100):
p = Process(target=work)
p.start()
p.join()
I guess you want to run 100 processes in parallel, but Process.join() blocks until process exit, you effectively run in serial. Besides, run more busy processes than CPU cores count leads to high CPU contention which is a performance penalty. And as a comment pointed out, your work() function is too simple, compare to the overhead of Process creation.
A better version:
import multiprocessing
import time
def work():
for i in range(2000000):
pow(i, 10)
n_processes = multiprocessing.cpu_count() # 8
total_runs = n_processes * 4
ps = []
n = total_runs
start1 = time.time()
while n:
# ensure processes number limit
ps = [p for p in ps if p.is_alive()]
if len(ps) < n_processes:
p = multiprocessing.Process(target=work)
p.start()
ps.append(p)
n = n-1
else:
time.sleep(0.01)
# wait for all processes to finish
while any(p.is_alive() for p in ps):
time.sleep(0.01)
end1=time.time()
start = time.time()
for i in range(total_runs):
work()
end=time.time()
print(f'With Parallel {end1-start1:.4f}s')
print(f'Without Parallel {end-start:.4f}s')
print(f'Acceleration factor {(end-start)/(end1-start1):.2f}')
result:
With Parallel 4.2835s
Without Parallel 33.0244s
Acceleration factor 7.71

Multiprocessing shared variable not updating

I'm trying to update a shared variable (numpy array in a namespace) when using the multiprocessing module. However, the variable is not updated and I dont understand why.
Here is a sample code to illustrate this:
from multiprocessing import Process, Manager
import numpy as np
chunk_size = 15
arr_length = 1000
jobs = []
namespace = Manager().Namespace()
namespace.arr = np.zeros(arr_length)
nb_chunk = arr_length/chunk_size + 1
def foo(i, ns):
from_idx = chunk_size*i
to_idx = min(arr_length, chunk_size*(i+1))
ns.arr[from_idx:to_idx] = np.random.randint(0, 100, to_idx-from_idx)
for i in np.arange(nb_chunk):
p = Process(target=foo, args=(i, namespace))
p.start()
jobs.append(p)
for i in np.arange(nb_chunk):
jobs[i].join()
print namespace.arr[:10]

You can not share in-built objects like list, dict across processes in Python. In order to share data between process, Python's multiprocessing provide two data structure:
Queue()
Pipe()
Also read: Exchanging objects between processes

The issue is that the Manager().Namespace() object doesn't notice that you're changing anything using ns.arr[from_idx:to_idx] = ... (as you're working on a inner data structure) and thus doesn't propagate to the other processes.
This answer explains very good what's going on here.
To fix it, create the list as a Manager().List() and pass this list to the processes, so that ns[from_idx:to_idx] = ... is recognized as a change and is propagated to the processes:
from multiprocessing import Process, Manager
import numpy as np
chunk_size = 15
arr_length = 1000
jobs = []
arr = Manager().list([0] * arr_length)
nb_chunk = arr_length/chunk_size + 1
def foo(i, ns):
from_idx = chunk_size*i
to_idx = min(arr_length, chunk_size*(i+1))
ns[from_idx:to_idx] = np.random.randint(0, 100, to_idx-from_idx)
for i in np.arange(nb_chunk):
p = Process(target=foo, args=(i, arr))
p.start()
jobs.append(p)
for i in np.arange(nb_chunk):
jobs[i].join()
print arr[:10]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing calculations significantly slower than sequential calculations - python

Related

Thread Pool Executor using Concurrent: no improvement for various number of workers

How to make 6 calculation as fast as possible based on one datastream?

starmap, starmap_async python

Python Multiprocessing Query

Multiprocessing shared variable not updating

Categories

Resources