I have created this simple code to check multiprocessing reading from a global dictionary object:
import numpy as np
import multiprocessing as mp
import psutil
from itertools import repeat
def computations_x( max_int ):
#random selection
mask_1 = np.random.randint( low=0, high=max_int, size=1000 )
mask_2 = np.random.randint( low=0, high=max_int, size=1000 )
exponent_1 = np.sqrt( np.pi )
vector_1 = np.array( [ read_obj[ k ]**( exponent_1 ) for k in mask_1 ] )
vector_2 = np.array( [ read_obj[ k ]**np.pi for k in mask_2 ] )
result = []
for j in range(100):
res_col = []
for i in range(100):
c = np.multiply( vector_1, vector_2 ).sum( axis=0 )
res_col.append(c)
res_col = np.array( res_col )
result.append( res_col )
result = np.array( result )
return result
global read_obj
total_items = 40000
max_int = 1000
keys = np.arange(0, max_int)
number_processors = psutil.cpu_count( logical=False )
#number_used_processors = 1
number_used_processors = number_processors - 1
number_tasks = number_used_processors
read_obj = { k: np.random.rand( 1000 ) for k in keys }
pool = mp.Pool( processes = number_used_processors )
args = list( repeat( max_int, number_tasks ) )
results = pool.map( computations_x, args )
pool.close()
pool.join()
However, when looking at CPU performance, I see that the CPU's are being switched by the OS when performing the computations. I am running on Ubuntu 18.04, is this normal behaviour when using Python's MP module? Here is what I observe in the system monitor when debugging the code (I am using Eclipse2019 for debugging)
Any help is appreciated, as in my main project I need to share a global "read only" object through processes in the same spirit as is done here, and I want to be sure this is not affecting performance really badly; I also want to make sure all tasks are executed concurrently within the Pool class. thanks.
I'd say that is the normal behaviour as the OS has to make sure that other processes are not starving for CPU time.
Here's a nice article on the OS scheduler basics: https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part1.html
It's focusing on Golang but the first part is pretty general.
Related
How do I update the multi_proc_parallel_functions function below to accept a list of args. This is using the multiprocess module.
Please note I will be using this with AWS Lambda and other multiprocessing modules can have issues on Lambda.
adder functions below are simply toy functions used to demo the issue.
import multiprocess as mp
def parallel_functions(function,send_end):
send_end.send(function())
def multi_proc_parallel_functions(function_list,target_func):
jobs = []
pipe_list = []
for function in function_list:
recv_end, send_end = mp.Pipe(False)
p = mp.Process(target=target_func, args=(function,send_end))
jobs.append(p)
pipe_list.append(recv_end)
p.start()
result_list = [x.recv() for x in pipe_list]
for proc in jobs:
proc.join()
return result_list
def adder10():
return np.random.randint(5) + 10
def adder1000():
return np.random.randint(5) + 1000
Create a list of functions
function_list = [adder10,adder10,adder10,adder1000]
Run all functions
multi_proc_parallel_functions(function_list,parallel_functions)
[13, 13, 13, 1003]
How do I update the multi_proc_parallel_functions to accept a varied length list of args which will vary per function as follows:
def adder10(x,y):
return np.random.randint(5) + 10 + x * y
def adder1000(a,b, c):
return np.random.randint(5) + 1000 -a + b +c
I think this will require *args.
This would be one way of doing it (with positional argument support):
import multiprocessing as mp
def parallel_functions(function, send_end, *args):
send_end.send(function(*args))
def multi_proc_parallel_functions(function_list, target_func):
jobs = []
pipe_list = []
for (function, *args) in function_list:
recv_end, send_end = mp.Pipe(False)
p = mp.Process(target=target_func, args=(function, send_end, *args))
jobs.append(p)
pipe_list.append(recv_end)
p.start()
result_list = [x.recv() for x in pipe_list]
for proc in jobs:
proc.join()
return result_list
import numpy as np
def adder10(x, y):
return np.random.randint(5) + 10 + x * y
def adder1000(a, b, c):
return np.random.randint(5) + 1000 -a + b +c
multi_proc_parallel_functions(
[ (adder10, 5, 4),
(adder10, 1, 2),
(adder1000, 5, 6, 7) ],
parallel_functions
)
Note that how the multiprocessing module works will depend on whether you are on Windows, macOS or Linux.
On Linux, the default way of creating a mp.Process is by using the fork-syscall, which means the function/its arguments does not need to be serializable/possible to pickle. The child process will inherit memory from the parent. macOS supports fork, Windows doesn't.
On Windows/macOS, the spawn syscall is used by default instead. This requires that everything sent to the child process is serializable/possible to pickle. This means you won't be able to send lambda expressions or dynamically created functions for example.
Example of something that would work on Linux (with your original implementation), but not on Windows (or macOS by default):
multi_proc_parallel_functions(
[ lambda: adder10(5, 4),
lambda: adder10(1, 2),
lambda: adder1000(5, 6, 7) ],
parallel_functions
)
# spawn: _pickle.PicklingError: Can't pickle <function <lambda> at 0x7fce7cc43010>: attribute lookup <lambda> on __main__ failed
# fork: [30, 12, 1008]
I would suggest using the operator module, which has functions for the math operations. This way, you can send a list of operators and values to modify the initial value in a flexible way.
Example where each argument is a tuple of (operator, value):
import operator
import numpy as np
np.random.seed(123)
def adder(*args):
x = np.random.randint(5)
print(x)
for operator, value in args:
x = operator(x, value)
print(x)
return x
adder((operator.add, 5), (operator.mul, 10))
This operation (2 + 5 * 10) outputs:
2
7
70
I want to solve a multi-objective optimization problem using DEAP, a python based framework. Due to time consuming processes, i need to use all of my CPU power to compute. So i used multiprocessing library as suggested in DEAP documentation an this example, but it results in PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed .
My total code is too long to write it down hear, but the following code is similar to my code and results in the same error.Can you please tell me where do i make mistake?
Thanks in advance
import multiprocessing
from deap import creator, base, tools, algorithms
import random
import matplotlib.pyplot as plt
def TEST(dec_var):
return dec_var[0]**2+dec_var[1]**2,(dec_var[0]-2)**2+dec_var[1]**2
def feasible(dec_var):
if all(i>0 for i in dec_var):
return True
return False
creator.create("FitnessMin", base.Fitness, weights=(-1.0,-1.0))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox=base.Toolbox()
toolbox.register("uniform", random.uniform, 0.0, 7.0)
toolbox.register("individual",tools.initRepeat,creator.Individual,toolbox.uniform ,n=2)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.1)
toolbox.register("select", tools.selNSGA2)
toolbox.register("evaluate", TEST)
toolbox.decorate("evaluate", tools.DeltaPenalty(feasible,(1000,1000)))
def main(seed=None):
random.seed(seed)
NGEN = 250
MU = 100
CXPB = 0.9
stats_func1 = tools.Statistics(key=lambda ind: ind.fitness.values[0])
stats_func2 = tools.Statistics(key=lambda ind: ind.fitness.values[1])
stats = tools.MultiStatistics(func1=stats_func1, func2=stats_func2)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)
logbook = tools.Logbook()
logbook.header = "gen", "evals", "func1","func2"
logbook.chapters["func1"].header = "min", "max"
logbook.chapters["func2"].header = "min", "max"
pop = toolbox.population(n=MU)
invalid_ind = [ind for ind in pop if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
pop = toolbox.select(pop, len(pop))
record = stats.compile(pop)
logbook.record(gen=0, evals=len(invalid_ind), **record)
print(logbook.stream)
for gen in range(1, NGEN):
offspring = tools.selTournamentDCD(pop, len(pop))
offspring = [toolbox.clone(ind) for ind in offspring]
for ind1, ind2 in zip(offspring[::2], offspring[1::2]):
if random.random() <= CXPB:
toolbox.mate(ind1, ind2)
toolbox.mutate(ind1)
toolbox.mutate(ind2)
del ind1.fitness.values, ind2.fitness.values
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
pop = toolbox.select(pop + offspring, MU)
record = stats.compile(pop)
logbook.record(gen=gen, evals=len(invalid_ind), **record)
print(logbook.stream)
return pop, logbook
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
pop, stats = main()
pool.close()
print pop
This is noted on the documentation page you linked:
The pickling of lambda function is not yet available in Python.
You are using lambda functions in stats_func1 and stats_func2 – just move them to global functions and try again:
def stats_key_1(ind):
return ind.fitness.values[0]
def stats_key_2(ind):
return ind.fitness.values[1]
# ... snip ...
def main(seed=None):
# ... snip ...
stats_func1 = tools.Statistics(key=stats_key_1)
stats_func2 = tools.Statistics(key=stats_key_1)
I want to make calls to pool.apply_async(func) and accumulate the results as soon as they are available without waiting for each other.
import multiprocessing
import numpy as np
chrNames=['chr1','chr2','chr3']
sims=[1,2,3]
def accumulate_chrBased_simBased_result(chrBased_simBased_result,accumulatedSignalArray,accumulatedCountArray):
signalArray = chrBased_simBased_result[0]
countArray = chrBased_simBased_result[1]
accumulatedSignalArray += signalArray
accumulatedCountArray += countArray
def func(chrName,simNum):
print('%s %d' %(chrName,simNum))
result=[]
signal_array=np.full((10000,), simNum, dtype=float)
count_array = np.full((10000,), simNum, dtype=int)
result.append(signal_array)
result.append(count_array)
return result
if __name__ == '__main__':
accumulatedSignalArray = np.zeros((10000,), dtype=float)
accumulatedCountArray = np.zeros((10000,), dtype=int)
numofProcesses = multiprocessing.cpu_count()
pool = multiprocessing.Pool(numofProcesses)
for chrName in chrNames:
for simNum in sims:
result= pool.apply_async(func, (chrName,simNum,))
accumulate_chrBased_simBased_result(result.get(),accumulatedSignalArray,accumulatedCountArray)
pool.close()
pool.join()
print(accumulatedSignalArray)
print(accumulatedCountArray)
In this way, each pool.apply_async call waits for other call to end.
Is there a way do get rid of this waiting for each other?
You are using result.get() on each iteration, and making the main process wait for the function to be ready in doing so.
Please find below a working version, with prints showing that accumulation is done when "func" is ready, and adding random sleeps to ensure sizable execution time differences.
import multiprocessing
import numpy as np
from time import time, sleep
from random import random
chrNames=['chr1','chr2','chr3']
sims=[1,2,3]
def accumulate_chrBased_simBased_result(chrBased_simBased_result,accumulatedSignalArray,accumulatedCountArray):
signalArray = chrBased_simBased_result[0]
countArray = chrBased_simBased_result[1]
accumulatedSignalArray += signalArray
accumulatedCountArray += countArray
def func(chrName,simNum):
result=[]
sleep(random()*5)
signal_array=np.full((10000,), simNum, dtype=float)
count_array = np.full((10000,), simNum, dtype=int)
result.append(signal_array)
result.append(count_array)
print('%s %d' %(chrName,simNum))
return result
if __name__ == '__main__':
accumulatedSignalArray = np.zeros((10000,), dtype=float)
accumulatedCountArray = np.zeros((10000,), dtype=int)
numofProcesses = multiprocessing.cpu_count()
pool = multiprocessing.Pool(numofProcesses)
results = []
for chrName in chrNames:
for simNum in sims:
results.append(pool.apply_async(func, (chrName,simNum,)))
for i in results:
print(i)
while results:
for r in results[:]:
if r.ready():
print('{} is ready'.format(r))
accumulate_chrBased_simBased_result(r.get(),accumulatedSignalArray,accumulatedCountArray)
results.remove(r)
pool.close()
pool.join()
print(accumulatedSignalArray)
print(accumulatedCountArray)
I'm trying to alter a dictionary in python inside a process pool environment, but the dictionary isn't changed when the pool finishes.
Here's a minimal example of the problem (the output batch_input is all zeros, although inside per_batch_build it changes the relevant values)
from multiprocessing import Pool, freeze_support
import numpy as np
import itertools
def test_process():
batch_size = 2
batch_input = {'part_evecs': np.zeros((2, 10, 10)),
'model_evecs': np.zeros((2, 10, 10)),
}
batch_model_dist = np.zeros((2, 10, 10))
pool = Pool(4)
batch_output = pool.map(per_batch_build, itertools.izip(itertools.repeat(batch_input),
itertools.repeat(batch_model_dist),
list(range(batch_size))))
pool.close()
pool.join()
return batch_input, batch_model_dist
# #profile
# def per_batch_build(batch_input, batch_model_dist, batch_part_dist, dataset, i_batch):
def per_batch_build(tuple_input):
batch_input, batch_model_dist, i_batch = tuple_input
batch_model_dist[i_batch] = np.ones((10,10))
batch_input['part_evecs'][i_batch] = np.ones((10,10))
batch_input['model_evecs'][i_batch] = np.ones((10,10))
But unfortunately batch_input, batch_model_dist, batch_part_dist are all zeros, although when printing batch_input inside per_batch_build is not zero.
Using the solutions provided from previous discussions, the result stays the same (the output arrays are all zeros)
from multiprocessing import Pool, freeze_support, Manager, Array
import numpy as np
import itertools
import ctypes
def test_process():
manager = Manager()
shared_array_base = Array(ctypes.c_double, [0] * (2*10*10))
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
shared_array = shared_array.reshape((2,10,10))
batch_size = 2
batch_input = manager.dict({'part_evecs': shared_array,
# 'model_evecs': np.zeros((2, 10, 10)),
})
batch_model_dist = np.zeros((2, 10, 10))
pool = Pool(4)
batch_output = pool.map(per_batch_build, itertools.izip(itertools.repeat(batch_input),
itertools.repeat(batch_model_dist),
list(range(batch_size))))
pool.close()
pool.join()
return batch_input, batch_model_dist
# #profile
# def per_batch_build(batch_input, batch_model_dist, batch_part_dist, dataset, i_batch):
def per_batch_build(tuple_input):
batch_input, batch_model_dist, i_batch = tuple_input
batch_model_dist[i_batch] = np.ones((10,10))
batch_input['part_evecs'][i_batch] = np.ones((10,10))
# batch_input['model_evecs'][i_batch] = np.ones((10,10))
You are changing a copy of the object created inside per_batch_build. You are naming them identically in both functions so it may be confusing.
Add
print(id(batch_model_dist))
inside both functions and see for yourself.
[Edit]
I should probably also link related response, for example:
Is shared readonly data copied to different processes for multiprocessing?
I am writing a simple python script that I need to scale to many threads. For simplicity, I have replaced the actual function I need to use with a matrix matrix multiply. I am having trouble getting my code to scale with the number of processors. Any advice to help me get the correct speedup would be helpful! My code and results are as follows:
import numpy as np
import time
import math
from multiprocessing.dummy import Pool
res = 4
#we must iterate over all of these values
wavektests = np.linspace(.1,2.5,res)
omegaratios = np.linspace(.1,2.5,res)
wavekmat,omegamat = np.meshgrid(wavektests,omegaratios)
def solve_for_omegaratio( ind ):
#obtain the indices for this run
x_ind = ind % res
y_ind = math.floor(ind / res)
#obtain the value for this run
wavek = wavektests[x_ind]
omega = omegaratios[y_ind]
#do some work ( I have replaced the real function with this)
randmat = np.random.rand(4000,4000)
nop = np.linalg.matrix_power(randmat,3)
#obtain a scalar value
value = x_ind + y_ind**2.0
return value
list_ind = range(res**2)
#Serial code execution
t0_proc = time.clock()
t0_wall = time.time()
threads = 0
dispersion = map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
print('serial execution')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
#Using pool defaults
t0_proc = time.clock()
t0_wall = time.time()
if __name__ == '__main__':
pool = Pool()
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = default')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
# Using 4 threads
t0_proc = time.clock()
t0_wall = time.time()
threads = 4
if __name__ == '__main__':
pool = Pool(threads)
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = ' + str(threads))
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
Results:
serial execution
wall clock time = 66.1561758518219
processor clock time = 129.16376499999998
------------------------------------------------
num of threads = default
wall clock time = 81.86436200141907
processor clock time = 263.45369
------------------------------------------------
num of threads = 4
wall clock time = 77.63390111923218
processor clock time = 260.66285300000004
------------------------------------------------
Because python has a GIL https://wiki.python.org/moin/GlobalInterpreterLock , "python-native" threads can't run execute truly concurrently and thus can't improve the performance of CPU-bound tasks like math. They can be used to parallelize IO bound tasks effectively (eg API calls which spend almost all their time waiting for network I/O). Forking separate processes with multiprocessing rather than dummy's thread-backed implementation will create multiple processes, not threads, which will be able to run concurrently ( at cost of significant memory overhead).