I'm trying to figure out the best way to parallelize a program like this:
global_data = some data
global_data2 = some data
data_store1 = np.empty(n)
data_store2 = np.empty(n)
.
.
.
def simulation(global_data):
retrieve values from global datasets and set element of global datastores
such that I do something like pass list(enumerate(global_data)) to a multiprocessing function, and each process sets elements of the global data stores corresponding to the received (index, vlaue) pair. I'm running on a high performance cluster with 128 cores, so I think parallelization is preferable to threading.
If you use a multiprocessing pool (e.g. a multiprocessing.Pool instance) with its map method, then your worker function, simulation, just needs to return its result back to the main process which will end up with a list of results that will be in the correct order. This will be less expensive than using a managed list to which the worker function adds its result:
import multiprocessing
def simulation(global_data_elem):
# We are passed a single element of global_data
# Do calculation with global_data_elem and return result.
# The CPU resources required to do this calculation must be sufficiently
# high to justify the additional overhead of multiprocessing (which is
# not the case for this demo):
return global_data_elem * 2
def main():
# global_data is some data (not necesarilly at global scope)
global_data = ['a', 'b', 'c']
# create pool of the correct size
# (not larger than the number of cores we have nor the number of tasks being submitted):
pool = multiprocessing.Pool(min(len(global_data), multiprocessing.cpu_count()))
# results are returned in the correct order (task submission order):
results = pool.map(simulation, global_data)
print(results)
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
['aa', 'bb', 'cc']
Related
I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization
I've a global NumPy array ys_final and have defined a function that generates an array ys. The ys array will be generated based on an input parameter and I want to add these ys arrays to the global array, i.e ys_final = ys_final + ys.
The order of addition doesn't matter so I want to use Pool.apply_async() from the multiprocessing library but I can't write to the global array. The code for reference is:
import multiprocessing as mp
ys_final = np.zeros(len)
def ys_genrator(i):
#code to generate ys array
return ys
pool = mp.Pool(mp.cpu_count())
for i in range(3954):
ys_final = ys_final + pool.apply_async(ys_genrator, args=(i)).get()
pool.close()
pool.join()
The above block of code keeps on running forever and nothing happens. I've tried *mp.Process also and still I face the same problem. There I defined a target function that directly adds to the global array but it is also not working as the block keeps running forever. Reference:
def func(i):
#code to generate ys
global ys_final
ys_final = ys_final + ys
for i in range(3954):
p = mp.Process(target=func, args=(i,))
p.start()
p.join()
Any suggestions will be really helpful.
EDIT:
My ys_genrator is a function for linear interpolation. Based on the parameter i which is an index for rows in a 2D image, the function creates an array of interpolated amplitudes that will be superimposed with all the interpolated amplitudes from the image, so ys need to be added to ys_final
The variable len is the length of the interpolated array, which is same for all rows.
For reference, a simpler version of ys_genrator(i) is as follows:
def ys_genrator(i):
ys = np.ones(10)*i
return ys
A few points:
pool.apply_async(ys_genrator, args=(i)) needs to be pool.apply_async(ys_genrator, args=(i,)). Note the comma after the i.
pool.apply_async(ys_genrator, args=(i,)).get() is exactly equivalent to pool.apply(ys.genrator, args=(i,)). That is, you will block because of your immediate call to get and you will have absolutely no parallism. You would need to do all your calls to pool.apply_async and save the returned AsyncResult instances and only then call get on these instances.
If you are running under Windows, you will have a problem. The code that creates new processes must be within a block governed by if __name__ == '__main__':
If you are running under something like Jupyter Notebook or iPython you will have a problem. The worker function, ys_genrator, would need to be in an external file and imported.
Using apply_async for submitting a lot of tasks is inefficient. You are better of using imap or imap_unordered where the tasks get submitted in "chunks" and you can process the results one by one as they become available. But you must choose a "suitable" chunksize argument.
Any code you have at the global level, such as ys_final = np.zeros(len) will be executed by every sub-process if you are running under Windows, and this can be wasteful if the subprocesses do not need to "see" this variable. If they do need to see this variable, be aware that each process in the pool will be working with its own copy of the variable so it better be a read-only usage. Even then, it can be very wasteful of storage if the variable is large. There are ways of sharing such a variable across the processes but it is not perfectly clear whether you need to (you haven't even defined variable len). So it is difficult to give you improved code. However, it appears that your worker function does not need to "see" ys_final, so I will take a shot at an improved solution.
But be aware that if your function ys_genrator is very trivial, nothing will be gained by using multiprocessing because there is overhead in both creating the processing pool and in passing arguments from one process to another. Also, if ys_genrator is using numpy, this can also be a source of problems since numpy uses multiprocessing for some of its own functions and you are better off not mixing numpy with your own multiprocessing.
import multiprocessing as mp
import numpy as np
SIZE = 3
def ys_genrator(i):
#code to generate ys array
# for this dummy example all SIZE entries will end up with the same result:
ys = [i] * SIZE # for example: [1, 1, 1]
return ys
def compute_chunksize(poolsize, iterable_size):
chunksize, remainder = divmod(iterable_size, 4 * poolsize)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
ys_final = np.zeros(SIZE)
n_iterations = 3954
poolsize = min(mp.cpu_count(), n_iterations)
chunksize = compute_chunksize(poolsize, n_iterations)
print('poolsize =', poolsize, 'chunksize =', chunksize)
pool = mp.Pool(poolsize)
for result in pool.imap_unordered(ys_genrator, range(n_iterations), chunksize):
ys_final += result
print(ys_final)
Prints:
poolsize = 8 chunksize = 124
[7815081. 7815081. 7815081.]
Update
You can also just use:
for result in pool.map(ys_genrator, range(n_iterations)):
ys_final += result
The issue is that when you use method map, the method wants to compute an efficient chunksize argument based on the size of the iterable argument (see my compute_chunksize function above, which is essentially what pool.map will use). But to do this, is will have to first convert the iterable to a list to get its size. If n_iterations is very large, this is not very efficient, although it's probably not an major issue for a size of 3954. Still, you would be better off using my compute_chunksize function in this case since you know the size of the iterable and then pass the chunksize argument explicitly to map as I have done in the code using imap_unordered.
I'm have implemented an Evolutionary Algorithm process in Python 3.8, and am attempting to optimise/reduce its runtime. Due to the heavy constraints upon valid solutions, it can take a few minutes to generate valid chromosomes. To avoid spending hours just generating the initial population, I want to use Multiprocessing to generate multiple at a time.
My code at this point in time is:
populationCount = 500
def readDistanceMatrix():
# code removed
def generateAvailableValues():
# code removed
def generateAvailableValuesPerColumn():
# code removed
def generateScheduleTemplate():
# code removed
def generateChromosome():
# code removed
if __name__ == '__main__':
# Data type = DataFrame
distanceMatrix = readDistanceMatrix()
# Data type = List of Integers
availableValues = generateAvailableValues()
# Data type = List containing Lists of Integers
availableValuesPerColumn = generateAvailableValuesPerColumn(availableValues)
# Data type = DataFrame
scheduleTemplate = generateScheduleTemplate(distanceMatrix)
# Data type = List containing custom class (with Integer and DataFrame)
population = []
while len(population) < populationCount:
chrmSolution = generateChromosome(availableValuesPerColumn, scheduleTemplate, distanceMatrix)
population.append(chrmSolution)
Where the population list is filled in with the while loop at the end. I would like to replace the while loop with a Multiprocessing solution that can use up to a pre-set number of cores. For example:
population = []
availableCores = 6
while len(population) < populationCount:
while usedCores < availableCores:
# start generating another chromosome as 'chrmSolution'
population.append(chrmSolution)
However, after reading and watching hours worth of tutorials, I'm unable to get a loop up-and-running. How should I go about doing this?
It sounds like a simple multiprocessing.Pool should do the trick, or at least be a place to start. Here's a simple example of how that might look:
from multiprocessing import Pool, cpu_count
child_globals = {} #mutable object at the `module` level acts as container for globals (constants)
if __name__ == '__main__':
# ...
def init_child(availableValuesPerColumn, scheduleTemplate, distanceMatrix):
#passing variables to the child process every time is inefficient if they're
# constant, so instead pass them to the initialization function, and let
# each child re-use them each time generateChromosome is called
child_globals['availableValuesPerColumn'] = availableValuesPerColumn
child_globals['scheduleTemplate'] = scheduleTemplate
child_globals['distanceMatrix'] = distanceMatrix
def child_work(i):
#child_work simply wraps generateChromosome with inputs, and throws out dummy `i` from `range()`
return generateChromosome(child_globals['availableValuesPerColumn'],
child_globals['scheduleTemplate'],
child_globals['distanceMatrix'])
with Pool(cpu_count(),
initializer=init_child, #init function to stuff some constants into the child's global context
initargs=(availableValuesPerColumn, scheduleTemplate, distanceMatrix)) as p:
#imap_unordered doesn't make child processes wait to ensure order is preserved,
# so it keeps the cpu busy more often. it returns a generator, so we use list()
# to store the results into a list.
population = list(p.imap_unordered(child_work, range(populationCount)))
I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)
I have a function which I am applying to different chunks of my data. Since each chunk is independent of the rest, I wish to execute the function for all chunks in parallel.
I have a result dictionary which should hold the output of calculations for each chunk.
Here is how I did it:
from joblib import Parallel, delayed
import multiprocessing
cpu_count = multiprocessing.cpu_count()
# I have 8 cores, so I divide the data into 8 chunks.
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count) # e.g., [0, 125, 250, ..., 875, 1000]
# initialize result dictionary with empty lists.
result = dict()
for i in range(cpu_count):
result[i] = []
# Parallel execution for 8 chunks
Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
However, when the execution is finished result has all initial empty lists. I figured that if I execute the function serially over each chunk of data, it works just fine. For example, if I replace the last line with the following, result will have all the calculated values.
# Instead of parallel execution, call the function in a for-loop.
for i in range(cpu_count):
myFunction(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i)
In this case, result values are updated.
It seems that when the function is executed in parallel, it cannot write on the given dictionary (result). So, I was wondering how I can obtain the output of function for each chunk of data?
joblib, by default uses the multiprocessing module in python. According to this SO Answer, when arguments are passed to new Processes they create a fork, which copies the memory space of the current process. This means that myFunction is essentially working on a copy of result and does not modify the original.
My suggestion is to have myFunction return the desired data as a list. The call to Process will then return a list of the lists generated by myFunction. From there, it is simple to add them to results. It could look something like this:
from joblib import Parallel, delayed
import multiprocessing
if __name__ == '__main__':
cpu_count = multiprocessing.cpu_count()
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count)
# make sure myFunction returns the grouped results in a list
r = Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
result = dict()
for i, data in enumerate(r): # cycles through each resultant chunk, numbered and in the original order
result[i] = data