Nesting parallel processes using multiprocessing - python

Is there a way to run a function in parallel within an already parallelised function? I know that using multiprocessing.Pool() this is not possible as a daemonic process can not create a child process. I am fairly new to parallel computing and am struggling to find a workaround.
I currently have several thousand calculations that need to be run in parallel using some other commercially available quantum mechanical code I interface to. Each calculation, has three subsequent calculations that need to be executed in parallel on normal termination of the parent calculation, if the parent calculation does not terminate normally, that is the end of the calculation for that point. I could always combine these three subsequent calculations into one big calculation and run normally - although I would much prefer to run separately in parallel.
Main currently looks like this, run() is the parent calculation that is first run in parallel for a series of points, and par_nacmes() is the function that I want to run in parallel for three child calculations following normal termination of the parent.
def par_nacmes(nacme_input_data):
nacme_dir, nacme_input, index = nacme_input_data # Unpack info in tuple for the calculation
axes_index = get_axis_index(nacme_input)
[norm_term, nacme_outf] = util.run_calculation(molpro_keys, pwd, nacme_dir, nacme_input, index) # Submit child calculation
if norm_term:
data.extract_nacme(nacme_outf, molpro_keys['nacme_regex'], index, axes_index)
else:
with open('output.log', 'w+') as f:
f.write('NACME Crashed for GP%s - axis %s' % (index, axes_index))
def run(grid_point):
index, geom = grid_point
if inputs['code'] == 'molpro':
[spe_dir, spe_input] = molpro.setup_spe(inputs, geom, pwd, index)
[norm_term, spe_outf] = util.run_calculation(molpro_keys, pwd, spe_dir, spe_input, index) # Run each parent calculation
if norm_term: # If parent calculation terminates normally - Extract data and continue with subsequent calculations for each point
data.extract_energies(spe_dir+spe_outf, inputs['spe'], molpro_keys['energy_regex'],
molpro_keys['cas_prog'], index)
if inputs['nacme'] == 'yes':
[nacme_dir, nacmes_inputs] = molpro.setup_nacme(inputs, geom, spe_dir, index)
nacmes_data = [(nacme_dir, nacme_inp, index) for nacme_inp in nacmes_inputs] # List of three tuples - each with three elements. Each tuple describes a child calculation to be run in parallel
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data]) # Run each calculation in list of tuples in parallel
if inputs['grad'] == 'yes':
pass
else:
with open('output.log', 'w+') as f:
f.write('SPE crashed for GP%s' % index)
elif inputs['code'] == 'molcas': # TO DO
pass
if __name__ == "__main__":
try:
pwd = os.getcwd() # parent dir
f = open(inp_geom, 'r')
ref_geom = np.genfromtxt(f, skip_header=2, usecols=(1, 2, 3), encoding=None)
f.close()
geom_list = coordinate_generator(ref_geom) # Generate nuclear coordinates
if inputs['code'] == 'molpro':
couplings = molpro.coupled_states(inputs['states'][-1])
elif inputs['code'] == 'molcas':
pass
data = setup.global_data(ref_geom, inputs['states'][-1], couplings, len(geom_list))
run_pool = multiprocessing.Pool()
run_pool.map(run, [(k, v) for k, v in enumerate(geom_list)]) # Run each parent calculation for each set of coordinates
except StopIteration:
print('Please ensure goemetry file is correct.')
Any insight on how to run these child calculations in parallel for each point would be a great help. I have seen some people suggest using multi-threading instead or to set daemon to false, although I am unsure if this is the best way to do this.

firstly I dont know why you have to run par_nacmes in paralel but if you have to you could:
a use threads to run them instead of processes
or b use multiprocessing.Process to run run however that would involve a lot of overhead so I personally wouldn't do it.
for a all you have to do is
replace
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data])
in run()
with
threads = []
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()
threads.append(t)
for t in threads: t.join()
or if you dont care if the treads have finished or not
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()

Related

Python multiprocessing write to file with starmap_async()

I'm currently setting up a automated simulation pipeline for OpenFOAM (CFD library) using the PyFoam library within Python to create a large database for machine learning purposes. The database will have around 500k distinct simulations. To run this pipeline on multiple machines, I'm using the multiprocessing.Pool.starmap_async(args) option which will continually start a new simulation once the old simulation has completed.
However, since some of the simulations might / will crash, I want to generate a textfile with all cases which have crashed.
I've already found this thread which implements the multiprocessing.Manager.Queue() and adds a listener but I failed to get it running with starmap_async(). For my testing I'm trying to print the case name for any simulation which has been completed but currently only one entry is written into the text file instead of all of them (the simulations all complete successfully).
So my question is how can I write a message to a file for each simulation which has completed.
The current code layout looks roughly like this - only important snipped has been added as the remaining code can't be run without OpenFOAM and additional customs scripts which were created for the automation.
Any help is highly appreciated! :)
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
# Defining global variables
manager = multiprocessing.Manager()
queue = manager.Queue()
def runCase(airfoil, angle, velocity):
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
'''
A lot of pre-processing commands to prepare the simulation
which has been removed from snipped such as generate geometry, create mesh etc...
'''
# run simulation
machine = LAMMachine(nr=4) # set number of cores for parallel execution
simulation = BasicRunner(argv=[solver, "-case", case.name], silent=True, lam=machine, logname="solver")
simulation.start() # start simulation
# check if simulation has completed
if simulation.runOK():
# write message into queue
queue.put(newCase)
if not simulation.runOK():
print("Simulation did not run successfully")
def listener(queue):
fname = 'errors.txt'
msg = queue.get()
while True:
with open(fname, 'w') as f:
if msg == 'complete':
break
f.write(str(msg) + '\n')
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
paramlist = list(itertools.product(nacas, angles, np.round(machs, 9)))
# create number of processes and keep 2 cores idle for other processes
nCores = psutil.cpu_count(logical=False) - 2
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
pool.apply_async(listener, (queue,)) # start the listener
pool.starmap_async(runCase, paramlist).get() # run parallel simulations
queue.put('complete')
pool.close()
pool.join()
if __name__ == '__main__':
main()
First, when your with multiprocessing.Pool(processes=nProcs) as pool: exits, there will be an implicit call to pool.terminate(), which will kill all pool processes and with it any running or queued up tasks. There is no point in calling queue.put('complete') since nobody is listening.
Second, your 'listener" task gets only a single message from the queue. If is "complete", it terminates immediately. If it is something else, it just loops continuously writing the same message to the output file. This cannot be right, can it? Did you forget an additional call to queue.get() in your loop?
Third, I do not quite follow your computation for nProcs. Why the division by 4? If you had 5 physical processors nProcs would be computed as 0. Do you mean something like:
nProcs = psutil.cpu_count(logical=False) // 4
if nProcs == 0:
nProcs = 1
elif nProcs > 1:
nProcs -= 1 # Leave a core free
Fourth, why do you need a separate "listener" task? Have your runCase task return whatever message is appropriate according to how it completes back to the main process. In the code below, multiprocessing.pool.Pool.imap is used so that results can be processed as the tasks complete and results returned:
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
def runCase(tpl):
# Unpack tuple:
airfoil, angle, velocity = tpl
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
... # Code omitted for brevity
# check if simulation has completed
if simulation.runOK():
return '' # No error
# Simulation did not run successfully:
return f"Simulation {newcase} did not run successfully"
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
# There is no reason to convert this into a list; it
# can be lazilly computed:
paramlist = itertools.product(nacas, angles, np.round(machs, 9))
# create number of processes and keep 1 core idle for main process
nCores = psutil.cpu_count(logical=False) - 1
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
with open('errors.txt', 'w') as f:
# Process message results as soon as the task ends.
# Use method imap_unordered if you do not care about the order
# of the messages in the output.
# We can only pass a single argument using imap, so make it a tuple:
for msg in pool.imap(runCase, zip(paramlist)):
if msg != '': # Error completion
print(msg)
print(msg, file=f)
pool.join() # Not really necessary here
if __name__ == '__main__':
main()

RuntimeError:freeze_support() on Mac

I'm new on python. I want to learn how to parallel processing in python. I saw the following example:
import multiprocessing as mp
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[20, 5])
data = arr.tolist()
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
pool = mp.Pool(mp.cpu_count())
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
print(results[:10])
but when I run it, this error happened:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
What should I do?
If you place everything in global scope inside this if __name__ == "__main__" block as follows, you should find that your program behaves as you expect:
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
if __name__ == "__main__":
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[20, 5])
data = arr.tolist()
pool = mp.Pool(mp.cpu_count())
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
print(results[:10])
Without this protection, if your current module was imported from a different module, your multiprocessing code would be executed. This could occur within a non-main process spawned in another Pool and spawning processes from sub-processes is not allowed, hence we protect against this problem.
I had a live example, where I faced the same RuntimeError issue when I executed a specific tool on MacOS-machines (on Linux machines it was fine though). However, I'm not sure about the exact cause for the problem, cause the if __name__ == "__main__" encapsulation seemed to be properly at place.
Following one comment on this Stack-Overflow entry, I suspected that using python>=3.8, which utilizes spawn as default method for calling subprocesses might be the problem.
My solution:
Using python=3.7 did the trick.

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Processing a long list using Multiprocessing

output = mp.Queue()
def endScoreList(all_docs, query, pc, output):
score_list = []
for doc in all_docs:
print "In process", pc
score_list.append(some_score(doc, query))
print "size of score_list is", len(score_list)
output.put((doc, score_list))
if __name__ == '__main__':
mp.freeze_support()
num_of_workers = mp.cpu_count()
doc_list = getDocuments(query)
## query is a list of strings.
## doc_list is a list of document names
processes = [mp.Process(target = endScoreList, args = (doc_list, x, query, output)) for x in range(num_of_workers)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
print results
I have a list of document names all_docs whose data I have to compare with an input query. This is done using score that I get from some_score(doc, query). The list of documents is ~100k. I have to get scores of all the documents. How can I make a program so that the scores are generated parallelly. The scores are independent of each other so in the end I just have to merge all the returned list of (doc, score). I tried to make a program, but I don't think it is running parallelly.
Please help me out.
I am using Windows 64-Bit/i7.
It's a bit hard to suggest what is going wrong with your current code, as the example you've shown has a number of issues (for instance, you're using // to introduce a comment, creating processes that call a finalScore function and pass doc_list as a parameter, neither of which are defined).
Rather than try to figure out what is going on with your code, I'd like to suggest an alternative solution that is likely to be much simpler. If you use multiprocessing.Pool's map method, you'll get your work distributed over however many processes are in the pool.
import multiprocessing as mp
def worker(doc):
return doc, some_score(doc, "query")
if __name__ == "__main__":
mp.freeze_support()
p = mp.Pool() # default is a number of processes equal to the number of CPU cores
scores = p.map(worker, all_docs)
p.close()
p.join()
This simple version assumes that the query string is a constant. If that's not the case, you could pass it as an argument in the map call (or consider using starmap instead).

How to utilize all cores with python multiprocessing

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Categories

Resources