setting cpu affinity using multiprocessing in python windows - python

I am trying to run 1200 iterations of a function with different values using multiprocessing.
Is there a way I can set the priority and affinities of the processors within the function itself ?
Here is an example of what I am doing :
with multiprocessing.Pool(processes=3) as pool:
r = pool.map(func, (c for c in combinations))
I want each of the 3 processes to have high priority using psutil, and the cpu_affinity to be specified. While I can use: psutil.Process().HIGH_PRIORITY_CLASS withing func, how should I specify different affinities for the three processors?

I would use the initializer function in mp.Pool:
#same prio for each child
def init_priority(prio_level):
set_prio(prio_level)
if __name__ == "__main__":
with Pool(nprocs, init_priority, (prio_level,)) as p:
p.map(...)
#Different prio for each child: (this may not be very useful
#because you cannot choose which child will accept each "task").
def init_priority(q):
prio_level = q.get()
set_prio(prio_level)
if __name__ == "__main__":
q = mp.Queue()
for _ in range(nprocs): #put one prio_level for each process
q.put(prio_level)
with Pool(nprocs, init_priority, (q,)) as p:
p.map(...)
If you need to have some high priority child processes and some low priority, and will need to be able to discern easily between them, I would skip mp.Pool, and just use your own Process's.

Related

Nesting parallel processes using multiprocessing

Is there a way to run a function in parallel within an already parallelised function? I know that using multiprocessing.Pool() this is not possible as a daemonic process can not create a child process. I am fairly new to parallel computing and am struggling to find a workaround.
I currently have several thousand calculations that need to be run in parallel using some other commercially available quantum mechanical code I interface to. Each calculation, has three subsequent calculations that need to be executed in parallel on normal termination of the parent calculation, if the parent calculation does not terminate normally, that is the end of the calculation for that point. I could always combine these three subsequent calculations into one big calculation and run normally - although I would much prefer to run separately in parallel.
Main currently looks like this, run() is the parent calculation that is first run in parallel for a series of points, and par_nacmes() is the function that I want to run in parallel for three child calculations following normal termination of the parent.
def par_nacmes(nacme_input_data):
nacme_dir, nacme_input, index = nacme_input_data # Unpack info in tuple for the calculation
axes_index = get_axis_index(nacme_input)
[norm_term, nacme_outf] = util.run_calculation(molpro_keys, pwd, nacme_dir, nacme_input, index) # Submit child calculation
if norm_term:
data.extract_nacme(nacme_outf, molpro_keys['nacme_regex'], index, axes_index)
else:
with open('output.log', 'w+') as f:
f.write('NACME Crashed for GP%s - axis %s' % (index, axes_index))
def run(grid_point):
index, geom = grid_point
if inputs['code'] == 'molpro':
[spe_dir, spe_input] = molpro.setup_spe(inputs, geom, pwd, index)
[norm_term, spe_outf] = util.run_calculation(molpro_keys, pwd, spe_dir, spe_input, index) # Run each parent calculation
if norm_term: # If parent calculation terminates normally - Extract data and continue with subsequent calculations for each point
data.extract_energies(spe_dir+spe_outf, inputs['spe'], molpro_keys['energy_regex'],
molpro_keys['cas_prog'], index)
if inputs['nacme'] == 'yes':
[nacme_dir, nacmes_inputs] = molpro.setup_nacme(inputs, geom, spe_dir, index)
nacmes_data = [(nacme_dir, nacme_inp, index) for nacme_inp in nacmes_inputs] # List of three tuples - each with three elements. Each tuple describes a child calculation to be run in parallel
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data]) # Run each calculation in list of tuples in parallel
if inputs['grad'] == 'yes':
pass
else:
with open('output.log', 'w+') as f:
f.write('SPE crashed for GP%s' % index)
elif inputs['code'] == 'molcas': # TO DO
pass
if __name__ == "__main__":
try:
pwd = os.getcwd() # parent dir
f = open(inp_geom, 'r')
ref_geom = np.genfromtxt(f, skip_header=2, usecols=(1, 2, 3), encoding=None)
f.close()
geom_list = coordinate_generator(ref_geom) # Generate nuclear coordinates
if inputs['code'] == 'molpro':
couplings = molpro.coupled_states(inputs['states'][-1])
elif inputs['code'] == 'molcas':
pass
data = setup.global_data(ref_geom, inputs['states'][-1], couplings, len(geom_list))
run_pool = multiprocessing.Pool()
run_pool.map(run, [(k, v) for k, v in enumerate(geom_list)]) # Run each parent calculation for each set of coordinates
except StopIteration:
print('Please ensure goemetry file is correct.')
Any insight on how to run these child calculations in parallel for each point would be a great help. I have seen some people suggest using multi-threading instead or to set daemon to false, although I am unsure if this is the best way to do this.
firstly I dont know why you have to run par_nacmes in paralel but if you have to you could:
a use threads to run them instead of processes
or b use multiprocessing.Process to run run however that would involve a lot of overhead so I personally wouldn't do it.
for a all you have to do is
replace
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data])
in run()
with
threads = []
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()
threads.append(t)
for t in threads: t.join()
or if you dont care if the treads have finished or not
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()

Running different Python functions in separate CPUs

Using multiprocessing.pool I can split an input list for a single function to be processed in parallel along multiple CPUs. Like this:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(f, range(100))
pool.close()
pool.join()
However, this does not allow to run different functions on different processors. If I want to do something like this, in parallel / simultaneously:
foo1(args1) --> Processor1
foo2(args2) --> Processor2
How can this be done?
Edit: After Darkonaut remarks, I do not care about specifically assigning foo1 to Processor number 1. It can be any processor as chosen by the OS. I am just interested in running independent functions in different/ parallel Processes. So rather:
foo1(args1) --> process1
foo2(args2) --> process2
I usually find it easiest to use the concurrent.futures module for concurrency. You can achieve the same with multiprocessing, but concurrent.futures has (IMO) a much nicer interface.
Your example would then be:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(2) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.submit(foo1, 1)
future_2 = executor.submit(foo2, 2)
# get results / re-raise exceptions that were thrown in workers
result_1 = future_1.result() # contains foo1(1)
result_2 = future_2.result() # contains foo2(2)
If you have many inputs, it is better to use executor.map with the chunksize argument instead:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(4) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.map(foo1, range(10000), chunksize=100)
future_2 = executor.map(foo2, range(10000), chunksize=100)
# executor.map returns an iterator which we have to consume to get the results
result_1 = list(future_1) # contains [foo1(x) for x in range(10000)]
result_2 = list(future_2) # contains [foo2(x) for x in range(10000)]
Note that the optimal values for chunksize, the number of processes, and whether process-based concurrency actually leads to increased performance depends on many factors:
The runtime of foo1 / foo2. If they are extremely cheap (as in this example), the communication overhead between processes might dominate the total runtime.
Spawning a process takes time, so the code inside with ProcessPoolExecutor needs to run long enough for this to amortize.
The actual number of physical processors in the machine you are running on.
Whether your application is IO bound or compute bound.
Whether the functions you use in foo are already parallelized (such as some np.linalg solvers, or scikit-learn estimators).

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Processing a long list using Multiprocessing

output = mp.Queue()
def endScoreList(all_docs, query, pc, output):
score_list = []
for doc in all_docs:
print "In process", pc
score_list.append(some_score(doc, query))
print "size of score_list is", len(score_list)
output.put((doc, score_list))
if __name__ == '__main__':
mp.freeze_support()
num_of_workers = mp.cpu_count()
doc_list = getDocuments(query)
## query is a list of strings.
## doc_list is a list of document names
processes = [mp.Process(target = endScoreList, args = (doc_list, x, query, output)) for x in range(num_of_workers)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
print results
I have a list of document names all_docs whose data I have to compare with an input query. This is done using score that I get from some_score(doc, query). The list of documents is ~100k. I have to get scores of all the documents. How can I make a program so that the scores are generated parallelly. The scores are independent of each other so in the end I just have to merge all the returned list of (doc, score). I tried to make a program, but I don't think it is running parallelly.
Please help me out.
I am using Windows 64-Bit/i7.
It's a bit hard to suggest what is going wrong with your current code, as the example you've shown has a number of issues (for instance, you're using // to introduce a comment, creating processes that call a finalScore function and pass doc_list as a parameter, neither of which are defined).
Rather than try to figure out what is going on with your code, I'd like to suggest an alternative solution that is likely to be much simpler. If you use multiprocessing.Pool's map method, you'll get your work distributed over however many processes are in the pool.
import multiprocessing as mp
def worker(doc):
return doc, some_score(doc, "query")
if __name__ == "__main__":
mp.freeze_support()
p = mp.Pool() # default is a number of processes equal to the number of CPU cores
scores = p.map(worker, all_docs)
p.close()
p.join()
This simple version assumes that the query string is a constant. If that's not the case, you could pass it as an argument in the map call (or consider using starmap instead).

How to utilize all cores with python multiprocessing

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Categories

Resources