Processing a long list using Multiprocessing - python

output = mp.Queue()
def endScoreList(all_docs, query, pc, output):
score_list = []
for doc in all_docs:
print "In process", pc
score_list.append(some_score(doc, query))
print "size of score_list is", len(score_list)
output.put((doc, score_list))
if __name__ == '__main__':
mp.freeze_support()
num_of_workers = mp.cpu_count()
doc_list = getDocuments(query)
## query is a list of strings.
## doc_list is a list of document names
processes = [mp.Process(target = endScoreList, args = (doc_list, x, query, output)) for x in range(num_of_workers)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
print results
I have a list of document names all_docs whose data I have to compare with an input query. This is done using score that I get from some_score(doc, query). The list of documents is ~100k. I have to get scores of all the documents. How can I make a program so that the scores are generated parallelly. The scores are independent of each other so in the end I just have to merge all the returned list of (doc, score). I tried to make a program, but I don't think it is running parallelly.
Please help me out.
I am using Windows 64-Bit/i7.

It's a bit hard to suggest what is going wrong with your current code, as the example you've shown has a number of issues (for instance, you're using // to introduce a comment, creating processes that call a finalScore function and pass doc_list as a parameter, neither of which are defined).
Rather than try to figure out what is going on with your code, I'd like to suggest an alternative solution that is likely to be much simpler. If you use multiprocessing.Pool's map method, you'll get your work distributed over however many processes are in the pool.
import multiprocessing as mp
def worker(doc):
return doc, some_score(doc, "query")
if __name__ == "__main__":
mp.freeze_support()
p = mp.Pool() # default is a number of processes equal to the number of CPU cores
scores = p.map(worker, all_docs)
p.close()
p.join()
This simple version assumes that the query string is a constant. If that's not the case, you could pass it as an argument in the map call (or consider using starmap instead).

Related

setting cpu affinity using multiprocessing in python windows

I am trying to run 1200 iterations of a function with different values using multiprocessing.
Is there a way I can set the priority and affinities of the processors within the function itself ?
Here is an example of what I am doing :
with multiprocessing.Pool(processes=3) as pool:
r = pool.map(func, (c for c in combinations))
I want each of the 3 processes to have high priority using psutil, and the cpu_affinity to be specified. While I can use: psutil.Process().HIGH_PRIORITY_CLASS withing func, how should I specify different affinities for the three processors?
I would use the initializer function in mp.Pool:
#same prio for each child
def init_priority(prio_level):
set_prio(prio_level)
if __name__ == "__main__":
with Pool(nprocs, init_priority, (prio_level,)) as p:
p.map(...)
#Different prio for each child: (this may not be very useful
#because you cannot choose which child will accept each "task").
def init_priority(q):
prio_level = q.get()
set_prio(prio_level)
if __name__ == "__main__":
q = mp.Queue()
for _ in range(nprocs): #put one prio_level for each process
q.put(prio_level)
with Pool(nprocs, init_priority, (q,)) as p:
p.map(...)
If you need to have some high priority child processes and some low priority, and will need to be able to discern easily between them, I would skip mp.Pool, and just use your own Process's.

Multiple iterations of function with multiple arguments returning multiple values using Multiprocessing in python

I doing 100 iterations of the function model so, i tried using multiprocessing to distribute the tasks and for getting the final output I tried using queue but it takes too much time, failing the purpose of multiprocessing. How to solve this problem?
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
q.put(total_labels2)
q1.put(auc_final)
q2.put(ada_clf)
print('done')
overall_labels={}
final_auc={}
final_ada_clf={}
processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
if __name__=='__main__':
p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
overall_labels[iteration]=q.get()
final_auc[iteration]=q1.get()
final_ada_clf[iteration]=q2.get()
p.start()
processes.append(p)
for each in processes:
each.join()
Below is my edited version, but returns only single output, i tried using multiple output but could not get it, so settled for only single output i.e. total_labels2:-
##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
total_labels2 = pd.Series(temp_arr)
return (repeat,total_labels2)
def get_result(total_labels2):
global testover_forall
testover_forall.append(total_labels2)
if __name__ == '__main__':
import multiprocessing as mp
testover_forall = []
pool = mp.Pool(40)
for repeat in range(100):
pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
pool.close()
pool.join()
repetations_index=[]
for i in range(100):
repetations_index.append(testover_forall[i][0])
final_last_labels = {}
for i in range(100):
temp = str(i)
final_last_labels[temp] = testover_forall[repetations_index[i]][1]
totally_last_labels=[]
for each in final_last_labels:
temp=np.array(final_last_labels[each])
totally_last_labels.append(temp)
See my comments (actually questions) to your post.
You should be using a multiprocessing pool to limit the number of processes that you create to the number of CPU cores that you have. This will also make it easier to get return values back from your model function instead of writing results to 3 different queues (and you could have written a tuple of 3 values to just one queue). You will, of course, require other import statements and code. Given your use of numpy and other libraries, which may be implemented in C Language, you could also retry running this using threading to see if that helps or hurts performance. Do this by replacing ProcessPoolExecutor with ThreadPoolExecutor in the two places it is referenced.
Note
Any changes that model makes to passed arguments X and Y will not be reflected back to the main process. So if model is called repeatedly with the same arguments over and over, as it appears to be, it's not clear whether each call will return different values, especially if the calls are being done in parallel.
from concurrent.futures import ProcessPoolExecutor
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
#q.put(total_labels2)
#q1.put(auc_final)
#q2.put(ada_clf)
return total_labels2, auc_final, ada_clf
#print('done')
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
futures = [executor.submit(model, x_train, y_labels) for iteration in range(100)]
# simple lists will suffice:
overall_labels = []
final_auc = []
final_ada_clf = []
for future in futures:
# get return value and store
total_labels2, auc_final, ada_clf = future.result()
overall_labels.append(total_labels2)
final_auc.append(auc_final)
final_ada_clf.append(ada_clf)
Update
It wasn't clear from the problem specification that the returned results are based on a random number generator and if successive calls to the worker function, model, do not employ a single random number generator across all processes in the multiprocessing pool, then the multiprocessing implementation will clearly return different results than the non-multiprocessing version. And it is not clear from the code provided where the random number generator is being used; it may be in library code that you have no access to. If that is the case, you have two options: (1) Use multithreading instead by changing the import statement as I have indicated in the code below; you may still achieve performance benefits as I have already mentioned or (2) Update the signature to model as follows. You will be passed a new argument, random_generator, that currently supports two methods, randint (like random.randint and random (like random.random), although it should be easy enough to modify the code if you need a different method from module random. You will use this random number generator in place of module random if you are able to. But note that this random generator will run much more slowly than the standard one; this is the price you pay.
Since we are also adding a repetition argument to model (it now has to be the final argument -- note the updated signature below), we can now use method map (no need to use a callback):
def model(X,Y, random_generator, repetition):
...
etc.
from multiprocessing import Pool
# or use the following import instead to use multithreading (but then use standard random generator):
# from multiprocessing.dummy import Pool
import random
from functools import partial
from multiprocessing.managers import BaseManager
class RandomGeneratorManager(BaseManager):
pass
class RandomGenerator:
def __init__(self):
random.seed(0)
def randint(self, a, b):
return random.randint(a, b)
def random(self):
return random.random()
# add other functions if needed
if __name__ == '__main__':
RandomGeneratorManager.register('RandomGenerator', RandomGenerator)
with RandomGeneratorManager() as manager:
random_generator = manager.RandomGenerator()
# why 40? why not use default, which is the number of cpu cores you have?:
pool = Pool(40):
worker = partial(model, x_train, y_labels, random_generator)
results = pool.map(worker, range(100))

Nesting parallel processes using multiprocessing

Is there a way to run a function in parallel within an already parallelised function? I know that using multiprocessing.Pool() this is not possible as a daemonic process can not create a child process. I am fairly new to parallel computing and am struggling to find a workaround.
I currently have several thousand calculations that need to be run in parallel using some other commercially available quantum mechanical code I interface to. Each calculation, has three subsequent calculations that need to be executed in parallel on normal termination of the parent calculation, if the parent calculation does not terminate normally, that is the end of the calculation for that point. I could always combine these three subsequent calculations into one big calculation and run normally - although I would much prefer to run separately in parallel.
Main currently looks like this, run() is the parent calculation that is first run in parallel for a series of points, and par_nacmes() is the function that I want to run in parallel for three child calculations following normal termination of the parent.
def par_nacmes(nacme_input_data):
nacme_dir, nacme_input, index = nacme_input_data # Unpack info in tuple for the calculation
axes_index = get_axis_index(nacme_input)
[norm_term, nacme_outf] = util.run_calculation(molpro_keys, pwd, nacme_dir, nacme_input, index) # Submit child calculation
if norm_term:
data.extract_nacme(nacme_outf, molpro_keys['nacme_regex'], index, axes_index)
else:
with open('output.log', 'w+') as f:
f.write('NACME Crashed for GP%s - axis %s' % (index, axes_index))
def run(grid_point):
index, geom = grid_point
if inputs['code'] == 'molpro':
[spe_dir, spe_input] = molpro.setup_spe(inputs, geom, pwd, index)
[norm_term, spe_outf] = util.run_calculation(molpro_keys, pwd, spe_dir, spe_input, index) # Run each parent calculation
if norm_term: # If parent calculation terminates normally - Extract data and continue with subsequent calculations for each point
data.extract_energies(spe_dir+spe_outf, inputs['spe'], molpro_keys['energy_regex'],
molpro_keys['cas_prog'], index)
if inputs['nacme'] == 'yes':
[nacme_dir, nacmes_inputs] = molpro.setup_nacme(inputs, geom, spe_dir, index)
nacmes_data = [(nacme_dir, nacme_inp, index) for nacme_inp in nacmes_inputs] # List of three tuples - each with three elements. Each tuple describes a child calculation to be run in parallel
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data]) # Run each calculation in list of tuples in parallel
if inputs['grad'] == 'yes':
pass
else:
with open('output.log', 'w+') as f:
f.write('SPE crashed for GP%s' % index)
elif inputs['code'] == 'molcas': # TO DO
pass
if __name__ == "__main__":
try:
pwd = os.getcwd() # parent dir
f = open(inp_geom, 'r')
ref_geom = np.genfromtxt(f, skip_header=2, usecols=(1, 2, 3), encoding=None)
f.close()
geom_list = coordinate_generator(ref_geom) # Generate nuclear coordinates
if inputs['code'] == 'molpro':
couplings = molpro.coupled_states(inputs['states'][-1])
elif inputs['code'] == 'molcas':
pass
data = setup.global_data(ref_geom, inputs['states'][-1], couplings, len(geom_list))
run_pool = multiprocessing.Pool()
run_pool.map(run, [(k, v) for k, v in enumerate(geom_list)]) # Run each parent calculation for each set of coordinates
except StopIteration:
print('Please ensure goemetry file is correct.')
Any insight on how to run these child calculations in parallel for each point would be a great help. I have seen some people suggest using multi-threading instead or to set daemon to false, although I am unsure if this is the best way to do this.
firstly I dont know why you have to run par_nacmes in paralel but if you have to you could:
a use threads to run them instead of processes
or b use multiprocessing.Process to run run however that would involve a lot of overhead so I personally wouldn't do it.
for a all you have to do is
replace
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data])
in run()
with
threads = []
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()
threads.append(t)
for t in threads: t.join()
or if you dont care if the treads have finished or not
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

How to utilize all cores with python multiprocessing

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Categories

Resources