Python Multiprocessing cannot Join for Large Data Set - python

I am trying compute cosine similarity of 260774x96 data matrix. I was using full size Dataframe to compute the cosine similarity, and using this code.
similarities = cdist(Dataframe, Dataframe, metric='cosine').
However, the Jupyter Notebook run out memory(64GB) and crash the kernel. So that I decide to compute single row with Dataframe each time, and then sort, trim and save it each loop. The issue is it takes too much times, about 12 hours.
So I applied multiprocessing to speed up, for small size data it works, single row compute with 500 rows Dataframe. However, computing single row with full size Dataframe, the process can work and put data to the queue, but the join() just not work. The main function keep running even the child process was end with no error. I try to print the queue in another cell, it has full data.
Anther problem is input data queue cannot take full size Dataframe, it show Full error.
I don't know what is the problem the program just stuck on join() step. Or is other practical way to have parallel computing.
from multiprocess import Lock, Process, Queue, current_process,Array,Lock
from time import time
import queue # imported for using queue.Empty exception
from scipy.spatial.distance import cdist
from scipy.spatial import distance
# from cos_process import cos_process
from sklearn.metrics import pairwise_distances
def cos_process(idx, result_queue, index_queue,df_data_only_numpy):
print(f"process start {current_process().name}")
# for shrik the data size, so that the multiprocessing Join can work.
# df_data_only_numpy = df_data_only_numpy[:500]
while True:
try:
task_index_list = index_queue.get_nowait()
except queue.Empty:
break
else:
cosine_dict = {}
for i in task_index_list:
similarities = cdist([df_data_only_numpy[i]], df_data_only_numpy, metric='cosine')
sorted_save_data = np.sort(similarities[0])[:20] # output the cosine simlarities data after sorted
sorted_save_key = (np.argsort(similarities[0])[:20]) #output the index of sorted simlarities
# make a dictionary {index:{index:cosine,index:cosine},....}
cosine_dict[i]={int(i):float(data) if data==data else (data) for i,data in zip(sorted_save_key,sorted_save_data) }
# print(cosine_dict)
try:
# put the dictionary data to queue
result_queue.put_nowait(cosine_dict)
except queue.Full:
print("queue full in process")
print(f"end process {current_process().name}")
return True
def main():
number_of_processes = 10 # create 10 processes for computer
process_compute_time = 1000 # each process run 1000 rows in the while loop
df_data_only_numpy = df_data_only.to_numpy()
index_queue = Queue() # index of the df
result_queue = Queue() # store the output of the similarity
processes = []
# for idx in df_data_only.T:
# only test 1000 rows data in this case, the dataframe has 260774 rows and 96 columns
for idx in range(0, 1000, process_compute_time):
try:
index_queue.put_nowait(list(range(idx, idx+process_compute_time)))
except queue.Full:
print("full")
print("creating process")
# creating processes
for w in range(number_of_processes):
p = Process(target=cos_process, args=(index_queue, result_queue, index_queue,df_data_only_numpy))
processes.append(p)
p.start()
# completing process
for p in processes:
print("before join")
p.join()
print("finish join")
# print the output
while not result_queue.empty():
print(result_queue.get())
return True
if __name__ == '__main__':
main()

Related

Queue (multiprocessing) where you can get each value twice

Is there any option to have a multiprocessing Queue where each value can be accessed twice?
My problem is I have one "Generator process" creating a constant flux of data and would like to access this in two different process each doing it's thing with the data.
A minimal "example" of the issue.
import multiprocessing as mp
import numpy as np
class Process1(mp.Process):
def __init__(self,Data_Queue):
mp.Process.__init__(self)
self.Data_Queue = Data_Queue
def run(self):
while True:
self.Data_Queue.get()
# Do stuff with
self.Data_Queue.task_done()
class Process2(mp.Process):
def __init__(self,Data_Queue):
mp.Process.__init__(self)
self.Data_Queue = Data_Queue
def run(self):
while True:
self.Data_Queue.get()
# Do stuff with
self.Data_Queue.task_done()
if __name__ == "__main__":
data_Queue = mp.Queue()
P1 = Process1()
P1.start()
P2 = Process2()
P2.start()
while True: # Generate data
data_Queue.put(np.random.rand(1000))
The idea is that I would like for both Process1 and Process2 to access all generated data in this example. What would happen is that each one would only just get some random portions of it this way.
Thanks for the help!
Update 1: As pointed in some of the questions and answers this becomes a little more complicated for two reasons I did not include in the initial question.
The data is externally generated on a non constant schedule (I may receive tons of data for a few seconds than wait minutes for more to come)
As such, data may arrive faster than it's possible to process so it would need to be "Queued" in a way while it waits for its turn to be processed.
One way to solve your problem is, first, to use multiprocessing.Array to share, let's say, a numpy array with your data between worker processes. Second, use a multiprocessing.Barrier to synchronize the main process and the workers when generating and processing data batches. And, finally, provide each process worker with its own queue to signal them when the next data batch is ready for processing. Below is the complete working example just to show you the idea:
#!/usr/bin/env python3
import os
import time
import ctypes
import multiprocessing as mp
import numpy as np
WORKERS = 5
DATA_SIZE = 10
DATA_BATCHES = 10
def process_data(data, queue, barrier):
proc = os.getpid()
print(f'[WORKER: {proc}] Started')
while True:
data_batch = queue.get()
if data_batch is None:
break
arr = np.frombuffer(data.get_obj())
print(f'[WORKER: {proc}] Started processing data {arr}')
time.sleep(np.random.randint(0, 2))
print(f'[WORKER: {proc}] Finished processing data {arr}')
barrier.wait()
print(f'[WORKER: {proc}] Finished')
def generate_data_array(i):
print(f'[DATA BATCH: {i}] Start generating data... ', end='')
time.sleep(np.random.randint(0, 2))
data = np.random.randint(0, 10, size=DATA_SIZE)
print(f'Done! {data}')
return data
if __name__ == '__main__':
data = mp.Array(ctypes.c_double, DATA_SIZE)
data_barrier = mp.Barrier(WORKERS + 1)
workers = []
# Start workers:
for _ in range(WORKERS):
data_queue = mp.Queue()
p = mp.Process(target=process_data, args=(data, data_queue, data_barrier))
p.start()
workers.append((p, data_queue))
# Generate data batches in the main process:
for i in range(DATA_BATCHES):
arr = generate_data_array(i + 1)
data_arr = np.frombuffer(data.get_obj())
np.copyto(data_arr, arr)
for _, data_queue in workers:
# Signal workers that the new data batch is ready:
data_queue.put(True)
data_barrier.wait()
# Stop workers:
for worker, data_queue in workers:
data_queue.put(None)
worker.join()
Here, you start with the definition of the shared data array data and the barrier data_barrier used for the process synchronization. Then, in the loop, you instantiate a queue data_queue, create and start a worker process p passing the shared data array, the queue instance, and the shared barrier instance data_barrier as its parameters. Once the workers have been started, you generate data batches in the loop, copy generated numpy arrays into shared data array, and signal processes via their queues that the next data batch is ready for processing. Then, you wait on barrier when all the worker processes have finished their work before generate the next data batch. In the end, you send None signal to all the processes in order to make them quit the infinite processing loop.

Python multiprocessing write to file with starmap_async()

I'm currently setting up a automated simulation pipeline for OpenFOAM (CFD library) using the PyFoam library within Python to create a large database for machine learning purposes. The database will have around 500k distinct simulations. To run this pipeline on multiple machines, I'm using the multiprocessing.Pool.starmap_async(args) option which will continually start a new simulation once the old simulation has completed.
However, since some of the simulations might / will crash, I want to generate a textfile with all cases which have crashed.
I've already found this thread which implements the multiprocessing.Manager.Queue() and adds a listener but I failed to get it running with starmap_async(). For my testing I'm trying to print the case name for any simulation which has been completed but currently only one entry is written into the text file instead of all of them (the simulations all complete successfully).
So my question is how can I write a message to a file for each simulation which has completed.
The current code layout looks roughly like this - only important snipped has been added as the remaining code can't be run without OpenFOAM and additional customs scripts which were created for the automation.
Any help is highly appreciated! :)
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
# Defining global variables
manager = multiprocessing.Manager()
queue = manager.Queue()
def runCase(airfoil, angle, velocity):
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
'''
A lot of pre-processing commands to prepare the simulation
which has been removed from snipped such as generate geometry, create mesh etc...
'''
# run simulation
machine = LAMMachine(nr=4) # set number of cores for parallel execution
simulation = BasicRunner(argv=[solver, "-case", case.name], silent=True, lam=machine, logname="solver")
simulation.start() # start simulation
# check if simulation has completed
if simulation.runOK():
# write message into queue
queue.put(newCase)
if not simulation.runOK():
print("Simulation did not run successfully")
def listener(queue):
fname = 'errors.txt'
msg = queue.get()
while True:
with open(fname, 'w') as f:
if msg == 'complete':
break
f.write(str(msg) + '\n')
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
paramlist = list(itertools.product(nacas, angles, np.round(machs, 9)))
# create number of processes and keep 2 cores idle for other processes
nCores = psutil.cpu_count(logical=False) - 2
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
pool.apply_async(listener, (queue,)) # start the listener
pool.starmap_async(runCase, paramlist).get() # run parallel simulations
queue.put('complete')
pool.close()
pool.join()
if __name__ == '__main__':
main()
First, when your with multiprocessing.Pool(processes=nProcs) as pool: exits, there will be an implicit call to pool.terminate(), which will kill all pool processes and with it any running or queued up tasks. There is no point in calling queue.put('complete') since nobody is listening.
Second, your 'listener" task gets only a single message from the queue. If is "complete", it terminates immediately. If it is something else, it just loops continuously writing the same message to the output file. This cannot be right, can it? Did you forget an additional call to queue.get() in your loop?
Third, I do not quite follow your computation for nProcs. Why the division by 4? If you had 5 physical processors nProcs would be computed as 0. Do you mean something like:
nProcs = psutil.cpu_count(logical=False) // 4
if nProcs == 0:
nProcs = 1
elif nProcs > 1:
nProcs -= 1 # Leave a core free
Fourth, why do you need a separate "listener" task? Have your runCase task return whatever message is appropriate according to how it completes back to the main process. In the code below, multiprocessing.pool.Pool.imap is used so that results can be processed as the tasks complete and results returned:
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
def runCase(tpl):
# Unpack tuple:
airfoil, angle, velocity = tpl
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
... # Code omitted for brevity
# check if simulation has completed
if simulation.runOK():
return '' # No error
# Simulation did not run successfully:
return f"Simulation {newcase} did not run successfully"
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
# There is no reason to convert this into a list; it
# can be lazilly computed:
paramlist = itertools.product(nacas, angles, np.round(machs, 9))
# create number of processes and keep 1 core idle for main process
nCores = psutil.cpu_count(logical=False) - 1
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
with open('errors.txt', 'w') as f:
# Process message results as soon as the task ends.
# Use method imap_unordered if you do not care about the order
# of the messages in the output.
# We can only pass a single argument using imap, so make it a tuple:
for msg in pool.imap(runCase, zip(paramlist)):
if msg != '': # Error completion
print(msg)
print(msg, file=f)
pool.join() # Not really necessary here
if __name__ == '__main__':
main()

Nesting parallel processes using multiprocessing

Is there a way to run a function in parallel within an already parallelised function? I know that using multiprocessing.Pool() this is not possible as a daemonic process can not create a child process. I am fairly new to parallel computing and am struggling to find a workaround.
I currently have several thousand calculations that need to be run in parallel using some other commercially available quantum mechanical code I interface to. Each calculation, has three subsequent calculations that need to be executed in parallel on normal termination of the parent calculation, if the parent calculation does not terminate normally, that is the end of the calculation for that point. I could always combine these three subsequent calculations into one big calculation and run normally - although I would much prefer to run separately in parallel.
Main currently looks like this, run() is the parent calculation that is first run in parallel for a series of points, and par_nacmes() is the function that I want to run in parallel for three child calculations following normal termination of the parent.
def par_nacmes(nacme_input_data):
nacme_dir, nacme_input, index = nacme_input_data # Unpack info in tuple for the calculation
axes_index = get_axis_index(nacme_input)
[norm_term, nacme_outf] = util.run_calculation(molpro_keys, pwd, nacme_dir, nacme_input, index) # Submit child calculation
if norm_term:
data.extract_nacme(nacme_outf, molpro_keys['nacme_regex'], index, axes_index)
else:
with open('output.log', 'w+') as f:
f.write('NACME Crashed for GP%s - axis %s' % (index, axes_index))
def run(grid_point):
index, geom = grid_point
if inputs['code'] == 'molpro':
[spe_dir, spe_input] = molpro.setup_spe(inputs, geom, pwd, index)
[norm_term, spe_outf] = util.run_calculation(molpro_keys, pwd, spe_dir, spe_input, index) # Run each parent calculation
if norm_term: # If parent calculation terminates normally - Extract data and continue with subsequent calculations for each point
data.extract_energies(spe_dir+spe_outf, inputs['spe'], molpro_keys['energy_regex'],
molpro_keys['cas_prog'], index)
if inputs['nacme'] == 'yes':
[nacme_dir, nacmes_inputs] = molpro.setup_nacme(inputs, geom, spe_dir, index)
nacmes_data = [(nacme_dir, nacme_inp, index) for nacme_inp in nacmes_inputs] # List of three tuples - each with three elements. Each tuple describes a child calculation to be run in parallel
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data]) # Run each calculation in list of tuples in parallel
if inputs['grad'] == 'yes':
pass
else:
with open('output.log', 'w+') as f:
f.write('SPE crashed for GP%s' % index)
elif inputs['code'] == 'molcas': # TO DO
pass
if __name__ == "__main__":
try:
pwd = os.getcwd() # parent dir
f = open(inp_geom, 'r')
ref_geom = np.genfromtxt(f, skip_header=2, usecols=(1, 2, 3), encoding=None)
f.close()
geom_list = coordinate_generator(ref_geom) # Generate nuclear coordinates
if inputs['code'] == 'molpro':
couplings = molpro.coupled_states(inputs['states'][-1])
elif inputs['code'] == 'molcas':
pass
data = setup.global_data(ref_geom, inputs['states'][-1], couplings, len(geom_list))
run_pool = multiprocessing.Pool()
run_pool.map(run, [(k, v) for k, v in enumerate(geom_list)]) # Run each parent calculation for each set of coordinates
except StopIteration:
print('Please ensure goemetry file is correct.')
Any insight on how to run these child calculations in parallel for each point would be a great help. I have seen some people suggest using multi-threading instead or to set daemon to false, although I am unsure if this is the best way to do this.
firstly I dont know why you have to run par_nacmes in paralel but if you have to you could:
a use threads to run them instead of processes
or b use multiprocessing.Process to run run however that would involve a lot of overhead so I personally wouldn't do it.
for a all you have to do is
replace
nacme_pool = multiprocessing.Pool()
nacme_pool.map(par_nacmes, [nacme_input for nacme_input in nacmes_data])
in run()
with
threads = []
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()
threads.append(t)
for t in threads: t.join()
or if you dont care if the treads have finished or not
for nacme_input in nacmes_data:
t = Thread(target=par_nacmes, args=(nacme_input,)); t.start()

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Python multiprocessing on function with return value

I am trying to do principal component analysis on each element of an array . Each element of the array is a matrix. I have 500 such arrays. Each array contains around 50 matrices. Overall it is taking too much time. I want to speed up the process using multiprocessing.
I've written following code:
from multiprocessing import Pool
def princomp(A):
A=A.T
# computing eigenvalues and eigenvectors of covariance matrix
M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
[latent,coeff] = linalg.eig(cov(M)) # attention:not always sorted
score = dot(coeff.T,M) # projection of the data in the new space
top_eigen=[i[0] for i in sorted(enumerate(latent), key=lambda x:x[1])]
#output = random.rand(990,3)
score = score.real
output = []
for a in range(0, 128):
output += [score[top_eigen[a]]]
return numpy.array(output)
def func1(...)
...
...
for i in range(0, nframes): #nframes is around 50
fc_array[i] = normalize(fc_array[i], axis=1, norm='l1') #fc_array[i] is a 2D numpy array
pool = Pool()
#ans_array[i] = princomp(fc_array[i]) # normal principle component analysis
result = pool.apply_async(princomp, [fc_array[i]]) # principle component analysis using multiprocessing
print "Frame No.: ", i
ans_array[i] = result.get()
p.close()
p.join()
return ans_array
It gets stuck inside the for loop of 'func1'. The print statement is not executed. It gets stuck. Am I even doing multiprocessing correctly?
I hope I'll get some help on this.
Thanks in advance.

Categories

Resources