I'm diving into the multiprocessing world in python.
After watching some videos I came up with a question due to the nature of my function.
This function takes 4 arguments:
The 1st argument is a file to be read, hence, this is a list of files to read.
The following 2 arguments are two different dictionaries.
The last argument is an optional argument "debug_mode" which is needed to be set to "True"
# process_data(file, signals_dict, parameter_dict, debug_mode=False)
file_list = [...]
t1 = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(process_data, file_list)
t2 = time.time()
The question is:
How can I specify the remaining parameters to the function?
Thanks in advance
ProcessPoolExecutor.map documentation is weak. The worker accepts a single parameter. If your target has a different call signature, you need to write an intermediate worker that is passed a container and knows how to expand that into the paramter list. The documention also fails to make it clear that you need to wait for the job to complete before closing the pool. If you start the jobs and exit the pool context with clause, the pool is terminated.
import concurrent.futures
import os
def process_data(a,b,c,d):
print(os.getpid(), a, b, c, d)
return a
def _process_data_worker(p):
return process_data(*p)
if __name__ == "__main__":
file_list = [["fooa", "foob", "fooc", "food"],
["bara", "barb", "barc", "bard"]]
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(_process_data_worker, file_list)
for result in results:
print('result', result)
You need to create a list of lists containing parameters for each process:
params_list = [[file1, dict1_1, dict2_1, True],
[file2, dict1_2, dict2_2, True],
[file3, dict1_3, dict2_3]]
Then, you can create processes like this:
executor.map(process_data, params_list)
Related
I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization
I have a multiprocessing code, and each process have to analyse same data differently.
The input data is always the same, it is not changeable.
Input data is a data frame 20 columns and 60k rows.
How to efficiently 'put' this data to each process?
On single process application I have used global variable, but in multiprocessing it's not working.
When I try to transfer this as a function argument, I have only the first element of the table
Welcome to Stack Overflow. You need to take the time and give reproducible minimal working examples to get specific answers and help the society in general.
Anyway, you shouldn't use global variables if you need to change them with each iteration/process/etc.
Multiprocessing works like that in rough easily-digestible terms:
import concurrent.futures
import glob
def manipulate_data_function(data):
result = torture_data(data)
return result
# ProcessPoolExecutor for CPU bound stuff
with concurrent.futures.ThreadPoolExecutor(max_workers = None) as executor:
futures = []
for file in glob.glob('*txt'):
futures.append(executor.submit(manipulate_data_function, data))
thank you for the answer, I don't change this date each iteration. I use the same data to each process, data how to change the data is given throw function argument
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p)
for f in concurrent.futures.as_completed(res):
fp = res
and next
def goal_fcn(x):
return heavy_calculation(x, global_DataFrame, global_String)
EDIT:
it work with:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
Each function (func1, etc) makes a request to a different url:
def thread_map(ID):
func_switch = \
{
0: func1,
1: func2,
2: func3,
3: func4
}
with ThreadPoolExecutor(max_workers=len(func_switch)) as threads:
futures = [threads.submit(func_switch[i], ID) for i in func_switch]
results = [f.result() for f in as_completed(futures)]
for df in results:
if not df.empty and df['x'][0] != '':
return df
return pd.DataFrame()
This is much faster (1.75 sec) compared to a for loop (4 sec), but the results are unordered.
How can each function be executed parallely while allowing to check the results by order of execution?
Preferably as background processes/threads returning the according dataframes starting with func1. So if the conditions for func1 are not met, check func2 and so on for the criteria given the results have already been fetched in the background. Each dataframe is different, but they all contain the same common column x.
Any suggestions are highly appreciated plus I hope ThreadPoolExecutor is appropriate for this scenario. Thanks!
First, let's do it as you are asking:
with ThreadPoolExecutor(max_workers=len(func_switch)) as threads:
futures = [threads.submit(func_switch[i], ID) for i in func_switch]
results = [f.result() for f in futures]
That was simple enough.
To process the futures as they are completed and place the results in the list in the futures were created, you need to associate with each future the order in which the future was created:
futures = {} # this time a dictionary
creation_order = 0
with ThreadPoolExecutor(max_workers=len(func_switch)) as threads:
for i in func_switch:
future = threads.submit(func_switch[i], ID)
futures[future] = creation_order # map the future to this value or any other values you want, such as the arguments being passed to the function, which happens to be the creation order
creation_order += 1
results = [None] * creation_order # preallocate results
for f in as_completed(futures):
result = f.result()
index = futures[f] # recover original creation_order:
results[index] = result
Of course, if you are waiting for all the futures to complete before you do anything with them, there is no point in using the as_completed method. I just wanted to show how if that weren't the case the method for associating the completed future back with the original creation order (or perhaps more useful, the original arguments used in the call to the worker function that created the future). An alternative is for the processing function to return the passed arguments as part of its result.
I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)
I am struggling for a while with Multiprocessing in Python. I would like to run 2 independent functions simultaneously, wait until both calculations are finished and then continue with the output of both functions. Something like this:
# Function A:
def jobA(num):
result=num*2
return result
# Fuction B:
def jobB(num):
result=num^3
return result
# Parallel process function:
{resultA,resultB}=runInParallel(jobA(num),jobB(num))
I found other examples of multiprocessing however they used only one function or didn't returned an output. Anyone knows how to do this? Many thanks!
I'd recommend creating processes manually (rather than as part of a pool), and sending the return values to the main process through a multiprocessing.Queue. These queues can share almost any Python object in a safe and relatively efficient way.
Here's an example, using the jobs you've posted.
def jobA(num, q):
q.put(num * 2)
def jobB(num, q):
q.put(num ^ 3)
import multiprocessing as mp
q = mp.Queue()
jobs = (jobA, jobB)
args = ((10, q), (2, q))
for job, arg in zip(jobs, args):
mp.Process(target=job, args=arg).start()
for i in range(len(jobs)):
print('Result of job {} is: {}'.format(i, q.get()))
This prints out:
Result of job 0 is: 20
Result of job 1 is: 1
But you can of course do whatever further processing you'd like using these values.