I am trying to call this function[1] in parallel. Therefore, I have created this function [2], and I call it like this [3][4]. The problem is that when I execute this code, the execution hangs and I never see the result, but if I execute run_simple_job in serial, everything goes ok. Why I can't execute this function in parallel? Any advice for that?
[1] function that I am trying to call
#make_verbose
def run_simple_job(job_params):
"""
Execute a job remotely, and get the digests.
The output will come as a json file and it contains info about the input and output path, and the generated digest.
:param job_params: (namedtuple) contains several attributes important for the job during execution.
client_id (string) id of the client.
command (string) command to execute the job
cluster (string) where the job will run
task_type (TypeTask) contains information about the job that will run
should_tamper (Boolean) Tells if this job should tamper the digests or not
:return : output (string) the output of the job execution
"""
client_id = job_params.client_id
_command = job_params.command
cluster = job_params.cluster
task_type = job_params.task_type
output = // execute job
return output
[2] function that calls in parallel
def spawn(f):
# 1 - how the pipe and x attributes end up here?
def fun(pipe, x):
pipe.send(f(x))
pipe.close()
return fun
def parmap2(f, X):
pipe = [Pipe() for x in X]
# 2 - what is happening with the tuples (c,x) and (p, c)?
proc = [Process(target=spawn(f), args=(c, x))
for x, (p, c) in izip(X, pipe)]
for p in proc:
logging.debug("Spawn")
p.start()
for p in proc:
logging.debug("Joining")
p.join()
return [p.recv() for (p, c) in pipe]
[3] Wrapper class
class RunSimpleJobWrapper:
""" Wrapper used when running a job """
def __init__(self, params):
self.params = params
[4] How I call the function to run in parallel
for cluster in clusters:
task_type = task_type_by_cluster[cluster]
run_wrapper_list.append(RunSimpleJobWrapper(get_job_parameter(client_id, cluster, job.command, majority(FAULTS), task_type)))
jobs_output = parmap2(run_simple_job_wrapper, run_wrapper_list)
You could simply use multiprocessing:
from multiprocessing import Pool
n_jobs = -1 # use all the available CPUs
pool = Pool(n_jobs)
param_list = [...] # generate a list of your parameters
results = pool.map(run_simple_job,param_list)
Related
I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.
I would like to parallelize a process in python which needs read access to several large, non-array data structures. What would be a recommended way to do this without copying all of the large data structures into every new process?
Thank you
The multiprocessing package provides two ways of sharing state: shared memory objects and server process managers. You should use server process managers as they support arbitrary object types.
The following program makes use of a server process manager:
#!/usr/bin/env python3
from multiprocessing import Process, Manager
# Simple data structure
class DataStruct:
data_id = None
data_str = None
def __init__(self, data_id, data_str):
self.data_id = data_id
self.data_str = data_str
def __str__(self):
return f"{self.data_str} has ID {self.data_id}"
def __repr__(self):
return f"({self.data_id}, {self.data_str})"
def set_data_id(self, data_id):
self.data_id = data_id
def set_data_str(self, data_str):
self.data_str = data_str
def get_data_id(self):
return self.data_id
def get_data_str(self):
return self.data_str
# Create function to manipulate data
def manipulate_data_structs(data_structs, find_str):
for ds in data_structs:
if ds.get_data_str() == find_str:
print(ds)
# Create manager context, modify the data
with Manager() as manager:
# List of DataStruct objects
l = manager.list([
DataStruct(32, "Andrea"),
DataStruct(45, "Bill"),
DataStruct(21, "Claire"),
])
# Processes that look for DataStructs with a given String
procs = [
Process(target = manipulate_data_structs, args = (l, "Andrea")),
Process(target = manipulate_data_structs, args = (l, "Claire")),
Process(target = manipulate_data_structs, args = (l, "David")),
]
for proc in procs:
proc.start()
for proc in procs:
proc.join()
For more information, see Sharing state between processes in the documentation.
I am attempting to dynamically open and parse through several text files (~10) to extract a particular value from key, for which I am utilizing multi-processing within Python to do this. My issue is that the function that I am calling writes particular data to a class list which I can see in the method, however outside the method that list is empty. Refer to the following:
class:
class MyClass(object):
__id_list = []
def __init__(self):
self.process_wrapper()
Caller Method:
def process_wrapper(self):
from multiprocessing import Pool
import multiprocessing
info_file = 'info*'
file_list = []
p = Pool(processes = multiprocessing.cpu_count() - 1)
for file_name in Path('c:/').glob('**/*/' + info_file):
file_list.append(str(os.path.join('c:/', file_name)))
p.map_async(self.get_ids, file_list)
p.close()
p.join()
print(self.__id_list) # this is showing as empty
Worker method:
def get_ids(self, file_name):
try:
with open(file_name) as data:
for line in data:
temp_split = line.split()
for item in temp_split:
value_split = str(item).split('=')
if 'id' == value_split[0].lower():
if int(value_split[1]) not in self._id_list:
self.__id_list.append(int(value_split[1]))
except:
raise FileReadError(f'There was an issue parsing "{file_name}".')
print(self.__id_list) # here the list prints fine
The map call returns a AysncResult class object. you should use that to wait for the processing to finish before checking self.__id_list. also you might consider returning a local list, collected those lists and aggregating them into the final list.
1. It looks like you have a typo in your get_ids method (self._id_list instead of self.__id_list). You can see it if you wait for the result:
result = p.map_async(self.get_ids, file_list)
result.get()
2. When a new child process is created, it gets a copy of the parent's address space however any subsequent changes (either by parent or child) are not reflected in the memory of the other process. They each have their own private address space.
Example:
$ cat fork.py
import os
l = []
l.append('global')
# Return 0 in the child and the child’s process id in the parent
pid = os.fork()
if pid == 0:
l.append('child')
print(f'Child PID: {os.getpid()}, {l}')
else:
l.append('parent')
print(f'Parent PID: {os.getpid()}, {l}')
print(l)
$ python3 fork.py
Parent PID: 9933, ['global', 'parent']
['global', 'parent']
Child PID: 9934, ['global', 'child']
['global', 'child']
Now back to your problem, you can use multiprocessing.Manager.list to create an object that is shared between processes:
from multiprocessing import Manager, Pool
m = Manager()
self.__id_list = m.list()
Docs: Sharing state between processes
or use threads as your workload seems to be I/O bound anyway:
from multiprocessing.dummy import Pool as ThreadPool
p = ThreadPool(processes = multiprocessing.cpu_count() - 1)
Alternatively check concurrent.futures
I'm having issues with using r2pipe, Radare2's API, with the multiprocessing Pool.map function in python. The problem I am facing is the application hangs on pool.join().
My hope was to use multithreading via the multiprocessing.dummy class in order to evaluate functions quickly through r2pipe. I have tried passing my r2pipe object as a namespace using the Manager class. I have attempted using events as well, but none of these seem to work.
class Test:
def __init__(self, filename=None):
if filename:
self.r2 = r2pipe.open(filename)
else:
self.r2 = r2pipe.open()
self.r2.cmd('aaa')
def t_func(self, args):
f = args[0]
r2_ns = args[1]
print('afbj # {}'.format(f['name']))
try:
bb = r2_ns.cmdj('afbj # {}'.format(f['name']))
if bb:
return bb[0]['addr']
else:
return None
except Exception as e:
print(e)
return None
def thread(self):
funcs = self.r2.cmdj('aflj')
mgr = ThreadMgr()
ns = mgr.Namespace()
ns.r2 = self.r2
pool = ThreadPool(2)
results = pool.map(self.t_func, product(funcs, [ns.r2]))
pool.close()
pool.join()
print(list(results))
This is the class I am using. I make a call to the Test.thread function in my main function.
I expect the application to print out the command it is about to run in r2pipe afbj # entry0, etc. Then to print out the list of results containing the first basic block address [40000, 50000, ...].
The application does print out the command about to run, but then hangs before printing out the results.
ENVIRONMENT
radare2: radare2 4.2.0-git 23712 # linux-x86-64 git.4.1.1-97-g5a48a4017
commit: 5a48a401787c0eab31ecfb48bebf7cdfccb66e9b build: 2020-01-09__21:44:51
r2pipe: 1.4.2
python: Python 3.6.9 (default, Nov 7 2019, 10:44:02)
system: Ubuntu 18.04.3 LTS
SOLUTION
This may be due to passing the same instance of r2pipe.open() to every call of t_func in the pool. One solution is to move the following lines of code into t_func:
r2 = r2pipe.open('filename')
r2.cmd('aaa')
This works, however its terribly slow to reanalyze for each thread/process.
Also, it is often faster to allow radare2 to do as much of the work as possible and limit the number of commands we need to send using r2pipe.
This problem is solved by using the command: afbj ##f
afbj # List basic blocks of given function and show results in json
##f # Execute the command for each function
EXAMPLE
Longer Example
import r2pipe
R2: r2pipe.open_sync = r2pipe.open('/bin/ls')
R2.cmd("aaaa")
FUNCS: list = R2.cmd('afbj ##f').split("\n")[:-1]
RESULTS: list = []
for func in FUNCS:
basic_block_info: list = eval(func)
first_block: dict = basic_block_info[0]
address_first_block: int = first_block['addr']
RESULTS.append(hex(address_first_block))
print(RESULTS)
'''
['0x4a56', '0x1636c', '0x3758', '0x15690', '0x15420', '0x154f0', '0x15420',
'0x154f0', '0x3780', '0x3790', '0x37a0', '0x37b0', '0x37c0', '0x37d0', '0x0',
...,
'0x3e90', '0x6210', '0x62f0', '0x8f60', '0x99e0', '0xa860', '0xc640', '0x3e70',
'0xd200', '0xd220', '0x133a0', '0x14480', '0x144e0', '0x145e0', '0x14840', '0x15cf0']
'''
Shorter Example
import r2pipe
R2 = r2pipe.open('/bin/ls')
R2.cmd("aaaa")
print([hex(eval(func)[0]['addr']) for func in R2.cmd('afbj ##f').split("\n")[:-1]])
I'm fairly new to multiprocessing and I have written the script below, but the methods are not getting called. I dont understand what I'm missing.
What I want to do is the following:
call two different methods asynchronously.
call one method before the other.
# import all necessary modules
import Queue
import logging
import multiprocessing
import time, sys
import signal
debug = True
def init_worker():
signal.signal(signal.SIGINT, signal.SIG_IGN)
research_name_id = {}
ids = [55, 125, 428, 429, 430, 895, 572, 126, 833, 502, 404]
# declare all the static variables
num_threads = 2 # number of parallel threads
minDelay = 3 # minimum delay
maxDelay = 7 # maximum delay
# declare an empty queue which will hold the publication ids
queue = Queue.Queue(0)
proxies = []
#print (proxies)
def split(a, n):
"""Function to split data evenly among threads"""
k, m = len(a) / n, len(a) % n
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)]
for i in xrange(n))
def run_worker(
i,
data,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay):
""" Function to pull out all publication links from nist
data - research ids pulled using a different script
queue - add the publication urls to the list
research_name_id - dictionary with research id as key and name as value
proxies - scraped proxies
"""
print 'getLinks', i
for d in data:
print d
queue.put(d)
def fun_worker(i, queue, proxies, debug, minDelay, maxDelay):
print 'publicationData', i
try:
print queue.pop()
except:
pass
def main():
print "Initializing workers"
pool = multiprocessing.Pool(num_threads, init_worker)
distributed_ids = list(split(list(ids), num_threads))
for i in range(num_threads):
data_thread = distributed_ids[i]
print data_thread
pool.apply_async(run_worker, args=(i + 1,
data_thread,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay,
))
pool.apply_async(fun_worker,
args=(
i + 1,
queue,
proxies,
debug,
minDelay,
maxDelay,
))
try:
print "Waiting 10 seconds"
time.sleep(10)
except KeyboardInterrupt:
print "Caught KeyboardInterrupt, terminating workers"
pool.terminate()
pool.join()
else:
print "Quitting normally"
pool.close()
pool.join()
if __name__ == "__main__":
main()
The only output that I get is
Initializing workers
[55, 125, 428, 429, 430, 895]
[572, 126, 833, 502, 404]
Waiting 10 seconds
Quitting normally
There are a couple of issues:
You're not using multiprocessing.Queue
If you want to share a queue with a subprocess via apply_async etc, you need to use a manager (see example).
However, you should take a step back and ask yourself what you are trying to do. Is apply_async is really the way to go? You have a list of items that you want to map over repeatedly, applying some long-running transformations that are compute intensive (because if they're just blocking on I/O, you might as well use threads). It seems to me that imap_unordered is actually what you want:
pool = multiprocessing.Pool(num_threads, init_worker)
links = pool.imap_unordered(run_worker1, ids)
output = pool.imap_unordered(fun_worker1, links)
run_worker1 and fun_worker1 need to be modified to take a single argument. If you need to share other data, then you should pass it in the initializer instead of passing it to the subprocesses over and over again.