I'm trying to use a defaultdict with multiprocessing, as described in Using defaultdict with multiprocessing?.
Example code:
from collections import defaultdict
from multiprocessing import Pool
from multiprocessing.managers import BaseManager, DictProxy
class DictProxyManager(BaseManager):
"""Support a using a defaultdict with multiprocessing"""
DictProxyManager.register('defaultdict', defaultdict, DictProxy)
class Test:
my_dict: defaultdict
def run(self):
for i in range(10):
self.my_dict['x'] += 1
def main():
test = Test()
mgr = DictProxyManager()
mgr.start()
test.my_dict = mgr.defaultdict(int)
p = Pool(processes=5)
for _ in range(10):
p.apply_async(test.run)
p.close()
p.join()
print(test.my_dict['x'])
if __name__ == '__main__':
main()
Expected output: 100
Actual output: Varies per run, usually somewhere in the 40-50 range.
For certain reasons I need to set the dict on an object rather than passing it as a parameter to the function in the Pool, but I don't think that should matter.
Why is it behaving this way? Thank you in advance!
The problem has nothing to do with defaultdict per se running as a manged object. The problem is that the operation being performed by method run on the defaultdict instance, namely self.my_dict['x'] += 1, is not atomic; it consists of first fetching the current value of key 'x' (if it exists) and then incrementing it and then finally storing it back. That is two separate method calls on the managed dictionary. In between those two calls another process could be running and retrieving the same value and incrementing and storing the same value.
You need to perform this non-atomic operation under a lock to ensure it is serialized across all processes as done below. I have also moved the call to DictProxyManager.register to inside function main for if you are running under Windows (you did not specify your platform but I inferred that possibility), that call will be issued needlessly by every process in the pool.
from collections import defaultdict
from multiprocessing import Pool, Lock
from multiprocessing.managers import BaseManager, DictProxy
class DictProxyManager(BaseManager):
"""Support a using a defaultdict with multiprocessing"""
def init_pool(the_lock):
global lock
lock = the_lock
class Test:
my_dict: defaultdict
def run(self):
for i in range(10):
with lock:
self.my_dict['x'] += 1
def main():
DictProxyManager.register('defaultdict', defaultdict, DictProxy)
test = Test()
mgr = DictProxyManager()
mgr.start()
test.my_dict = mgr.defaultdict(int)
lock = Lock()
p = Pool(processes=5, initializer=init_pool, initargs=(lock,))
for _ in range(10):
p.apply_async(test.run)
p.close()
p.join()
print(test.my_dict['x'])
if __name__ == '__main__':
main()
Prints:
100
Related
I have some code that farms out work to tasks. The tasks put their results on a queue, and the main thread reads these results from the queue and deals with them.
from multiprocessing import Process, Queue, Pool, Manager
import uuid
def handle_task(arg, queue, end_marker):
... add some number of results to the queue . . .
queue.put(end_marker)
def main(tasks):
manager = Manager()
queue = manager.Queue()
count = len(tasks)
end_marker = uuid.uuid4()
with Pool() as pool:
pool.starmap(handle_task, ((task, queue, end_marker) for task in tasks))
while count > 0:
value = queue.get()
if value == end_marker:
count -= 1
else:
... deal with value ...
This code works, but it is incredibly kludgy and inelegant. What if tasks is a iterator? Why do I need to know how many tasks there are ahead of time and keep track of each of them.
Is there a cleaner way of reading from a Queue and and knowing that every process that will write to that thread is done, and you've read everything that they've written?
First of all, operations on a managed queue are very slow compared to a multiprocessing.Queue instance. But why are you even using an an additional queue to return results when a multiprocessing pool already uses such a queue for returning results? Instead of having handle_task write some number of result values to a queue, it could simply return a list of these values. For example,
from multiprocessing import Pool
def handle_task(arg):
results = []
# Add some number of results to the results list:
results.append(arg + arg)
results.append(arg * arg)
return results
def main(tasks):
with Pool() as pool:
map_results = pool.map(handle_task, tasks)
for results in map_results:
for value in results:
# Deal with value:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
Prints:
14
49
4
4
6
9
As a side benefit, the results returned will be in task-submission order, which one day might be important. If you want to be able to process the returned values as they become available, then you can use pool.imap or pool.imap_unordered (if you don't care about the order of the returned values, which seems to be the case):
from multiprocessing import Pool
def handle_task(arg):
results = []
# Add some number of results to the results list:
results.append(arg + arg)
results.append(arg * arg)
return results
def main(tasks):
with Pool() as pool:
for results in pool.imap_unordered(handle_task, tasks):
for value in results:
# Deal with value:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
If the number of tasks being submitted is "large", then you should probably use the chunksize argument of the imap_unordered method. A reasonable value would be len(tasks) / (4 * pool_size) where you are using by default a value of multiprocessing.cpu_count() for your pool size. This is more or less how a chunksize value is computed when you use the map or starmap methods and you have not specified the chunksize argument.
Using a multiprocessing.Queue instance
from multiprocessing import Pool, Queue
from queue import Empty
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map(handle_task, tasks)
try:
while True:
value = queue.get_nowait()
print(value)
except Empty:
pass
if __name__ == '__main__':
main([7, 2, 3])
Although callling queue.empty() is not supposed to be reliable for a multiprocessing.Queue instance, as long as you are doing this after all the tasks have finished processing, it seems no more unreliable than relying on blocking get calls raising an exception only after all items have been retrieved:
from multiprocessing import Pool, Queue
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map(handle_task, tasks)
while not queue.empty():
value = queue.get_nowait()
print(value)
if __name__ == '__main__':
main([7, 2, 3])
But if you want to do everything strictly according to what the documentation implies is the only reliable method when using a multiprocessing.Queue instance, that would be by using sentinels as you already are doing:
from multiprocessing import Pool, Queue
class Sentinel:
pass
SENTINEL = Sentinel()
def init_pool_processes(q):
global queue
queue = q
def handle_task(arg):
results = []
# Add some number of results to the results list:
queue.put(arg + arg) # Referencing the global queue
queue.put(arg * arg)
queue.put(SENTINEL)
def main(tasks):
queue = Queue()
with Pool(initializer=init_pool_processes, initargs=(queue,)) as pool:
pool.map_async(handle_task, tasks) # Does not block
sentinel_count = len(tasks)
while sentinel_count != 0:
value = queue.get()
if isinstance(value, Sentinel):
sentinel_count -= 1
else:
print(value)
if __name__ == '__main__':
main([7, 2, 3])
Conclusion
If you need to use a queue for output, I would recommend a multiprocessing.Queue. In this case using sentinels is really the only 100% correct way of proceeding. I would also use the map_async method so that you can start processing results as they are returned.
Using a Managed Queue
This is Pingu's answer, which remains deleted for now:
from multiprocessing import Pool, Manager
from random import randint
def process(n, q):
for x in range(randint(1, 10)):
q.put((n, x))
def main():
with Manager() as manager:
queue = manager.Queue()
with Pool() as pool:
pool.starmap(process, [(n, queue) for n in range(5)])
while not queue.empty():
print(queue.get())
if __name__ == '__main__':
main()
I have a function in which I create a pool of processes. More over I use multiprocessing.Value() and multiprocessing.Lock() in order to manage some shared values between processes.
I want to do the same thing with an array of objects in order to share it between processes but I don't know how to do it. I will only read from that array.
This is the function:
from multiprocessing import Value,Pool,Lock,cpu_count
def predict(matches_path, unknown_path, files_path, imtodetect_path, num_query_photos, use_top3, uid, workbook, excel_file_path,modelspath,email_address):
shared_correct_matched_imgs = Value('i', 0)
shared_unknown_matched_imgs = Value('i', 0)
shared_tot_imgs = Value('i', 0)
counter = Value('i', 0)
shared_lock = Lock()
num_workers = cpu_count()
feature = load_feature(modelspath)
pool = Pool(initializer=init_globals,
initargs=[counter, shared_tot_imgs, shared_correct_matched_imgs, shared_unknown_matched_imgs,
shared_lock], processes=num_workers)
for img in glob.glob(os.path.join(imtodetect_path, '*g')):
pool.apply_async(predict_single_img, (img,imtodetect_path,excel_file_path,files_path,use_top3,uid,matches_path,unknown_path,num_query_photos,index,modelspath))
index+=increment
pool.close()
pool.join()
The array is created with the instruction feature = load_feature(modelspath). This is the array that I want to share.
In init_globals I inizialize the shared value:
def init_globals(counter, shared_tot_imgs, shared_correct_matched_imgs, shared_unknown_matched_imgs, shared_lock):
global cnt, tot_imgs, correct_matched_imgs, unknown_matched_imgs, lock
cnt = counter
tot_imgs = shared_tot_imgs
correct_matched_imgs = shared_correct_matched_imgs
unknown_matched_imgs = shared_unknown_matched_imgs
lock = shared_lock
The easy way of providing shared static data is simply to make it a global variable accessible to the function you want to call. If you're using an operating system which supports "fork", it is very straightforward to use global variables in child processes as long as they're constant (if you modify them, changes won't be reflected in the other processes)
import multiprocessing as mp
from random import randint
shared = ['some', 'shared', 'data', f'{randint(0,1e6)}']
def foo():
print(' '.join(shared))
if __name__ == "__main__":
mp.set_start_method("fork")
#defining "shared" here would be valid also
p = mp.Process(target=foo)
p.start()
p.join()
print(' '.join(shared)) #same random number means "shared" is same object
This won't work when using "spawn" as the start method (the only one available on windows), because the memory of the parent is not shared in any way with the child, so the child must "import" the main file to gain access to whatever the target function is (this is also why you can run into problems with decorators.) If you define your data outside the if __name__ == "__main__": block, it will kinda work, but you will have made separate copies of the data, which can be undesirable if it's big, slow to create, or can change each time it's created.
import multiprocessing as mp
from random import randint
shared = ['some', 'shared', 'data', f'{randint(0,1e6)}']
def foo():
print(' '.join(shared))
if __name__ == "__main__":
mp.set_start_method("spawn")
p = mp.Process(target=foo)
p.start()
p.join()
print(' '.join(shared)) #different number means different copy of "shared" (1 a million chance of being same i guess...)
TL;DR I want to collect the accumulated data in the globals of each worker when the pool is finished processing
Description of what I think I'm missing
As I'm new to multiprocessing, I don't know of all the features that exist. I am looking for a way to make a worker return the value it was initialized with (after manipulating that value a bunch of millions of times). Then, I hope I can collect and merge all these values at the end of the program when all the 'jobs' are done.
import multiprocessing as mp
from collections import defaultdict, Counter
from customtools import load_regexes #, . . .
import gzip
import nltk
result_dict = None
regexes = None
def create_worker():
global result_dict
global regexes
result_dict = defaultdict(Counter) # I want to return this at the end
# these are a bunch of huge regexes
regexes = load_regexes()
These functions represents the way I load and process data. The data is a big gzipfile with articles.
def load_data(semaphore):
with gzip.open('some10Gbfile') as f:
for line in file:
semaphore.acquire()
yield str(line, 'utf-8')
def worker_job(line):
global regexes
global result_dict
hits = defaultdict(Counter)
for sent in nltk.sent_tokenize(line[3:]):
for rename, regex in regex.items():
for hit in regex.finditer(sent):
hits[rename][hit.group(0)]+=1
# and more and more... results = _filter(_extract(hits))
# store some data in results_dict here . . .
return filtered_hits
Class ResultEater():
def __init__(self):
self.wordscounts=defaultdict(Counter)
self.filtered=Counter()
def eat_results(self, filte red_hits):
for k, v in filte.items():
for i, c in v.items():
self.wordscount[k][i]+=c
This is the main program
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count(), initializer=create_worker)
semaphore = mp.Semaphore(50)
loader = load_data(semaphore)
results = ResultEater()
for intermediate_result in pool.imap_unordered(worker_job, loader, chunksize=10):
results.eat_results(intermediate_result)
semaphore.release()
# results.eat_workers(the_leftover_workers_or_something)
results.print()
I don't really think I understand how exactly returning the data incrementally isn't sufficient, but it kinda seems like you need some sort of finalization function to send the data similar to how you have an initialization function. Unfortunately, I don't think this sort of thing exists for mp.Pool, so it'll require you to use a couple mp.Process's, and send input args, and return results with a couple mp.Queue's
On a side note your use of Semaphore is unncessary, as the call to the "load_data" iterator always happens on the main process. I have moved that to another "producer" process, which puts inputs to a queue, which is also already synchronized automatically by default. This allows you to have one process for gathering inputs, several processes for processing the inputs to outputs, and leaves the main (parent) process to gather outputs. If the "producer" generating the inputs is IO limited by file read speed (very likely), it could also be in a thread rather than a process, but in this case the difference is probably minimal.
I have created an example of a custom "Pool" which allows you to return some data at the end of each worker's "life" using aforementioned "producer-consumer" scheme. there are print statements to track what is going on in each process, but please also read the comments to track what's going on and why:
import multiprocessing as mp
from time import sleep
from queue import Empty
class ExitFlag:
def __init__(self, exit_value=None):
self.exit_value = exit_value #optionally pass value along with exit flag
def producer_func(input_q, n_workers):
for i in range(100): #100 lines of some long file
print(f"put {i}")
input_q.put(i) #put each line of the file to the work queue
print('stopping consumers')
for i in range(n_workers):
input_q.put(ExitFlag()) #send shut down signal to each of the workers
print('producer exiting')
def consumer_func(input_q, output_q, work_func):
counter = 0
while True:
try:
item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock.
except Empty:
continue
print(f"get {item}")
if isinstance(item, ExitFlag):
break
else:
counter += 1
output_q.put(work_func(item))
output_q.put(ExitFlag(exit_value=counter))
print('consumer exiting')
def work_func(number):
sleep(.1) #some heavy nltk work...
return number*2
if __name__ == '__main__':
input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints
output_q = mp.Queue(maxsize=10)
n_workers = mp.cpu_count()
producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway)
producer.start()
consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)]
for c in consumers: c.start()
total = 0
stop_signals = 0
exit_values = []
while True:
try:
item = output_q.get(.1)
except Empty:
continue
if isinstance(item, ExitFlag):
stop_signals += 1
if item.exit_value is not None:
exit_values.append(item.exit_value) #do something with the return at the end
if stop_signals >= n_workers: #stop waiting for more results once all consumers finish
break
else:
total += item #do something with the incremental return values
print(total)
print(exit_values)
#cleanup
producer.join()
print("producer joined")
for c in consumers: c.join()
print("consumers joined")
Here is my code below , I put string in queue , and hope dowork2 to do something work , and return char in shared_queue
but I always get nothing at while not shared_queue.empty()
please give me some point , thanks.
import time
import multiprocessing as mp
class Test(mp.Process):
def __init__(self, **kwargs):
mp.Process.__init__(self)
self.daemon = False
print('dosomething')
def run(self):
manager = mp.Manager()
queue = manager.Queue()
shared_queue = manager.Queue()
# shared_list = manager.list()
pool = mp.Pool()
results = []
results.append(pool.apply_async(self.dowork2,(queue,shared_queue)))
while True:
time.sleep(0.2)
t =time.time()
queue.put('abc')
queue.put('def')
l = ''
while not shared_queue.empty():
l = l + shared_queue.get()
print(l)
print( '%.4f' %(time.time()-t))
pool.close()
pool.join()
def dowork2(queue,shared_queue):
while True:
path = queue.get()
shared_queue.put(path[-1:])
if __name__ == '__main__':
t = Test()
t.start()
# t.join()
# t.run()
I managed to get it work by moving your dowork2 outside the class. If you declare dowork2 as a function before Test class and call it as
results.append(pool.apply_async(dowork2, (queue, shared_queue)))
it works as expected. I am not 100% sure but it probably goes wrong because your Test class is already subclassing Process. Now when your pool creates a subprocess and initialises the same class in the subprocess, something gets overridden somewhere.
Overall I wonder if Pool is really what you want to use here. Your worker seems to be in an infinite loop indicating you do not expect a return value from the worker, only the result in the return queue. If this is the case, you can remove Pool.
I also managed to get it work keeping your worker function within the class when I scrapped the Pool and replaced with another subprocess:
foo = mp.Process(group=None, target=self.dowork2, args=(queue, shared_queue))
foo.start()
# results.append(pool.apply_async(Test.dowork2, (queue, shared_queue)))
while True:
....
(you need to add self to your worker, though, or declare it as a static method:)
def dowork2(self, queue, shared_queue):
We are trying to access data between two threads, but are unable to accomplish this. We are looking for an easy (and elegant) way.
This is our current code.
Goal: after the second thread/process is done, the listHolder in instance B must contain 2 items.
Class A:
self.name = "MyNameIsBlah"
Class B:
# Contains a list of A Objects. Is now empty.
self.listHolder = []
def add(self, obj):
self.listHolder.append(obj)
def remove(self, obj):
self.listHolder.remove(obj)
def process(list):
# Create our second instance of A in process/thread
secondItem = A()
# Add our new instance to the list, so that we can access it out of our process/thread.
list.append(secondItem)
# Create new instance of B which is the manager. Our listHolder is empty here.
manager = B()
# Create new instance of A which is our first item
firstItem = A()
# Add our first item to the manager. Our listHolder now contains one item now.
b.add(firstItem)
# Start a new seperate process.
p = Process(target=process, args=manager.listHolder)
# Now start the thread
p.start()
# We now want to access our second item here from the listHolder, which was initiated in the seperate process/thread.
print len(manager.listHolder) << 1
print manager.listHolder[1] << ERROR
Expected output: 2 A instances in listHolder.
Got output: 1 A instance in listHolder.
How can we access our objects in the manager with the use of a seperated process/threads, so they can run two functions simultaneously in a non-thread-blocking way.
Currently we are trying to accomplish this with processes, but if threads can accomplish this goal in a easier way, then its not a problem. Python 2.7 is used.
Update 1:
#James Mills replied with using ".join()". However, this will block the main thread until the second Process is done. I tried using this, but the Process which is used in this example will never stop execution (while True). It will act as a timer, which must be able to iterate to a list and remove objects from the list.
Anyone has any suggestion how to accomplish this and fix the current cPickle error?
if James Mills answer doesn't work for you, here's a writeup of how to use queues to explicitly send data back and forth to a worker process:
#!/usr/bin/env python
import logging, multiprocessing, sys
def myproc(arg):
return arg*2
def worker(inqueue, outqueue):
logger = multiprocessing.get_logger()
logger.info('start')
while True:
job = inqueue.get()
logger.info('got %s', job)
outqueue.put( myproc(job) )
def beancounter(inqueue):
while True:
print 'done:', inqueue.get()
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO,
)
logger.info('setup')
data_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
for num in range(5):
data_queue.put(num)
worker_p = multiprocessing.Process(
target=worker, args=(data_queue, out_queue),
name='worker',
)
worker_p.start()
bean_p = multiprocessing.Process(
target=beancounter, args=(out_queue,),
name='beancounter',
)
bean_p.start()
worker_p.join()
bean_p.join()
logger.info('done')
if __name__=='__main__':
main()
from: Django multiprocessing and empty queue after put
Another example of using multiprocessing Manager to handle the data is here:
http://johntellsall.blogspot.com/2014/05/code-multiprocessing-producerconsumer.html
One of the simplest ways of Sharing state between processes is to use the multiprocessing.Manager class to synchronize data between processes (which interally uses a Queue):
Example:
from multiprocessing import Process, Manager
def f(d, l):
d[1] = '1'
d['2'] = 2
d[0.25] = None
l.reverse()
if __name__ == '__main__':
manager = Manager()
d = manager.dict()
l = manager.list(range(10))
p = Process(target=f, args=(d, l))
p.start()
p.join()
print d
print l
Output:
bash-4.3$ python -i foo.py
{0.25: None, 1: '1', '2': 2}
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>>
Note: Please be careful with the types of obejcts ou are sharing and attaching to your Process classes as you may end up with issues with pickling. See: Python multiprocessing pickling error