Sharing local data across threads inside class method - python

I have a method defined inside a class as follows:
from Queue import Queue
from threading import Thread, Lock
class Example(object):
def f(self):
things = [] # list of some objects to process
my_set = set()
my_dict = {}
my_list = []
def update(q, t_lock, my_set, my_dict, my_list):
while True:
thing = q.get()
new_set, new_dict, new_list = otherclass.method(thing)
with t_lock:
my_set = set_overlay(my_set, new_set)
my_dict.update(new_dict)
my_list += new_list
q.task_done()
q = Queue()
num_threads = 10
thread_lock = Lock()
for i in range(num_threads):
worker = Thread(target=update, args=(q, thread_lock, my_set, my_dict, my_list))
worker.setDaemon(True)
worker.start()
for t in things:
q.put(t)
q.join()
As you can see I'm trying to update the variables my_set, my_dict, and my_list with some results from the threaded update method defined inside the f method. I can pass them all to the update function, but this only works for mutable datatypes.
I updated this to use threading.Lock() as user maxywb suggested, but I would like to keep the same questions below open to be answered now that the lock is included.
My questions are:
Is this threadsafe? Can I guarantee that no updates to any of variables will be lost?
What if I wanted to throw an immutable variable into the mix like an int, that was added to from the results of otherclass.method(thing)? How would I go about doing that?
Is there a better way to do this/architect this? The idea here is to update variables local to a class method from a thread, while sharing a (hopefully) threadsafe reference to that variable across threads.

Related

Why is multiprocessing.managers.DictProxy with a defaultdict not multiprocess-safe?

I'm trying to use a defaultdict with multiprocessing, as described in Using defaultdict with multiprocessing?.
Example code:
from collections import defaultdict
from multiprocessing import Pool
from multiprocessing.managers import BaseManager, DictProxy
class DictProxyManager(BaseManager):
"""Support a using a defaultdict with multiprocessing"""
DictProxyManager.register('defaultdict', defaultdict, DictProxy)
class Test:
my_dict: defaultdict
def run(self):
for i in range(10):
self.my_dict['x'] += 1
def main():
test = Test()
mgr = DictProxyManager()
mgr.start()
test.my_dict = mgr.defaultdict(int)
p = Pool(processes=5)
for _ in range(10):
p.apply_async(test.run)
p.close()
p.join()
print(test.my_dict['x'])
if __name__ == '__main__':
main()
Expected output: 100
Actual output: Varies per run, usually somewhere in the 40-50 range.
For certain reasons I need to set the dict on an object rather than passing it as a parameter to the function in the Pool, but I don't think that should matter.
Why is it behaving this way? Thank you in advance!
The problem has nothing to do with defaultdict per se running as a manged object. The problem is that the operation being performed by method run on the defaultdict instance, namely self.my_dict['x'] += 1, is not atomic; it consists of first fetching the current value of key 'x' (if it exists) and then incrementing it and then finally storing it back. That is two separate method calls on the managed dictionary. In between those two calls another process could be running and retrieving the same value and incrementing and storing the same value.
You need to perform this non-atomic operation under a lock to ensure it is serialized across all processes as done below. I have also moved the call to DictProxyManager.register to inside function main for if you are running under Windows (you did not specify your platform but I inferred that possibility), that call will be issued needlessly by every process in the pool.
from collections import defaultdict
from multiprocessing import Pool, Lock
from multiprocessing.managers import BaseManager, DictProxy
class DictProxyManager(BaseManager):
"""Support a using a defaultdict with multiprocessing"""
def init_pool(the_lock):
global lock
lock = the_lock
class Test:
my_dict: defaultdict
def run(self):
for i in range(10):
with lock:
self.my_dict['x'] += 1
def main():
DictProxyManager.register('defaultdict', defaultdict, DictProxy)
test = Test()
mgr = DictProxyManager()
mgr.start()
test.my_dict = mgr.defaultdict(int)
lock = Lock()
p = Pool(processes=5, initializer=init_pool, initargs=(lock,))
for _ in range(10):
p.apply_async(test.run)
p.close()
p.join()
print(test.my_dict['x'])
if __name__ == '__main__':
main()
Prints:
100

python - multiprocessing with queue

Here is my code below , I put string in queue , and hope dowork2 to do something work , and return char in shared_queue
but I always get nothing at while not shared_queue.empty()
please give me some point , thanks.
import time
import multiprocessing as mp
class Test(mp.Process):
def __init__(self, **kwargs):
mp.Process.__init__(self)
self.daemon = False
print('dosomething')
def run(self):
manager = mp.Manager()
queue = manager.Queue()
shared_queue = manager.Queue()
# shared_list = manager.list()
pool = mp.Pool()
results = []
results.append(pool.apply_async(self.dowork2,(queue,shared_queue)))
while True:
time.sleep(0.2)
t =time.time()
queue.put('abc')
queue.put('def')
l = ''
while not shared_queue.empty():
l = l + shared_queue.get()
print(l)
print( '%.4f' %(time.time()-t))
pool.close()
pool.join()
def dowork2(queue,shared_queue):
while True:
path = queue.get()
shared_queue.put(path[-1:])
if __name__ == '__main__':
t = Test()
t.start()
# t.join()
# t.run()
I managed to get it work by moving your dowork2 outside the class. If you declare dowork2 as a function before Test class and call it as
results.append(pool.apply_async(dowork2, (queue, shared_queue)))
it works as expected. I am not 100% sure but it probably goes wrong because your Test class is already subclassing Process. Now when your pool creates a subprocess and initialises the same class in the subprocess, something gets overridden somewhere.
Overall I wonder if Pool is really what you want to use here. Your worker seems to be in an infinite loop indicating you do not expect a return value from the worker, only the result in the return queue. If this is the case, you can remove Pool.
I also managed to get it work keeping your worker function within the class when I scrapped the Pool and replaced with another subprocess:
foo = mp.Process(group=None, target=self.dowork2, args=(queue, shared_queue))
foo.start()
# results.append(pool.apply_async(Test.dowork2, (queue, shared_queue)))
while True:
....
(you need to add self to your worker, though, or declare it as a static method:)
def dowork2(self, queue, shared_queue):

Python Thread with Queue

i try to get working python thread with queue. But when i put any value to queue i cannt find this value in other thread.
from Queue import Queue
from threading import Thread
import time
class ThreadWorker(object):
verbose = True
thread = None
queue = Queue()
def __init__(self, workerId, queueMaxSize = 50, emptyQueuewaitTime = 1):
self.queue.maxsize = queueMaxSize
self.thread = Thread(target=self.__work, args=(workerId, emptyQueuewaitTime))
self.thread.setDaemon(True)
self.thread.start()
def __work(self, workerId, sl):
while(True):
if self.queue.empty:
print '[THREAD_WORKER] id: {}, EMPTY QUEUE sleeping: {}'.format(workerId, sl)
time.sleep(sl)
continue
if self.verbose:
print '[THREAD_WORKER] id: {}, queueSize: {}'.format(workerId, self.queue.qsize())
d = self.queue.get()
self.queue.task_done()
def put(self, item, waitIfFull = True):
self.queue.put(item, waitIfFull)
if self.verbose:
print "Add to queue, current queue size: {}".format(self.queue.qsize())
Create instance and fill the queue ...
t = ThreadWorker("t1")
t.put("item1")
t.put("item2")
t.put("item3")
Output from thread with name t1 is: [THREAD_WORKER] id: t1, EMPTY QUEUE sleeping: 1
But in queue are three items ....
queue.empty is a method; you need to call it. That's the immediate issue.
If you create two ThreadWorkers, you're going to find that they're sharing their queues. Unlike other languages, an assignment like queue = Queue() at class level doesn't declare an instance variable; it declares a class attribute. To create an instance attribute, you would instead assign self.queue = Queue() in the __init__ method. There is no need for any sort of declaration of this attribute's existence at class level.
Finally, checking whether a Queue is empty is very prone to race conditions, since whether or not it's empty might change between empty() and get(). It's generally better to just call get, and let get wait for a put if the queue is empty.

Access data between two threading processes

We are trying to access data between two threads, but are unable to accomplish this. We are looking for an easy (and elegant) way.
This is our current code.
Goal: after the second thread/process is done, the listHolder in instance B must contain 2 items.
Class A:
self.name = "MyNameIsBlah"
Class B:
# Contains a list of A Objects. Is now empty.
self.listHolder = []
def add(self, obj):
self.listHolder.append(obj)
def remove(self, obj):
self.listHolder.remove(obj)
def process(list):
# Create our second instance of A in process/thread
secondItem = A()
# Add our new instance to the list, so that we can access it out of our process/thread.
list.append(secondItem)
# Create new instance of B which is the manager. Our listHolder is empty here.
manager = B()
# Create new instance of A which is our first item
firstItem = A()
# Add our first item to the manager. Our listHolder now contains one item now.
b.add(firstItem)
# Start a new seperate process.
p = Process(target=process, args=manager.listHolder)
# Now start the thread
p.start()
# We now want to access our second item here from the listHolder, which was initiated in the seperate process/thread.
print len(manager.listHolder) << 1
print manager.listHolder[1] << ERROR
Expected output: 2 A instances in listHolder.
Got output: 1 A instance in listHolder.
How can we access our objects in the manager with the use of a seperated process/threads, so they can run two functions simultaneously in a non-thread-blocking way.
Currently we are trying to accomplish this with processes, but if threads can accomplish this goal in a easier way, then its not a problem. Python 2.7 is used.
Update 1:
#James Mills replied with using ".join()". However, this will block the main thread until the second Process is done. I tried using this, but the Process which is used in this example will never stop execution (while True). It will act as a timer, which must be able to iterate to a list and remove objects from the list.
Anyone has any suggestion how to accomplish this and fix the current cPickle error?
if James Mills answer doesn't work for you, here's a writeup of how to use queues to explicitly send data back and forth to a worker process:
#!/usr/bin/env python
import logging, multiprocessing, sys
def myproc(arg):
return arg*2
def worker(inqueue, outqueue):
logger = multiprocessing.get_logger()
logger.info('start')
while True:
job = inqueue.get()
logger.info('got %s', job)
outqueue.put( myproc(job) )
def beancounter(inqueue):
while True:
print 'done:', inqueue.get()
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO,
)
logger.info('setup')
data_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
for num in range(5):
data_queue.put(num)
worker_p = multiprocessing.Process(
target=worker, args=(data_queue, out_queue),
name='worker',
)
worker_p.start()
bean_p = multiprocessing.Process(
target=beancounter, args=(out_queue,),
name='beancounter',
)
bean_p.start()
worker_p.join()
bean_p.join()
logger.info('done')
if __name__=='__main__':
main()
from: Django multiprocessing and empty queue after put
Another example of using multiprocessing Manager to handle the data is here:
http://johntellsall.blogspot.com/2014/05/code-multiprocessing-producerconsumer.html
One of the simplest ways of Sharing state between processes is to use the multiprocessing.Manager class to synchronize data between processes (which interally uses a Queue):
Example:
from multiprocessing import Process, Manager
def f(d, l):
d[1] = '1'
d['2'] = 2
d[0.25] = None
l.reverse()
if __name__ == '__main__':
manager = Manager()
d = manager.dict()
l = manager.list(range(10))
p = Process(target=f, args=(d, l))
p.start()
p.join()
print d
print l
Output:
bash-4.3$ python -i foo.py
{0.25: None, 1: '1', '2': 2}
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>>
Note: Please be careful with the types of obejcts ou are sharing and attaching to your Process classes as you may end up with issues with pickling. See: Python multiprocessing pickling error

should I protect built-in data structure( list, dict) when using multiple threads?

I think I should use Lock object to protect custom class when using multiple threads, however, because Python use GIL to ensure that only one thread is running at any given time, does it mean that there's no need to use Lock to protect built-in type like list? example,
num_list = []
def consumer():
while True:
if len(num_list) > 0:
num = num_list.pop()
print num
return
def producer():
num_list.append(1)
consumer_thread = threading.Thread(target = consumer)
producer_thread = threading.Thread(target = producer)
consumer_thread.start()
producer_thread.start()
The GIL protects the interpreter state, not yours. There are some operations that are effectively atomic - they require a single bytecode and thus effectively do not require locking. (see is python variable assignment atomic? for an answer from a very reputable Python contributor).
There isn't really any good documentation on this though so I wouldn't rely on that in general unless if you plan on disassembling bytecode to test your assumptions. If you plan on modifying state from multiple contexts (or modifying and accessing complex state) then you should plan on using some sort of locking/synchronization mechanism.
If you're interested in approaching this class of problem from a different angle you should look into the Queue module. A common pattern in Python code is to use a synchronized queue to communicate among thread contexts rather than working with shared state.
#jeremy-brown explains with words(see below)... but if you want a counter example:
The lock isn't protecting your state. The following example doesn't use locks, and as a result if the xrange value is high enough it will result in failures: IndexError: pop from empty list.
import threading
import time
con1_list =[]
con2_list =[]
stop = 10000
total = 500000
num_list = []
def consumer(name, doneFlag):
while True:
if len(num_list) > 0:
if name == 'nix':
con2_list.append(num_list.pop())
if len(con2_list) == stop:
print 'done b'
return
else:
con1_list.append(num_list.pop())
if len(con1_list) == stop:
print 'done a'
return
def producer():
for x in xrange(total):
num_list.append(x)
def test():
while not (len(con2_list) >=stop and len(con1_list) >=stop):
time.sleep(1)
print set(con1_list).intersection( set(con2_list))
consumer_thread = threading.Thread(target = consumer, args=('nick',done1))
consumer_thread2 = threading.Thread(target = consumer, args=('nix',done2))
producer_thread = threading.Thread(target = producer)
watcher = threading.Thread(target = test)
consumer_thread.start();consumer_thread2.start();producer_thread.start();watcher.start()

Categories

Resources