Python threading.local() not working in Thread class - python

In Python3.6, I use threading.local() to store some status for thread.
Here is a simple example to explain my question:
import threading
class Test(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.local = threading.local()
self.local.test = 123
def run(self):
print(self.local.test)
When I start this thread:
t = Test()
t.start()
Python gives me an error:
AttributeError: '_thread._local' object has no attribute 'test'
It seems the test atrribute can not access out of the __init__ function scope, because I can print the value in the __init__ function after local set attribute test=123.
Is it necessary to use threading.local object inside in a Thread subclass? I think the instance attributes of a Thread instance could keep the attributes thread safe.
Anyway, why the threading.local object not work as expected between instance function?

When you constructed your thread you were using a DIFFERENT thread. when you execute the run method on the thread you are starting a NEW thread. that thread does not yet have a thread local variable set. this is why you do not have your attribute it was set on the thread constructing the thread object and not the thread running the object.

As stated in https://docs.python.org/3.6/library/threading.html#thread-local-data:
The instance’s values will be different for separate threads.
Test.__init__ executes in the caller's thread (e.g. the thread where t = Test() executes). Yes, it's good place to create thread-local storage (TLS).
But when t.run executes, it will have completely diffferent contents -- the contents accessible only within the thread t.
TLS is good when You need to share data in scope of current thread. It like just a local variable inside a function -- but for threads. When the thread finishes execution -- TLS disappears.
For inter-thread communication Futures can be a good choice. Some others are Conditional variables, events, etc. See threading docs page.

Related

How to periodically call instance method from a separate process

I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!

How to have dedicated variable for multiprocessing worker, which keeps its value between calls?

I have the following code:
pool = Pool(cpu_count())
pool.imap(process_item, items, chunksize=100)
In the process_item() function I am using structures which are resource demanding to create, but it would be reusable. (but not concurrently shareable) Currently within each call of process_item() it creates the resource in a local variable repeatedly. It would be great performance benefit to create once (for each worker) then reuse
Question
How to have delegated cpu_count() instances for those resource, and how to implement the process_item() function to access the appropriate delegated instance belonging that particular worker?
If you cannot use anything outside the standard library, I would suggest using using an initializer when creating the pool:
from multiprocessing import Pool, Manager, Process
import os
import random
class A:
def __init__(self):
self.var = random.randint(0, 1000)
def get(self):
print(self.var, os.getpid())
def worker(some_arg):
global expensive_var
expensive_var.get()
def initializer(*args):
global expensive_var
expensive_var = A()
if __name__ == "__main__":
pool = Pool(8, initializer=initializer, initargs=())
for result in pool.imap(worker, range(100)):
continue
Create your local variables inside the initializer, and make them global. Then you can use them inside the function you are passing to the pool. This works because the initializer is executed in when each process of the pool starts. So making them global would make it a global variable in the scope of the child process only, allowing access to it during execution of the function you passed to the pool.
There was a stackoverflow answer that explained all this better, but I can't seem to find it for now. But this is basically the gist of it.

Python - is `threading.Event` "set" during garbage collection?

The title of this post pretty much sums up my question - will threads waiting on an Event be notified if that event has been garbage collected? In my particular case I have a class whose instances have an Event as an attribute, and I'm wondering whether I should implement a __del__ method on this class that calls self.event.set() before it's garbage collected.
I'm new to asynchronicity, so if event's don't set() when they're garbage collected, perhaps it's bad practice to do so, and better to let threads hang? Thanks in advance for any responses.
Since other objects hold a reference to the event, the event itself won't be deleted or garbage collected. It has no idea that your object is being deleted. Whether you want your class to have a __del__ that sets the event when the object is deleted (either naturally through having its ref count go to zero or though garbage collection) is entirely dependent on your event system design. Suppose I have a dozen objects referencing the event. Do I want the event fired when each one goes away? Depends!
Note that it's not necessarily the case that waiting for an Event implies the Event isn't in trash. Cyclic trash is one possibility, and here's another:
import threading
class C(object):
def __init__(self):
self.e = threading.Event()
def __del__(self):
print("going away")
def f():
C().e.wait()
t = threading.Thread(target=f)
t.start()
print("main ending")
That prints:
going away
main ending
and then it hangs forever, as Python attempts to .join() the thread as part of interpreter shutdown processing.
The function f(), run in a thread, creates an instance of C that becomes trash immediately after its e attribute is retrieved. So its __del__ method is called, and "going away" is displayed.
You can infer from the behavior that, no, a trash Event does not get set by magic. But it's not going to come up in practice, so don't worry about it ;-)

Python: Holding a reference to a subclass of threading

I have a subclass of threading.Thread. After instantiating it, it runs forever in the background.
class MyThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.daemon = True
self.start()
def run(self):
while True:
<do something>
If I were to instantiate the thread from within another class, I would normally do so with
self.my_thread = MyThread()
In cases when I never thereafter have to access the thread, I have long wondered whether I can instead instantiate it simply with
MyThread()
(i.e., instantiate it without holding a reference). Will the thread eventually be garbage collected because there is no reference holding it?
it doesnt matter ... you can test it easily with del self.my_thread and you should see the thread continue running even though you deleted the only reference and forced garbage collection ... that said it is usually a good idea to hold a reference (so that you can set flags and what not for the other thread, although shared memory may be sufficient)

Python threading.Thread, scopes and garbage collection

Say I derive from threading.Thread:
from threading import Thread
class Worker(Thread):
def start(self):
self.running = True
Thread.start(self)
def terminate(self):
self.running = False
self.join()
def run(self):
import time
while self.running:
print "running"
time.sleep(1)
Any instance of this class with the thread being started must have it's thread actively terminated before it can get garbage collected (the thread holds a reference itself). So this is a problem, because it completely defies the purpose of garbage collection. In that case having some object encapsulating a thread, and with the last instance of the object going out of scope the destructor gets called for thread termination and cleanup. Thuss a destructor
def __del__(self):
self.terminate()
will not do the trick.
The only way I see to nicely encapsulate threads is by using low level thread builtin module and weakref weak references. Or I may be missing something fundamental. So is there a nicer way than tangling things up in weakref spaghetti code?
How about using a wrapper class (which has-a Thread rather than is-a Thread)?
eg:
class WorkerWrapper:
__init__(self):
self.worker = Worker()
__del__(self):
self.worker.terminate()
And then use these wrapper classes in client code, rather than threads directly.
Or perhaps I miss something (:
To add an answer inspired by #datenwolf's comment, here is another way to do it that deals with the object being deleted or the parent thread ending:
import threading
import time
import weakref
class Foo(object):
def __init__(self):
self.main_thread = threading.current_thread()
self.initialised = threading.Event()
self.t = threading.Thread(target=Foo.threaded_func,
args=(weakref.proxy(self), ))
self.t.start()
while not self.initialised.is_set():
# This loop is necessary to stop the main threading doing anything
# until the exception handler in threaded_func can deal with the
# object being deleted.
pass
def __del__(self):
print 'self:', self, self.main_thread.is_alive()
self.t.join()
def threaded_func(self):
self.initialised.set()
try:
while True:
print time.time()
if not self.main_thread.is_alive():
print('Main thread ended')
break
time.sleep(1)
except ReferenceError:
print('Foo object deleted')
foo = Foo()
del foo
foo = Foo()
I guess you are a convert from C++ where a lot of meaning can be attached to scopes of variables, equalling lifetimes of variables. This is not the case for Python, and garbage collected languages in general.
Scope != Lifetime simply because garbage collection occurs whenever the interpreter gets around to it, not on scope boundaries. Especially as you are trying to do asynchronuous stuff with it, the raised hairs on your neck should vibrate to the clamour of all the warning bells in your head!
You can do stuff with the lifetime of objects, using 'del'.
(In fact, if you read the sources to the cpython garbage collector module, the obvious (and somewhat funny) disdain for objects with finalizers (del methods) expressed there, should tell everybody to use even the lifetime of an object only if necessary).
You could use sys.getrefcount(self) to find out when to leave the loop in your thread. But I can hardly recommend that (just try out what numbers it returns. You won't be happy. To see who holds what just check gc.get_referrers(self)).
The reference count may/will depend on garbage collection as well.
Besides, tying the runtime of a thread of execution to scopes/lifetimes of objects is an error 99% of the time. Not even Boost does it. It goes out of its RAII way to define something called a 'detached' thread.
http://www.boost.org/doc/libs/1_55_0/doc/html/thread/thread_management.html

Categories

Resources