is a Python dictionary thread-safe when keys are thread IDs? - python

Is a Python dictionary thread safe when using the thread ID of the current thread only to read or write? Like
import thread
import threading
class Thread(threading.Thread):
def __init__(self, data):
super(Thread, self).__init__()
self.data = data
def run(self):
data = self.data[thread.get_ident()]
# ...

If data is a standard Python dictionary, the __getitem__ call is implemented entirely in C, as is the __hash__ method on the integer value returned by thread.get_ident(). At that point the data.__getitem__(<thread identifier>) call is thread safe. The same applies to writing to data; the data.__setitem__() call is entirely handled in C.
The moment any of these hooks are implemented in Python code, the GIL can be released between bytecodes and all bets are off.
This all makes the assumption you are using CPython; Jython, IronPython, PyPy and other python implementations may make different decisions on when to switch threads.
You'd be better of using the threading.local() mapping object instead, as that is guaranteed to provide you with a thread-local namespace. It only supports attribute access though.

Related

How to periodically call instance method from a separate process

I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!

Do mutable class attributes require a lock when reading or updating?

I'm using a couple of class attributes to keep track of aggregate task completion across multiple instances of class. When reading or updating the class attributes do I need to use a lock of some sort?
class ClassAttrExample:
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
# do I need a lock context here???
self.of_type_list.append(self.name)
self.of_type_int += 1
If not threads are involved, no locks are required just because class instances share data. As long as the operations are performed in the same thread, everything is safe.
If threads are involved, you'll want locks.
For the specific case of CPython (the reference interpreter), as an implementation detail, the .append call does not require a lock. The GIL can only be switched out between bytecodes (or when a bytecode calls into C code that explicitly releases it, which list never does), and list.append is effectively atomic as a result (all the work it does occurs within a single CALL_METHOD bytecode which never calls back into Python level code, so the GIL is definitely held the whole time).
By contrast, += involves reading the input operand, then performing the increment, then reassigning the input, and the GIL can be swapped between those operations, leading to missed increments when two threads read the value before either writes back to it.
So if multithreaded access is possible, for the int case, the lock is required. And given you need the lock anyway, you may as well lock around the append call too, ensuring the code remains portable to GIL-free Python interpreters.
A fully portable thread-safe version of your class would look something like:
import threading
class ClassAttrExample:
_lock = threading.Lock()
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
with self._lock:
# Can't use bare name to refer to class attribute, must access
# through class or instance thereof
self.of_type_list.append(self.name) # Load-only access to of_type_list
# can use self directly
type(self).of_type_int += 1 # Must use type(self) to avoid creating
# instance attribute that shadows class
# attribute on store

Python on-the-fly function to script conversion

Is there a reasonably natural way of converting python function to standalone scripts? Something like:
def f():
# some long and involved computation
script = function_to_script(f) # now script is some sort of closure,
# which can be run in a separate process
# or even shipped over the network to a
# different host
and NOT like:
script = open("script.py", "wt")
script.write("#!/usr/bin/env python")
...
You can turn any "object" into a function by defining the __call__ method on it (see here.) Hence, if you want to compartmentalize some state with the computations, as long as what you've provided from the very top to the bottom of a class can be pickled, then that object can be pickled.
class MyPickledFunction(object):
def __init__(self, *state):
self.__state = state
def __call__(self, *args, **kwargs):
#stuff in here
That's the easy cheater way. Why pickling? Anything that can be pickled can be sent to another process without fear. You're forming a "poor man's closure" by using an object like this.
(There's a nice post about the "marshal" library here on SO if you want to truly pickle a function.)

How to share object tree with process fork?

I don't have much experience with multithreading, and I'm trying to get something like the below working:
from multiprocessing import Process
class Node:
def __init__(self):
self.children = {}
class Test(Process):
def __init__(self, tree):
super().__init__()
self.tree = tree
def run(self):
# infinite loop which does stuff to the tree
self.tree.children[1] = Node()
self.tree.children[2] = Node()
x = Node()
t = Test(x)
t.start()
print(x.children) # random access to tree
I realize this shouldn't (and doesn't) work for a variety of very sensible reasons, but I'm not sure how to get it to work. Referring to the documentation, it seems that I need to do something with Managers and Proxies, but I honestly have no idea where to start, or whether that is actually what I'm looking for. Could someone provide an example of the above that works?
multiprocessing has limited support for implicitly shared objects, which can even share lists and dicts.
In general, multiprocessing is shared-nothing (after the initial fork) and relies on explicit communication between the processes. This adds overhead (how much really depends on the kind of interaction between the processes), but neatly avoids a lot of the pitfalls of multithreaded programming. The high-level building blocks of multiprocessing favor master/slave models (esp. the Pool class), with masters handing out work items, and slaves operating on them, returning results.
Keeping state in sync across several processes may, depending how often they change, incur a prohibitive overhead.
TL;DR: It can be done, but probably shouldn't.
import time, multiprocessing
class Test(multiprocessing.Process):
def __init__(self, manager):
super().__init__()
self.quit = manager.Event()
self.dict = manager.dict()
def stop(self):
self.quit.set()
self.join()
def run(self):
self.dict['item'] = 0
while not self.quit.is_set():
time.sleep(1)
self.dict['item'] += 1
m = multiprocessing.Manager()
t = Test(m)
t.start()
for x in range(10):
time.sleep(1.2)
print(t.dict)
t.stop()
The multiprocessing examples show how to create proxies for more complicated objects, which should allow you to implement the tree structure in your question.
It seems to me that what you want is actual multithreading, rather than multiprocessing. With threads rather than processes, you can do precisely that, since threads run in the same process, sharing all memory and therefore data with each other.

Python threading.Thread, scopes and garbage collection

Say I derive from threading.Thread:
from threading import Thread
class Worker(Thread):
def start(self):
self.running = True
Thread.start(self)
def terminate(self):
self.running = False
self.join()
def run(self):
import time
while self.running:
print "running"
time.sleep(1)
Any instance of this class with the thread being started must have it's thread actively terminated before it can get garbage collected (the thread holds a reference itself). So this is a problem, because it completely defies the purpose of garbage collection. In that case having some object encapsulating a thread, and with the last instance of the object going out of scope the destructor gets called for thread termination and cleanup. Thuss a destructor
def __del__(self):
self.terminate()
will not do the trick.
The only way I see to nicely encapsulate threads is by using low level thread builtin module and weakref weak references. Or I may be missing something fundamental. So is there a nicer way than tangling things up in weakref spaghetti code?
How about using a wrapper class (which has-a Thread rather than is-a Thread)?
eg:
class WorkerWrapper:
__init__(self):
self.worker = Worker()
__del__(self):
self.worker.terminate()
And then use these wrapper classes in client code, rather than threads directly.
Or perhaps I miss something (:
To add an answer inspired by #datenwolf's comment, here is another way to do it that deals with the object being deleted or the parent thread ending:
import threading
import time
import weakref
class Foo(object):
def __init__(self):
self.main_thread = threading.current_thread()
self.initialised = threading.Event()
self.t = threading.Thread(target=Foo.threaded_func,
args=(weakref.proxy(self), ))
self.t.start()
while not self.initialised.is_set():
# This loop is necessary to stop the main threading doing anything
# until the exception handler in threaded_func can deal with the
# object being deleted.
pass
def __del__(self):
print 'self:', self, self.main_thread.is_alive()
self.t.join()
def threaded_func(self):
self.initialised.set()
try:
while True:
print time.time()
if not self.main_thread.is_alive():
print('Main thread ended')
break
time.sleep(1)
except ReferenceError:
print('Foo object deleted')
foo = Foo()
del foo
foo = Foo()
I guess you are a convert from C++ where a lot of meaning can be attached to scopes of variables, equalling lifetimes of variables. This is not the case for Python, and garbage collected languages in general.
Scope != Lifetime simply because garbage collection occurs whenever the interpreter gets around to it, not on scope boundaries. Especially as you are trying to do asynchronuous stuff with it, the raised hairs on your neck should vibrate to the clamour of all the warning bells in your head!
You can do stuff with the lifetime of objects, using 'del'.
(In fact, if you read the sources to the cpython garbage collector module, the obvious (and somewhat funny) disdain for objects with finalizers (del methods) expressed there, should tell everybody to use even the lifetime of an object only if necessary).
You could use sys.getrefcount(self) to find out when to leave the loop in your thread. But I can hardly recommend that (just try out what numbers it returns. You won't be happy. To see who holds what just check gc.get_referrers(self)).
The reference count may/will depend on garbage collection as well.
Besides, tying the runtime of a thread of execution to scopes/lifetimes of objects is an error 99% of the time. Not even Boost does it. It goes out of its RAII way to define something called a 'detached' thread.
http://www.boost.org/doc/libs/1_55_0/doc/html/thread/thread_management.html

Categories

Resources