I am trouble figuring out how to make a synchronized Python object. I have a class called Observation and a class called Variable that basically looks like (code is simplified to show the essence):
class Observation:
def __init__(self, date, time_unit, id, meta):
self.date = date
self.time_unit = time_unit
self.id = id
self.count = 0
self.data = 0
def add(self, value):
if isinstance(value, list):
if self.count == 0:
self.data = []
self.data.append(value)
else:
self.data += value
self.count += 1
class Variable:
def __init__(self, name, time_unit, lock):
self.name = name
self.lock = lock
self.obs = {}
self.time_unit = time_unit
def get_observation(self, id, date, meta):
self.lock.acquire()
try:
obs = self.obs.get(id, Observation(date, self.time_unit, id, meta))
self.obs[id] = obs
finally:
self.lock.release()
return obs
def add(self, date, value, meta={}):
self.lock.acquire()
try:
obs = self.get_observation(id, date, meta)
obs.add(value)
self.obs[id] = obs
finally:
self.lock.release()
This is how I setup the multiprocessing part:
plugin = function defined somewhere else
tasks = JoinableQueue()
result = JoinableQueue()
mgr = Manager()
lock = mgr.RLock()
var = Variable('foobar', 'year', lock)
for person in persons:
tasks.put(Task(plugin, var, person))
Example of how the code is supposed to work:
I have an instance of Variable called var and I want to add an observation to var:
today = datetime.datetime.today()
var.add(today, 1)
So, the add function of Variable looks whether there already exists an observation for that date, if it does then it returns that observation else it creates a new instance of Observation. Having found an observation than the actual value is added by the call obs.add(value). My main concern is that I want to make sure that different processes are not creating multiple instances of Observation for the same date, that's why I lock it.
One instance of Variable is created and is shared between different processes using the multiprocessing library and is the container for numerous instances of Observation. The above code does not work, I get the error:
RuntimeError: Lock objects should only
be shared between processes through
inheritance
However, if I instantiate a Lock object before launching the different processes and supply it to the constructor of Variable then it seems that I get a race condition as all processes seem to be waiting for each other.
The ultimate goal is that different processes can update the obs variable in the object Variable. I need this to be threadsafe because I am not just modifying the dictionary in place but adding new elements and incrementing existing variables. the obs variable is a dictionary that contains a bunch of instances of Observation.
How can I make this synchronized where I share one single instance of Variable between numerous multiprocessing processes? Thanks so much for your cognitive surplus!
UPDATE 1:
* I am using multiprocessing Locks and I have changed the source code to show this.
* I have changed the title to more accurately capture the problem
* I have replaced theadsafe with synchronization where I was confusing the two terms.
Thanks to Dmitry Dvoinikov for pointing me out!
One question that I am still not sure about is where do I instantiate Lock? Should this happen inside the class or before initializing the multiprocesses and give it as an argument? ANSWER: Should happen outside the class.
UPDATE 2:
* I fixed the 'Lock objects should only be shared between processes through inheritance' error by moving the initialization of the Lock outside the class definition and using a manager.
* Final question, now everything works except that it seems that when I put my Variable instance in the queue then it does not get updated, and everytime I get it from the queue it does not contain the observation I added in the previous iteration. This is the only thing that is confusing me :(
UPDATE 3:
The final solution was to set the var.obs dictionary to an instance of mgr.dict() and then to have a custom serializer. Happy tho share the code with somebody who is struggling with this as well.
You are talking not about thread safety but about synchronization between separate processes and that's entirely different thing. Anyway, to start
different processes can update the obs variable in the object Variable.
implies that Variable is in shared memory, and you have to explicitly store objects there, by no magic a local instance becomes visible to separate process. Here:
Data can be stored in a shared memory map using Value or Array
Then, your code snippet is missing crucial import section. No way to tell whether you instantiate the right multiprocessing.Lock, not multithreading.Lock. Your code doesn't show the way you create processes and pass data around.
Therefore, I'd suggest that you realize the difference between threads and processes, whether you truly need a shared memory model for an application which contains multiple processes and examine the spec.
Related
I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!
I'm using a couple of class attributes to keep track of aggregate task completion across multiple instances of class. When reading or updating the class attributes do I need to use a lock of some sort?
class ClassAttrExample:
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
# do I need a lock context here???
self.of_type_list.append(self.name)
self.of_type_int += 1
If not threads are involved, no locks are required just because class instances share data. As long as the operations are performed in the same thread, everything is safe.
If threads are involved, you'll want locks.
For the specific case of CPython (the reference interpreter), as an implementation detail, the .append call does not require a lock. The GIL can only be switched out between bytecodes (or when a bytecode calls into C code that explicitly releases it, which list never does), and list.append is effectively atomic as a result (all the work it does occurs within a single CALL_METHOD bytecode which never calls back into Python level code, so the GIL is definitely held the whole time).
By contrast, += involves reading the input operand, then performing the increment, then reassigning the input, and the GIL can be swapped between those operations, leading to missed increments when two threads read the value before either writes back to it.
So if multithreaded access is possible, for the int case, the lock is required. And given you need the lock anyway, you may as well lock around the append call too, ensuring the code remains portable to GIL-free Python interpreters.
A fully portable thread-safe version of your class would look something like:
import threading
class ClassAttrExample:
_lock = threading.Lock()
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
with self._lock:
# Can't use bare name to refer to class attribute, must access
# through class or instance thereof
self.of_type_list.append(self.name) # Load-only access to of_type_list
# can use self directly
type(self).of_type_int += 1 # Must use type(self) to avoid creating
# instance attribute that shadows class
# attribute on store
I'm trying to cache attributes to get a behavior like this:
ply = Player(id0=1)
ply.name = 'Bob'
# Later on, even in a different file
ply = Player(id0=1)
print(ply.name) # outputs: Bob
So basically I want to retain the value between different objects if only their id0 is equal.
Here's what I attempted:
class CachedAttr(object):
_cached_attrs = {}
def __init__(self, defaultdict_factory=int):
type(self)._cached_attrs[id(self)] = defaultdict(defaultdict_factory)
def __get__(self, instance, owner):
if instance:
return type(self)._cached_attrs[id(self)][instance.id0]
def __set__(self, instance, value):
type(self)._cached_attrs[id(self)][instance.id0] = value
And you'd use the class like so:
class Player(game_engine.Player):
name = CachedAttr(str)
health = CachedAttr(int)
It seems to work. However, a friend of mine (somewhat) commented about this:
You are storing objects by their id (memory address) which is most likely going to leaks or get values of garbage collected objects from a new one which reused the pointer. This is dangerous since the id itself is not a reference but only an integer independent of the object itself (which means you will most likely store freed pointers and grow in size till you hit a MemoryError).
And I've been experiencing some random crashes, could this be the reason of the crashes?
If so, is there a better way to cache the values other than their id?
Edit: Just to make sure; my Player class inherits from game_engine.Player, which is not created by me (I'm only creating a mod for an other game), and the game_engine.Player is used by always getting a new instance of the player from his id0. So this isn't a behavior defined by me.
Instead of __init__, look at __new__. There, look up the unique object in dict and return that instead of cerating a new one. That way, you avoid the unnecessary allocations for the object wherever you need it. Also, you avoid the problem that the different objects have different IDs.
I am trying to understand Threads in Python.
The code
And now I have a problem, which I have surrounded in one simple class:
# -*- coding: utf-8 -*-
import threading
class myClassWithThread(threading.Thread):
__propertyThatShouldNotBeShared = []
__id = None
def __init__(self, id):
threading.Thread.__init__(self)
self.__id = id
def run(self):
while 1:
self.dummy1()
self.dummy2()
def dummy1(self):
if self.__id == 2:
self.__propertyThatShouldNotBeShared.append("Test value")
def dummy2(self):
for data in self.__propertyThatShouldNotBeShared:
print self.__id
print data
self.__propertyThatShouldNotBeShared.remove(data)
obj1 = myClassWithThread(1)
obj2 = myClassWithThread(2)
obj3 = myClassWithThread(3)
obj1.start()
obj2.start()
obj3.start()
Description
Here is what the class does :
The class has two attributes :
__id which is an identifier for the object, given when the constructor is called
__propertyThatShouldNotBeShared is a list and will contain a text value
Now the methods
run() contains an infinite loop in which I call dummy1() and then dummy2()
dummy1() which adds to attribute (list) __propertyThatShouldNotBeShared the value "Test value" only IF the __id of the object is equal to 2
dummy2() checks if the size of the list __propertyThatShouldNotBeShared is strictly superior to 0, then
for each value in __propertyThatShouldNotBeShared it prints the id of
the object and the value contained in __propertyThatShouldNotBeShared
then it removes the value
Here is the output that I get when I launch the program :
21
Test valueTest value
2
Test value
Exception in thread Thread-2:
Traceback (most recent call last):
File "E:\PROG\myFace\python\lib\threading.py", line 808, in __bootstrap_inner
self.run()
File "E:\PROG\myFace\myProject\ghos2\src\Tests\threadDeMerde.py", line 15, in run
self.dummy2()
File "E:\PROG\myFace\myProject\ghos2\src\Tests\threadDeMerde.py", line 27, in dummy2
self.__propertyThatShouldNotBeShared.remove(data)
ValueError: list.remove(x): x not in list
The problem
As you can see in the first line of the output I get this "1"...which means that, at some point, the object with the id "1" tries to print something on the screen...and actually it does!
But this should be impossible!
Only object with id "2" should be able to print anything!
What is the problem in this code ? Or what is the problem with my logic?
The problem is this:
class myClassWithThread(threading.Thread):
__propertyThatShouldNotBeShared = []
It defines one list for all objects which is shared. You should do this:
class myClassWithThread(threading.Thread):
def __init__(self, id):
self.__propertyThatShouldNotBeShared = []
# the other code goes here
There are two problems hereāthe one you asked about, thread-safety, and the one you didn't, the difference between class and instance attributes.
It's the latter that's causing your actual problem. A class attribute is shared by all instances of the class. It has nothing to do with whether those instances are accessed on a single thread or on multiple threads; there's only one __propertyThatShouldNotBeShared that's shared by everyone. If you want an instance attribute, you have to define it on the instance, not on the class. Like this:
class myClassWithThread(threading.Thread):
def __init__(self, id):
self.__propertyThatShouldNotBeShared = []
Once you do that, each instance has its own copy of __propertyThatShouldNotBeShared, and each lives on its own thread, so there is no thread-safety issue possible.
However, your original code does have a thread-safety problem.
Almost nothing is automatically thread-safe (aka "synchronized"); exceptions (like queue.Queue) will say so explicitly, and be meant specifically for threaded programming.
You can avoid this in three ways:
Don't share anything.
Don't mutate anything you share.
Don't mutate anything you share unless it's protected by an appropriate synchronization object.
The last one is of course the most flexible, but also the most complicated. In fact, it's at the center of why people consider threaded programming hard.
The short version is, everywhere you modify or access shared mutable data like self.__propertyThatShouldNotBeShared, you need to be holding some kind of synchronization object, like a Lock. For example:
class myClassWithThread(threading.Thread):
__lock = threading.Lock()
# etc.
def dummy1(self):
if self.__id == 2:
with self.__lock:
self.__propertyThatShouldNotBeShared.append("Test value")
If you stick to CPython, and to built-in types, you can often get away with ignoring locks. But "often" in threaded programming is just a synonym for "always during testing and debugging, right up until the release or big presentation, when it suddenly begins failing". Unless you want to learn the rules for how the Global Interpreter Lock and the built-in types work in CPython, don't rely on this.
Class variables in Python are just that: shared by all instances of the class. You need an instance variable, which you usually define inside __init__. Remove the class-level declarations (and the double leading underscores, they're for name mangling which you don't need here.)
I am new to Python. I am writing a simulation in SimPy to model a production line, which looks like: Machine 1 -> Buffer 1 -> Machine 2 -> Buffer 2 -> and so on..
My question:
I have a class, Machine, of which there are several instances. Suppose that the current instance is Machine 2. The methods of this instance affect the states of Machines 1 and 3. For ex: if Buffer 2 was empty then Machine 3 is idle. But when Machine 2 puts a part in Buffer 2, Machine 3 should be activated.
So, what is the way to refer to different instances of the same class from any given instance of that class?
Also, slightly different question: What is the way to call an object (Buffers 1 and 2, in this case) from the current instance of another class?
Edit: Edited to add more clarity about the system.
It is not common for instances of a class to know about other instances of the class.
I would recommend you keep some sort of collection of instances in your class itself, and use the class to look up the instances:
class Machine(object):
lst = []
def __init__(self, name):
self.name = name
self.id = len(Machine.lst)
Machine.lst.append(self)
m0 = Machine("zero")
m1 = Machine("one")
print(Machine.lst[1].name) # prints "one"
This is a silly example that I cooked up where you put some data into the first machine which then moves it to the first buffer which then moves it to the second machine ...
Each machine just tags the data with it's ID number and passes it along, but you could make the machines do anything. You could even register a function to be called at each machine when it gets data.
class Machine(object):
def __init__(self,number,next=None):
self.number=number
self.register_next(next)
def register_next(self,next):
self.next=next
def do_work(self,data):
#do some work here
newdata='%s %d'%(str(data),self.number)
if(self.next is not None):
self.next.do_work(newdata)
class Buffer(Machine):
def __init__(self,number,next=None):
Machine.__init__(self,number,next=next)
data=None
def do_work(self,data):
if(self.next is not None):
self.next.do_work(data)
else:
self.data=data
#Now, create an assembly line
assembly=[Machine(0)]
for i in xrange(1,20):
machine=not i%2
assembly.append(Machine(i) if machine else Buffer(i))
assembly[-2].register_next(assembly[-1])
assembly[0].do_work('foo')
print (assembly[-1].data)
EDIT
Buffers are now Machines too.
Now that you added more info about the problem, I'll suggest an alternate solution.
After you have created your machines, you might want to link them together.
class Machine(object):
def __init__(self):
self.handoff = None
def input(self, item):
item = do_something(item) # machine processes item
self.handoff(item) # machine hands off item to next machine
m0 = Machine()
m1 = Machine()
m0.handoff = m1.input
m2 = Machine()
m1.handoff = m2.input
def output(item):
print(item)
m2.handoff = output
Now when you call m0.input(item) it will do its processing, then hand off the item to m1, which will do the same and hand off to m2, which will do its processing and call output(). This example shows synchronous processing (an item will go all the way through the chain before the function calls return) but you could also have the .input() method put the item on a queue for processing and then return immediately; in this way, you could make the machines process in parallel.
With this system, the connections between the machines are explicit, and each machine only knows about the one that follows it (the one it needs to know about).
I use the word "threading" to describe the process of linking together objects like this. An item being processed follows the thread from machine to machine before arriving at the output. Its slightly ambiguous because it has nothing to do with threads of execution, so that term isn't perfect.