How to periodically call instance method from a separate process - python

I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first

Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.

All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!

Related

Do mutable class attributes require a lock when reading or updating?

I'm using a couple of class attributes to keep track of aggregate task completion across multiple instances of class. When reading or updating the class attributes do I need to use a lock of some sort?
class ClassAttrExample:
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
# do I need a lock context here???
self.of_type_list.append(self.name)
self.of_type_int += 1
If not threads are involved, no locks are required just because class instances share data. As long as the operations are performed in the same thread, everything is safe.
If threads are involved, you'll want locks.
For the specific case of CPython (the reference interpreter), as an implementation detail, the .append call does not require a lock. The GIL can only be switched out between bytecodes (or when a bytecode calls into C code that explicitly releases it, which list never does), and list.append is effectively atomic as a result (all the work it does occurs within a single CALL_METHOD bytecode which never calls back into Python level code, so the GIL is definitely held the whole time).
By contrast, += involves reading the input operand, then performing the increment, then reassigning the input, and the GIL can be swapped between those operations, leading to missed increments when two threads read the value before either writes back to it.
So if multithreaded access is possible, for the int case, the lock is required. And given you need the lock anyway, you may as well lock around the append call too, ensuring the code remains portable to GIL-free Python interpreters.
A fully portable thread-safe version of your class would look something like:
import threading
class ClassAttrExample:
_lock = threading.Lock()
of_type_list = []
of_type_int = 0
def __init__(self, name):
self.name = name
def do_task(self):
# does some stuff
with self._lock:
# Can't use bare name to refer to class attribute, must access
# through class or instance thereof
self.of_type_list.append(self.name) # Load-only access to of_type_list
# can use self directly
type(self).of_type_int += 1 # Must use type(self) to avoid creating
# instance attribute that shadows class
# attribute on store

How to control access to a file from multiple processes in python

I am stuck into finding solution to below multiprocessing issue.
I have a class Record in record.py module. The responsibility of record class is to process the input data and save it into a JSON file.
The Record class has method put() to update JSON file.
The record class is initialized in the class decorator. The decorator is applied over most of the classes of various sub-modules.
Decorator extracts information of each method it decorates and sends data to put() method of Record class.
put() method of Record class then updates the JSON file.
The problem is when the different process runs, each process creates its own instance of record object and Json data gets corrupted since
multiple processes tries to update the same json file.
Also, each process may have threads running that tries to access and update same JSON file.
Please let me know how can i resolve this problem.
class Record():
def put(data):
# read json file
# update json file with new data
# close json file
def decorate_method(theMethod):
# Extract method details
data = extract_method_details(theMethod)
# Initialize record object
rec = Record()
rec.put(data)
class ClassDeco(cls):
# This class decorator decorates all methods of the target class
for method in cls(): #<----This is just a pseudo codebase
decorate_method()
#ClassDeco
class Test()
def __init__():
pass
def run(a):
# some function calls
if __name__ == "__main__":
t = Test()
p = Pool(processes=len(process_pool))
p.apply_async(t.run, args=(10,))
p.apply_async(t.run, args=(20,))
p.close()
You should lock the file prior to reading and writing it. Check another question related to file locking in python: Locking a file in Python
Have you ever heard about critical section concept in multiprocessing/multithreading programming?
If so think about using multiprocessing locks to allow only one process at the time to write to the file.

How would I write an object to a file for later use?

In a program I am creating, I have to write a threading.Thread object to a file, so I can use it later. How would I go about doing this?
You can use the pickle module, although you have to implement some functions to make it work. This is assuming you want to save the state of the things being done in the thread, instead of the thread itself, which is handled by the operating system and can't be serialized in a meaningful way.
import pickle
...
class MyThread(threading.Thread):
def run(self):
... # Add the functionality. You have to keep track of your state in a manner that is visible to other functions by using "self." in front of the variables that should be saved
def __getstate__(self):
... # Return a pickable object representing the state
def __setstate__(self, state):
... # Restore the state. You may have to call the "__init__" method, but you have to test it, as I am not sure if this is required to make the resulting object function as expected. You might run the thread from here as well, if you don't, it has to be started manually.
To save the state:
pickle.dump(thread, "/path/to/file")
To load the state:
thread = pickle.load("/path/to/file")
Use the pickle module. It allows saving of python types.

Python on-the-fly function to script conversion

Is there a reasonably natural way of converting python function to standalone scripts? Something like:
def f():
# some long and involved computation
script = function_to_script(f) # now script is some sort of closure,
# which can be run in a separate process
# or even shipped over the network to a
# different host
and NOT like:
script = open("script.py", "wt")
script.write("#!/usr/bin/env python")
...
You can turn any "object" into a function by defining the __call__ method on it (see here.) Hence, if you want to compartmentalize some state with the computations, as long as what you've provided from the very top to the bottom of a class can be pickled, then that object can be pickled.
class MyPickledFunction(object):
def __init__(self, *state):
self.__state = state
def __call__(self, *args, **kwargs):
#stuff in here
That's the easy cheater way. Why pickling? Anything that can be pickled can be sent to another process without fear. You're forming a "poor man's closure" by using an object like this.
(There's a nice post about the "marshal" library here on SO if you want to truly pickle a function.)

How to create a synchronized object with Python multiprocessing?

I am trouble figuring out how to make a synchronized Python object. I have a class called Observation and a class called Variable that basically looks like (code is simplified to show the essence):
class Observation:
def __init__(self, date, time_unit, id, meta):
self.date = date
self.time_unit = time_unit
self.id = id
self.count = 0
self.data = 0
def add(self, value):
if isinstance(value, list):
if self.count == 0:
self.data = []
self.data.append(value)
else:
self.data += value
self.count += 1
class Variable:
def __init__(self, name, time_unit, lock):
self.name = name
self.lock = lock
self.obs = {}
self.time_unit = time_unit
def get_observation(self, id, date, meta):
self.lock.acquire()
try:
obs = self.obs.get(id, Observation(date, self.time_unit, id, meta))
self.obs[id] = obs
finally:
self.lock.release()
return obs
def add(self, date, value, meta={}):
self.lock.acquire()
try:
obs = self.get_observation(id, date, meta)
obs.add(value)
self.obs[id] = obs
finally:
self.lock.release()
This is how I setup the multiprocessing part:
plugin = function defined somewhere else
tasks = JoinableQueue()
result = JoinableQueue()
mgr = Manager()
lock = mgr.RLock()
var = Variable('foobar', 'year', lock)
for person in persons:
tasks.put(Task(plugin, var, person))
Example of how the code is supposed to work:
I have an instance of Variable called var and I want to add an observation to var:
today = datetime.datetime.today()
var.add(today, 1)
So, the add function of Variable looks whether there already exists an observation for that date, if it does then it returns that observation else it creates a new instance of Observation. Having found an observation than the actual value is added by the call obs.add(value). My main concern is that I want to make sure that different processes are not creating multiple instances of Observation for the same date, that's why I lock it.
One instance of Variable is created and is shared between different processes using the multiprocessing library and is the container for numerous instances of Observation. The above code does not work, I get the error:
RuntimeError: Lock objects should only
be shared between processes through
inheritance
However, if I instantiate a Lock object before launching the different processes and supply it to the constructor of Variable then it seems that I get a race condition as all processes seem to be waiting for each other.
The ultimate goal is that different processes can update the obs variable in the object Variable. I need this to be threadsafe because I am not just modifying the dictionary in place but adding new elements and incrementing existing variables. the obs variable is a dictionary that contains a bunch of instances of Observation.
How can I make this synchronized where I share one single instance of Variable between numerous multiprocessing processes? Thanks so much for your cognitive surplus!
UPDATE 1:
* I am using multiprocessing Locks and I have changed the source code to show this.
* I have changed the title to more accurately capture the problem
* I have replaced theadsafe with synchronization where I was confusing the two terms.
Thanks to Dmitry Dvoinikov for pointing me out!
One question that I am still not sure about is where do I instantiate Lock? Should this happen inside the class or before initializing the multiprocesses and give it as an argument? ANSWER: Should happen outside the class.
UPDATE 2:
* I fixed the 'Lock objects should only be shared between processes through inheritance' error by moving the initialization of the Lock outside the class definition and using a manager.
* Final question, now everything works except that it seems that when I put my Variable instance in the queue then it does not get updated, and everytime I get it from the queue it does not contain the observation I added in the previous iteration. This is the only thing that is confusing me :(
UPDATE 3:
The final solution was to set the var.obs dictionary to an instance of mgr.dict() and then to have a custom serializer. Happy tho share the code with somebody who is struggling with this as well.
You are talking not about thread safety but about synchronization between separate processes and that's entirely different thing. Anyway, to start
different processes can update the obs variable in the object Variable.
implies that Variable is in shared memory, and you have to explicitly store objects there, by no magic a local instance becomes visible to separate process. Here:
Data can be stored in a shared memory map using Value or Array
Then, your code snippet is missing crucial import section. No way to tell whether you instantiate the right multiprocessing.Lock, not multithreading.Lock. Your code doesn't show the way you create processes and pass data around.
Therefore, I'd suggest that you realize the difference between threads and processes, whether you truly need a shared memory model for an application which contains multiple processes and examine the spec.

Categories

Resources