I am stuck into finding solution to below multiprocessing issue.
I have a class Record in record.py module. The responsibility of record class is to process the input data and save it into a JSON file.
The Record class has method put() to update JSON file.
The record class is initialized in the class decorator. The decorator is applied over most of the classes of various sub-modules.
Decorator extracts information of each method it decorates and sends data to put() method of Record class.
put() method of Record class then updates the JSON file.
The problem is when the different process runs, each process creates its own instance of record object and Json data gets corrupted since
multiple processes tries to update the same json file.
Also, each process may have threads running that tries to access and update same JSON file.
Please let me know how can i resolve this problem.
class Record():
def put(data):
# read json file
# update json file with new data
# close json file
def decorate_method(theMethod):
# Extract method details
data = extract_method_details(theMethod)
# Initialize record object
rec = Record()
rec.put(data)
class ClassDeco(cls):
# This class decorator decorates all methods of the target class
for method in cls(): #<----This is just a pseudo codebase
decorate_method()
#ClassDeco
class Test()
def __init__():
pass
def run(a):
# some function calls
if __name__ == "__main__":
t = Test()
p = Pool(processes=len(process_pool))
p.apply_async(t.run, args=(10,))
p.apply_async(t.run, args=(20,))
p.close()
You should lock the file prior to reading and writing it. Check another question related to file locking in python: Locking a file in Python
Have you ever heard about critical section concept in multiprocessing/multithreading programming?
If so think about using multiprocessing locks to allow only one process at the time to write to the file.
Related
I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!
I have a time-consuming database lookup (downloads data from online) which I want to avoid doing constantly, so I would like to pickle the data if I don't already have it.
This data is being used by the class which has this classmethod.
Is this a 'proper' or expected use of a classmethod? I feel like I could fairly easily refactor it to be an instance method but it feels like it should be a classmethod due to what it's doing. Below is a mockup of the relevant parts of the class.
import os
import pickle
class Example:
def __init__(self):
self.records = self.get_records()
#classmethod
def get_records(cls):
"""
If the records aren't already downloaded from the server,
get them and add to a pickle file.
Otherwise, just load the pickle file.
"""
if not os.path.exists('records.pkl'):
# Slow request
records = get_from_server()
with open('records.pkl', 'wb') as rec_file:
pickle.dump(records, rec_file)
else:
with open('records.pkl', 'rb') as rec_file:
records = pickle.load(rec_file)
return records
def use_records(self):
for item in self.records:
...
Is there also an easy way to refactor this so that I can retrieve the data on request, even if the pickle file exists? Is that as simple as just adding another argument to the classmethod?
Thanks for any help.
I'm trying to get attributes of a flowfile in my python script , I have done the following :
class TransformCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
# Read input FlowFile content
input_text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
input_obj = json.loads(input_text)
but how can I access my flowfile attributes in the process method ?
They won't be immediately available in the process method unless you do something like pass a reference to the FlowFile into your TransformCallback constructor. Another option is to split up the reading and writing (since you are using IOUtils.toString() to read the whole thing in at once) into two separate calls, then you can do the attribute manipulation outside the process() methods.
By the way, if you just need to read in the whole content as a string, you don't need a StreamCallback or InputStreamCallback, you can use session.read(flowFile) which returns an InputStream (rather than executing a provided callback). You can call IOUtils.toString() on that (and don't forget to close it afterwards), thereby avoiding the callback and allowing easier access to the flow file attributes using your current FlowFile reference (and the getAttribute() or getAttributes() methods).
In a program I am creating, I have to write a threading.Thread object to a file, so I can use it later. How would I go about doing this?
You can use the pickle module, although you have to implement some functions to make it work. This is assuming you want to save the state of the things being done in the thread, instead of the thread itself, which is handled by the operating system and can't be serialized in a meaningful way.
import pickle
...
class MyThread(threading.Thread):
def run(self):
... # Add the functionality. You have to keep track of your state in a manner that is visible to other functions by using "self." in front of the variables that should be saved
def __getstate__(self):
... # Return a pickable object representing the state
def __setstate__(self, state):
... # Restore the state. You may have to call the "__init__" method, but you have to test it, as I am not sure if this is required to make the resulting object function as expected. You might run the thread from here as well, if you don't, it has to be started manually.
To save the state:
pickle.dump(thread, "/path/to/file")
To load the state:
thread = pickle.load("/path/to/file")
Use the pickle module. It allows saving of python types.
I am working in Python with an Email() class that I would like to extend into a SerializeEmail() class, which simply adds two further methods, .write_email() and .read_email(). I would like this sort of behaviour:
# define email
my_email = SerializeEmail()
my_email.recipients = 'link#hyrule.com'
my_email.subject = 'RE: Master sword'
my_email.body = "Master using it and you can have this."
# write email to file system for hand inspection
my_email.write_email('my_email.txt')
...
# Another script reads in email
my_verified_email = SerializeEmail()
my_verified_email.read_email('my_email.txt')
my_verified_email.send()
I have navigated the json encode/decode process, and I can successfully write my SerializeEmail() object, and read it in, however, I can't find a satisfactory way to recreate my object via a SerializeEmail.read_email() call.
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
def read_email(self,file_name):
with open(file_name,"r") as f:
json.load(f,cls=SerializeEmailJSONDecoder)
The problem here is that the json.load() call in my read_email() method returns an instance of my SerializeEmail object, but doesn't assign that object to the current instance that I'm using to call it. So right now I'd have to do something like this,
another_email = my_verified_email.read_email('my_email.txt')
when what I want is for the call to my_veridied_email.read_email() to populate the current instance of my_verified_email with the data on the file. I've tried
self = json.load(f,cls=SerializeEmailJSONDecoder)
but that doesn't work. I could just assign each individual element of my returned object to my "self" object, but that seems ad-hoc and inelegant, and I'm looking for the "right way" to do this, if it exists. Any suggestions? If you think that my whole approach is flawed and recommend a different way of accomplishing this task, please sketch it out for me.
While you could jump through a number of hoops to load serialized content into an existing instance, I wouldn't recommend doing so. It's an unnecessary complication which really gains you nothing; it means that the extra step of creating a dummy instance is required every time you want to load an e-mail from JSON. I'd recommend using either a factory class or a factory method which loads the e-mail from the serialized JSON and returns it as a new instance. My personal preference would be a factory method, which you'd accomplish as follows:
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
#staticmethod
def read_email(file_name):
with open(file_name,"r") as f:
return json.load(f,cls=SerializeEmailJSONDecoder)
# You can now create a new instance by simply doing the following:
new_email = SerializeEmail.read_email('my_email.txt')
Note the #staticmethod decorator, which allows you to call the method on the class without any implicit first argument being passed in. Normally factory methods would be #classmethods, but since you're loading the object from JSON, the implicit class argument is unnecessary.
Notice how, with this modification, you don't need to instantiate a SerializeEmail object before you can load another one from JSON. You simply call the method directly on the class and get the desired behavior.