I'm trying to get attributes of a flowfile in my python script , I have done the following :
class TransformCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
# Read input FlowFile content
input_text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
input_obj = json.loads(input_text)
but how can I access my flowfile attributes in the process method ?
They won't be immediately available in the process method unless you do something like pass a reference to the FlowFile into your TransformCallback constructor. Another option is to split up the reading and writing (since you are using IOUtils.toString() to read the whole thing in at once) into two separate calls, then you can do the attribute manipulation outside the process() methods.
By the way, if you just need to read in the whole content as a string, you don't need a StreamCallback or InputStreamCallback, you can use session.read(flowFile) which returns an InputStream (rather than executing a provided callback). You can call IOUtils.toString() on that (and don't forget to close it afterwards), thereby avoiding the callback and allowing easier access to the flow file attributes using your current FlowFile reference (and the getAttribute() or getAttributes() methods).
Related
I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!
I am stuck into finding solution to below multiprocessing issue.
I have a class Record in record.py module. The responsibility of record class is to process the input data and save it into a JSON file.
The Record class has method put() to update JSON file.
The record class is initialized in the class decorator. The decorator is applied over most of the classes of various sub-modules.
Decorator extracts information of each method it decorates and sends data to put() method of Record class.
put() method of Record class then updates the JSON file.
The problem is when the different process runs, each process creates its own instance of record object and Json data gets corrupted since
multiple processes tries to update the same json file.
Also, each process may have threads running that tries to access and update same JSON file.
Please let me know how can i resolve this problem.
class Record():
def put(data):
# read json file
# update json file with new data
# close json file
def decorate_method(theMethod):
# Extract method details
data = extract_method_details(theMethod)
# Initialize record object
rec = Record()
rec.put(data)
class ClassDeco(cls):
# This class decorator decorates all methods of the target class
for method in cls(): #<----This is just a pseudo codebase
decorate_method()
#ClassDeco
class Test()
def __init__():
pass
def run(a):
# some function calls
if __name__ == "__main__":
t = Test()
p = Pool(processes=len(process_pool))
p.apply_async(t.run, args=(10,))
p.apply_async(t.run, args=(20,))
p.close()
You should lock the file prior to reading and writing it. Check another question related to file locking in python: Locking a file in Python
Have you ever heard about critical section concept in multiprocessing/multithreading programming?
If so think about using multiprocessing locks to allow only one process at the time to write to the file.
I am working on a project where I have a number of custom classes to interface with a varied collection of data on a user's system. These classes only have properties as user-facing attributes. Some of these properties are decently resource intensive, so I want to only run the generation code once, and store the returned value on disk (cache it, that is) for faster retrieval on subsequent runs. As it stands, this is how I am accomplishing this:
def stored_property(func):
"""This ``decorator`` adds on-disk functionality to the `property`
decorator. This decorator is also a Method Decorator.
Each key property of a class is stored in a settings JSON file with
a dictionary of property names and values (e.g. :class:`MyClass`
stores its properties in `my_class.json`).
"""
#property
#functools.wraps(func)
def func_wrapper(self):
print('running decorator...')
try:
var = self.properties[func.__name__]
if var:
# property already written to disk
return var
else:
# property written to disk as `null`
return func(self)
except AttributeError:
# `self.properties` does not yet exist
return func(self)
except KeyError:
# `self.properties` exists, but property is not a key
return func(self)
return func_wrapper
class MyClass(object):
def __init__(self, wf):
self.wf = wf
self.properties = self._properties()
def _properties(self):
# get name of class in underscore format
class_name = convert(self.__class__.__name__)
# this is a library used (in Alfred workflows) for interacted with data stored on disk
properties = self.wf.stored_data(class_name)
# if no file on disk, or one of the properties has a null value
if properties is None or None in properties.values():
# get names of all properties of this class
propnames = [k for (k, v) in self.__class__.__dict__.items()
if isinstance(v, property)]
properties = dict()
for prop in propnames:
# generate dictionary of property names and values
properties[prop] = getattr(self, prop)
# use the external library to save that dictionary to disk in JSON format
self.wf.store_data(class_name, properties,
serializer='json')
# return either the data read from file, or data generated in situ
return properties
#this decorator ensures that this generating code is only run if necessary
#stored_property
def only_property(self):
# some code to get data
return 'this is my property'
This code works precisely as I need it, but it still forces me to manually add the _properties(self) method to each class wherein I need this functionality (currently, I have 3). What I want is a way to "insert" this functionality into any class I please. I think that a Class Decorator could get this job done, but try as I might, I can't quite figure out how to wrangle it.
For the sake of clarity (and in case a decorator is not the best way to get what I want), I will try to explain the overall functionality I am after. I want to write a class that contains some properties. The values of these properties are generated via various degrees of complex code (in one instance, I'm searching for a certain app's pref file, then searching for 3 different preferences (any of which may or may not exist) and determining the best single result from those preferences). I want the body of the properties' code only to contain the algorithm for finding the data. But, I don't want to run that algorithmic code each time I access that property. Once I generate the value once, I want to write it to disk and then simply read that on all subsequent calls. However, I don't want each value written to its own file; I want a dictionary of all the values of all the properties of a single class to be written to one file (so, in the example above, my_class.json would contain a JSON dictionary with one key, value pair). When accessing the property directly, it should first check to see if it already exists in the dictionary on disk. If it does, simply read and return that value. If it exists, but has a null value, then try to run the generation code (i.e. the code actually written in the property method) and see if you can find it now (if not, the method will return None and that will once again be written to file). If the dictionary exists and that property is not a key (my current code doesn't really make this possible, but better safe than sorry), run the generation code and add the key, value pair. If the dictionary doesn't exist (i.e. on the first instantiation of the class), run all generation code for all properties and create the JSON file. Ideally, the code would be able to update one property in the JSON file without rerunning all of the generation code (i.e. running _properties() again).
I know this is a bit peculiar, but I need the speed, human-readable content, and elegant code all together. I would really not to have to compromise on my goal. Hopefully, the description of what I want it clear enough. If not, let me know in a comment what doesn't make sense and I will try to clarify. But I do think that a Class Decorator could probably get me there (essentially by inserting the _properties() method into any class, running it on instantiation, and mapping its value to the properties attribute of the class).
Maybe I'm missing something, but it doesn't seem that your _properties method is specific to the properties that a given class has. I'd put that in a base class and have each of your classes with #stored_property methods subclass that. Then you don't need to duplicate the _properties method.
class PropertyBase(object):
def __init__(self, wf):
self.wf = wf
self.properties = self._properties()
def _properties(self):
# As before...
class MyClass(PropertyBase):
#stored_property
def expensive_to_calculate(self):
# Calculate it here
If for some reason you can't subclass PropertyBase directly (maybe you already need to have a different base class), you can probably use a mixin. Failing that, make _properties accept an instance/class and a workflow object and call it explicitly in __init__ for each class.
Is there a reasonably natural way of converting python function to standalone scripts? Something like:
def f():
# some long and involved computation
script = function_to_script(f) # now script is some sort of closure,
# which can be run in a separate process
# or even shipped over the network to a
# different host
and NOT like:
script = open("script.py", "wt")
script.write("#!/usr/bin/env python")
...
You can turn any "object" into a function by defining the __call__ method on it (see here.) Hence, if you want to compartmentalize some state with the computations, as long as what you've provided from the very top to the bottom of a class can be pickled, then that object can be pickled.
class MyPickledFunction(object):
def __init__(self, *state):
self.__state = state
def __call__(self, *args, **kwargs):
#stuff in here
That's the easy cheater way. Why pickling? Anything that can be pickled can be sent to another process without fear. You're forming a "poor man's closure" by using an object like this.
(There's a nice post about the "marshal" library here on SO if you want to truly pickle a function.)
I am working in Python with an Email() class that I would like to extend into a SerializeEmail() class, which simply adds two further methods, .write_email() and .read_email(). I would like this sort of behaviour:
# define email
my_email = SerializeEmail()
my_email.recipients = 'link#hyrule.com'
my_email.subject = 'RE: Master sword'
my_email.body = "Master using it and you can have this."
# write email to file system for hand inspection
my_email.write_email('my_email.txt')
...
# Another script reads in email
my_verified_email = SerializeEmail()
my_verified_email.read_email('my_email.txt')
my_verified_email.send()
I have navigated the json encode/decode process, and I can successfully write my SerializeEmail() object, and read it in, however, I can't find a satisfactory way to recreate my object via a SerializeEmail.read_email() call.
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
def read_email(self,file_name):
with open(file_name,"r") as f:
json.load(f,cls=SerializeEmailJSONDecoder)
The problem here is that the json.load() call in my read_email() method returns an instance of my SerializeEmail object, but doesn't assign that object to the current instance that I'm using to call it. So right now I'd have to do something like this,
another_email = my_verified_email.read_email('my_email.txt')
when what I want is for the call to my_veridied_email.read_email() to populate the current instance of my_verified_email with the data on the file. I've tried
self = json.load(f,cls=SerializeEmailJSONDecoder)
but that doesn't work. I could just assign each individual element of my returned object to my "self" object, but that seems ad-hoc and inelegant, and I'm looking for the "right way" to do this, if it exists. Any suggestions? If you think that my whole approach is flawed and recommend a different way of accomplishing this task, please sketch it out for me.
While you could jump through a number of hoops to load serialized content into an existing instance, I wouldn't recommend doing so. It's an unnecessary complication which really gains you nothing; it means that the extra step of creating a dummy instance is required every time you want to load an e-mail from JSON. I'd recommend using either a factory class or a factory method which loads the e-mail from the serialized JSON and returns it as a new instance. My personal preference would be a factory method, which you'd accomplish as follows:
class SerializeEmail(Email):
def write_email(self,file_name):
with open(file_name,"w") as f:
json.dump(self,f,cls=SerializeEmailJSONEncoder,sort_keys=True,indent=4)
#staticmethod
def read_email(file_name):
with open(file_name,"r") as f:
return json.load(f,cls=SerializeEmailJSONDecoder)
# You can now create a new instance by simply doing the following:
new_email = SerializeEmail.read_email('my_email.txt')
Note the #staticmethod decorator, which allows you to call the method on the class without any implicit first argument being passed in. Normally factory methods would be #classmethods, but since you're loading the object from JSON, the implicit class argument is unnecessary.
Notice how, with this modification, you don't need to instantiate a SerializeEmail object before you can load another one from JSON. You simply call the method directly on the class and get the desired behavior.