Twisted inlineCallbacks and remote generators - python

I have used defer.inlineCallbacks in my code as I find it much easier to read and debug than using addCallbacks.
I am using PB and I have hit a problem when returning data to the client. The data is about 18Mb in size and I get a failed BananaError because of the length of the string being returned.
What I want to do is to write a generator so I can just keep calling the function and return some of the data each time the function is called.
How would I write that with inlineCallbacks already being used? Is it actually possible, If i return a value instead. Would something like the following work?
#defer.inlineCallbacks
def getLatestVersions(self):
returnlist = []
try:
latest_versions = yield self.cur.runQuery("""SELECT id, filename,path,attributes ,MAX(version) ,deleted ,snapshot , modified, size, hash,
chunk_table, added, isDir, isSymlink, enchash from files group by filename, path""")
except:
logger.exception("problem querying latest versions")
for result in latest_versions:
returnlist.append(result)
if len(return_list) >= 10:
yield return_list
returnlist = []
yield returnlist

A generator function decorated with inlineCallbacks returns a Deferred - not a generator. This is always the case. You can never return a generator from a function decorated with inlineCallbacks.
See the pager classes in twisted.spread.util for ideas about another approach you can take.

Related

Is there a Python standard library function to create a generator from repeatedly calling a functional?

I have a method want to call repeatedly to iterate over, which will raise a StopIteration when it's done (in this case an instance of pyarrow.csv.CSVStreamingReader looping over a large file). I can use it in a for loop like this:
def batch_generator():
while True:
try:
yield reader.read_next_batch()
except StopIteration:
return
for batch in batch_generator():
writer.write_table(batch)
It can be done in a generic way with a user-defined function:
def make_generator(f):
def gen():
while True:
try:
yield f()
except StopIteration:
return
return gen()
for batch in make_generator(reader.read_next_batch):
writer.write_table(batch)
...but I wondered if something like this was possible with standard library functions or with some obscure syntax?
I would assume that the normal iter() function with its second argument should do what you want. As in:
for batch in iter(reader.read_next_batch, None):
...
The answer to your underlying question of how to iterate a CSVStreamingReader is: The CSVStreamingReader is iterable and does just the thing you want:
reader = pyarrow.csv.open_csv(...)
for batch in reader:
...
In general it is really rare for python libraries to return "iterable" things that are not python-iterable. That is always a sensible first thing to try.

Python: How to return data more that once during a function call

Is there any way that a function can be called once and then return data mutliple times at distinct times?
For example, suppose I had the following code:
def do_something():
for i in range(1, 10):
return 1
However, I want to be able to return more than one piece of data from a single function call, but at asynchronous times, is this possible?
For context, I have a program that generates word documents, converts them into pdfs and then combines them into a single pdf document. I want to be able to call an external function from the GUI to create the documents, and then display a progress bar that displays the current progress through the function.
Edit:
I am already aware of the yield function. I thought my specific problem at the bottom of the question would help. To be clearer, I am looking for is a way to return multiple values from a function and cause a different event for each value returned. Although it may be a poor example, what I want is to be able to do is something similar to a .then(){} in Javascript, but be able to perform the .then(){} using multiple returned values
yield is the thing as mentioned by almost everyone for returning or getting multiple values from a function.
Having read your problem statement. Here is the solution I would devise for you.
Create a function to update status bar, the value of status bar would be fetched from a global variable. So global x=0 at starting, and in the update function it will first update the x = x+1 then after that it will increment the status bar.
def do_something():
for i in range(1, 10):
# fetch and perform operation on that Doc for PDF
update_status_bar()
You want a generator:
def do_something():
for i in range(1,10):
yield i
nums = do_something()
Each time you call next on nums, the body of do_something will continue executing up to the next yield statement, at which point it returns that value.
>>> print next(nums) # Outputs 1
>>> print next(nums) # Outputs 2
>>> ...
You are looking for generators.
Instead of returning you yield (read What does the "yield" keyword do in Python?) from your function.
def do_something():
for i in range(1, 10):
yield i
If you want this function to be called repeatedly you will need to have a wrapper that calls this function repeatedly. Somewhat similar to:
def worker():
for i in do_something():
UpdateProgress(i)
sleep(prgressInterval)
thread = Thread(target=worker)
thread.start()

Python Generator/Iterator

I am working on improving my python and getting up to speed on generators. I have an object that I am working on to process a series of events. I want the list of events to be pulled sequentially and through various methods. I want to use generators for this purpose (I know I can write something else to do this without them).
Here is the sample code that I've been working on:
def _get_next_event():
def gen():
for i,event in enumerate(range(1,10)):
yield event
iterator = gen()
def run():
return iterator
run.next = iterator.__next__
return run
t = _get_next_event()
t.next()
for x in t():
if x < 5:
print(x)
else:
break
t.next()
This lets me do a for loop on the events as well as pull next one individually via function's next method.
I am implementing this in my class, it looks like this:
def _get_next_event(self):
def gen():
print(self.all_sessions)
for event in self.all_sessions:
yield event['event']
iterator = gen()
def run():
return iterator
run.next = iterator.__next__
return run
However, before it works in the class I have to run it, for example before the for loop I have one of these:
self._get_next_event = self._get_next_event()
I think there should be a more elegant way of doing this... what am I missing?
Usually, generators are... not written like that.
Ordinarily, you'd just use yield in the top-level function:
def _get_next_event():
for i,event in enumerate(range(1,10)):
yield event
You can then just write this:
for event in _get_next_event():
# do something with event
Perhaps you had some reason for doing it the way you've shown, but that reason is not evident from the code you've posted.
(for the record, I'm assuming your generator does not literally look like that, or else I'd tell you to change the whole function body to return range(1, 10))

__iter__() implemented as a generator

I have an object subclass which implements a dynamic dispatch __ iter __ using a caching generator (I also have a method for invalidating the iter cache) like so:
def __iter__(self):
print("iter called")
if self.__iter_cache is None:
iter_seen = {}
iter_cache = []
for name in self.__slots:
value = self.__slots[name]
iter_seen[name] = True
item = (name, value)
iter_cache.append(item)
yield item
for d in self.__dc_list:
for name, value in iter(d):
if name not in iter_seen:
iter_seen[name] = True
item = (name, value)
iter_cache.append(item)
yield item
self.__iter_cache = iter_cache
else:
print("iter cache hit")
for item in self.__iter_cache:
yield item
It seems to be working... Are there any gotchas I may not be aware of? Am I doing something ridiculous?
container.__iter__() returns an iterator object. The iterator objects themselves are required to support the two following methods, which together form the iterator protocol:
iterator.__iter__()
Returns the iterator object itself.
iterator.next()
Return the next item from the container.
That's exactly what every generator has, so don't be afraid of any side-effects.
It seems like a very fragile approach. It is enough to change any of __slots, __dc_list, __iter_cache during active iteration to put the object into an inconsistent state.
You need either to forbid changing the object during iteration or generate all cache items at once and return a copy of the list.
It might be better to separate the iteration of the object from the caching of the values it returns. That would simplify the iteration process and allow you to easily control how the caching is accomplished as well as whether it is enabled or not, for example.
Another possibly important consideration is the fact that your code would not predictively handle the situation where the object being iterated over gets changed between successive calls to the method. One simple way to deal with that would be to populate the cache's contents completely on the first call, and then just yield what it contains for each call -- and document the behavior.
What you're doing is valid albeit weird. What is a __slots or a __dc_list ?? Generally it's better to describe the contents of your object in an attribute name, rather than its type (eg: self.users rather than self.u_list).
You can use my LazyProperty decorator to simplify this substantially.
Just decorate your method with #LazyProperty. It will be called the first time, and the decorator will then replace the attribute with the results. The only requirement is that the value is repeatable; it doesn't depend on mutable state. You also have that requirement in your current code, with your self.__iter_cache.
def __iter__(self)
return self.__iter
#LazyProperty
def __iter(self)
def my_generator():
yield whatever
return tuple(my_generator())

Hashing a python function to regenerate output when the function is modified

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.
If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.
The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.
So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);
What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

Categories

Resources