Exporting duplicated code from inside a yielding generator function - python

Observe the following method:
def _locate(self, text):
"""
This method accesses preceding locators if these exist, it then calls an overridable helper method called _relocate
which receives text with readjusted boundaries and searches inside, the basic implemented behaviour is that of a logical or
"""
if not self.precedents:
for sub_segment in self._relocate(text, Segment(0, len(text), 1)):
if self._multiple:
yield sub_segment
elif self.max_segment.prob > self._prob_threshold:
yield self.max_segment
return
else:
for precedent in self.precedents:
for segment in precedent.locate(text):
for sub_segment in self._relocate(text, segment):
if self._multiple:
yield sub_segment
elif self.max_segment.prob > self._prob_threshold:
yield self.max_segment
return
# if we haven't found a good enough segment return the best one we came across while locating
if not self._multiple:
yield self.max_segment
it has some code which is duplicated twice:
for sub_segment in self._relocate(text, segment):
if self._multiple:
yield sub_segment
elif self.max_segment.prob > self._prob_threshold:
yield self.max_segment
return
I naively thought I could probably define a single helper method and have the code just once so started to implement it, however, this proved next to impossible (because of the fact that the code uses both yields and returns) and caused me much more pain in terms of code length and run-time that it was worth.
Not sure what I'm asking exactly (if anything perhaps I'm asking if anyone has any idea of some general approach to sharing generator code that yields or else sees how this can be done here?), but in any case as the topic of generators go I found this experience quite telling and interesting so I thought I'd share.

I think you can remove the code duplication by defining a generator of segments outside the loop
def _locate(self, text):
"""
This method accesses preceding locators if these exist, it then calls an overridable helper method called _relocate
which receives text with readjusted boundaries and searches inside, the basic implemented behaviour is that of a logical or
"""
if self.precedents:
segments = (seg for precedent in self.precedents for seg in precedent.locate(text))
else:
segments = (Segment(0, len(text), 1),)
for segment in segments:
for sub_segment in self._relocate(text, segment):
if self._multiple:
yield sub_segment
elif self.max_segment.prob > self._prob_threshold:
yield self.max_segment
return
# if we haven't found a good enough segment return the best one we came across while trying
if not self._multiple:
yield self.max_segment

Related

Definining `fac` with generators. And: Why no stack overflow with generators?

Is there a way we can define the following code (a classic example for recursion) via generators in Python? I am using Python 3.
def fac(n):
if n==0:
return 1
else:
return n * fac(n-1)
I tried this, no success:
In [1]: def fib(n):
...: if n == 0:
...: yield 1
...: else:
...: n * yield (n-1)
File "<ipython-input-1-bb0068f2d061>", line 5
n * yield (n-1)
^
SyntaxError: invalid syntax
Classic recursion in Python leads to Stack Overflow
This classic example leads to a stack overflow on my machine for an input of n=3000. In the Lisp dialect "Scheme" I'd use tail recursion and avoid stack overflow. Not possible in Python. That's why generators come in handy in Python. But I wonder:
Why no stack overflow with generators?
Why is there no stack overflow with generators in Python? How do they work internally? Doing some research leads me always to examples showing how generators are used in Python, but not much about the inner workings.
Update 1: yield from my_function(...)
As I tried to explain in the comments secion, maybe my example above was a poor choice for making a point. My actual question was targeted at the inner workings of generators used recursively in yield from statements in Python 3.
Below is an (incomplete) example code that I use to proces JSON files generatred by Firebox bookmark backups. At several points I use yield from process_json(...) to recursively call the function again via generators.
Exactly in this example, how is stack overflow avoided? Or is it?
# (snip)
FOLDERS_AND_BOOKMARKS = {}
FOLDERS_DATES = {}
def process_json(json_input, folder_path=""):
global FOLDERS_AND_BOOKMARKS
# Process the json with a generator
# (to avoid recursion use generators)
# https://stackoverflow.com/a/39016088/5115219
# Is node a dict?
if isinstance(json_input, dict):
# we have a dict
guid = json_input['guid']
title = json_input['title']
idx = json_input['index']
date_added = to_datetime_applescript(json_input['dateAdded'])
last_modified = to_datetime_applescript(json_input['lastModified'])
# do we have a container or a bookmark?
#
# is there a "uri" in the dict?
# if not, we have a container
if "uri" in json_input.keys():
uri = json_input['uri']
# return URL with folder or container (= prev_title)
# bookmark = [guid, title, idx, uri, date_added, last_modified]
bookmark = {'title': title,
'uri': uri,
'date_added': date_added,
'last_modified': last_modified}
FOLDERS_AND_BOOKMARKS[folder_path].append(bookmark)
yield bookmark
elif "children" in json_input.keys():
# So we have a container (aka folder).
#
# Create a new folder
if title != "": # we are not at the root
folder_path = f"{folder_path}/{title}"
if folder_path in FOLDERS_AND_BOOKMARKS:
pass
else:
FOLDERS_AND_BOOKMARKS[folder_path] = []
FOLDERS_DATES[folder_path] = {'date_added': date_added, 'last_modified': last_modified}
# run process_json on list of children
# json_input['children'] : list of dicts
yield from process_json(json_input['children'], folder_path)
# Or is node a list of dicts?
elif isinstance(json_input, list):
# Process children of container.
dict_list = json_input
for d in dict_list:
yield from process_json(d, folder_path)
Update 2: yield vs yield from
Ok, I get it. Thanks to all the comments.
So generators via yield create iterators. That has nothing to do with recursion, so no stack overflow here.
But generators via yield from my_function(...) are indeed recursive calls of my function, albeit delayed, and only evaluated if demanded.
This second example can indeed cause a stack overflow.
OK, after your comments I have completely rewritten my answer.
How does recursion work and why do we get a stack overflow?
Recursion is often an elegant way to solve a problem. In most programming languages, every time you call a function, all the information and state needed for the function a put on the stack - a so called "stack frame". The stack is a special per-thread memory region and limited in size.
Now recursive functions implicitly use these stack frames to store state/intermediate results. E.g., the factorial function is n * (n-1) * ((n-1) -1)... 1 and all these "n-1" are stored on the stack.
An iterative solution has to store these intermediate results explicitly in a variable (that often sits in a single stack frame).
How do generators avoid stack overflow?
Simply: They are not recursive. They are implemented like iterator objects. They store the current state of the computation and return a new result every time you request it (implicitly or with next()).
If it looks recursive, that's just syntactic sugar. "Yield" is not like return. It yields the current value and then "pauses" the computation. That's all wrapped up in one object and not in a gazillion stack frames.
This will give you a series from ´1 to n!´:
def fac(n):
if (n <= 0):
yield 1
else:
v = 1
for i in range(1, n+1):
v = v * i
yield v
There is no recursion, the intermediate results are stored in v which is most likely stored in one object (on the heap, probably).
What about yield from
OK, that's interesting, since that was only added in Python 3.3.
yield from can be used to delegate to another generator.
You gave an example like:
def process_json(json_input, folder_path=""):
# Some code
yield from process_json(json_input['children'], folder_path)
This looks recursive, but instead it's a combination of two generator objects. You have your "inner" generator (which only uses the space of one object) and with yield from you say "I'd like to forward all the values from that generator to my caller".
So it doesn't generate one stack frame per generator result, instead it creates one object per generator used.
In this example, you are creating one generator object per child JSON-object. That would probably be the same number of stack frames needed if you did it recursively. You won't see a stack overflow though, because objects are allocated on the heap and you have a very different size limit there - depending on your operating system and settings. On my laptop, using Ubuntu Linux, ulimit -s gives me 8 MB for the default stack size, while my process memory size is unlimited (although I have only 8GB of physical memory).
Look at this documentation page on generators: https://wiki.python.org/moin/Generators
And this QA: Understanding generators in Python
Some nice examples, also for yield from:
https://www.python-course.eu/python3_generators.php
TL;DR: Generators are objects, they don't use recursion. Not even yield from, which just delegates to another generator object. Recursion is only practical when the number of calls is bounded and small, or your compiler supports tail call optimization.

python recursion not working with yield

I made a tree data structure and a function which gives out all its leaves, but the recursive algorithm never seems to work for any of the child nodes. The function gets called once using the root node
def get_files(self, initials):
for child in self.children:
name = initials + os.sep + child.name
if child.children == []:
yield name
else:
child.get_files(name)
full class: https://pastebin.com/4eukaVWx
if child.children == []:
yield name
else:
child.get_files(name)
Here you're yielding only in the if. In the other branch, the data is lost. You need to yield the elements returned by child.get_files(name). I'd do:
if not child.children:
yield name
else:
yield from child.get_files(name)
yield from is available in "recent" python versions. An alternative for older versions is a loop:
for item in child.get_files(name):
yield item
(a similar issue happens a lot with functions: Why does my function return None?)
Not a solution but an observation:
I guess you are printing something in the pastebin code and you trimmed down the print statement just to post a mve on the question. It works completely fine without the print statements but as soon as you put a single print statement in the method, the recursion stops happening.

Workarounds to suspend (serialize) and resume a recursive generator stack?

I have a recursive generator function that creates a tree of ChainMap contexts, and finally does something with the context at the end of the tree. It looks like this (parent_context is a ChainMap, hierarchy is a list):
def recursive_generator(parent_context, hierarchy):
next_level = hierarchy[0]
next_level_contexts = get_contexts(next_level) # returns a list of dicts
for context in next_level_contexts:
child_context = parent_context.new_child().update(context)
if next_level == hierarchy[-1]:
yield do_something(**child_context)
else:
yield from recursive_generator(child_context, hierarchy[1:])
Now I'd like to flag one level of the hierarchy such that the operation suspends after finishing that level, serializes the state to disk to be picked up later where it left off. Is there a way to do this without losing the elegance of the recursion?
I know that you can't pickle generators, so I thought about refactoring into an iterator object. But I think yield from is necessary for the recursion here (edit: at least without some tedious management of the stack), so I think it needs to be a generator, no? Is there a workaround for this?
you seem to be exploring a tree with DFS. so you could construct the tree in memory and make the DFS explicit. then just store the tree and restart at the left-most node (i think?).
that's effectively "tedious management of the stack", but it has a nice picture that would help implement it (at least for me, looking at your problem as DFS of a tree makes the implementation seem fairly obvious - before i thought of it like that, it seemed quite complicated - but i may be missing something).
sorry if that's obvious and insufficient...
[edit]
class Inner:
def __init__(self, context, hierarchy):
self.children = []
next_level = hierarchy[0]
next_level_contexts = get_contexts(next_level)
for context in next_level_contexts:
child_context = parent_context.new_child().update(context)
if next_level == hierarchy[-1]:
self.children.append(Leaf(context))
else:
self.children.append(Inner(child_context, hierarchy[1:]))
def do_something(self):
# this will do something on the left-most leaf
self.children[0].so_something()
def prune(self):
# this will remove the left-most leaf
if isinstance(self.children[0], Leaf):
self.children.pop(0)
else:
self.children[0].prune()
if not self.children[0]:
self.children.pop(0)
def __bool__(self):
return bool(self.children)
class Leaf:
def __init__(self, context):
self.context = context
def do_something():
do_something(**self.context)
the code above hasn't been tested. i ended up using classes for nodes as a tuple seemed too confusing. you create the tree by creating the parent node. then you can "do something" by calling do_something, after which you will want to remove the "done" leaf with prune:
tree = Inner(initial_context, initial_hierarchy)
while tree:
tree.do_something()
tree.prune()
i am pretty sure it will contain bugs, but hopefully it's enough to show the idea. sorry i can't do more but i need to repot plants....
ps it's amusing that you can write code with generators, but didn't know what DFS was. you might enjoy reading the "algorithm design manual" - it's part textbook and part reference, and it doesn't treat you like an idiot (i too have no formal education in computer science, and i thought it was a good book).
[edited to change to left-most first, which is what you had before, i think]
and alko has a good point...
Here's what I ended up doing:
def recursive_generator(parent_context, hierarchy):
next_level = hierarchy[0]
next_level_contexts = get_contexts(next_level) # returns a list of dicts
for context in next_level_contexts:
child_context = parent_context.new_child().update(context)
if next_level == hierarchy[-1]:
yield child_context
else:
yield from recursive_generator(child_context, hierarchy[1:])
def traverse_tree(hierarchy):
return list(recursive_generator(ChainMap(), hierarchy)
def do_things(contexts, start, stop):
for context in contexts[start:stop]:
yield do_something(**context)
Then I can pickle the list returned by traverse_tree and later load it and run it in pieces with do_things. This is all in a class with a lot more going on of course, but this gets to the gist of it.

How to write a PLY interface for hand-written lexer?

I'm writing a compiler in Python, and I made a hand-written lexer, because I can't figure out how to parse indentation in PLY. Also, my lexer uses some yield statements like so:
def scan():
...
for i in tokens:
if i[0]: yield Token(self.line, i[0] if i[0] in keywords else "ident", i[0])
elif i[1]:
if "e" in i[1]:
base, exp = i[1].split("e")
val = float(base) * 10 ** int(exp)
else: val = float(i[1])
yield Token(self.line, "float", val)
... other cases ...
However, I realized that the PLY parser requires a token method, so I made one that looks like this:
def token(self):
return next(self.scan())
The actual scanning using scan() takes an average of 124 ms, according to my tests, but when I use the PLY parser, the parsing doesn't start after a few minutes. It appears that my token() method has a problem.
Also, I tried to rename the scan() method so that it could become the interface. Python returns something like
AttributeError: 'generator' object has no attribute 'type'
So it appears that PLY needs a method that will return a single token at a time.
Is there any way to rewrite the token() method so that it would return the next iteration of scan() and not be that slow?
You need to save your generator somewhere, like:
def start(...):
self.lexer = self.scan()
def token(...):
return next(self.lexer)
Disclaimer: I don't know anything about PLY.

using function attributes to store results for lazy (potential) processing

I'm doing some collision detection and I would very much like to use the same function in two different contexts. In one context, I would like for it to be something like
def detect_collisions(item, others):
return any(collides(item, other) for other in others)
and in another, I would like it to be
def get_collisions(item, others):
return [other for other in others if collides(item, other)]
I really hate the idea of writing two functions here. Just keeping their names straight is one turnoff and complicating the interface to the collision detecting is another. so I was thinking:
def peek(gen):
try:
first = next(gen)
except StopIteration:
return False
else:
return it.chain((first,), gen)
def get_collisions(item, others):
get_collisions.all = peek(other for other in others if collides(item, other))
return get_collisions.all
Now when I just want to do a check, I can say:
if get_collisions(item, others):
# aw snap
or
if not get_collisions(item, others):
# w00t
and in the other context where I actually want to examine them, I can do:
if get_collisions(item, others):
for collision in get_collisions.all:
# fix it
and in both cases, I don't do any more processing than I need to.
I recognize that this is more code than the first two functions but it also has the advantage of:
Keeping my interface to the collision detection as a tree with the node at the top level instead of a mid level. This seems simpler.
Hooking myself up with a handy peek function. If I use it one other time, then I'm actually writing less code. (In response to YAGNI, if I have it, I will)
So. If you were the proverbial homicidal maniac that knows where I live, would I be expecting a visit from you if I wrote the above code? If so, how would you approach this situation?
Just make get_collisions return a generator:
def get_collisions(item, others):
return (other for other in others if collides(item, other))
Then, if you want to do a check:
for collision in get_collisions(item, others):
    print 'Collision!'
break
else:
    print 'No collisions!'
This is very similar to what we were discussing in "pythonic way to rewrite an assignment in an if statement", but this version only handles one positional or one keyword argument per call so one can always retrieve the actual valued cached (not just whether is a True value in the Pythonian boolean sense or not).
I never really cared for your proposal to accept multiple keywords and have differently named functions depending on whether you wanted all the results put through any() or all() -- but liked the idea of using keyword arguments which would allow a single function to be used in two or more spots simultaneously. Here's what I ended up with:
# can be called with a single unnamed value or a single named value
def cache(*args, **kwargs):
if len(args)+len(kwargs) == 1:
if args:
name, value = 'value', args[0] # default attr name 'value'
else:
name, value = kwargs.items()[0]
else:
raise NotImplementedError('"cache" calls require either a single value argument '
'or a name=value argument identifying an attribute.')
setattr(cache, name, value)
return value
# add a sub-function to clear the cache attributes (optional and a little weird)
cache.clear = lambda: cache.func_dict.clear()
# you could then use it either of these two ways
if get_collisions(item, others):
# no cached value
if cache(collisions=get_collisions(item, others)):
for collision in cache.collisions:
# fix them
By putting all the ugly details in a separate function, it doesn't affect the code in get_collisions() one way or the other, and is also available for use elsewhere.

Categories

Resources