I'm trying to manually step through a windows folder/ file structure using os.walk(). I'm working in Jupyter Notebooks.
If I execute:
next(os.walk(path))
I get a result that makes sense the first time, but then I keep getting exactly the same response every time I execute that statement.
However, if I do:
g=os.walk(path)
next(g)
then I do get the next logical record each time I execute:
next(g)
Note that both:
type(g) and type(os.walk(path))
return 'generator'.
Please explain why 'next' behaves differently depending on whether it is applied to g or os.walk()
Thank you--
Because every time you call os.walk, you get a new generator which starts at the top (or bottom with topdown=False). If you call next repeatedly on the same generator, on the other hand, you will iterate through all the values it generates.
In principle, this is no different than the operation of range. next(range(42)) always produces 0. If that were not the case, range would be pretty useless, since there would be no way of knowing where a given for i in range(x): iteration would start.
Every time you call os.walk(path) you create a new generator, one which is ready to walk through all the nodes accessible from the path, starting at the first one.
When you do next(os.walk(path)) you
Create a new generator.
Extract the first item from the generator using next.
Drop the generator which subsequently gets garbage collected and disappears, along with any knowledge of how many items you have extracted from it.
Repeating next(os.walk(path)) takes you back to point 1, which gets you a fresh generator starting at the first element yet again.
When you do g = os.walk(path); next(g) you
Create a new generator.
Store the generator in the variable g. This prevents it from being garbage collected, and preserves its internal state.
Extract the first element from the generator (using next) and advance its internal state.
Repeating next(g) gets you the next item in the generator you just used.
os.walk is a generator function. Each time it is called it returns a new iterable object.
When you write g=os.walk(path), you create a new iterable object named g. Each time you call next(g) the iterator takes one step.
When you write next(os.walk(path)) you create a new iterable object but do not give it a name. You have called it once but have no way of calling it again because it has not been bound to a name. That's the difference.
Related
Hi I'm trying to wrap my head around the concept of generator in Python specifically using Spacy.
As far as I understood, generator runs only once. and nlp.pipe(list) returns a generator to use
machine effectively.
And the generator worked as I predicted like below.
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
type(docs)
for x in docs:
print(x)
# First iteration, worked
one
two
three
for x in docs:
print(x)
# Nothing is printed this time
But strange thing happened when I tried to make a list using the generator
for things in nlp.pipe(example1):
print(things)
#First iteration prints things
a is something
b is other thing
c is new thing
d is extra
for things in nlp.pipe(example1):
print(things)
#Second iteration prints things again!
a is something
b is other thing
c is new thing
d is extra
Why this generator runs infinitely? I tried several times and it seems like it runs infinitely.
Thank you
I think you're confused because the term "generator" can be used to mean two different things in Python.
The first thing it can mean is a "generator object" which kind of iterator. The docs variable you created in your first example is a reference to one of these. A generator object can only be iterated once, after that it's exhausted and you'll need to create another one if you want to do more iteration.
The other thing "generator" can mean is a "generator function". A generator function is a function that returns a generator object when you call it. Indeed, the term "generator" is sometimes sloppily used for functions that return iterators generally, even when that's not technically correct. A real generator function is implemented using the yield keyword, but from the caller's perspective, it doesn't really matter how the function is implemented, just that it returns some kind of iterator.
I don't know anything about the library you're using, but it seems like nlp.pipe returns an iterator, so in at least the loosest sense (at least) it can be called a generator function. The iterator it returns is (presumably) the generator object.
Generator objects are single-use, like all iterators are supposed to be. Generator functions on the other hand, can be called as many times as you find appropriate (some might have side effects). Each time you call the generator function, you'll get a new generator object. This is why your second code block works, as you're calling nlp.pipe once for each loop, rather than iterating on the same iterator for both loops.
for things in nlp.pipe(example1) creates a new instance of the nlp.pipe() generator (i.e. an iterator).
If you had assigned the generator to a variable and used the variable multiple times, then you would have seen the effect that you were expecting:
pipeGen = nlp.pipe(example1)
for things in pipeGen:
print(things)
#First iteration will things
for things in pipeGen:
print(things)
#Second iteration will print nothing
In other words nlp.pipe() returns a NEW iterator whereas pipeGen IS an iterator.
I have a set of points in the space, each of them is linked to some other: http://molview.org/?q=Decane
For each point I need to find three other points:
One to form a bond: first neighbors
Second to form an angle: second neighbors
Third to form a dihedral angle: third neighbors is best but second if not existing
I have a working algorithm:
def search_and_build(index, neighbor):
#index is the currently selected point, neighbor is a list containing all the connected point...
#if the index is already done return directly
if is_done(index):
return
set_done(index)
for i, j in enumerate(neighbor):
#add function are adding data in dictionaries and search function are searching in the bonding dict for second and third neighbors
add_bond(j, index)
add_angle(j, search_angle(j))
add_dihedral(j, search_dihedral(j))
search_and_build(j, get_sorted_neighbors(j))
This algorithm is using recursivity in a for loop. I use it because I thought recursivity is cool and also because it instantly worked. I assumed that python would finish the for first and then run another function, but after some debugging I realized that it's not working like that. Sometimes the for is running multiples times before another function sometimes not
I googled and it's apparently a bad practice to use such algorithms, would someone be able to explain?
Each time your for loop gets to the last line it calls the function again, starting the for loop again and so on.
The issue is that the for loop in all of those functions calls has not finished executing, it has executed once, and put a new function call on the stack for search_and_build and each search_and_build execution will do the same while there's still something in your dict.
By the time you get back to the first For loops the dict that's getting iterated on doesn't exist or has shrunk a lot, but there was supposed to be something/more of something to iterate over when it first started.
Basicly recursion is cool, but it makes thing pretty hard to get your head around or debug, even more if you invole other loops inside each steps of your recursion.
TLDR : Mutating and iterable while looping over it is very bad.
I've been learning about generators in Python recently and have a question. I've used iterators before when learning Java, so I know how they basically work.
So I understand what's going on here in this question: Python for loop and iterator behavior
Essentially, once the for loop traverses through the iterator, it stops there, so doing another for loop would continue the iterator at the end of it (and result in nothing being printed out). That is clear to me
I'm also aware of the tee method from itertools, which lets me "duplicate" a generator. I found this to be helpful when I want to check if a generator is empty before doing anything to it (as I can check whether the duplicate in list form is empty).
In a code I'm writing, I need to create many of the same generators at different instances throughout the code, so my line of thought was: why don't I write a method that makes a generator? So every time I need a new one, I can call that method. Maybe my misunderstanding has to do with this "generator creation" process but that seemed right to me.
Here is the code I'm using. When I first call the method and duplicate it using tee, everything works fine, but then once I call it again after looping through it, the method returns an empty generator. Does this "using a method" workaround not work?
node_list=[]
generate_hash_2, temp= tee(generate_nodes(...))
for node in list(temp):
node_list.append(...)
print("Generate_hash_2:{}".format(node_list))
for node in generate_hash_2:
if node.hash_value==x:
print x
node_list2=[]
generate_hash_3, temp2= tee(generate_nodes(...)) #exact same parameters as before
for node in list(temp2):
node_list2.append(...)
print("Generate_hash_3:{}".format(node_list2))
`
def generate_nodes(nodes, type):
for node in nodes:
if isinstanceof(node.type,type):
yield node
Please ignore the poor variable name choices but the temp2 prints out fine, but temp3 prints out an empty list, despite the methods taking identical parameters :( Note that the inside of the for loop doesn't modify any of the items or anything. Any help or explanation would be great!
For a sample XML file, I have this:
<top></top>
For a sample output, I'm getting:
Generate_hash_2:["XPath:/*[name()='top'][1], Node Type:UnknownNode, Tag:top, Text:"]
Generate_hash_3:[]
If you are interested in helping me understand this further, I've been writing these methods to get an understanding of the files in here: https://github.com/mmoosstt/XmlXdiff/tree/master/lib/diffx , specifically the differ.py file
The code in that file constantly calls the _gen_dx_nodes() method (with the same parameters), which is a method that creates a generator. But the code's generator never "ends" and forces the writer to do something to reset it. So I'm confused why this happens to me (because I've been running into my problem when calling that method from different methods in succession). I've also been using the same test cases so I'm pretty lost here on how to fix this. Any help would be great!
I've got a django app that, at a very pseudo-codey level, does something like this:
class SerialisedContentItem():
def __init__(self, item),
self.__output = self.__jsonify(item)
def fetch_output(self):
return self.__output
def __jsonify(self, item):
serialized = do_a_bunch_of_serialisey_stuff()
return json.dumps(serialized)
So basically - as soon as the class is instantiated, it:
runs an internal function to generate an output string of JSON
stores it in an internal variable
exposes a public function that can be
called later to retrieve the JSON
It's then being used to generate a page something like this:
for item in page.items:
json_item = SerialisedContentItem(item)
yield json_item.fetch_output()
This, to me, seems a bit pointless. And it's also causing issues with some business logic changes we need to make.
What I'd prefer to do is defer the calling of the "jsonify" function until I actually want it. Roughly speaking, changing the above to:
class SerialisedContentItem():
def __init__(self, item),
self.__item = item
def fetch_output(self):
return self.__jsonify(self.__item):
This seems simpler, and mucks with my logic slightly less.
But: is there a downside I'm not seeing? Is my change less performant, or not a good way of doing things?
As long as you only call fetch_output once per item, there's no performance hit (there would be one, obviously, if you called fetch_output twice on the same SerializedContentItem instance). And not doing useless operations is usually a good thing too (you don't expect open("/path/to/some/file.ext") to read the file's content, do you ?)
The only caveat is that, with the original version, if item is mutated between the initialization of SerializedContentItem and the call to fetch_output, the change won't be reflected in the json output (since it's created right at initialisation time), while with your "lazy" version those changes WILL reflect in the json. Whether this is a no-go, a potential issue or actually just what you want depends on the context, so only you can tell.
EDIT:
what's prompted my question: according to my (poor) understanding of yield, using it here makes sense: I only need to iterate page items once, so do so in a way that minimises the memory footprint. But the bulk of the work isn't currently being done in the yield function, it's being done on the line above it when the class is instantiated, making yield a bit pointless. Or am I misunderstanding how it works?
I'm afraid you are indeed misunderstanding yield. Defering the json seralization until the yield json_item.fetch_output() will change nothing (nada, zero, zilch, shunya) to memory comsuption wrt/ the original version.
yield is not a function, it's a keyword. What it does is to turn the function containing it into a "generator function" - a function that returns a generator (a lazy iterator) object, that you can then iterate over. It will not change anything to the memory used to jsonify an item, and whether this jsonification happens "on the same line" as the yield keyword or not is totally irrelevant.
What a generator brings you (wrt/ memory use) is that you don't have to create a whole list of contents at once, ie:
def eager():
result = []
for i in range(1000):
result.append("foo {}\n".format(i))
return result
with open("file.txt", "w") as outfile:
for item in eager():
outfile.write(item)
This FIRST creates a 1000 items long list in memory, then iterate over it.
vs
def lazy():
result = []
for i in range(1000):
yield "foo {}\n".format(i)
with open("file.txt", "w") as outfile:
for item in lazy():
outfile.write(item)
this lazily generates string after string on each iteration, so you don't end up with a 1000 items list in memory - BUT you still generated 1000 strings, each of them using the same amount of space as with the first solution. The difference is that since (in this example) you don't keep any reference on those strings they can be garbage collected on each iteration, while storing them in a list prevent them from being collected until there' no more reference on the list itself.
I understand generator generates value once at a time, which could save a lot memory and not like list which stores all value in memory.
I want to know in python, how yield knows which value should be returned during the iteration without storing all data at once in memory?
In my understanding, if i want to print 1 to 100 using yield, it is necessary that yield needs to know or store 1 to 100 first and then move point one by one to return value ?
If not, then how yield return value once at a time, but without storing all value in memory?
Simply put, yield delays the execution but remembers where it left off. However, more specifically, when yield is called, the variables in the state of the generator function are saved in a "frozen" state. When yield is called again, the built in next function sends back the data in line to be transmitted. If there is no more data to be yielded (hence a StopIteration is raised), the generator data stored in its "frozen" state is discarded.
Each time a generator yields, the stack frame of the generator is saved off inside the generator object, so it can be restored when execution resumes (when the next value is requested).
You can see the structure definition on CPython here.
If you want to see more, generators are somewhat introspectable, so you can take a look at, say, the progression of the locals, the line number it's currently on, etc.:
def myrange(n):
for i in range(n):
yield i
mr = myrange(10)
# Before any values consumed:
print(mr.gi_frame.f_locals) # Outputs {'n': 10}
print(mr.gi_frame.f_lineno) # Outputs 1
next(mr) # Advance one
print(mr.gi_frame.f_locals) # Outputs {'n': 10, 'i': 0}
print(mr.gi_frame.f_lineno) # Outputs 3
list(mr) # Consumes generator
print(mr.gi_frame) # Outputs None; the frame is discarded when the generator finishes