I'm learning Python's generator from this slide: http://www.dabeaz.com/generators/Generators.pdf
There is an example in it, which can be describe like this:
You have a log file called log.txt, write a program to watch the content of it, if there are new line added to it, print them. Two solutions:
1. with generator:
import time
def follow(thefile):
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
logfile = open("log.txt")
loglines = follow(logfile)
for line in loglines:
print line
2. Without generator:
import time
logfile = open("log.txt")
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
print line
What's the benefit of using generator here?
If all you have is a hammer, everything looks like a nail
I'd almost just like to answer this question with just the above quote. Just because you can does not mean you need to all the time.
But conceptually the generator version separates functionality, the follow function serves the purpose of encapsulating the continuous reading from a file while waiting for new input. Which frees you to do anything in your loop with the new line that you want. In the second version the code to read from the file, and to print out is intermingled with the control loop. This might not be really an issue in this small example but that is something you might want to think about.
One benefit is the ability to pass your generator around (say to different functions) and iterate manually by calling .next(). Here is a slightly modified version of your initial generator example:
import time
def follow(file_name):
with open(file_name, 'rb') as f:
for line in f:
if not line:
time.sleep(0.1)
continue
yield line
loglines = follow(logfile)
first_line = loglines.next()
second_line = loglines.next()
for line in loglines:
print line
First of all I opened the file with a context manager (with statement, which auto-closes the file when you're done with it, or on exception). Next, at the bottom I've demonstrated using the .next() method, allowing you to manually step through. This can be useful sometimes if you need to break logic out from a simple for item in gen loop.
A generator function is defined like a normal function, but whenever it needs to generate a value, it does so with the yield keyword rather than return. Its main advantage is it allows its code to produce a series of values over time, rather than computing them at once and sending them back like a list.For example
# A Python program to generate squares from 1
# to 100 using yield and therefore generator
# An infinite generator function that prints
# next square number. It starts with 1
def nextSquare():
i = 1;
# An Infinite loop to generate squares
while True:
yield i*i
i += 1 # Next execution resumes
# from this point
# Driver code to test above generator
# function
for num in nextSquare():
if num > 100:
break
print(num)
Return sends a specified value back to its caller whereas Yield can produce a sequence of values. We should use yield when we want to iterate over a sequence, but don’t want to store the entire sequence in memory.
Ideally most loops are roughly of the form:
for element in get_the_next_value():
process(element)
However sometimes (as in your example #2), the loop is actually more complex as you sometimes get an element and sometimes don't. That means in your example without the element you have mixed up code for generating an element with the code for processing it. It doesn't show too clearly in the example because the code to generate the next value isn't actually too complex and the processing is just one line, but example number 1 is separating these two concepts more cleanly.
A better example might be one that processes variable length paragraphs from a file with blank lines separating each paragraph: try writing code for that with and without generators and you should see the benefit.
While your example might be a bit simple to fully take advantage of generators, I prefer to use generators to encapsulate the generation of any sequence data where there is also some kind of filtering of the data. It keeps the 'what I'm doing with the data' code separated from the 'how I get the data' code.
Related
For instance:
def read_file(f):
with open(f, 'r') as file_to_read:
while True:
line = file_to_read.readline()
if line:
yield line
else:
time.sleep(0.1)
The generator is consumed by another function:
def fun_function(f):
l = read_file(f)
for line in l:
do_fun_stuff()
A use case would be reading an infinitely updating text file like a log where new lines are added every second or so.
As far as I understand the read_file() function is blocking others as long as something is yielded. But since nothing should be done unless a new line is present in the file, this seems to be okay in this case. My question would be if there are other reasons not to prefer this blocking pattern (like performance)?
I am using seek function to extract new lines in an updated file. My code looks like this:
read_data=open('path-to-myfile','r')
read_data.seek(0,2)
while True:
time.sleep(sometime)
new_data=read_data.readlines()
do something with new_data
myfile is a csv file that will be constantly updated
The problem is that usually after several loops inside the while, new_data return nothing. It can be different loop numbers. While I checked myfile, it is still updating..... So any problem I have on my code ? Or is there any other way to do this ?
Any help appreciated !!
You have two programs accessing the same file on disk? If that is the case, then the resource may be locking. I set up an example script that writes to a file, and another file that reads for changes based on the code you provided.
So in one instance of python:
import time
while True:
time.sleep(2)
with open('test.txt','a') as read_data:
read_data.seek(0,2)
read_data.write("bibbity boopity\n")
And in another instance of python
import time
read_data=open('test.txt','r')
read_data.seek(0,2)
while True:
time.sleep(1)
new_data=read_data.readlines()
print(new_data)
In this case, the resource is updating slower than its being read, so changes printed by the bottom prog will be blank. But if I speed up the changes per second, well I still see them. But there are some instances where not all the updates are seen.
You may want to use asynchronous file reading to catch all the changes. Python 3 asyncio library doesn't support async file read/write, but curio does.
See also this question
I have the following function:
def getInput():
# define buffer (list of lines)
buffer = []
run = True
while run:
# loop through each line of user input, adding it to buffer
for line in sys.stdin.readlines():
if line == 'quit\n':
run = False
else:
buffer.append(line.replace('\n',''))
# return list of lines
return buffer
which is called in my function takeCommands(), which is called to actually run my program.
However, this doesn't do anything. I'm hoping to add each line to an array, and once a line == 'quit' it stops taking user input. I've tried both for line in sys.stdin.readlines() and for line sys.stdin, but neither of them register any of my input (I'm running it in Windows Command Prompt). Any ideas? Thanks
So, took your code out of the function and ran some tests.
import sys
buffer = []
while run:
line = sys.stdin.readline().rstrip('\n')
if line == 'quit':
run = False
else:
buffer.append(line)
print buffer
Changes:
Removed the 'for' loop
Using 'readline' instead of 'readlines'
strip'd out the '\n' after input, so all processing afterwards is much easier.
Another way:
import sys
buffer = []
while True:
line = sys.stdin.readline().rstrip('\n')
if line == 'quit':
break
else:
buffer.append(line)
print buffer
Takes out the 'run' variable, as it is not really needed.
I'd use itertools.takewhile for this:
import sys
import itertools
print list(itertools.takewhile(lambda x: x.strip() != 'quit', sys.stdin))
Another way to do this would be to use the 2-argument iter form:
print list(iter(raw_input,'quit'))
This has the advantage that raw_input takes care of all of the line-buffering issues and it will strip the newlines for you already -- But it will loop until you run out of memory if the user forgets to add a quit to the script.
Both of these pass the test:
python test.py <<EOF
foo
bar
baz
quit
cat
dog
cow
EOF
There are multiple separate problems with this code:
while run:
# loop through each line of user input, adding it to buffer
for line in sys.stdin.readlines():
if line == 'quit':
run = False
First, you have an inner loop that won't finish until all lines have been processed, even if you type "quit" at some point. Setting run = False doesn't break out of that loop. Instead of quitting as soon as you type "quit", it will keep going until it's looked at all of the lines, and then quit if you typed "quit" at any point.
You can fix this one pretty easily by adding a break after the run = False.
But, with or without that fix, if you didn't type "quit" during that first time through the outer loop, since you've already read all input, there's nothing else to read, so you'll just keep running an empty inner loop over and over forever that you can never exit.
You have a loop that means "read and process all the input". You want to do that exactly once. So, what should the outer loop be? It should not be anyway; the way to do something once is to not use a loop. So, to fix this one, get rid of run and the while run: loop; just use the inner loop.
Then, if you type "quit", line will actually be "quit\n", because readlines does not strip newlines.
You fix this one by either testing for "quit\n", or stripping the lines.
Finally, even if you fix all of these problems, you're still waiting forever before doing anything. readlines returns a list of lines. The only way it can possibly do that is by reading all of the lines that will ever be on stdin. You can't even start looping until you've read all those lines.
When standard input is a file, that happens when the file ends, so it's not too terrible. But when standard input is the Windows command prompt, the command prompt never ends.* So, this takes forever. You don't get to start processing the list of lines, because it takes forever to wait for the list of lines.
The solution is to not use readlines(). Really, there is never a good reason to call readlines() on anything, stdin or not. Anything that readlines works on is already an iterable full of lines, just like the list that readlines would give you, except that it's "lazy": it can give you the lines one at a time, instead of waiting and giving you all of them at once. (And even if you really need the list, just do list(f) instead of f.readlines().)
So, instead of for line in sys.stdin.readlines():, just do for line in sys.stdin: (Or, better, replace the explicit loop completely and use a sequence of iterator transformations, as in mgilson's answer.)
The fixes JBernardo, Wing Tang Wong, etc. proposed are all correct, and necessary. The reason none of them fixed your problems is that if you have 4 bugs and fix 1, your code still doesn't work. That's exactly why "doesn't work" isn't a useful measure of anything in programming, and you have to debug what's actually going wrong to know whether you're making progress.
* I lied a bit about stdin never being finished. If you type a control-Z (you may or may not need to follow it with a return), then stdin is finished. But if your assignment is to make it quit as soon as the user types "quit"< turning in something that only quits when the user types "quit" and then return, control-Z, return again probably won't be considered successful.
I have got stuck with a problem.
It goes like this,
A function returns a single result normally. What I want is it to return continuous streams of result for a certain time frame(optional).
Is it feasible for a function to repeatedly return results for a single function call?
While browsing through the net I did come across gevent and threading. Will it work if so any heads up how to solve it?
I just need to call the function carry out the work and return results immediately after every task is completed.
Why you need this is not specified in the question, so it is hard to know what you need, but I will give you a general idea, and code too.
You could return in that way: return var1, var2, var3 (but that's not what you need I think)
You have multiple options: either blocking or non-blocking. Blocking means your code will no longer execute while you are calling the function. Non-blocking means that it will run in parallel. You should also know that you will definitely need to modify the code calling that function.
That's if you want it in a thread (non-blocking):
def your_function(callback):
# This is a function defined inside of it, just for convenience, it can be any function.
def what_it_is_doing(callback):
import time
total = 0
while True:
time.sleep(1)
total += 1
# Here it is a callback function, but if you are using a
# GUI application (not only) for example (wx, Qt, GTK, ...) they usually have
# events/signals, you should be using this system.
callback(time_spent=total)
import thread
thread.start_new_thread(what_it_is_doing, tuple(callback))
# The way you would use it:
def what_I_want_to_do_with_each_bit_of_result(time_spent):
print "Time is:", time_spent
your_function(what_I_want_to_do_with_each_bit_of_result)
# Continue your code normally
The other option (blocking) involves a special kind of functions generators which are technically treated as iterators. So you define it as a function and acts as an iterator. That's an example, using the same dummy function than the other one:
def my_generator():
import time
total = 0
while True:
time.sleep(1)
total += 1
yield total
# And here's how you use it:
# You need it to be in a loop !!
for time_spent in my_generator():
print "Time spent is:", time_spent
# Or, you could use it that way, and call .next() manually:
my_gen = my_generator()
# When you need something from it:
time_spent = my_gen.next()
Note that in the second example, the code would make no sense because it is not really called at 1 second intervals, because there's the other code running each time it yields something or .next is called, and that may take time. But I hope you got the point.
Again, it depends on what you are doing, if the app you are using has an "event" framework or similar you would need to use that, if you need it blocking/non-blocking, if time is important, how your calling code should manipulate the result...
Your gevent and threading are on the right track, because a function does what it is programmed to do, either accepting 1 var at a time or taking a set and returning either a set or a var. The function has to be called to return either result, and the continuous stream of processing is probably taking place already or else you are asking about a loop over a kernel pointer or something similar, which you are not, so ...
So, your calling code which encapsulates your function is important, the function, any function, eg, even a true/false boolean function only executes until it is done with its vars, so there muse be a calling function which listens indefinitely in your case. If it doesn't exist you should write one ;)
Calling code which encapsulates is certainly very important.
Folks aren't going to have enough info to help much, except in the super generic sense that we can tell you that you are or should be within in some framework's event loop, or other code's loop of some form already- and that is what you want to be listening to/ preparing data for.
I like "functional programming's," "map function," for this sort of thing. I think. I can't comment at my rep level or I would restrict my speculation to that. :)
To get a better answer from another person post some example code and reveal your API if possible.
I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/#href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
A generator simply yields values. There's no way for the generator to know what's being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?
The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.
If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.
I think you may be looking for the yield function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?
A solution could be to use a Queue to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more than n items in memory at the same time.