Pause Python Generator - python

I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/#href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1

A generator simply yields values. There's no way for the generator to know what's being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?

The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.
If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.

I think you may be looking for the yield function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?

A solution could be to use a Queue to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more than n items in memory at the same time.

Related

trio.Event(): Which is “better”: setting and initializing a new Event or checking if someone is waiting for it beforehand?

import trio
work_available = trio.Event()
async def get_work():
while True:
work = check_for_work()
if not work:
await work_available.wait()
else:
return work
def add_work_to_pile(...):
...
if work_available.statistics().tasks_waiting:
global work_available
work_available.set()
work_available = trio.Event()
In this Python-like code example I get work in bursts via add_work_to_pile(). The workers which get work via get_work() are slow. So most of the time add_work_to_pile() is called there will be no one waiting on work_available.
Which is better/cleaner/simpler/more pythonic/more trionic/more intended by the trio developers?
checking if someone is looking for the Event() via statistics().tasks_waiting, like in the example code, ...or...
unconditionally set() setting the Event() and creating a new one each time? (Most of them in vain.)
Furthermore... the API does not really seem to expect regular code to check if someone is waiting via this statistics() call...
I don’t mind spending a couple more lines to make things clearer. But that goes both ways: a couple CPU cycles more are fine for simpler code...
Creating a new Event is roughly the same cost as creating the _EventStatistics object within the statistics method. You'll need to profile your own code to pick out any small difference in performance. However, although it is safe and performant, the intent of statistics across trio's classes is for debug rather than core logic. Using/discarding many Event instances would be relatively more along the intent of the devs.
A more trionic pattern would be to load each work item into a buffered memory channel in place of your add_work_to_pile() method and then iterate on that in the task that awaits get_work. I feel the amount of code is comparable to your example:
import trio
send_chan, recv_chan = trio.open_memory_channel(float('inf'))
async def task_that_uses_work_items():
# # compare
# while True:
# work = await get_work()
# handle_work(work)
async for work in recv_chan:
handle_work(work)
def add_work_to_pile():
...
for work in new_work_set:
send_chan.send_nowait(work)
# maybe your work is coming in from a thread?
def add_work_from_thread():
...
for work in new_work_set:
trio_token.run_sync_soon(send_chan.send_nowait, work)
Furthermore, it's performant because the work items are efficiently rotated through a deque internally. This code would checkpoint for every work item so you may have to jump through some hoops if you want to avoid that.
I think you might want a trio.ParkingLot. It gives more control over parking (i.e. which is like Event.wait()) and unparking (which is like Event.set() except that it doesn't stop future parkers from waiting). But it doesn't have any notion of being set at all so you would need to store that information separately. If you work is naturally Truety when set (e.g. a non-empty list) then that might be easy anyway. Example:
available_work = []
available_work_pl = trio.ParkingLot()
async def get_work():
while not available_work:
await available_work_pl.park()
result = list(available_work)
available_work.clear()
return result
def add_work_to_pile():
available_work.append(foo)
available_work_pl.unpark()
Edit: Replaced "if" with "while" in get_work(). I think if has a race condition: if there are two parked tasks and then add_work_to_pile() gets called twice, then one get_work() would get both work items but the other would still be unparked and return an empty list. Using while instead will make it loop back around until more data is added.
IMHO you don't want an event in the first place. The combination of an array and something that tells the reader there's work in the array is already available as memory channels. They have the additional advantage that you can tell them how much work to accept before the sender stalls.
send_channel, recv_channel = trio.open_memory_channel(10)
get_work = recv_channel.receive
add_work_to_pile = send_channel.send
# both are async functions or use the _nowait() versions

Yielding a value from a coroutine in Python, a.k.a. convert callback to generator

I'm new to Python and functional programming. I'm using version 2.7.6
I'm using the Tornado framework to make async network requests. From what I learned about functional programming, I want my data to stream through my code by using generators. I have done most of what I need using generators and transforming the data as they stream through my function calls.
At the very end of my stream, I want to make a REST request for some data. I have one for-loop just before I submit my data to Tornado, to initiate the pull, and then send the http request. The http object provided by Tornado takes a callback function as an option, and always returns a Future--which is actually a Tornado Future object, and not the official Python Future.
My problem is that since I'm now using generators to pull my data through my code, I no longer want to use the callback function. My reasoning for this is that after I get my data back from the callback, my data is now being pushed through my code, and I can no longer make use of generators.
My goal is to create an interface that appears like so:
urls = (...generated urls...)
responses = fetch(urls)
Where responses is a generator over the completed urls.
What I attempted to do--among many things--is convert the results from the callback into a generator. I was thinking about something like this, although I'm far from implementing it for other issues I will soon explain. However, I wanted my fetch function to look something like this:
def fetch(urls):
def url_generator():
while True:
val = yield
yield val
#curry
def handler(gen, response):
gen.send(response)
gen = url_generator()
for u in urls:
http.fetch(u, callback=handler(gen))
return gen
I simplified the code and syntax to focus on the problem, but I figured this was going to work fine. My strategy was to define a coroutine/generator which I will then send the responses to, as I receive them.
What I'm having the most trouble with is the coroutine/generator. Even if I define a generator in the above manner and perform the following, then I get an infinite loop--this is one of my main problems.
def gen():
while True:
val = yield
print 'val', val
yield val
print 'after', val
break
g = gen()
g.send(None)
g.send(10)
for e in g:
print e
This prints val 10 after 10 in the coroutine as expected with the break, but the for-loop never gets the value of 10. It doesn't print anything while the break is there. If I remove the break, then I get the infinite loop:
val None
None
after None
None
val None
None
after None
None
...
If I remove the for-loop, then the coroutine will only print val 10 as it waits on the second yield. I expect this. However, using it doesn't produce anything.
Similarly, if I remove the for-loop and replace it with print next(g), then I get a StopIteration error, which I assume means I called next on a generator that had no more values.
Anywho, I am at a complete loss while I plunge into more depth on Python. I figure this is such a common situation in Python that somebody knows a great approach. I searched for 'convert callback into generator' and such, but didn't have much luck.
On another note, I could possibly yield each future from the http request, but I didn't have much luck "waiting" on the yield for the future to complete. I read a lot about 'yield from', but it seems to be Python 3 specific and Tornado doesn't seem to work on Python 3 yet.
Thanks for viewing, and thanks for any help you can provide.
Tornado works great on Python 3.
The problem with your simplified code above is that this isn't doing what you expect:
val = yield
You expect the generator to pause there (blocking your for-loop) until some other function calls g.send(value), but that's not what happens. Instead, the code behaves like:
val = yield None
So the for-loop receives None values as fast as it can process them. After it receives each None, it implicitly calls g.next(), which is the same as g.send(None). So, your code is equivalent to this:
def gen():
while True:
val = yield None
print 'val', val
yield val
print 'after', val
g = gen()
g.send(None)
g.send(10)
while True:
try:
e = g.send(None)
print e
except StopIteration:
break
Reading this version of the code, where the implicit behaviors are made explicit, I hope it's clear why it's just generating None in an infinite loop.
What you need is some way for one function to add items to the head of a queue, while another function blocks waiting for items, and pulls them off the tail of the queue when they're ready. Starting in Tornado 4.2 we have exactly that:
http://www.tornadoweb.org/en/stable/queues.html
The web spider example is close to what you want to do, I'm sure you can adapt it:
http://www.tornadoweb.org/en/stable/guide/queues.html

Free up memory by deleting numpy arrays

I have written a fatigue analysis program with a GUI. The program takes strain information for unit loads for each element of a finite element model, reads in a load case using np.genfromtxt('loadcasefilename.txt') and then does some fatigue analysis and saves the result for each element in another array.
The load cases are about 32Mb as text files and there are 40 or so which get read and analysed in a loop. The loads for each element are interpolated by taking slices of the load case array.
The GUI and fatigue analysis run in separate threads. When you click 'Start' on the fatigue analysis it starts the loop over the load cases in the fatigue analysis.
This brings me onto my problem. If I have a lot of elements, the analysis will not finish. How early it quits depends on how many elements there are, which makes me think it might be a memory problem. I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
Code below:
for LoadCaseNo in range(len(LoadCases[0]['LoadCaseLoops'])):#range(1):#xxx
#Get load case data
self.statustext.emit('Opening current load case file...')
LoadCaseFilePath=LoadCases[0]['LoadCasePaths'][LoadCaseNo][0]
#TK: load case paths may be different
try:
with open(LoadCaseFilePath):
pass
except Exception as e:
self.statustext.emit(str(e))
LoadCaseLoops=LoadCases[0]['LoadCaseLoops'][LoadCaseNo,0]
LoadCase=np.genfromtxt(LoadCaseFilePath,delimiter=',')
LoadCaseArray=np.array(LoadCases[0]['LoadCaseLoops'])
LoadCaseArray=LoadCaseArray/np.sum(LoadCaseArray,axis=0)
#Loop through sections
for SectionNo in range(len(Sections)):#range(100):#xxx
SectionCount=len(Sections)
#Get section data
Elements=Sections[SectionNo]['elements']
UnitStrains=Sections[SectionNo]['strains'][:,1:]
Nodes=Sections[SectionNo]['nodes']
rootdist=Sections[SectionNo]['rootdist']
#Interpolate load case data at this section
NeighbourFind=rootdist-np.reshape(LoadCase[0,1:],(1,-1))
NeighbourFind[NeighbourFind<0]=1e100
nearest=np.unravel_index(NeighbourFind.argmin(), NeighbourFind.shape)
nearestcol=int(nearest[1])
Distance0=LoadCase[0,nearestcol+1]
Distance1=LoadCase[0,nearestcol+7]
MxLow=LoadCase[1:,nearestcol+1]
MxHigh=LoadCase[1:,nearestcol+7]
MyLow=LoadCase[1:,nearestcol+2]
MyHigh=LoadCase[1:,nearestcol+8]
MzLow=LoadCase[1:,nearestcol+3]
MzHigh=LoadCase[1:,nearestcol+9]
FxLow=LoadCase[1:,nearestcol+4]
FxHigh=LoadCase[1:,nearestcol+10]
FyLow=LoadCase[1:,nearestcol+5]
FyHigh=LoadCase[1:,nearestcol+11]
FzLow=LoadCase[1:,nearestcol+6]
FzHigh=LoadCase[1:,nearestcol+12]
InterpFactor=(rootdist-Distance0)/(Distance1-Distance0)
Mx=MxLow+(MxHigh-MxLow)*InterpFactor[0,0]
My=MyLow+(MyHigh-MyLow)*InterpFactor[0,0]
Mz=MzLow+(MzHigh-MzLow)*InterpFactor[0,0]
Fx=-FxLow+(FxHigh-FxLow)*InterpFactor[0,0]
Fy=-FyLow+(FyHigh-FyLow)*InterpFactor[0,0]
Fz=FzLow+(FzHigh-FzLow)*InterpFactor[0,0]
#Loop through section coordinates
for ElementNo in range(len(Elements)):
MaterialID=int(Elements[ElementNo,1])
if Materials[MaterialID]['curvefit'][0,0]!=3:
StrainHist=UnitStrains[ElementNo,0]*Mx+UnitStrains[ElementNo,1]*My+UnitStrains[ElementNo,2]*Fz
elif Materials[MaterialID]['curvefit'][0,0]==3:
StrainHist=UnitStrains[ElementNo,3]*Fx+UnitStrains[ElementNo,4]*Fy+UnitStrains[ElementNo,5]*Mz
EndIn=len(StrainHist)
Extrema=np.bitwise_or(np.bitwise_and(StrainHist[1:EndIn-1]<=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]<=StrainHist[2:EndIn]),np.bitwise_and(StrainHist[1:EndIn-1]>=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]>=StrainHist[2:EndIn]))
Extrema=np.concatenate((np.array([True]),Extrema,np.array([True])),axis=0)
Extrema=StrainHist[np.where(Extrema==True)]
del StrainHist
#Do fatigue analysis
self.statustext.emit('Analysing load case '+str(LoadCaseNo+1)+' of '+str(len(LoadCases[0]['LoadCaseLoops']))+' - '+str(((SectionNo+1)*100)/SectionCount)+'% complete')
del MxLow,MxHigh,MyLow,MyHigh,MzLow,MzHigh,FxLow,FxHigh,FyLow,FyHigh,FzLow,FzHigh,Mx,My,Mz,Fx,Fy,Fz,Distance0,Distance1
gc.collect()
There's obviously a retain cycle or other leak somewhere, but without seeing your code, it's impossible to say more than that. But since you seem to be more interested in workarounds than solutions…
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
No, Python doesn't have any equivalent to pack. (Of course if you know exactly what set of values you want to keep around, you can always np.savetxt or pickle.dump or otherwise stash them, then exec or spawn a new interpreter instance, then np.loadtxt or pickle.load or otherwise restore those values. But then if you know exactly what set of values you want to keep around, you probably aren't going to have this problem in the first place, unless you've actually hit an unknown memory leak in NumPy, which is unlikely.)
But it has something that may be better. Kick off a child process to analyze each element (or each batch of elements, if they're small enough that the process-spawning overhead matters), send the results back in a file or over a queue, then quit.
For example, if you're doing this:
def analyze(thingy):
a = build_giant_array(thingy)
result = process_giant_array(a)
return result
total = 0
for thingy in thingies:
total += analyze(thingy)
You can change it to this:
def wrap_analyze(thingy, q):
q.put(analyze(thingy))
total = 0
for thingy in thingies:
q = multiprocessing.Queue()
p = multiprocessing.Process(target=wrap_analyze, args=(thingy, q))
p.start()
p.join()
total += q.get()
(This assumes that each thingy and result is both smallish and pickleable. If it's a huge NumPy array, look into NumPy's shared memory wrappers, which are designed to make things much easier when you need to share memory directly between processes instead of passing it.)
But you may want to look at what multiprocessing.Pool can do to automate this for you (and to make it easier to extend the code to, e.g., use all your cores in parallel). Notice that it has a maxtasksperchild parameter, which you can use to recycle the pool processes every, say, 10 thingies, so they don't run out of memory.
But back to actually trying to solve things briefly:
I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
None of that should make any difference at all. If you're just reassigning all the local variables to new values each time through the loop, and aren't keeping references to them anywhere else, then they're just going to get freed up anyway, so you'll never have more than 2 at a (brief) time. And gc.collect() only helps if there are reference cycles. So, on the one hand, it's good news that these had no effect—it means there's nothing obviously stupid in your code. On the other hand, it's bad news—it means that whatever's wrong isn't obviously stupid.
Usually people see this because they keep growing some data structure without realizing it. For example, maybe you vstack all the new rows onto the old version of giant_array instead of onto an empty array, then delete the old version… but it doesn't matter, because each time through the loop, giant_array isn't 5*N, it's 5*N, then 10*N, then 15*N, and so on. (That's just an example of something stupid I did not long ago… Again, it's hard to give more specific examples while knowing nothing about your code.)

Return continuous result from a single function call

I have got stuck with a problem.
It goes like this,
A function returns a single result normally. What I want is it to return continuous streams of result for a certain time frame(optional).
Is it feasible for a function to repeatedly return results for a single function call?
While browsing through the net I did come across gevent and threading. Will it work if so any heads up how to solve it?
I just need to call the function carry out the work and return results immediately after every task is completed.
Why you need this is not specified in the question, so it is hard to know what you need, but I will give you a general idea, and code too.
You could return in that way: return var1, var2, var3 (but that's not what you need I think)
You have multiple options: either blocking or non-blocking. Blocking means your code will no longer execute while you are calling the function. Non-blocking means that it will run in parallel. You should also know that you will definitely need to modify the code calling that function.
That's if you want it in a thread (non-blocking):
def your_function(callback):
# This is a function defined inside of it, just for convenience, it can be any function.
def what_it_is_doing(callback):
import time
total = 0
while True:
time.sleep(1)
total += 1
# Here it is a callback function, but if you are using a
# GUI application (not only) for example (wx, Qt, GTK, ...) they usually have
# events/signals, you should be using this system.
callback(time_spent=total)
import thread
thread.start_new_thread(what_it_is_doing, tuple(callback))
# The way you would use it:
def what_I_want_to_do_with_each_bit_of_result(time_spent):
print "Time is:", time_spent
your_function(what_I_want_to_do_with_each_bit_of_result)
# Continue your code normally
The other option (blocking) involves a special kind of functions generators which are technically treated as iterators. So you define it as a function and acts as an iterator. That's an example, using the same dummy function than the other one:
def my_generator():
import time
total = 0
while True:
time.sleep(1)
total += 1
yield total
# And here's how you use it:
# You need it to be in a loop !!
for time_spent in my_generator():
print "Time spent is:", time_spent
# Or, you could use it that way, and call .next() manually:
my_gen = my_generator()
# When you need something from it:
time_spent = my_gen.next()
Note that in the second example, the code would make no sense because it is not really called at 1 second intervals, because there's the other code running each time it yields something or .next is called, and that may take time. But I hope you got the point.
Again, it depends on what you are doing, if the app you are using has an "event" framework or similar you would need to use that, if you need it blocking/non-blocking, if time is important, how your calling code should manipulate the result...
Your gevent and threading are on the right track, because a function does what it is programmed to do, either accepting 1 var at a time or taking a set and returning either a set or a var. The function has to be called to return either result, and the continuous stream of processing is probably taking place already or else you are asking about a loop over a kernel pointer or something similar, which you are not, so ...
So, your calling code which encapsulates your function is important, the function, any function, eg, even a true/false boolean function only executes until it is done with its vars, so there muse be a calling function which listens indefinitely in your case. If it doesn't exist you should write one ;)
Calling code which encapsulates is certainly very important.
Folks aren't going to have enough info to help much, except in the super generic sense that we can tell you that you are or should be within in some framework's event loop, or other code's loop of some form already- and that is what you want to be listening to/ preparing data for.
I like "functional programming's," "map function," for this sort of thing. I think. I can't comment at my rep level or I would restrict my speculation to that. :)
To get a better answer from another person post some example code and reveal your API if possible.

What's the benefit of using generator in this case?

I'm learning Python's generator from this slide: http://www.dabeaz.com/generators/Generators.pdf
There is an example in it, which can be describe like this:
You have a log file called log.txt, write a program to watch the content of it, if there are new line added to it, print them. Two solutions:
1. with generator:
import time
def follow(thefile):
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
logfile = open("log.txt")
loglines = follow(logfile)
for line in loglines:
print line
2. Without generator:
import time
logfile = open("log.txt")
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
print line
What's the benefit of using generator here?
If all you have is a hammer, everything looks like a nail
I'd almost just like to answer this question with just the above quote. Just because you can does not mean you need to all the time.
But conceptually the generator version separates functionality, the follow function serves the purpose of encapsulating the continuous reading from a file while waiting for new input. Which frees you to do anything in your loop with the new line that you want. In the second version the code to read from the file, and to print out is intermingled with the control loop. This might not be really an issue in this small example but that is something you might want to think about.
One benefit is the ability to pass your generator around (say to different functions) and iterate manually by calling .next(). Here is a slightly modified version of your initial generator example:
import time
def follow(file_name):
with open(file_name, 'rb') as f:
for line in f:
if not line:
time.sleep(0.1)
continue
yield line
loglines = follow(logfile)
first_line = loglines.next()
second_line = loglines.next()
for line in loglines:
print line
First of all I opened the file with a context manager (with statement, which auto-closes the file when you're done with it, or on exception). Next, at the bottom I've demonstrated using the .next() method, allowing you to manually step through. This can be useful sometimes if you need to break logic out from a simple for item in gen loop.
A generator function is defined like a normal function, but whenever it needs to generate a value, it does so with the yield keyword rather than return. Its main advantage is it allows its code to produce a series of values over time, rather than computing them at once and sending them back like a list.For example
# A Python program to generate squares from 1
# to 100 using yield and therefore generator
# An infinite generator function that prints
# next square number. It starts with 1
def nextSquare():
i = 1;
# An Infinite loop to generate squares
while True:
yield i*i
i += 1 # Next execution resumes
# from this point
# Driver code to test above generator
# function
for num in nextSquare():
if num > 100:
break
print(num)
Return sends a specified value back to its caller whereas Yield can produce a sequence of values. We should use yield when we want to iterate over a sequence, but don’t want to store the entire sequence in memory.
Ideally most loops are roughly of the form:
for element in get_the_next_value():
process(element)
However sometimes (as in your example #2), the loop is actually more complex as you sometimes get an element and sometimes don't. That means in your example without the element you have mixed up code for generating an element with the code for processing it. It doesn't show too clearly in the example because the code to generate the next value isn't actually too complex and the processing is just one line, but example number 1 is separating these two concepts more cleanly.
A better example might be one that processes variable length paragraphs from a file with blank lines separating each paragraph: try writing code for that with and without generators and you should see the benefit.
While your example might be a bit simple to fully take advantage of generators, I prefer to use generators to encapsulate the generation of any sequence data where there is also some kind of filtering of the data. It keeps the 'what I'm doing with the data' code separated from the 'how I get the data' code.

Categories

Resources