Python for loop iteration not completing, skipping steps - python

I got a 'wrong answer' error in my "queue from stack" algorithm when I expected it to work. For those not familiar with the algorithm, the solution requires two stacks of list type- a "push stack" and a 'pop stack', which is in effect a queue buffer that the push stack dumps itself into when the queue stack gets called and is empty. See if you can determine what's going on and where the problem is.
def pop(self):
self.stack_to_push_to= [1,2] # sample hard coding
self.queue_to_pop = [] # sample hard coding
if self.queue_to_pop == 0: # trigger a dump to form a new queue buffer
for _ in stack_to_push_to:
self.queue_to_pop.append(self.stack_to_push_to.pop())
print(self.queue_to_pop) # [2] but expected [2,1]

Too much was being done on the append line, clever and concise though it seemed to be. When one pops from an iterable that is currently being iterated on, it confuses the count and python iterator process, or that's what I'm guessing is happening. A similar thing happens in excel when you delete rows while at the same time traversing down them (not up them though). I just assumed Python would have been able to handle this on it's own somehow.
Problematic code.
if self.queue_to_pop == 0: # trigger a dump to form a new sub-queue
for _ in stack_to_push_to:
self.queue_to_pop.append(self.stack_incoming.pop()) #!!!!!!
The pop method served the interest of automatically working backwards, which I want, and I also thought it would save time and space complexity, but it ultimately didn't work and I found two alternatives that do. I'd be interested in learning if there's a way to reduce the complexities of my solutions.
Option 1:
for i in range(len(self.stack_to_push_to)-1,-1,-1):
self.queue_buffer.append(self.stack_to_push_to[i])
self.stack_to_push_to = []
Option 2:
for _ in reversed(self.stack_to_push_to):
self.queue_buffer.append(_)
self.stack_to_push_to = []
I didn't see anyone else post this issue so I thought it was worth sharing and hope it enlightens others.

Related

trio.Event(): Which is “better”: setting and initializing a new Event or checking if someone is waiting for it beforehand?

import trio
work_available = trio.Event()
async def get_work():
while True:
work = check_for_work()
if not work:
await work_available.wait()
else:
return work
def add_work_to_pile(...):
...
if work_available.statistics().tasks_waiting:
global work_available
work_available.set()
work_available = trio.Event()
In this Python-like code example I get work in bursts via add_work_to_pile(). The workers which get work via get_work() are slow. So most of the time add_work_to_pile() is called there will be no one waiting on work_available.
Which is better/cleaner/simpler/more pythonic/more trionic/more intended by the trio developers?
checking if someone is looking for the Event() via statistics().tasks_waiting, like in the example code, ...or...
unconditionally set() setting the Event() and creating a new one each time? (Most of them in vain.)
Furthermore... the API does not really seem to expect regular code to check if someone is waiting via this statistics() call...
I don’t mind spending a couple more lines to make things clearer. But that goes both ways: a couple CPU cycles more are fine for simpler code...
Creating a new Event is roughly the same cost as creating the _EventStatistics object within the statistics method. You'll need to profile your own code to pick out any small difference in performance. However, although it is safe and performant, the intent of statistics across trio's classes is for debug rather than core logic. Using/discarding many Event instances would be relatively more along the intent of the devs.
A more trionic pattern would be to load each work item into a buffered memory channel in place of your add_work_to_pile() method and then iterate on that in the task that awaits get_work. I feel the amount of code is comparable to your example:
import trio
send_chan, recv_chan = trio.open_memory_channel(float('inf'))
async def task_that_uses_work_items():
# # compare
# while True:
# work = await get_work()
# handle_work(work)
async for work in recv_chan:
handle_work(work)
def add_work_to_pile():
...
for work in new_work_set:
send_chan.send_nowait(work)
# maybe your work is coming in from a thread?
def add_work_from_thread():
...
for work in new_work_set:
trio_token.run_sync_soon(send_chan.send_nowait, work)
Furthermore, it's performant because the work items are efficiently rotated through a deque internally. This code would checkpoint for every work item so you may have to jump through some hoops if you want to avoid that.
I think you might want a trio.ParkingLot. It gives more control over parking (i.e. which is like Event.wait()) and unparking (which is like Event.set() except that it doesn't stop future parkers from waiting). But it doesn't have any notion of being set at all so you would need to store that information separately. If you work is naturally Truety when set (e.g. a non-empty list) then that might be easy anyway. Example:
available_work = []
available_work_pl = trio.ParkingLot()
async def get_work():
while not available_work:
await available_work_pl.park()
result = list(available_work)
available_work.clear()
return result
def add_work_to_pile():
available_work.append(foo)
available_work_pl.unpark()
Edit: Replaced "if" with "while" in get_work(). I think if has a race condition: if there are two parked tasks and then add_work_to_pile() gets called twice, then one get_work() would get both work items but the other would still be unparked and return an empty list. Using while instead will make it loop back around until more data is added.
IMHO you don't want an event in the first place. The combination of an array and something that tells the reader there's work in the array is already available as memory channels. They have the additional advantage that you can tell them how much work to accept before the sender stalls.
send_channel, recv_channel = trio.open_memory_channel(10)
get_work = recv_channel.receive
add_work_to_pile = send_channel.send
# both are async functions or use the _nowait() versions

Removing items from list while iterating over it

There are two separate processes running in Python script. Both interact with a global variable POST_QUEUE = []
Process 1 (P1) adds items to POST_QUEUE every 60 seconds. This can be anywhere from 0 to 50 items at a time.
Process 2 (P2) iterates over POST_QUEUE via a for-loop at set intervals and performs an operation on the list items one at a time. After performing said operation, the process removes the item from the list.
Below is a generalized version of P2:
def Process_2():
for post in POST_QUEUE:
if perform_operation(post):
Print("Success!")
else:
Print("Failure.")
POST_QUEUE.remove(post)
Understandably, I've run into an issue where when removing items from a list that a for-loop is iterating over, it screws up the indexing and terminates the loop earlier than expected (i.e., before it performs the necessary operation on each post and removes it from POST_QUEUE).
Is there a better way to do this than just creating a copy of POST_QUEUE and having P2 iterate over that while removing items from the original POST_QUEUE object? For example:
def Process_2():
POST_QUEUE_COPY = POST_QUEUE[:]
for post in POST_QUEUE_COPY:
if perform_operation(post):
Print("Success!")
else:
Print("Failure.")
POST_QUEUE.remove(post)
Since you do not really need the indexes of the elements I would suggest something like this as an easy solution:
def Process_2():
while len(POST_QUEUE):
if perform_operation(post[0]):
Print("Success!")
else:
Print("Failure.")
POST_QUEUE.remove(post[0])
However this solution has a runtime of O(n^2) for every use of the loop since python needs to move every element in the list on every iteration.
So IMO a better implmentation would be:
def Process_2():
reversed_post_queue = POST_QUEUE[::-1]
while len(reversed_post_queue):
if perform_operation(post[-1]):
Print("Success!")
else:
Print("Failure.")
POST_QUEUE.remove(post[-1])
that way you keep the order (which I suppose is important to you throughout this answer) while only moving the elements of the list once and resulting in a runtime of O(n)
finally, the best implementation IMO is to create or import a queue module
so that you could easily use the list as FIFO.
How about this:
while POST_QUEUE_COPY:
post = POST_QUEUE_COPY.pop(0)
if perform_operation(post):
Print("Success!")
else:
Print("Failure.")
And processes don’t share data, threads do. So unless you are using something like multiprocessing.Manager or some shared-memory construct, I don’t think your current logic would work.
You can loop through your list from right to left. This way, the removal of items will not cause issues to the loop. It's not a good idea in general to remove items from list while looping, but if you need to do it, going from right to left is the best option:
def Process_2():
for i in range(len(POST_QUEUE)-1, -1, -1):
if perform_operation(POST_QUEUE[i]):
Print("Success!")
else:
Print("Failure.")
POST_QUEUE.pop(i)

Free up memory by deleting numpy arrays

I have written a fatigue analysis program with a GUI. The program takes strain information for unit loads for each element of a finite element model, reads in a load case using np.genfromtxt('loadcasefilename.txt') and then does some fatigue analysis and saves the result for each element in another array.
The load cases are about 32Mb as text files and there are 40 or so which get read and analysed in a loop. The loads for each element are interpolated by taking slices of the load case array.
The GUI and fatigue analysis run in separate threads. When you click 'Start' on the fatigue analysis it starts the loop over the load cases in the fatigue analysis.
This brings me onto my problem. If I have a lot of elements, the analysis will not finish. How early it quits depends on how many elements there are, which makes me think it might be a memory problem. I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
Code below:
for LoadCaseNo in range(len(LoadCases[0]['LoadCaseLoops'])):#range(1):#xxx
#Get load case data
self.statustext.emit('Opening current load case file...')
LoadCaseFilePath=LoadCases[0]['LoadCasePaths'][LoadCaseNo][0]
#TK: load case paths may be different
try:
with open(LoadCaseFilePath):
pass
except Exception as e:
self.statustext.emit(str(e))
LoadCaseLoops=LoadCases[0]['LoadCaseLoops'][LoadCaseNo,0]
LoadCase=np.genfromtxt(LoadCaseFilePath,delimiter=',')
LoadCaseArray=np.array(LoadCases[0]['LoadCaseLoops'])
LoadCaseArray=LoadCaseArray/np.sum(LoadCaseArray,axis=0)
#Loop through sections
for SectionNo in range(len(Sections)):#range(100):#xxx
SectionCount=len(Sections)
#Get section data
Elements=Sections[SectionNo]['elements']
UnitStrains=Sections[SectionNo]['strains'][:,1:]
Nodes=Sections[SectionNo]['nodes']
rootdist=Sections[SectionNo]['rootdist']
#Interpolate load case data at this section
NeighbourFind=rootdist-np.reshape(LoadCase[0,1:],(1,-1))
NeighbourFind[NeighbourFind<0]=1e100
nearest=np.unravel_index(NeighbourFind.argmin(), NeighbourFind.shape)
nearestcol=int(nearest[1])
Distance0=LoadCase[0,nearestcol+1]
Distance1=LoadCase[0,nearestcol+7]
MxLow=LoadCase[1:,nearestcol+1]
MxHigh=LoadCase[1:,nearestcol+7]
MyLow=LoadCase[1:,nearestcol+2]
MyHigh=LoadCase[1:,nearestcol+8]
MzLow=LoadCase[1:,nearestcol+3]
MzHigh=LoadCase[1:,nearestcol+9]
FxLow=LoadCase[1:,nearestcol+4]
FxHigh=LoadCase[1:,nearestcol+10]
FyLow=LoadCase[1:,nearestcol+5]
FyHigh=LoadCase[1:,nearestcol+11]
FzLow=LoadCase[1:,nearestcol+6]
FzHigh=LoadCase[1:,nearestcol+12]
InterpFactor=(rootdist-Distance0)/(Distance1-Distance0)
Mx=MxLow+(MxHigh-MxLow)*InterpFactor[0,0]
My=MyLow+(MyHigh-MyLow)*InterpFactor[0,0]
Mz=MzLow+(MzHigh-MzLow)*InterpFactor[0,0]
Fx=-FxLow+(FxHigh-FxLow)*InterpFactor[0,0]
Fy=-FyLow+(FyHigh-FyLow)*InterpFactor[0,0]
Fz=FzLow+(FzHigh-FzLow)*InterpFactor[0,0]
#Loop through section coordinates
for ElementNo in range(len(Elements)):
MaterialID=int(Elements[ElementNo,1])
if Materials[MaterialID]['curvefit'][0,0]!=3:
StrainHist=UnitStrains[ElementNo,0]*Mx+UnitStrains[ElementNo,1]*My+UnitStrains[ElementNo,2]*Fz
elif Materials[MaterialID]['curvefit'][0,0]==3:
StrainHist=UnitStrains[ElementNo,3]*Fx+UnitStrains[ElementNo,4]*Fy+UnitStrains[ElementNo,5]*Mz
EndIn=len(StrainHist)
Extrema=np.bitwise_or(np.bitwise_and(StrainHist[1:EndIn-1]<=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]<=StrainHist[2:EndIn]),np.bitwise_and(StrainHist[1:EndIn-1]>=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]>=StrainHist[2:EndIn]))
Extrema=np.concatenate((np.array([True]),Extrema,np.array([True])),axis=0)
Extrema=StrainHist[np.where(Extrema==True)]
del StrainHist
#Do fatigue analysis
self.statustext.emit('Analysing load case '+str(LoadCaseNo+1)+' of '+str(len(LoadCases[0]['LoadCaseLoops']))+' - '+str(((SectionNo+1)*100)/SectionCount)+'% complete')
del MxLow,MxHigh,MyLow,MyHigh,MzLow,MzHigh,FxLow,FxHigh,FyLow,FyHigh,FzLow,FzHigh,Mx,My,Mz,Fx,Fy,Fz,Distance0,Distance1
gc.collect()
There's obviously a retain cycle or other leak somewhere, but without seeing your code, it's impossible to say more than that. But since you seem to be more interested in workarounds than solutions…
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
No, Python doesn't have any equivalent to pack. (Of course if you know exactly what set of values you want to keep around, you can always np.savetxt or pickle.dump or otherwise stash them, then exec or spawn a new interpreter instance, then np.loadtxt or pickle.load or otherwise restore those values. But then if you know exactly what set of values you want to keep around, you probably aren't going to have this problem in the first place, unless you've actually hit an unknown memory leak in NumPy, which is unlikely.)
But it has something that may be better. Kick off a child process to analyze each element (or each batch of elements, if they're small enough that the process-spawning overhead matters), send the results back in a file or over a queue, then quit.
For example, if you're doing this:
def analyze(thingy):
a = build_giant_array(thingy)
result = process_giant_array(a)
return result
total = 0
for thingy in thingies:
total += analyze(thingy)
You can change it to this:
def wrap_analyze(thingy, q):
q.put(analyze(thingy))
total = 0
for thingy in thingies:
q = multiprocessing.Queue()
p = multiprocessing.Process(target=wrap_analyze, args=(thingy, q))
p.start()
p.join()
total += q.get()
(This assumes that each thingy and result is both smallish and pickleable. If it's a huge NumPy array, look into NumPy's shared memory wrappers, which are designed to make things much easier when you need to share memory directly between processes instead of passing it.)
But you may want to look at what multiprocessing.Pool can do to automate this for you (and to make it easier to extend the code to, e.g., use all your cores in parallel). Notice that it has a maxtasksperchild parameter, which you can use to recycle the pool processes every, say, 10 thingies, so they don't run out of memory.
But back to actually trying to solve things briefly:
I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
None of that should make any difference at all. If you're just reassigning all the local variables to new values each time through the loop, and aren't keeping references to them anywhere else, then they're just going to get freed up anyway, so you'll never have more than 2 at a (brief) time. And gc.collect() only helps if there are reference cycles. So, on the one hand, it's good news that these had no effect—it means there's nothing obviously stupid in your code. On the other hand, it's bad news—it means that whatever's wrong isn't obviously stupid.
Usually people see this because they keep growing some data structure without realizing it. For example, maybe you vstack all the new rows onto the old version of giant_array instead of onto an empty array, then delete the old version… but it doesn't matter, because each time through the loop, giant_array isn't 5*N, it's 5*N, then 10*N, then 15*N, and so on. (That's just an example of something stupid I did not long ago… Again, it's hard to give more specific examples while knowing nothing about your code.)

how can I detect infinite loops in python

I am learning Python 3 and working on an exercise that calls for writing a Python program which simulates/reads a BASIC program as input. I am stuck on writing the part of the Python program that should detect infinite loops. Here is the code I have so far:
def execute(prog):
while True:
location = 0
if prog[location] == len(prog) - 1:
break
return "success"
getT = prog[location].split()
T = len(getT) - 1
location = findLine(prog, T)
visited = [False] * len(prog)
Here, prog is a list of strings containing the BASIC program (strings are in the form of 5 GOTO 30, 10 GOTO 20, etc.).
T is the target string indicated in prog[location].
If the BASIC program has an infinite loop, then my Python program will have an infinite loop. I know that if any line is visited twice, then it loops forever, and my program should return "infinite loop".
A hint given by the tutorial assistant says "initialize a list visited = [False] * len(prog) and change visited[i] to True when prog[i] is visited. Each time through the loop, one value updates in visited[]. Think about how you change a single value in a list. Then think about how you identify which value in visited[] needs to change."
So this is the part I am stuck on. How do I keep track of which strings in prog have been visited/looped through?
I'm not sure I agree that visiting a line twice proves an infinite loop. See the comments under the question. But I can answer the actual question.
Here's the hint:
A hint given by the tutorial assistant says "initialize a list visited = [False] * len(prog) and change visited[i] to True when prog[i] is visited. Each time through the loop, one value updates in visited[]. Think about how you change a single value in a list. Then think about how you identify which value in visited[] needs to change."
This is saying you should have two lists, one that contains the program, and one that contains true/false flags. The second one is to be named visited and initially contains False values.
The Python code is just like the hint says:
visited = [False] * len(prog)
This uses the * list operator, "list repetition", to repeat a length-1 list and make a new list of a longer length.
To change visited[i] to True is simple:
visited[i] = True
Then you can do something like this:
if visited[i]:
print("We have already visited line {}".format(i))
print("Infinite loop? Exiting.")
sys.exit(1)
Note that we are testing for the True value by simply saying if visited[i]:
We could also write if visited[i] == True: but the shorter form is sufficient and is customary in the Python community. This and other customary idioms are documented here: http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
For a program this small, it's not too bad to keep two lists like this. For larger and complex programs, I prefer to keep everything together in one place. This would use a "class" which you might not have learned yet. Something like this:
class ProgramCode(object):
def __init__(self, statement):
self.code = statement
self.visited = False
prog = []
with open(input_basic_program_file, "rt") as f:
for line in f:
prog.append(ProgramCode(line))
Now instead of two lists, we have a single list where each item is a bit of BASIC code and a visited flag.
P.S. The above shows an explicit for loop that repeatedly uses .append() to add to a list. An experienced Python developer would likely use a "list comprehension" instead, but I wanted to make this as easy to follow as possible.
Here's the list comprehension. Don't worry if it looks weird now; your class will teach this to you eventually.
with open(input_basic_program_file, "rt") as f:
prog = [ProgramCode(line) for line in f]
I know of no automatic way of infinite loop detection in Python, but by using divide and conquer methods and testing individual functions, you can find the offending function or block of code and then proceed to debug further.
If the Python program outputs data, but you never see that output, that's a good indicator you have an infinite loop. You can test all your functions in the repl, and the function that does "not come back" [to the command prompt] is a likely suspect.
You can write output under a debug variable of some sort, to be shut off when everything works. This could be a member variable of a Python class to which your code would have to have access to at any time, or you could have a module-scoped variable like Debug=1 and use debug levels to print varying amounts of debug info, like 1 a little, 2 more, 3, even more, and 4 verbose.
As an example, if you printed the value of a loop counter in a suspected function, then eventually that loop counter would keep printing well beyond the count of data (test records) you were using to test.
Here is a combination I came up with using parts of J. Carlos P.'s answer with the hints that steveha gave and using the hint that the instructions gave:
def execute(prog):
location = 0
visited = [False] * len(prog)
while True:
if location==len(prog)-1:
return "success"
findT = prog[location].split()
T = findT[- 1]
if visited[location]:
return "infinite loop"
visited[location] = True
location = findLine(prog, T)

Pause Python Generator

I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/#href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
A generator simply yields values. There's no way for the generator to know what's being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?
The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.
If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.
I think you may be looking for the yield function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?
A solution could be to use a Queue to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more than n items in memory at the same time.

Categories

Resources