Background
=============
Let's say I'm writing some unit tests, and I want to test the re-opening of a log file (it was corrupted for some reason outside or inside my program). I currently have a TextIOWrapper from originally running open(), which I want to fully delete or "clean up". Once it's cleaned up, I want to re-run open(), and I want the ID of that new TextIOWrapper to be something new.
Problem
=============
It seems to re-appear with the same ID. How do I fully clean this thing up? Is it a lost cause for some reason hidden in the docs?
Debug
=============
My actual code has more try/except blocks for various edge cases, but here's the gist:
import gc # I don't want to do this
# create log
log = open("log", "w")
id(log) # result = 01111311110
# close log and delete everything I can think to delete
log.close()
log.__del__()
del log
gc.collect()
# TODO clean up some special way?
# re-open the log
log = open("log", "a")
id(log) # result = 01111311110
Why is that resulting ID still the same?
Theory 1: Due to the way the IO stream works, the TextIOWrapper will end up in the same place in memory for a given file, and my method of testing this function needs re-work.
Theory 2: Somehow I am not properly cleaning this up.
I think you do enough clean up by simply calling log.close(). My hypothesis (now proven see below) is based on the fact that my example below delivers the result you were expecting in the code in your question.
It seems that python reuses the id numbers for some reason.
Try this example:
log = open("log", "w")
print(id(log)) # result = 01111311110
# close log and delete everything I can think to delete
log.close()
log = open("log", "a")
print(id(log))
log.close()
[edit]
I found proof of my hypothesis:
The id is unique only as long as an object is alive. Objects that have no references left to them are removed from memory, allowing the id() value to be re-used for another object, hence the non-overlapping lifetimes wording.
In CPython, id() is the memory address. New objects will be slotted into the next available memory space, so if a specific memory address has enough space to hold the next new object, the memory address will be reused.
The moment all references to an object are gone, the reference count on the object drops to 0 and it is deleted, there and then.
Garbage collection only is needed to break cyclic references, objects that reference one another only, with no further references to the cycle. Because such a cycle will never reach a reference count of 0 without help, the garbage collector periodically checks for such cycles and breaks one of the references to help clear those objects from memory.
more info on Python's reuse of id values at How unique is Python's id()?
Related
I've made a map editor in Python2.7.9 for a small project and I'm looking for ways to preserve the data I edit in the event of some unhandled exception. My editor already has a method for saving out data, and my current solution is to have the main loop wrapped in a try..finally block, similar to this example:
import os, datetime #..and others.
if __name__ == '__main__':
DataMgr = DataManager() # initializes the editor.
save_note = None
try:
MainLoop() # unsurprisingly, this calls the main loop.
except Exception as e: # I am of the impression this will catch every type of exception.
save_note = "Exception dump: %s : %s." % (type(e).__name__, e) # A memo appended to the comments in the save file.
finally:
exception_fp = DataMgr.cwd + "dump_%s.kmap" % str(datetime.datetime.now())
DataMgr.saveFile(exception_fp, memo = save_note) # saves out to a dump file using a familiar method with a note outlining what happened.
This seems like the best way to make sure that, no matter what happens, an attempt is made to preserve the editor's current state (to the extent that saveFile() is equipped to do so) in the event that it should crash. But I wonder if encapsulating my entire main loop in a try block is actually safe and efficient and good form. Is it? Are there risks or problems? Is there a better or more conventional way?
Wrapping the main loop in a try...finally block is the accepted pattern when you need something to happen no matter what. In some cases it's logging and continuing, in others it's saving everything possible and quitting.
So you're code is fine.
If your file isn't that big, I would suggest maybe reading the entire input file into memory, closing the file, then doing your data processing on the copy in memory, this will solve any problems you have with not corrupting your data at the cost of potentially slowing down your runtime.
Alternatively, take a look at the atexit python module. This allows you to register a function(s) for a automatic callback function when the program exits.
That being said what you have should work reasonably well.
I have written a fatigue analysis program with a GUI. The program takes strain information for unit loads for each element of a finite element model, reads in a load case using np.genfromtxt('loadcasefilename.txt') and then does some fatigue analysis and saves the result for each element in another array.
The load cases are about 32Mb as text files and there are 40 or so which get read and analysed in a loop. The loads for each element are interpolated by taking slices of the load case array.
The GUI and fatigue analysis run in separate threads. When you click 'Start' on the fatigue analysis it starts the loop over the load cases in the fatigue analysis.
This brings me onto my problem. If I have a lot of elements, the analysis will not finish. How early it quits depends on how many elements there are, which makes me think it might be a memory problem. I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
Code below:
for LoadCaseNo in range(len(LoadCases[0]['LoadCaseLoops'])):#range(1):#xxx
#Get load case data
self.statustext.emit('Opening current load case file...')
LoadCaseFilePath=LoadCases[0]['LoadCasePaths'][LoadCaseNo][0]
#TK: load case paths may be different
try:
with open(LoadCaseFilePath):
pass
except Exception as e:
self.statustext.emit(str(e))
LoadCaseLoops=LoadCases[0]['LoadCaseLoops'][LoadCaseNo,0]
LoadCase=np.genfromtxt(LoadCaseFilePath,delimiter=',')
LoadCaseArray=np.array(LoadCases[0]['LoadCaseLoops'])
LoadCaseArray=LoadCaseArray/np.sum(LoadCaseArray,axis=0)
#Loop through sections
for SectionNo in range(len(Sections)):#range(100):#xxx
SectionCount=len(Sections)
#Get section data
Elements=Sections[SectionNo]['elements']
UnitStrains=Sections[SectionNo]['strains'][:,1:]
Nodes=Sections[SectionNo]['nodes']
rootdist=Sections[SectionNo]['rootdist']
#Interpolate load case data at this section
NeighbourFind=rootdist-np.reshape(LoadCase[0,1:],(1,-1))
NeighbourFind[NeighbourFind<0]=1e100
nearest=np.unravel_index(NeighbourFind.argmin(), NeighbourFind.shape)
nearestcol=int(nearest[1])
Distance0=LoadCase[0,nearestcol+1]
Distance1=LoadCase[0,nearestcol+7]
MxLow=LoadCase[1:,nearestcol+1]
MxHigh=LoadCase[1:,nearestcol+7]
MyLow=LoadCase[1:,nearestcol+2]
MyHigh=LoadCase[1:,nearestcol+8]
MzLow=LoadCase[1:,nearestcol+3]
MzHigh=LoadCase[1:,nearestcol+9]
FxLow=LoadCase[1:,nearestcol+4]
FxHigh=LoadCase[1:,nearestcol+10]
FyLow=LoadCase[1:,nearestcol+5]
FyHigh=LoadCase[1:,nearestcol+11]
FzLow=LoadCase[1:,nearestcol+6]
FzHigh=LoadCase[1:,nearestcol+12]
InterpFactor=(rootdist-Distance0)/(Distance1-Distance0)
Mx=MxLow+(MxHigh-MxLow)*InterpFactor[0,0]
My=MyLow+(MyHigh-MyLow)*InterpFactor[0,0]
Mz=MzLow+(MzHigh-MzLow)*InterpFactor[0,0]
Fx=-FxLow+(FxHigh-FxLow)*InterpFactor[0,0]
Fy=-FyLow+(FyHigh-FyLow)*InterpFactor[0,0]
Fz=FzLow+(FzHigh-FzLow)*InterpFactor[0,0]
#Loop through section coordinates
for ElementNo in range(len(Elements)):
MaterialID=int(Elements[ElementNo,1])
if Materials[MaterialID]['curvefit'][0,0]!=3:
StrainHist=UnitStrains[ElementNo,0]*Mx+UnitStrains[ElementNo,1]*My+UnitStrains[ElementNo,2]*Fz
elif Materials[MaterialID]['curvefit'][0,0]==3:
StrainHist=UnitStrains[ElementNo,3]*Fx+UnitStrains[ElementNo,4]*Fy+UnitStrains[ElementNo,5]*Mz
EndIn=len(StrainHist)
Extrema=np.bitwise_or(np.bitwise_and(StrainHist[1:EndIn-1]<=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]<=StrainHist[2:EndIn]),np.bitwise_and(StrainHist[1:EndIn-1]>=StrainHist[0:EndIn-2] , StrainHist[1:EndIn-1]>=StrainHist[2:EndIn]))
Extrema=np.concatenate((np.array([True]),Extrema,np.array([True])),axis=0)
Extrema=StrainHist[np.where(Extrema==True)]
del StrainHist
#Do fatigue analysis
self.statustext.emit('Analysing load case '+str(LoadCaseNo+1)+' of '+str(len(LoadCases[0]['LoadCaseLoops']))+' - '+str(((SectionNo+1)*100)/SectionCount)+'% complete')
del MxLow,MxHigh,MyLow,MyHigh,MzLow,MzHigh,FxLow,FxHigh,FyLow,FyHigh,FzLow,FzHigh,Mx,My,Mz,Fx,Fy,Fz,Distance0,Distance1
gc.collect()
There's obviously a retain cycle or other leak somewhere, but without seeing your code, it's impossible to say more than that. But since you seem to be more interested in workarounds than solutions…
In MatLab, I'd use the 'pack' function to write the workspace to disk, clear it, and then reload it at the end of each loop. I know this isn't good practice but it would get the job done! Can I do the equivalent in Python somehow?
No, Python doesn't have any equivalent to pack. (Of course if you know exactly what set of values you want to keep around, you can always np.savetxt or pickle.dump or otherwise stash them, then exec or spawn a new interpreter instance, then np.loadtxt or pickle.load or otherwise restore those values. But then if you know exactly what set of values you want to keep around, you probably aren't going to have this problem in the first place, unless you've actually hit an unknown memory leak in NumPy, which is unlikely.)
But it has something that may be better. Kick off a child process to analyze each element (or each batch of elements, if they're small enough that the process-spawning overhead matters), send the results back in a file or over a queue, then quit.
For example, if you're doing this:
def analyze(thingy):
a = build_giant_array(thingy)
result = process_giant_array(a)
return result
total = 0
for thingy in thingies:
total += analyze(thingy)
You can change it to this:
def wrap_analyze(thingy, q):
q.put(analyze(thingy))
total = 0
for thingy in thingies:
q = multiprocessing.Queue()
p = multiprocessing.Process(target=wrap_analyze, args=(thingy, q))
p.start()
p.join()
total += q.get()
(This assumes that each thingy and result is both smallish and pickleable. If it's a huge NumPy array, look into NumPy's shared memory wrappers, which are designed to make things much easier when you need to share memory directly between processes instead of passing it.)
But you may want to look at what multiprocessing.Pool can do to automate this for you (and to make it easier to extend the code to, e.g., use all your cores in parallel). Notice that it has a maxtasksperchild parameter, which you can use to recycle the pool processes every, say, 10 thingies, so they don't run out of memory.
But back to actually trying to solve things briefly:
I've tried fixing this by deleting the load case array at the end of each loop (after deleting all the arrays which are slices of it) and running gc.collect() but this has not had any success.
None of that should make any difference at all. If you're just reassigning all the local variables to new values each time through the loop, and aren't keeping references to them anywhere else, then they're just going to get freed up anyway, so you'll never have more than 2 at a (brief) time. And gc.collect() only helps if there are reference cycles. So, on the one hand, it's good news that these had no effect—it means there's nothing obviously stupid in your code. On the other hand, it's bad news—it means that whatever's wrong isn't obviously stupid.
Usually people see this because they keep growing some data structure without realizing it. For example, maybe you vstack all the new rows onto the old version of giant_array instead of onto an empty array, then delete the old version… but it doesn't matter, because each time through the loop, giant_array isn't 5*N, it's 5*N, then 10*N, then 15*N, and so on. (That's just an example of something stupid I did not long ago… Again, it's hard to give more specific examples while knowing nothing about your code.)
Original situation:
The application I'm working on at the moment will receive notification from another application when a particular file has had data added and is ready to be read. At the moment I have something like this:
class Foo(object):
def __init__(self):
self.myFile = open("data.txt", "r")
self.myFile.seek(0, 2) #seeks to the end of the file
self.mainWindow = JFrame("Foo",
defaultCloseOperation = JFrame.EXIT_ON_CLOSE,
size = (640, 480))
self.btn = JButton("Check the file", actionPerformed=self.CheckFile)
self.mainWindow.add(self.btn)
self.mainWindow.visible = True
def CheckFile(self, event):
while True:
line = self.myFile.readline()
if not line:
break
print line
foo = Foo()
Eventually, CheckFile() will be triggered when a certain message is received on a socket. At the moment, I'm triggering it from a JButton.
Despite the fact that the file is not touched anywhere else in the program, and I'm not using with on the file, I keep on getting ValueError: I/O operation on closed file when I try to readline() it.
Initial Solution:
In trying to figure out when exactly the file was being closed, I changed my application code to:
foo = Foo()
while True:
if foo.myFile.closed == True:
print "File is closed!"
But then the problem went away! Or if I change it to:
foo = Foo()
foo.CheckFile()
then the initial CheckFile(), happening straight away, works. But then when I click the button ~5 seconds later, the exception is raised again!
After changing the infinite loop to just pass, and discovering that everything was still working, my conclusion was that initially, with nothing left to do after instantiating a Foo, the application code was ending, foo was going out of scope, and thus foo.myFile was going out of scope and the file was being closed. Despite this, swing was keeping the window open, which was then causing errors when I tried to operate on an unopened file.
Why I'm still confused:
The odd part is, if foo had gone out of scope, why then, was swing still able to hook into foo.CheckFile() at all? When I click on the JButton, shouldn't the error be that the object or method no longer exists, rather than the method being called successfully and giving an error on the file operation?
My next idea was that maybe, when the JButton attempted to call foo.CheckFile() and found that foo no longer existed, it created a new Foo, somehow skipped its __init__ and went straight to its CheckFile(). However, this doesn't seem to be the case either. If I modify Foo.__init__ to take a parameter, store that in self.myNum, and print it out in CheckFile(), the value I pass in when I instantiate the initial object is always there. This would seem to suggest that foo isn't going out of scope at all, which puts me right back where I started!!!
EDIT: Tidied question up with relevant info from comments, and deleted a lot of said comments.
* Initial, Partial Answer (Added to Question) *
I think I just figured this out. After foo = Foo(), with no code left to keep the module busy, it would appear that the object ceases to exist, despite the fact that the application is still running, with a Swing window doing stuff.
If I do this:
foo = Foo()
while True:
pass
Then everything works as I would expect.
I'm still confused though, as to how foo.CheckFile() was being called at all. If the problem was that foo.myFile was going out of scope and being closed, then how come foo.CheckFile() was able to be called by the JButton?
Maybe someone else can provide a better answer.
I think the problem arises from memory being partitioned into two types in Java, heap and non-heap.Your class instance foo gets stored in heap memory while its method CheckFile is loaded into the method area of non-heap memory. After your script finishes, there are no more references to foo so it gets marked for garbage collection, while the Swing interface is still referring to CheckFile, so it gets marked as in-use. I am assuming that foo.myFile is not considered static so it also is stored in heap memory. As for the Swing interface, it's presumably still being tracked as in-use as long as the window is open and being updated by the window manager.
Edit: your solution of using a while True loop is a correct one, in my opinion. Use it to monitor for events and when the window closes or the last line is read, exit the loop and let the program finish.
Edit 2: alternative solution - try having foo inherit from JFrame to make Swing keep a persistent pointer to it in its main loop as long as the window is open.
I have a class that looks like this:
class A:
def __init__(self, filename, sources):
# gather info from file
# info is updated during lifetime of the object
def close(self):
# save info back to file
Now, this is in a server program, so it might be shutdown without prior notice by a signal. Is it safe to define this to make sure the class saves it's info, if possible?
def __del__(self):
self.close()
If not, what would you suggest as a solution instead?
Waiting until later is just not the strategy to making something reliable. In fact, you have to go the complete opposite direction. As soon as you know something that should be persistent, you need to take action to persist it. In fact, if you want to make it reliable, you need to first write to disk the steps needed to recover from the failure that might happen while you are trying to commit the change. pseudopython:
class A:
def __init__(self, filename, sources):
self.recover()
# gather info from file
# info is updated during lifetime of the object
def update_info(self, info):
# append 'info' to recovery_log
# recovery_log.flush()
# write 'info' to file
# file.flush()
# append 'info-SUCCESS' to recover_log
# recovery_log.flush()
def recover(self):
# open recovery_log
# skip to last 'info-SUCCESS'
# read 'info' from recover_log
# write 'info' to file
# file.flush()
# append 'info-SUCCESS' to recover_log
# recovery_log.flush()
The important bit is that recover() happens every time, and that every step is followed by a flush() to make sure data makes it out to disk before the next step occurs. another important thing is that only appends ever occur on the recover log itself. nothing is overwritten in such a way that the data in the log can become corrupted.
No. You are NEVER safe.
If the opearting system wants to kill you without prior notice, it will. You can do nothing about it. Your program can stop running after any instruction, at any time, and have no opportunity to execute any additional code.
There is just no way of protecting your server from a killing signal.
You can, if you want, trap lesser signals and manually delete your objects, forcing the calls to close().
For orderly cleanup you can use the sys.atexit hooks. Register a function there that calls your close method. The destructor of on object may not be called at exit.
The __del__ method is not guaranteed to ever be called for objects that still exist when the interpreter exits.
Even if __del__ is called, it can be called too late. In particular, it can occur after modules it wants to call have been unloaded. As pointed out by Keith, sys.atexit is much safer.
I was wondering if Python has similar issues as C regarding the order of execution of certain elements of code.
For example, I know in C there are times say when it's not guaranteed that some variable is initialized before another. Or just because one line of code is above another it's not guaranteed that it is implemented before all the ones below it.
Is it the same for Python? Like if I open a file of data, read in the data, close the file, then do other stuff do I know for sure that the file is closed before the lines after I close the file are executed??
The reason I ask is because I'm trying to read in a large file of data (1.6GB) and use this python module specific to the work I do on the data. When I run this module I get this error message:
File "/glast01/software/ScienceTools/ScienceTools-v9r15p2-SL4/sane/v3r18p1/python/GtApp.py", line 57, in run
input, output = self.runWithOutput(print_command)
File "/glast01/software/ScienceTools/ScienceTools-v9r15p2-SL4/sane/v3r18p1/python/GtApp.py", line 77, in runWithOutput
return os.popen4(self.command(print_command))
File "/Home/eud/jmcohen/.local/lib/python2.5/os.py", line 690, in popen4
stdout, stdin = popen2.popen4(cmd, bufsize)
File "/Home/eud/jmcohen/.local/lib/python2.5/popen2.py", line 199, in popen4
inst = Popen4(cmd, bufsize)
File "/Home/eud/jmcohen/.local/lib/python2.5/popen2.py", line 125, in __init__
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
>>>
Exception exceptions.AttributeError: AttributeError("Popen4 instance has no attribute 'pid'",) in <bound method Popen4.__del__ of <popen2.Popen4 instance at 0x9ee6fac>> ignored
I assume it's related to the size of the data I read in (it has 17608310 rows and 22 columns).
I thought perhaps if I closed the file I opened right after I read in the data this would help, but it didn't. This led me to thinking about the order that lines of code are executed in, hence my question.
Thanks
The only thing I can think of that may surprise some people is:
def test():
try:
return True
finally:
return False
print test()
Output:
False
finally clauses really are executed last, even if a return statement precedes them. However, this is not specific to Python.
Execution of C certainly is sequential, for actual statements. There are even rules that define the sequence points, so you can know how individual expressions evaluate.
CPython itself is written in such a way that any effects like those you mention are minimized; code always executes top to bottom barring literal evaluation during compilation, objects are GCed as soon as their refcount hits 0, etc.
Execution in the cpython vm is very linear. I do not think whatever problem you have has to do with order of execution.
One thing you should be careful about in Python but not C: exceptions can be raised everywhere, so just because you see a close() call below the corresponding open() call does not mean that call is actually reached. Use try/finally everywhere (or the with statement in new enough pythons) to make sure opened files are closed (and other kinds of resources that can be freed explicitly are freed).
If your problem is with memory usage, not some other kind of resource, debugging it can be harder. Memory cannot be freed explicitly in python. The cpython vm (which you are most likely using) does release memory as soon as the last reference to it goes away, but sometimes cannot free memory trapped in cycles with objects that have a __del__ method. If you have any __del__ methods of your own or use classes that have them this may be part of your problem.
Your actual question (the memory one, not the order of execution one) is hard to answer without seeing more code, though. It may be something obvious (or there may at least be some obvious way to reduce the amount of memory you need).
"if I open a file of data, read in the data, close the file, then do other stuff do I know for sure that the file is closed before the lines after I close the file are executed??"
Closed yes.
Released from memory. No. No guarantees about when garbage collection will occur.
Further, closing a file says nothing about all the other variables you've created and the other objects you've left laying around attached to those variables.
There's no "order of operations" issue.
I'll bet that you have too many global variables with too many copies of the data.
If the data consists of columns and rows, why not use the built in file iterator to fetch one line at a time?
f = open('file.txt')
first_line = f.next()
popen2.py:
class Popen4(Popen3):
childerr = None
def __init__(self, cmd, bufsize=-1):
_cleanup()
self.cmd = cmd
p2cread, p2cwrite = os.pipe()
c2pread, c2pwrite = os.pipe()
self.pid = os.fork()
if self.pid == 0:
# Child
os.dup2(p2cread, 0)
os.dup2(c2pwrite, 1)
os.dup2(c2pwrite, 2)
self._run_child(cmd)
os.close(p2cread)
self.tochild = os.fdopen(p2cwrite, 'w', bufsize)
os.close(c2pwrite)
self.fromchild = os.fdopen(c2pread, 'r', bufsize)
man 2 fork:
The fork() function may fail if:
[ENOMEM]
Insufficient storage space is available.
os.popen4 eventually calls open2.Popen4.__init__, which must fork in order to create the child process that you try to read from/write to. This underlying call is failing, likely due to resource exhaustion.
You may be using too much memory elsewhere, causing fork to attempt to use more than the RLIMIT_DATA or RLIMIT_RSS limit given to your user. As recommended by Python memory profiler - Stack Overflow, Heapy can help you determine whether this is the case.