how to create uncollectable garbage in python? - python

I have a large long-running server, and, over weeks the memory usage steadily climbs.
Generally, as pointed out below, its unlikely that leaks are my problem; however, I have not got a lot to go on so I want to see if there are any leaks.
Getting at console output is tricky so I'm not running with gc.set_debug(). This is not a big problem though, as I have easily added an API to get it to run gc.collect() and then iterate through gc.garbage and send the results back out to me over HTTP.
My problem is that running it locally for a short time my gc.garbage is always empty. I can't test my bit of code that lists the leaks before I deploy it.
Is there a trivial recipe for creating an uncollectable bit of garbage so I can test my code that lists the garbage?

Any cycle of finalizable objects (that is, objects with a __del__ method) is uncollectable (because the garbage collector does not know which order to run the finalizers in):
>>> class Finalizable:
... def __del__(self): pass
...
>>> a = Finalizable()
>>> b = Finalizable()
>>> a.x = b
>>> b.x = a
>>> del a
>>> del b
>>> import gc
>>> gc.collect()
4
>>> gc.garbage
[<__main__.Finalizable instance at 0x1004e0b48>,
<__main__.Finalizable instance at 0x1004e73f8>]
But as a general point, it seems unlikely to me that your problem is due to uncollectable garbage, unless you are in the habit of using finalizers. It's more likely due to the accumulation of live objects, or to fragmentation of memory (since Python uses a non-moving collector).

Related

Strange behavior with `weakref` in IPython

While coding a cache class for one of my projects I wanted to try out the weakref package as its functionality seems to fit this purpose very well. The class is supposed to cache blocks of data from disk as readable and writable buffers for ctypes.Structures. The blocks of data are supposed to be discarded when no structure is pointing to them, unless the buffer was modified due to some change to the structures.
To prevent dirty blocks from being garbage collected my idea was to set block.some_attr_name = block in the structures' __setattr__. Even when all structures are eventually garbage collected, the underlying block of data still has a reference count of at least 1 because block.some_attr_name references block.
I wanted to test this idea, so I opened up an IPython session and typed
import weakref
class Test:
def __init__ (self):
self.self = self
ref = weakref.ref(Test(), lambda r: print("Test was trashed"))
As expected, this didn't print Test was trashed. But when I went to type del ref().self to see whether the referent will be discarded, while typing, before hitting Enter, Test was trashed appeared. Oddly enough, even hitting the arrow keys or resizing the command line window after assigning ref will cause the referent to be trashed, even though the referent's reference count cannot drop to zero because it is referencing itself. This behavior persists even if I artificially increase the reference count by replacing self.self = self with self.refs = [self for i in range(20)].
I couldn't reproduce this behavior in the standard python.exe interpreter (interactive session) which is why I assume this behavior to be tied to IPython (but I am not actually sure about this).
Is this behavior expected with the devil hiding somewhere in the details of IPython's implementation or is this behavior a bug?
Edit 1: It gets stranger. If in the IPython session I run
import weakref
class Test:
def __init__ (self):
self.self = self
test = Test()
ref = weakref.ref(test, lambda r: print("Aaaand it's gone...", flush = True))
del test
the referent is not trashed immediately. But if I hold down any key, "typing" out "aaaa..." (~200 a's), suddenly Aaaand it's gone... appears. And since I added flush = True I can rule out buffering for the late response. I definitely wouldn't expect IPython to be decreasing reference counts just because I was holding down a key. Maybe Python itself checks for circular references in some garbage collection cycles?
(tested with IPython 7.30.1 running Python 3.10.1 on Windows 10 x64)
In Python's documentation on Extending and Embedding the Python Interpreter under subsection 1.10 Reference Counts the second to last paragraph reads:
While Python uses the traditional reference counting implementation, it also offers a cycle detector that works to detect reference cycles. This allows applications to not worry about creating direct or indirect circular references; these are the weakness of garbage collection implemented using only reference counting. Reference cycles consist of objects which contain (possibly indirect) references to themselves, so that each object in the cycle has a reference count which is non-zero. Typical reference counting implementations are not able to reclaim the memory belonging to any objects in a reference cycle, or referenced from the objects in the cycle, even though there are no further references to the cycle itself.
So I guess my idea of circular references to prevent garbage collection from eating my objects won't work out.

What does the return value of gc.collect() actually mean?

When I do gc.collect() in my Python script, it returns values like 86, 14, etc.
I understand that this call performs garbage collection and I've already gone through the documentation here. But can someone explain through an example what do the numbers actually mean?
As you're being chided about for not reading yourself ;-) , it returns "the number of unreachable objects". But the docs aren't really detailed enough to know exactly what that means.
It's really the sum of two numbers: the number of objects that were identified as garbage and actually freed, plus the number of objects that were identified as garbage but could not be freed. For an example of the latter, objects directly involved in unreachable ("garbage") reference cycles containing at least one object with a __del__ method could not be freed automatically before Python 3.4.
Here's an example under Python 3.6.5:
>>> gc.collect() # no trash to begin with
0
>>> a = []
>>> a.append(a) # create an object that references itself
>>> gc.collect() # but it's not trash because name "a" is bound to it
0
>>> a = None # break the binding; _now_ it's trash
# but refcounting alone can't discover that it's trash
>>> gc.collect() # .collect() finds this cyclic trash
1 # and reports that one trash object was collected
In general, there's scant use for this return value.

Will a Python generator be garbage collected if it will not be used any more but hasn't reached StopIteration yet?

When a generator is not used any more, it should be garbage collected, right? I tried the following code but I am not sure which part I was wrong.
import weakref
import gc
def countdown(n):
while n:
yield n
n-=1
cd = countdown(10)
cdw = weakref.ref(cd)()
print cd.next()
gc.collect()
print cd.next()
gc.collect()
print cdw.next()
On the second last line, I called garbage collector and since there is no call to cd any more. gc should free cd right. But when I call cdw.next(), it is still printing 8. I tried a few more cdw.next(), it could successfully print all the rest until StopIteration.
I tried this because I wanted to understand how generator and coroutine work. On slide 28 of David Beazley's PyCon presentation "A Curious Course on Coroutines and Concurrency", he said that a coroutine might run indefinitely and we should use .close() to shut it down. Then he said that garbage collector will call .close(). In my understanding, once we called .close() ourselves, gc will call .close() again. Will gc receive a warning that it can't call .close() on an already closed coroutine?
Thanks for any inputs.
Due to the dynamic nature of python, the reference to cd isn't freed until you reach the end of the current routine because (at least) the Cpython implementation of python doesn't "read ahead". (If you don't know what python implementation you're using, it's almost certainly "Cpython"). There are a number of subtleties that would make that virtually impossible for the interpreter to determine whether an object should be free if it still exists in the current namespace in the general case (e.g. you can still reach it by a call to locals()).
In some less general cases, other python implementations may be able to free an object before the end of the current stack frame, but Cpython doesn't bother.
Try this code instead which demonstrates that the generator is free to be cleaned up in Cpython:
import weakref
def countdown(n):
while n:
yield n
n-=1
def func():
a = countdown(10)
b = weakref.ref(a)
print next(a)
print next(a)
return b
c = func()
print c()
Objects (including generators) are garbage collected when their reference count reaches 0 (in Cpython -- Other implementations may work differently). In Cpython, reference counts are only decremented when you see a del statement, or when an object goes out of scope because the current namespace changes.
The important thing is that once there are no more references to an object, it is free to be cleaned up by the garbage collector. The details of how the implementation determines that there are no more references are left to the implementers of the particular python distribution you're using.
In your example, the generator won't get garbage collected until the end of the script. Python doesn't know if you're going to be using cd again, so it can't throw it away. To put it precisely, there's still a reference to your generator in the global namespace.
A generator will get GCed when its reference count drops to zero, just like any other object. Even if the generator is not exhausted.
This can happen under lots of normal circumstances - if it's in a local name that falls out of scope, if it's deled, if its owner gets GCed. But if any live objects (including namespaces) hold strong references to it, it won't get GCed.
The Python garbage collector isn't quite that smart. Even though you don't refer to cd any more after that line, the reference is still live in local variables, so it can't be collected. (In fact, it's possible that some code you're using might dig around in your local variables and resurrect it. Unlikely, but possible. So Python can't make any assumptions.)
If you want to make the garbage collector actually do something here, try adding:
del cd
This will remove the local variable, allowing the object to be collected.
The other answers have explained that gc.collect() won't garbage collect anything that still has references to it. There is still a live reference cd to the generator, so it will not be gc'ed until cd is deleted.
However in addition, the OP is creating a SECOND strong reference to the object using this line, which calls the weak reference object:
cdw = weakref.ref(cd)()
So if one were to do del cd and call gc.collect(), the generator would still not be gc'ed because cdw is also a reference.
To obtain an actual weak reference, do not call the weakref.ref object. Simply do this:
cdw = weakref.ref(cd)
Now when cd is deleted and garbage collected, the reference count will be zero and calling the weak reference will result in None, as expected.

Should I worry about circular references in Python?

Suppose I have code that maintains a parent/children structure. In such a structure I get circular references, where a child points to a parent and a parent points to a child. Should I worry about them? I'm using Python 2.5.
I am concerned that they will not be garbage collected and the application will eventually consume all memory.
"Worry" is misplaced, but if your program turns out to be slow, consume more memory than expected, or have strange inexplicable pauses, the cause is indeed likely to be in those garbage reference loops -- they need to be garbage collected by a different procedure than "normal" (acyclic) reference graphs, and that collection is occasional and may be slow if you have a lot of objects tied up in such loops (the cyclical-garbage collection is also inhibited if an object in the loop has a __del__ special method).
So, reference loops will not affect your program's correctness, but may affect its performance and/or footprint.
If and when you want to remove unwanted loops of references, you can often use the weakref module in Python's standard library.
If and when you want to exert more direct control (or perform debugging, see what exactly is happening) regarding cyclical garbage collection, use the gc module in Python's standard library.
Experimentally: you're fine:
import itertools
for i in itertools.count():
a = {}
b = {"a":a}
a["b"] = b
It consistently stays at using 3.6 MB of RAM.
Python will detect the cycle and release the memory when there are no outside references.
Circular references are a normal thing to do, so I don't see a reason to be worried about them. Many tree algorithms require that each node have links to its children and its parent. They're also required to implement something like a doubly linked list.
I don't think you should worry. Try the following program and will you see that it won't consume all memory:
while True:
a=range(100)
b=range(100)
a.append(b)
b.append(a)
a.append(a)
b.append(b)
There seems to be a issue with references to methods in lists in a variable. Here are two examples. The first one does not call __del__. The second one with weakref is ok for __del__. However, in this later case the problem is that you cannot weakly reference methods: http://docs.python.org/2/library/weakref.html
import sys, weakref
class One():
def __init__(self):
self.counters = [ self.count ]
def __del__(self):
print("__del__ called")
def count(self):
print(sys.getrefcount(self))
sys.getrefcount(One)
one = One()
sys.getrefcount(One)
del one
sys.getrefcount(One)
class Two():
def __init__(self):
self.counters = [ weakref.ref(self.count) ]
def __del__(self):
print("__del__ called")
def count(self):
print(sys.getrefcount(self))
sys.getrefcount(Two)
two = Two()
sys.getrefcount(Two)
del two
sys.getrefcount(Two)

Is it possible to have an actual memory leak in Python because of your code?

I don't have a code example, but I'm curious whether it's possible to write Python code that results in essentially a memory leak.
It is possible, yes.
It depends on what kind of memory leak you are talking about. Within pure python code, it's not possible to "forget to free" memory such as in C, but it is possible to leave a reference hanging somewhere. Some examples of such:
an unhandled traceback object that is keeping an entire stack frame alive, even though the function is no longer running
while game.running():
try:
key_press = handle_input()
except SomeException:
etype, evalue, tb = sys.exc_info()
# Do something with tb like inspecting or printing the traceback
In this silly example of a game loop maybe, we assigned 'tb' to a local. We had good intentions, but this tb contains frame information about the stack of whatever was happening in our handle_input all the way down to what this called. Presuming your game continues, this 'tb' is kept alive even in your next call to handle_input, and maybe forever. The docs for exc_info now talk about this potential circular reference issue and recommend simply not assigning tb if you don't absolutely need it. If you need to get a traceback consider e.g. traceback.format_exc
storing values in a class or global scope instead of instance scope, and not realizing it.
This one can happen in insidious ways, but often happens when you define mutable types in your class scope.
class Money(object):
name = ''
symbols = [] # This is the dangerous line here
def set_name(self, name):
self.name = name
def add_symbol(self, symbol):
self.symbols.append(symbol)
In the above example, say you did
m = Money()
m.set_name('Dollar')
m.add_symbol('$')
You'll probably find this particular bug quickly, but in this case you put a mutable value at class scope and even though you correctly access it at instance scope, it's actually "falling through" to the class object's __dict__.
This used in certain contexts like holding objects could potentially cause things that cause your application's heap to grow forever, and would cause issues in say, a production web application that didn't restart its processes occasionally.
Cyclic references in classes which also have a __del__ method.
Ironically, the existence of a __del__ makes it impossible for the cyclic garbage collector to clean an instance up. Say you had something where you wanted to do a destructor for finalization purposes:
class ClientConnection(...):
def __del__(self):
if self.socket is not None:
self.socket.close()
self.socket = None
Now this works fine on its own, and you may be led to believe it's being a good steward of OS resources to ensure the socket is 'disposed' of.
However, if ClientConnection kept a reference to say, User and User kept a reference to the connection, you might be tempted to say that on cleanup, let's have user de-reference the connection. This is actually the flaw, however: the cyclic GC doesn't know the correct order of operations and cannot clean it up.
The solution to this is to ensure you do cleanup on say, disconnect events by calling some sort of close, but name that method something other than __del__.
poorly implemented C extensions, or not properly using C libraries as they are supposed to be.
In Python, you trust in the garbage collector to throw away things you aren't using. But if you use a C extension that wraps a C library, the majority of the time you are responsible for making sure you explicitly close or de-allocate resources. Mostly this is documented, but a python programmer who is used to not having to do this explicit de-allocation might throw away the handle (like returning from a function or whatever) to that library without knowing that resources are being held.
Scopes which contain closures which contain a whole lot more than you could've anticipated
class User:
def set_profile(self, profile):
def on_completed(result):
if result.success:
self.profile = profile
self._db.execute(
change={'profile': profile},
on_complete=on_completed
)
In this contrived example, we appear to be using some sort of 'async' call that will call us back at on_completed when the DB call is done (the implementation could've been promises, it ends up with the same outcome).
What you may not realize is that the on_completed closure binds a reference to self in order to execute the self.profile assignment. Now, perhaps the DB client keeps track of active queries and pointers to the closures to call when they're done (since it's async) and say it crashes for whatever reason. If the DB client doesn't correctly cleanup callbacks etc, in this case, the DB client now has a reference to on_completed which has a reference to User which keeps a _db - you've now created a circular reference that may never get collected.
(Even without a circular reference, the fact that closures bind locals and even instances sometimes may cause values you thought were collected to be living for a long time, which could include sockets, clients, large buffers, and entire trees of things)
Default parameters which are mutable types
def foo(a=[]):
a.append(time.time())
return a
This is a contrived example, but one could be led to believe that the default value of a being an empty list means append to it, when it is in fact a reference to the same list. This again might cause unbounded growth without knowing that you did that.
The classic definition of a memory leak is memory that was used once, and now is not, but has not been reclaimed. That nearly impossible with pure Python code. But as Antoine points out, you can easily have the effect of consuming all your memory inadvertently by allowing data structures to grow without bound, even if you don't need to keep all of the data around.
With C extensions, of course, you are back in unmanaged territory, and anything is possible.
Of course you can. The typical example of a memory leak is if you build a cache that you never flush manually and that has no automatic eviction policy.
In the sense of orphaning allocated objects after they go out of scope because you forgot to deallocate them, no; Python will automatically deallocate out of scope objects (Garbage Collection). But in the sense that #Antione is talking about, yes.
Since many modules are written in C , yes, it is possible to have memory leaks.
imagine you are using a gui paint drawing context (eg with wxpython) , you can create memory buffers but if you forgot to release it. you will have memory leaks...
in this case, C++ functions of wx api are wrapped to python.
a bigger wrong usage , imagine you overload these wx widgets methods within python... memoryleaks assured.
I create an object with a heavy attribute to show off in the process memory usage.
Then I create a dictionary which refers itself for a big number of times.
Then I delete the object, and ask GC to collect garrbage. It collects none.
Then I check the process RAM footprint - it is the same.
Here you go, memory leak!
α python
Python 2.7.15 (default, Oct 2 2018, 11:47:18)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> class B(object):
... b = list(range(1 * 10 ** 8))
...
>>>
[1]+ Stopped python
~/Sources/plan9port [git branch:master]
α ps aux | grep python
alexander.pugachev 85164 0.0 19.0 7562952 3188184 s010 T 2:08pm 0:03.78 /usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
~/Sources/plan9port [git branch:master]
α fg
python
>>> b = B()
>>> for i in range(1000):
... b.a = {'b': b}
...
>>>
[1]+ Stopped python
~/Sources/plan9port [git branch:master]
α ps aux | grep python
alexander.pugachev 85164 0.0 19.0 7579336 3188264 s010 T 2:08pm 0:03.79 /usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
~/Sources/plan9port [git branch:master]
α fg
python
>>> b.a['b'].a
{'b': <__main__.B object at 0x109204950>}
>>> del(b)
>>> gc.collect()
0
>>>
[1]+ Stopped python
~/Sources/plan9port [git branch:master]
α ps aux | grep python
alexander.pugachev 85164 0.0 19.0 7579336 3188268 s010 T 2:08pm 0:05.13 /usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python

Categories

Resources