Python: cracking the gc enigma

Python: cracking the gc enigma - python

I am trying to understand gc because I have got a large list in a program which I need to delete to free up some badly needed memory. The basic question I want to answer is how can I find what is being tracked by gc and what has been freed? following is code illustrating my problem
import gc
old=gc.get_objects()
a=1
new=gc.get_objects()
b=[e for e in new if e not in old]
print "Problem 1: len(new)-len(old)>1 :", len(new), len(old)
print "Problem 2: none of the element in b contain a or id(a): ", a in b, id(a) in b
print "Problem 3: The reference counts are insanely high, WHY?? "
IMHO this is weird behavior that isnt addressed in the docs. For starters why does assigning a single variable create multiple entries for the gc? and why is none of them the variable I made?? Where is the entry for the variable I created in get_objects()?
EDIT: In response to martjin's first reponse I checked the following
a="foo"
print a in gc.get_objects()
Still no-go :( how can I check that a is being tracked by gc?

The result of gc.get_objects() is itself not tracked; it would create a circular reference otherwise:
>>> import gc
>>> print gc.get_objects.__doc__
get_objects() -> [...]
Return a list of objects tracked by the collector (excluding the list
returned).
You do not see a listed because that references one of the low-integer singletons. Python re-uses the same set of int objects for values between -5 and 256. As such, a = 1 does not create a new object to be tracked. Nor will you see any other primitive types.
CPython garbage collection only needs to track container types, types that can reference other value because the only thing that GC needs to do is break circular references.
Note that by the time any Python script starts, already some automatic code has been run. site.py sets up your Python path for example, which involves lists, mappings, etc. Then there are the memoized int values mentioned above, CPython also caches tuple() objects for re-use, etc. As a result, on start-up, easily 5k+ objects are already alive before one line of your code has started.

Related

Garbage collection for a simple python class

I am writing a python class like this:
class MyImageProcessor:
def __init__ (self, image, metadata):
self.image=image
self.metadata=metadata
Both image and metadata are objects of a class written by a
colleague. Now I need to make sure there is no waste of memory. I am thinking of defining a quit() method like this,
def quit():
self.image=None
self.metadata=None
import gc
gc.collect()
and suggest users to call quit() systematically. I would like to know whether this is the right way. In particular, do the instructions in quit() above guarantee that unused memories being well collected?
Alternatively, I could rename quit() to the build-in __exit__(), and suggest users to use the "with" syntax. But my question is
more about whether the instructions in quit() indeed fulfill the garbage collection work one would need in this situation.
Thank you for your help.

In python every object has a built-in reference_count, the variables(names) you create are only pointers to the objects. There are mutable and unmutable variables (for example if you change the value of an integer, the name will be pointed to another integer object, while changing a list element will not cause changing of the list name).
Reference count basically counts how many variable uses that data, and it is incremented/decremented automatically.
The garbage collector will destroy the objects with zero references (actually not always, it takes extra steps to save time). You should check out this article.
Similarly to object constructors (__init__()), which are called on object creation, you can define destructors (__del__()), which are executed on object deletion (usually when the reference count drops to 0). According to this article, in python they are not needed as much needed in C++ because Python has a garbage collector that handles memory management automatically. You can check out those examples too.
Hope it helps :)

No need for quit() (Assuming you're using C-based python).
Python uses two methods of garbage collection, as alluded to in the other answers.
First, there's reference counting. Essentially each time you add a reference to an object it gets incremented & each time you remove the reference (e.g., it goes out of scope) it gets decremented.
From https://devguide.python.org/garbage_collector/:
When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on.
You can get information about current reference counts for an object using sys.getrefcount(x), but really, why bother.
The second way is through garbage collection (gc). [Reference counting is a type of garbage collection, but python specifically calls this second method "garbage collection" -- so we'll also use this terminology. ] This is intended to find those places where reference count is not zero, but the object is no longer accessible. ("Reference cycles") For example:
class MyObj:
pass
x = MyObj()
x.self = x
Here, x refers to itself, so the actual reference count for x is more than 1. You can call del x but that merely removes it from your scope: it lives on because "someone" still has a reference to it.
gc, and specifically gc.collect() goes through objects looking for cycles like this and, when it finds an unreachable cycle (such as your x post deletion), it will deallocate the whole lot.
Back to your question: You don't need to have a quit() object because as soon as your MyImageProcessor object goes out of scope, it will decrement reference counters for image and metadata. If that puts them to zero, they're deallocated. If that doesn't, well, someone else is using them.
Your setting them to None first, merely decrements the reference count right then, but when MyImageProcessor goes out of scope, it won't decrement those reference count again, because MyImageProcessor no longer holds the image or metadata objects! So you're just explicitly doing what python does for you already for free: no more, no less.
You didn't create a cycle, so your calling gc.collect() is unlikely to change anything.
Check out https://devguide.python.org/garbage_collector/ if you are interested in more earthy details.

Not sure if it make sense but to my logic you could
Use :
gc.get_count()
before and after
gc.collect()
to see if something has been removed.
what are count0, count1 and count2 values returned by the Python gc.get_count()

Is there any usage of self-referential lists or circular reference in list, eg. appending a list to itself

So if I have a list a and append a to it, I will get a list that contains it own reference.
>>> a = [1,2]
>>> a.append(a)
>>> a
[1, 2, [...]]
>>> a[-1][-1][-1]
[1, 2, [...]]
And this basically results in seemingly infinite recursions.
And not only in lists, dictionaries as well:
>>> b = {'a':1,'b':2}
>>> b['c'] = b
>>> b
{'a': 1, 'b': 2, 'c': {...}}
It could have been a good way to store the list in last element and modify other elements, but that wouldn't work as the change will be seen in every recursive reference.
I get why this happens, i.e. due to their mutability. However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?

The use case is that Python is a dynamically typed language, where anything can reference anything, including itself.
List elements are references to other objects, just like variable names and attributes and the keys and values in dictionaries. The references are not typed, variables or lists are not restricted to only referencing, say, integers or floating point values. Every reference can reference any valid Python object. (Python is also strongly typed, in that the objects have a specific type that won't just change; strings remain strings, lists stay lists).
So, because Python is dynamically typed, the following:
foo = []
# ...
foo = False
is valid, because foo isn't restricted to a specific type of object, and the same goes for Python list objects.
The moment your language allows this, you have to account for recursive structures, because containers are allowed to reference themselves, directly or indirectly. The list representation takes this into account by not blowing up when you do this and ask for a string representation. It is instead showing you a [...] entry when there is a circular reference. This happens not just for direct references either, you can create an indirect reference too:
>>> foo = []
>>> bar = []
>>> foo.append(bar)
>>> bar.append(foo)
>>> foo
[[[...]]]
foo is the outermost [/]] pair and the [...] entry. bar is the [/] pair in the middle.
There are plenty of practical situations where you'd want a self-referencing (circular) structure. The built-in OrderedDict object uses a circular linked list to track item order, for example. This is not normally easily visible as there is a C-optimised version of the type, but we can force the Python interpreter to use the pure-Python version (you want to use a fresh interpreter, this is kind-of hackish):
>>> import sys
>>> class ImportFailedModule:
... def __getattr__(self, name):
... raise ImportError
...
>>> sys.modules["_collections"] = ImportFailedModule() # block the extension module from being loaded
>>> del sys.modules["collections"] # force a re-import
>>> from collections import OrderedDict
now we have a pure-python version we can introspect:
>>> od = OrderedDict()
>>> vars(od)
{'_OrderedDict__hardroot': <collections._Link object at 0x10a854e00>, '_OrderedDict__root': <weakproxy at 0x10a861130 to _Link at 0x10a854e00>, '_OrderedDict__map': {}}
Because this ordered dict is empty, the root references itself:
>>> od._OrderedDict__root.next is od._OrderedDict__root
True
just like a list can reference itself. Add a key or two and the linked list grows, but remains linked to itself, eventually:
>>> od["foo"] = "bar"
>>> od._OrderedDict__root.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
True
>>> od["spam"] = 42
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next.next is od._OrderedDict__root
True
The circular linked list makes it easy to alter the key ordering without having to rebuild the whole underlying hash table.

However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?
I don't think there are many useful use-cases for this. The reason this is allowed is because there could be some actual use-cases for it and forbidding it would make the performance of these containers worse or increase their memory usage.
Python is dynamically typed and you can add any Python object to a list. That means one would need to make special precautions to forbid adding a list to itself. This is different from (most) typed-languages where this cannot happen because of the typing-system.
So in order to forbid such recursive data-structures one would either need to check on every addition/insertion/mutation if the newly added object already participates in a higher layer of the data-structure. That means in the worst case it has to check if the newly added element is anywhere where it could participate in a recursive data-structure. The problem here is that the same list can be referenced in multiple places and can be part of multiple data-structures already and data-structures such as list/dict can be (almost) arbitrarily deep. That detection would be either slow (e.g. linear search) or would take quite a bit of memory (lookup). So it's cheaper to simply allow it.
The reason why Python detects this when printing is that you don't want the interpreter entering an infinite loop, or get a RecursionError, or StackOverflow. That's why for some operations like printing (but also deepcopy) Python temporarily creates a lookup to detect these recursive data-structures and handles them appropriately.

Consider building a state machine that parse string of digits an check if you can divide by 25 you could model each node as list with 10 outgoing directions consider some connections going to them self
def canDiv25(s):
n0,n1,n1g,n2=[],[],[],[]
n0.extend((n1,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1.extend((n1g,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1g.extend(n1)
n2.extend((n1,n0,n2,n0,n0,n1g,n0,n2,n0,n0))
cn=n0
for c in s:
cn=cn[int(c)]
return cn is n1g
for i in range(144):
print("%d %d"%(i,canDiv25(str(i))),end='\t')
While this state machine by itself has little practical it show what could happen. Alternative you could have an simple Adventure game where each room is represented as a dictionary you can go for example NORTH but in that room there is of course a back link to SOUTH. Also sometimes game developers make it so that for example to simulate a tricky path in some dungeon the way in NORTH direction will point to the room itself.

A very simple application of this would be a circular linked list where the last node in a list references the first node. These are useful for creating infinite resources, state machines or graphs in general.
def to_circular_list(items):
head, *tail = items
first = { "elem": head }
current = first
for item in tail:
current['next'] = { "elem": item }
current = current['next']
current['next'] = first
return first
to_circular_list([1, 2, 3, 4])
If it's not obvious how that relates to having a self-referencing object, think about what would happen if you only called to_circular_list([1]), you would end up with a data structure that looks like
item = {
"elem": 1,
"next": item
}
If the language didn't support this kind of direct self referencing, it would be impossible to use circular linked lists and many other concepts that rely on self references as a tool in Python.

The reason this is possible is simply because the syntax of Python doesn't prohibit it, much in the way any C or C++ object can contain a reference to itself. An example might be: https://www.geeksforgeeks.org/self-referential-structures/
As #MSeifert said, you will generally get a RecursionError at some point if you're trying to access the list repeatedly from itself. Code that uses this pattern like this:
a = [1, 2]
a.append(a)
def loop(l):
for item in l:
if isinstance(item, list):
loop(l)
else: print(item)
will eventually crash without some sort of condition. I believe that even print(a) will also crash. However:
a = [1, 2]
while True:
for item in a:
print(item)
will run infinitely with the same expected output as the above. Very few recursive problems don't unravel into a simple while loop. For an example of recursive problems that do require a self-referential structure, look up Ackermann's function: http://mathworld.wolfram.com/AckermannFunction.html. This function could be modified to use a self-referential list.
There is certainly precedent for self-referential containers or tree structures, particularly in math, but on a computer they are all limited by the size of the call stack and CPU time, making it impractical to investigate them without some sort of constraint.

Possible to get "value of address" in python?

In the following, I can see that when adding an integer in python, it adds the integer, assigns that resultant value to a new memory address and then sets the variable to point to that memory address:
>>> a=100
>>> id(a)
4304852448
>>> a+=5
>>> id(a)
4304852608
Is there a way to see what the value is at the (old) memory address 4304852448 (0x10096d5e0) is? For example: value_of(0x10096d5e0)

it adds the integer, assigns that resultant value to a new memory address
No; the object that represents the resulting value has a different memory address. Typically this object is created on the fly, but specifically for integers (which are immutable, and for some other types applying a similar optimization) the object may already exist and be reused.
Is there a way to see what the value is at the (old) memory address 4304852448 (0x10096d5e0) is?
This question is not well posed. First off, "memory addresses" in this context are virtualized, possibly more than once. Second, in the Python model, addresses do not "have values". They are potentially the location of objects, which in turn represent values.
That said: at the location of the previous id(a), the old object will still be present if and only if it has not been garbage-collected. For this, it is sufficient that some other reference is held to the object. (The timing of garbage collection is implementation-defined, but the reference CPython implementation implements garbage collection by reference counting.)
You are not, in general, entitled to examine the underlying memory directly (except perhaps with some kind of process spy that has the appropriate permissions), because Python is just not that low-level of a language. (As in #xprilion's answer, the CPython implementation provides a lower-level memory interface via ctypes; however, the code there is effectively doing an unsafe cast.)
If you did (and assuming CPython), you would not see a binary representation of the integer 100 in the memory location indicated by calling id(a) the first time - you would see instead the first word of the PyObject struct used to implement objects in the C code, which would usually be the refcount, but could be a pointer to the next live heap object in a reference-tracing Python build (if I'm reading it correctly).
Once it's garbage-collected, the contents of that memory are again undefined. They depend on the implementation, and even specifically for the C implementation, they depend on what the standard library free() does with the memory. (Normally it will be left untouched, because there is no reason to zero it out and it takes time to do so. A debug build might write special values into that memory, because it helps detect memory corruption.)

You are trying to dereference the memory address, much like the pointers used in C/C++. The following code might help you -
import ctypes
a = 100
b = id(a)
print(b, ": ", a)
a+=5
print(id(a), ": ", a)
print(ctypes.cast(b, ctypes.py_object).value)
OUTPUT:
140489243334080 : 100
140489243334240 : 105
100
The above example establishes that the value stored in the previous address of a remains the same.
Hope this answers the question. Read #Karl's answer for more information about what's happening in the background - https://stackoverflow.com/a/58468810/2650341

What does the return value of gc.collect() actually mean?

When I do gc.collect() in my Python script, it returns values like 86, 14, etc.
I understand that this call performs garbage collection and I've already gone through the documentation here. But can someone explain through an example what do the numbers actually mean?

As you're being chided about for not reading yourself ;-) , it returns "the number of unreachable objects". But the docs aren't really detailed enough to know exactly what that means.
It's really the sum of two numbers: the number of objects that were identified as garbage and actually freed, plus the number of objects that were identified as garbage but could not be freed. For an example of the latter, objects directly involved in unreachable ("garbage") reference cycles containing at least one object with a __del__ method could not be freed automatically before Python 3.4.
Here's an example under Python 3.6.5:
>>> gc.collect() # no trash to begin with
0
>>> a = []
>>> a.append(a) # create an object that references itself
>>> gc.collect() # but it's not trash because name "a" is bound to it
0
>>> a = None # break the binding; _now_ it's trash
# but refcounting alone can't discover that it's trash
>>> gc.collect() # .collect() finds this cyclic trash
1 # and reports that one trash object was collected
In general, there's scant use for this return value.

Python: how to "kill" a class instance/object?

I want a Roach class to "die" when it reaches a certain amount of "hunger", but I don't know how to delete the instance. I may be making a mistake with my terminology, but what I mean to say is that I have a ton of "roaches" on the window and I want specific ones to disappear entirely.
I would show you the code, but it's quite long. I have the Roach class being appended into a Mastermind classes roach population list.

In general:
Each binding variable -> object increases internal object's reference counter
there are several usual ways to decrease reference (dereference object -> variable binding):
exiting block of code where variable was declared (used for the first time)
destructing object will release references of all attributes/method variable -> object references
calling del variable will also delete reference in the current context
after all references to one object are removed (counter==0) it becomes good candidate for garbage collection, but it is not guaranteed that it will be processed (reference here):
CPython currently uses a reference-counting scheme with (optional)
delayed detection of cyclically linked garbage, which collects most
objects as soon as they become unreachable, but is not guaranteed to
collect garbage containing circular references. See the documentation
of the gc module for information on controlling the collection of
cyclic garbage. Other implementations act differently and CPython may
change. Do not depend on immediate finalization of objects when they
become unreachable (ex: always close files).
how many references on the object exists, use sys.getrefcount
module for configure/check garbage collection is gc
GC will call object.__ del__ method when destroying object (additional reference here)
some immutable objects like strings are handled in a special way - e.g. if two vars contain same string, it is possible that they reference the same object, but some not - check identifying objects, why does the returned value from id(...) change?
id of object can be found out with builtin function id
module memory_profiler looks interesting - A module for monitoring memory usage of a python program
there is lot of useful resources for the topic, one example: Find all references to an object in python

You cannot force a Python object to be deleted; it will be deleted when nothing references it (or when it's in a cycle only referred to be the items in the cycle). You will have to tell your "Mastermind" to erase its reference.
del somemastermind.roaches[n]

for i,roach in enumerate(roachpopulation_list)
if roach.hunger == 100
del roachpopulation_list[i]
break
Remove the instance by deleting it from your population list (containing all the roach instances.
If your Roaches are Sprites created in Pygame, then a simple command of .kill would remove the instance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.