What does cpython do to help detect object cycles(reference counting)? - python

From what I've read about cpython it seems like it does reference counting + something extra to detect/free objects pointing to each other.(Correct me if I'm wrong). Could someone explain the something extra? Also does this guarantee* no cycle leaking? If not is there any research into an algorithm proven to add to reference counting to make it never leak*? Would this be just running a non reference counting tracing gc every so often?
*discounting bugs and problems with modules using foreign function interface

As explained in the documentation for gc.garbage, there is no guarantee that no leaks occur; specifically, cyclic objects with __del__ methods are not collected by default. For such objects, the cyclic links have to be manually broken to enable further GC.
From what I understand by browsing the CPython sourcecode, the interpreter keeps references to all objects under its control. The "extra" garbage collector runs a mark-and-sweep-like algorithm through the heap, remembers for each object if it is reachable from the "outside" and, if not, deletes it. (The GC is generational, but it may be run explicitly from the gc module with a generation argument.)
The only efficient algorithm that I could think of that satisfies your criteria would indeed be a "full" GC algorithm to augment reference counting and this is what seems to be implemented in Python. I'm not an expert in these matters though.

Related

When to Py_INCREF?

I'm working on a C extension and am at the point where I want to track down memory leaks. From reading Python's documentation it's hard to understand when to increment / decrement reference count of Python objects. Also, after couple days spending trying to embed Python interpreter (in order to compile the extension as a standalone program), I had to give up this endeavor. So, tools like Valgrind are helpless here.
So far, by trial and error I learned that, for example, Py_DECREF(Py_None) is a bad thing... but is this true of any constant? I don't know.
My major confusions so far can be listed like this:
Do I have to decrement refcount on anything created by PyWhatever_New() if it doesn't outlive the procedure that created it?
Does every Py_INCREF need to be matched by Py_DECREF, or should there be one more of one / the other?
If a call to Python procedure resulted in a PyObject*, do I need to increment it to ensure that I can still use it (forever), or decrement it to ensure that eventually it will be garbage-collected, or neither?
Are Python objects created through C API on the stack allocated on stack or on heap? (It is possible that Py_INCREF reallocates them on heap for example).
Do I need to do anything special to Python objects created in C code before passing them to Python code? What if Python code outlives C code that created Python objects?
Finally, I understand that Python has both reference counting and garbage collector: if that's the case, how critical is it if I mess up the reference count (i.e. not decrement enough), will GC eventually figure out what to do with those objects?
Most of this is covered in Reference Count Details, and the rest is covered in the docs on the specific questions you're asking about. But, to get it all in one place:
Py_DECREF(Py_None) is a bad thing... but is this true of any constant?
The more general rule is that calling Py_DECREF on anything you didn't get a new/stolen reference to, and didn't call Py_INCREF on, is a bad thing. Since you never call Py_INCREF on anything accessible as a constant, this means you never call Py_DECREF on them.
Do I have to decrement refcount on anything created by PyWhatever_New()
Yes. Anything that returns a "new reference" has to be decremented. By convention, anything that ends in _New should return a new reference, but it should be documented anyway (e.g., see PyList_New).
Does every Py_INCREF need to be matched by Py_DECREF, or should there be one more of one / the other?
The number in your own code may not necessarily balance. The total number has to balance, but there are increments and decrements happening inside Python itself. For example, anything that returns a "new reference" has already done an inc, while anything that "steals" a reference will do the dec on it.
Are Python objects created through C API on the stack allocated on stack or on heap? (It is possible that Py_INCREF reallocates them on heap for example).
There's no way to create objects through C API on the stack. The C API only has functions that return pointers to objects.
Most of these objects are allocated on the heap. Some are actually in static memory.
But your code should not care anyway. You never allocate or delete them; they get allocated in the PySpam_New and similar functions, and deallocate themselves when you Py_DECREF them to 0, so it doesn't matter to you where they are.
(The except is constants that you can access via their global names, like Py_None. Those, you obviously know are in static storage.)
Do I need to do anything special to Python objects created in C code before passing them to Python code?
No.
What if Python code outlives C code that created Python objects?
I'm not sure what you mean by "outlives" here. Your extension module is not going to get unloaded while any objects depend on its code. (In fact, until at least 3.8, your module probably never going to get unloaded until shutdown.)
If you just mean the function that _New'd up an object returning, that's not an issue. You have to go very far out of your way to allocate any Python objects on the stack. And there's no way to pass things like a C array of objects, or a C string, into Python code without converting them to a Python tuple of objects, or a Python bytes or str. There are a few cases where, e.g., you could stash a pointer to something on the stack in a PyCapsule and pass that—but that's the same as in any C program, and… just don't do it.
Finally, I understand that Python has both reference counting and garbage collector
The garbage collector is just a cycle breaker. If you have objects that are keeping each other alive with a reference cycle, you can rely on the GC. But if you've leaked references to an object, the GC will never clean it up.

how python handle with circle on GC?

I know that python uses reference counting for garbage collection.
Every object that is allocated on the heap has counter that counts the number of object that refer to it, when the counter hits zero, the object is delete.
but how python handle with circle pointer?
if one of then delete the second stay with 1 counter but need to be delete.
The way this is handled is dependent on the python implementation. The reference implementation, the one you're probably using, is sometimes called CPython, because it is written in C.
CPython uses reference counting to clean up object which are obviously no longer used. However, every once in a while, it pauses execution of the program, and begins will the objects directly referenced by variables alive in the program. Then, it follows all references as long as it can, marking which objects have been visited. Once it has followed all references, it finds all the objects which aren't reachable from the main program, and deletes them. This is called tracing garbage collection, of which mark and sweep is a particular implementation.
If you want, and you're sure your program has no circular references, you can turn this feature off to improve performance. If you have circular references, however, you'll accidentally cause memory leaks, so it's usually not worth doing unless you're really worried about performance.

Memory deallocation in Linked list Python

While deleting the end node from the linked list we just set the link of the node pointing to the end node to "None".Does that mean that the end node is destroyed and memory occupied by it has been released?
You ask: "Does that mean that the end node is destroyed and memory occupied by it has been released?"
With the little information you have given the answer to your question as you posed it is definitely not a plain unqualified "yes".
The simplest example of why "yes" is wrong is that if there is any other reference to that end node, then it can't immediately be released - if that were the case then nothing much would work, would it? However that doesn't mean the node won't ever be regarded as deleteable.
Moreover, even once releasable, that doesn't mean the memory "has been released" - this is implementation-dependent and may well not be deterministic, i.e. you can't necessarily rely on the memory having been immediately released, or predict when (if ever) it is actually released.
The "garbage collector" metaphor is used to refer to recovering unused memory because IRL garbage collection happens every now and then but can't be relied on to happen (or have happened) at a particular time.
What happens to unreferenced data is nothing to do with the language specification, which is another reason why the answer is not a plain "yes". It is completely implementation-dependent. You don't say if you are using cPython or Jython, or some other flavour. You need to refer to the documentation for the implementation you are using. cPython does expose its garbage collector, refer to e.g. https://docs.python.org/2/library/gc.html and https://docs.python.org/3/library/gc.html, and Jython uses the Java garbage collector. You may or may not be able to influence their behaviour, you should refer to the documentation for the interpreter you are using.
The reasons for not necessarily immediately recycling releasable memory are usually to do with performance - why do work which isn't needed? - but if your interpreter does postpone recycling then it will at some point, when based on some criteria resources become limited, make some effort to tidy up - do the garbage collection - and this means that 99.9...% of the time you don't need to concern yourself with the recycling because it is automatically handled (with corresponding overhead cost) once the interpreter implementation considers it necessary.
Yes.
Python has a garbage collector so objects that cannot be reached in any way are automatically destroyed and their memory will be reused for other objects created in the future.

Usage of Cython directive no_gc

In Cython 0.25 the no_gc directive was added. The documentation for this new directive (as well as for a related no_gc_clear directive) can be found here, but the only thing I really understand about it is that it can speed up your code be disabling certain aspects of garbage collection.
I am interested because I have some high performance Cython code which uses extension types, and I understand that no_gc can speed things up further. In my code, instances of extension types are always left alive until the very end when the program closes, which makes me think that disabling garbage collection for these might be OK.
I guess what I really need is an example where the usage of no_gc goes bad and leads to memory leaks, together with en explanation of exactly why that happens.
It's to do with circular references - when instance a holds a reference to a Python object that references a again then a can never be freed through reference counting so Python tries to detect the cycle.
A very trial example of a class that could cause issues is:
# Cython code:
cdef class A:
cdef param
def __init__(self):
self.param = self
(and some Python code to run it)
import cython_module
while True:
cython_module.A()
This is fine as is (the cycles are detected and they get deallocated every so often) but if you add no_gc then you will run out of memory.
A more realistic example might be a parent/child pair that store a reference to each other.
It's worth adding that the performance gains likely to be small. The garbage collector is only run occasionally in situations when a lot of objects have been allocated and few have been freed (https://docs.python.org/3/library/gc.html - see set_threshold). It's hopefully unlikely that this describes your high performance code.
There's probably also a small performance cost on allocation and deallocation of your objects with GC, to add/remove them from the list of tracked objects (but again, hopefully you aren't allocating/deallocting huge numbers)
Finally, if your class never stores any references to Python objects then it's effectively no_gc anyway. Setting the option will do no harm but also do no good.

Are all Python objects tracked by the garbage collector?

I'm trying to debug a memory leak (see question Memory leak in Python Twisted: where is it?).
When the garbage collector is running, does it have access to all Python objects created by the Python interpreter? If we suppose Python C libraries are not leaking, should RSS memory usage grow linearly with respect to the GC object count? What about sys.getobjects?
CPython uses two mechanisms to clean up garbage. One is reference counting, which affects all objects but which can't clean up objects that (directly or indirectly) refer to each other. That's where the actual garbage collector comes in: python has the gc module, which searches for cyclic references in objects it knows about. Only objects that can potentially be part of a reference cycle need to worry about participating in the cyclic gc. So, for example, lists do, but strings do not; strings don't reference any other objects. (In fact, the story is a little more complicated, as there's two ways of participating in cyclic gc, but that isn't really relevant here.)
All Python classes (and instances thereof) automatically get tracked by the cyclic gc. Types defined in C aren't, unless they put in a little effort. All the builtin types that could be part of a cycle do. But this does mean the gc module only knows about the types that play along.
Apart from the collection mechanism there's also the fact that Python has its own aggregating memory allocator (obmalloc), which allocates entire memory arenas and uses the memory for most of the smaller objects it creates. Python now does free these arenas when they're completely empty (for a long time it didn't), but actually emptying an arena is fairly rare: because CPython objects aren't movable, you can't just move some stragglers to another arena.
The RSS does not grow linearly with the number of Python objects, because Python objects can vary in size. An int object is usually much smaller than a big list.
I suppose that you mean gc.get_objects when you wrote sys.getobjects. This function gives you a list of all reachable objects. If you suppose a leak, you can iterate this list and try to find objects that should have been freed already. (For instance you might know that all objects of a certain type are to be freed at a certain point.)
A Python class designed to be unable to be involved in cycles is not tracked by the GC.
class V(object):
__slots__ = ()
Instances of V cannot have any attribute. Its size is 16, like the size of object().
sys.getsizeof(V()) and v().sizeof() return the same value: 16.
V isn't useful, but I imagine that other classes derived from base types (e.g. tuple), that only add methods, can be crafted so that reference counting is enough to manage them in memory.

Categories

Resources