I have this code:
import gc
def hacerciclo():
l=[0]
l[0]=l
recolector=gc.collect()
print("Garbage collector %d" % recolector)
for i in range (10):
hacerciclo()
recolector=gc.collect()
print("Garbage collector %d" % recolector)
This is an example code to the use of gc.collect(). The problem is that the same code shows different outputs in different computers.
One computers show:
Garbage collector 1
Garbage collector 10
others show:
Garbage collector 0
Garbage collector 10
Why this happens?
The current version of Python uses reference counting to keep track of allocated memory. Each object in Python has a reference count which indicates how many objects are pointing to it. When this reference count reaches zero the object is freed. This works well for most programs. However, there is one fundamental flaw with reference counting and it is due to something called reference cycles. The simplest example of a reference cycle is one object that refers to itself. For example:
>>> l = []
>>> l.append(l)
>>> del l
The reference count for the list created is now one. However, since it cannot not be reached from inside Python and cannot possibly be used again, it should be considered garbage. In the current version of Python, this list will never be freed.
Creating reference cycles is usually not good programming practice and can almost always be avoided. However, sometimes it is difficult to avoid creating reference cycles and other times the programmer does not even realize it is happening. For long running programs such as servers this is especially troublesome. People do not want their servers to run out of memory because reference counting failed to free unreachable objects. For large programs it is difficult to find how reference cycles are being created.
Source: http://arctrix.com/nas/python/gc/
The link below has the sample example you are using and it also explains:
http://www.digi.com/wiki/developer/index.php/Python_Garbage_Collection
Related
I am running some memory-heavy scripts which iterate over documents in a database, and due to memory constraints on the server I manually delete references to the large object at the conclusion of each iteration:
for document in database:
initial_function_calls()
big_object = memory_heavy_operation(document)
save_to_file(big_object)
del big_object
additional_function_calls()
The initial_function_calls() and additional_function_calls() are each slightly memory-heavy. Do I see any benefit by explicitly deleting the reference to the large object for garbage collection? Alternatively, does leaving it and having it point to a new object in the next iteration suffice?
As often in these cases; it depends. :-/
I'm assuming we're talking about CPython here.
Using del or re-assigning a name reduces the reference count for an object. Only if that reference could reaches 0 can it be de-allocated. So if you inadvertently stashed a reference to big_object away somewhere, using del won't help.
When garbage collection is triggered depends on the amount of allocations and de-allocations. See the documentation for gc.set_threshold().
If you're pretty sure that there are no further references, you could use gc.collect() to force a garbage collection run. That might help if your code doesn't do a lot of other allocations.
One thing to keep in mind is that if the big_object is created by a C extension module (like e.g. numpy), it could manage its own memory. In that case the garbage collection won't affect it! Also small integers and small strings are pre-allocated and won't be garbage collected. You can use gc.is_tracked() to check if an object is managed by the garbage collector.
What I would suggest is that you run your program with and without del+gc.collect(), and monitor the amount of RAM used. On UNIX-like systems, look at the resident set size. You could also use sys._debugmallocstats().
Unless you see the resident set size grow and grow, I wouldn't worry about it.
Consider the following code for illustration propose:
import mod
f1s = ["A1", "B1", "C1"]
f2s = ["A2", "B2", "C2"]
for f1, f2 in zip(f1s,f2s):
# Creating an object
acumulator = mod.AcumulatorObject()
# Using object
acumulator.append(f1)
acumulator.append(f2)
# Output of object
acumulator.print()
So, I use an instance of a class at the beginning of the for to perform an operation. For each tuple in the for I need to perform the same action, however I can not use the same object because it would add the effect of the last iteration. Therefore, at the beginning of every iteration I create a new instance.
My question is if by doing this a memory leak is created? What action I have to do for each object created? (Delete it maybe? Or by assign the new object to the same name it is cleared?)
tl,dr; no
The reference implementation of Python uses reference counting for garbage collection. There are other implementations that use different GC strategies and this affects the precise time at which __del__ methods are called, which may or may not be reliable or timely in PyPy, Jython or IronPython. These differences are not important unless when you are dealing with resources like file pointers and other expensive system resources.
In cPython the GC will wipe out objects when the referencing count is zero. For example, when you do acumulator = mod.AcumulatorObject() inside a for loop, a new object replaces the old one at the next iteration - and since there are no other variables referencing the old object it will be garbage collected in the next GC pass. The reference implementation cPython will spoil you with things like releasing resources automatically when they go out of scope but YMMV regarding other implementations.
That is why many people commented memory leaks are not of concern in Python.
You have complete control over cPython's garbage collector using the cg module. The default settings are pretty conservative and in 10 years doing Python for a living I never had to fire a GC cycle manually - but I've seen a situation where delaying it helped performance:
Yes, I had previously played with sys.setcheckinterval. I changed it to 1000 (from its default of 100), but it didn't do any measurable difference. Disabling Garbage Collection has helped - thanks. This has been the biggest speedup so far - saving about 20% (171 minutes for the whole run, down to 135 minutes) - I'm not sure what the error bars are on that, but it must be a statistically significant increase.
Just follow best practices like wrapping system resources using with or (try/finally blocks) and you should have no problems.
I know that python uses reference counting for garbage collection.
Every object that is allocated on the heap has counter that counts the number of object that refer to it, when the counter hits zero, the object is delete.
but how python handle with circle pointer?
if one of then delete the second stay with 1 counter but need to be delete.
The way this is handled is dependent on the python implementation. The reference implementation, the one you're probably using, is sometimes called CPython, because it is written in C.
CPython uses reference counting to clean up object which are obviously no longer used. However, every once in a while, it pauses execution of the program, and begins will the objects directly referenced by variables alive in the program. Then, it follows all references as long as it can, marking which objects have been visited. Once it has followed all references, it finds all the objects which aren't reachable from the main program, and deletes them. This is called tracing garbage collection, of which mark and sweep is a particular implementation.
If you want, and you're sure your program has no circular references, you can turn this feature off to improve performance. If you have circular references, however, you'll accidentally cause memory leaks, so it's usually not worth doing unless you're really worried about performance.
In Cython 0.25 the no_gc directive was added. The documentation for this new directive (as well as for a related no_gc_clear directive) can be found here, but the only thing I really understand about it is that it can speed up your code be disabling certain aspects of garbage collection.
I am interested because I have some high performance Cython code which uses extension types, and I understand that no_gc can speed things up further. In my code, instances of extension types are always left alive until the very end when the program closes, which makes me think that disabling garbage collection for these might be OK.
I guess what I really need is an example where the usage of no_gc goes bad and leads to memory leaks, together with en explanation of exactly why that happens.
It's to do with circular references - when instance a holds a reference to a Python object that references a again then a can never be freed through reference counting so Python tries to detect the cycle.
A very trial example of a class that could cause issues is:
# Cython code:
cdef class A:
cdef param
def __init__(self):
self.param = self
(and some Python code to run it)
import cython_module
while True:
cython_module.A()
This is fine as is (the cycles are detected and they get deallocated every so often) but if you add no_gc then you will run out of memory.
A more realistic example might be a parent/child pair that store a reference to each other.
It's worth adding that the performance gains likely to be small. The garbage collector is only run occasionally in situations when a lot of objects have been allocated and few have been freed (https://docs.python.org/3/library/gc.html - see set_threshold). It's hopefully unlikely that this describes your high performance code.
There's probably also a small performance cost on allocation and deallocation of your objects with GC, to add/remove them from the list of tracked objects (but again, hopefully you aren't allocating/deallocting huge numbers)
Finally, if your class never stores any references to Python objects then it's effectively no_gc anyway. Setting the option will do no harm but also do no good.
I'm trying to debug a memory leak (see question Memory leak in Python Twisted: where is it?).
When the garbage collector is running, does it have access to all Python objects created by the Python interpreter? If we suppose Python C libraries are not leaking, should RSS memory usage grow linearly with respect to the GC object count? What about sys.getobjects?
CPython uses two mechanisms to clean up garbage. One is reference counting, which affects all objects but which can't clean up objects that (directly or indirectly) refer to each other. That's where the actual garbage collector comes in: python has the gc module, which searches for cyclic references in objects it knows about. Only objects that can potentially be part of a reference cycle need to worry about participating in the cyclic gc. So, for example, lists do, but strings do not; strings don't reference any other objects. (In fact, the story is a little more complicated, as there's two ways of participating in cyclic gc, but that isn't really relevant here.)
All Python classes (and instances thereof) automatically get tracked by the cyclic gc. Types defined in C aren't, unless they put in a little effort. All the builtin types that could be part of a cycle do. But this does mean the gc module only knows about the types that play along.
Apart from the collection mechanism there's also the fact that Python has its own aggregating memory allocator (obmalloc), which allocates entire memory arenas and uses the memory for most of the smaller objects it creates. Python now does free these arenas when they're completely empty (for a long time it didn't), but actually emptying an arena is fairly rare: because CPython objects aren't movable, you can't just move some stragglers to another arena.
The RSS does not grow linearly with the number of Python objects, because Python objects can vary in size. An int object is usually much smaller than a big list.
I suppose that you mean gc.get_objects when you wrote sys.getobjects. This function gives you a list of all reachable objects. If you suppose a leak, you can iterate this list and try to find objects that should have been freed already. (For instance you might know that all objects of a certain type are to be freed at a certain point.)
A Python class designed to be unable to be involved in cycles is not tracked by the GC.
class V(object):
__slots__ = ()
Instances of V cannot have any attribute. Its size is 16, like the size of object().
sys.getsizeof(V()) and v().sizeof() return the same value: 16.
V isn't useful, but I imagine that other classes derived from base types (e.g. tuple), that only add methods, can be crafted so that reference counting is enough to manage them in memory.