When garbage collection occurs?

When garbage collection occurs? - python

>>>a=6
>>>b=5
>>>c=4
>>>d=c
>>>print(d)
>>>del b
>>># a and b "must be" garbage collection or "maybe" garbage collection
a and b maybe Garbage collection or a and b must be Garbage collection ?
How to prove it?

CPython uses reference counting. Jython and IronPython use their underlying VM's GC. Having said that, CPython interns small integers, including those used in your code, and therefore those specifically will never be GCed.

Python variables are just names referring to objects. In your example you have three objects, the integers 4, 5 and 6.
The integer 6 is referenced by a, 5 is initially referenced by b, and 4 is referenced by both c and d. Then you call del(b). This removes the reference from the integer 5. So at this point, 6 and 4 are still referenced, while 5 is not.
Exactly how garbage collection is handled is an implementation detail.
Now look here:
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object
So the numbers you used in this example will never be garbage collected.
As to when garbage is collected is described in the documentation of gc.set_threshold(threshold0, threshold1, threshold2):
In order to decide when to run, the collector keeps track of the number object allocations and deallocations since the last collection. When the number of allocations minus the number of deallocations exceeds threshold0, collection starts. Initially only generation 0 is examined. If generation 0 has been examined more than threshold1 times since generation 1 has been examined, then generation 1 is examined as well. Similarly, threshold2 controls the number of collections of generation 1 before collecting generation 2.
The standard values for the thresholds are;
In [2]: gc.get_threshold()
Out[2]: (700, 10, 10)

The way garbage collection works is an implementation detail. See also the question “My class defines __del__ but it is not called when I delete the object” in the Python FAQ.
In CPython, reference counting is default, this implies that objects will be deleted in the moment the last reference gets deleted. So if you have an object called a, which is only referenced in the current scope, del a will delete it completely.
However, CPython also maintains a list of cyclic objects, to deal with the special cases where reference counting fails. You cannot tell when objects which end up in this list will get deleted, but eventually they will.
In other Python implementations, there might be a full garbage collector for all objects, so you should never rely on del a actually deleting the object. This is also why you should always close file descriptors manually using .close() to prevent resources from leaking until the program shuts down.

Related

Garbage collection for a simple python class

I am writing a python class like this:
class MyImageProcessor:
def __init__ (self, image, metadata):
self.image=image
self.metadata=metadata
Both image and metadata are objects of a class written by a
colleague. Now I need to make sure there is no waste of memory. I am thinking of defining a quit() method like this,
def quit():
self.image=None
self.metadata=None
import gc
gc.collect()
and suggest users to call quit() systematically. I would like to know whether this is the right way. In particular, do the instructions in quit() above guarantee that unused memories being well collected?
Alternatively, I could rename quit() to the build-in __exit__(), and suggest users to use the "with" syntax. But my question is
more about whether the instructions in quit() indeed fulfill the garbage collection work one would need in this situation.
Thank you for your help.

In python every object has a built-in reference_count, the variables(names) you create are only pointers to the objects. There are mutable and unmutable variables (for example if you change the value of an integer, the name will be pointed to another integer object, while changing a list element will not cause changing of the list name).
Reference count basically counts how many variable uses that data, and it is incremented/decremented automatically.
The garbage collector will destroy the objects with zero references (actually not always, it takes extra steps to save time). You should check out this article.
Similarly to object constructors (__init__()), which are called on object creation, you can define destructors (__del__()), which are executed on object deletion (usually when the reference count drops to 0). According to this article, in python they are not needed as much needed in C++ because Python has a garbage collector that handles memory management automatically. You can check out those examples too.
Hope it helps :)

No need for quit() (Assuming you're using C-based python).
Python uses two methods of garbage collection, as alluded to in the other answers.
First, there's reference counting. Essentially each time you add a reference to an object it gets incremented & each time you remove the reference (e.g., it goes out of scope) it gets decremented.
From https://devguide.python.org/garbage_collector/:
When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on.
You can get information about current reference counts for an object using sys.getrefcount(x), but really, why bother.
The second way is through garbage collection (gc). [Reference counting is a type of garbage collection, but python specifically calls this second method "garbage collection" -- so we'll also use this terminology. ] This is intended to find those places where reference count is not zero, but the object is no longer accessible. ("Reference cycles") For example:
class MyObj:
pass
x = MyObj()
x.self = x
Here, x refers to itself, so the actual reference count for x is more than 1. You can call del x but that merely removes it from your scope: it lives on because "someone" still has a reference to it.
gc, and specifically gc.collect() goes through objects looking for cycles like this and, when it finds an unreachable cycle (such as your x post deletion), it will deallocate the whole lot.
Back to your question: You don't need to have a quit() object because as soon as your MyImageProcessor object goes out of scope, it will decrement reference counters for image and metadata. If that puts them to zero, they're deallocated. If that doesn't, well, someone else is using them.
Your setting them to None first, merely decrements the reference count right then, but when MyImageProcessor goes out of scope, it won't decrement those reference count again, because MyImageProcessor no longer holds the image or metadata objects! So you're just explicitly doing what python does for you already for free: no more, no less.
You didn't create a cycle, so your calling gc.collect() is unlikely to change anything.
Check out https://devguide.python.org/garbage_collector/ if you are interested in more earthy details.

Not sure if it make sense but to my logic you could
Use :
gc.get_count()
before and after
gc.collect()
to see if something has been removed.
what are count0, count1 and count2 values returned by the Python gc.get_count()

Assignment of a variable

I am puzzled with an issue of a variable assignment in python. I tried looking for pages on internet about this issue, but I still do not understand it.
Let's say I assign a certain value to the variable:
v = 2
Python reserved a place in some memory location with a value 2. We now know how to access that particular memory location by assigning to it a variable name: v
If then, we do this:
v = v + 4
A new memory location will be filed with the value 6 (2+4), so now to access this value, we are binding the variable v to it.
Is this correct, or completely incorrect?
And what happens with the initial value 2 in the memory location? It is no longer bound to any variable, so this value can not be accessed any more? Does this mean that in the next couple of minutes, this memory location will be freed by the garbage collector?
Thank you for the reply.
EDIT:
The v = 2 and v = v + 2 are just simple examples. In my case instead of 2 and v+2 I have a non-python native objects which can weight from tens of megabytes, up to 400 megabytes.
I apologize for not clearing this on the very start. I did not know there is a difference between integers and objects of other types.

You're correct, as far as this goes. Actually, what happens is that Python constructs integer objects 2 and 4. In the first assignment, the interpreter makes v point to the 2 object.
In the second assignment, we construct an object 6 and make v point to that, as you suspected.
What happens next is dependent on the implementation. Most interpreters and compilers that use objects for even atomic values, will recognize the 2 object as a small integer, likely to be used again, and very cheap to maintain -- so it won't get collected. In general, a literal appearing in the code will not be turned over for garbage collection.
Thanks for updating the question. Yes, your large object, with no references to it, is now available for GC.

The example with small integers (-5 to 256) is not the best one because these are represented in a special way (see this for example).
You are generally correct about the flow of things, so if you had something like
v = f(v)
the following things happen:
a new object gets created as a result of evaluating f(v) (modulo the small integer or literal optimization)
this new object is bound to variable v
whatever was previously in v is now inaccessible and, generally, may get garbage collected soon

Python: how to "kill" a class instance/object?

I want a Roach class to "die" when it reaches a certain amount of "hunger", but I don't know how to delete the instance. I may be making a mistake with my terminology, but what I mean to say is that I have a ton of "roaches" on the window and I want specific ones to disappear entirely.
I would show you the code, but it's quite long. I have the Roach class being appended into a Mastermind classes roach population list.

In general:
Each binding variable -> object increases internal object's reference counter
there are several usual ways to decrease reference (dereference object -> variable binding):
exiting block of code where variable was declared (used for the first time)
destructing object will release references of all attributes/method variable -> object references
calling del variable will also delete reference in the current context
after all references to one object are removed (counter==0) it becomes good candidate for garbage collection, but it is not guaranteed that it will be processed (reference here):
CPython currently uses a reference-counting scheme with (optional)
delayed detection of cyclically linked garbage, which collects most
objects as soon as they become unreachable, but is not guaranteed to
collect garbage containing circular references. See the documentation
of the gc module for information on controlling the collection of
cyclic garbage. Other implementations act differently and CPython may
change. Do not depend on immediate finalization of objects when they
become unreachable (ex: always close files).
how many references on the object exists, use sys.getrefcount
module for configure/check garbage collection is gc
GC will call object.__ del__ method when destroying object (additional reference here)
some immutable objects like strings are handled in a special way - e.g. if two vars contain same string, it is possible that they reference the same object, but some not - check identifying objects, why does the returned value from id(...) change?
id of object can be found out with builtin function id
module memory_profiler looks interesting - A module for monitoring memory usage of a python program
there is lot of useful resources for the topic, one example: Find all references to an object in python

You cannot force a Python object to be deleted; it will be deleted when nothing references it (or when it's in a cycle only referred to be the items in the cycle). You will have to tell your "Mastermind" to erase its reference.
del somemastermind.roaches[n]

for i,roach in enumerate(roachpopulation_list)
if roach.hunger == 100
del roachpopulation_list[i]
break
Remove the instance by deleting it from your population list (containing all the roach instances.
If your Roaches are Sprites created in Pygame, then a simple command of .kill would remove the instance.

Why do Python variables take a new address (id) every time they're modified?

Just wondering what the logic behind this one is? On the surface it seems kind of inefficient, that every time you do something simple like "x=x+1" that it has to take a new address and discard the old one.

The Python variable (called an identifier or name, in Python) is a reference to a value. The id() function says something for that value, not the name.
Many values are not mutable; integers, strings, floats all do not change in place. When you add 1 to another integer, you return a new integer that then replaces the reference to the old value.
You can look at Python names as labels, tied to values. If you imagine values as balloons, you are retying the label a new balloon each time you assign to that name. If there are no other labels attached to a balloon anymore, it simply drifts away in the wind, never to be seen again. The id() function gives you a unique number for that balloon.
See this previous answer of mine where I talk a little bit more about that idea of values-as-balloons.
This may seem inefficient. For many often used and small values, Python actually uses a process called interning, where it will cache a stash of these values for re-use. None is such a value, as are small integers and the empty tuple (()). You can use the intern() function to do the same with strings you expect to use a lot.
But note that values are only cleaned up when their reference count (the number of 'labels') drops to 0. Loads of values are reused all over the place all the time, especially those interned integers and singletons.

Because the basic types are immutable, so every time you modify it, it needs to be instantiated again
...which is perfectly fine, especially for thread-safe functions

The = operator doesn't modify an object, it assigns the name to a completely different object, which may or may not already have an id.
For your example, integers are immutable; there's no way to add something to one and keep the same id.
And, in fact, small integers are interned at least in cPython, so if you do:
x = 1
y = 2
x = x + 1
Then x and y may have the same id.

In python "primitive" types like ints and strings are immutable, which means they can not be modified.
Python is actually quite efficient, because, as #Wooble commented, «Very short strings and small integers are interned.»: if two variables reference the same (small) immutable value their id is the same (reducing duplicated immutables).
>>> a = 42
>>> b = 5
>>> id(a) == id(b)
False
>>> b += 37
>>> id(a) == id(b)
True
The reason behind the use of immutable types is a safe approach to the concurrent access on those values.
At the end of the day it depends on a design choice.
Depending on your needs you can take more advantage of an implementation instead of another.
For instance, a different philosophy can be found in a somewhat similar language, Ruby, where those types that in Python are immutable, are not.

To be accurate, assignment x=x+1 doesn't modify the object that x is referencing, it just lets the x point to another object whose value is x+1.
To understand the logic behind, one needs to understand the difference between value semantics and reference semantics.
An object with value semantics means only its value matters, not its identity. While an object with reference semantics focuses on its identity(in Python, identity can be returned from id(obj)).
Typically, value semantics implies immutability of the object. Or conversely, if an object is mutable(i.e. in-place change), that means it has reference semantics.
Let's briefly explain the rationale behind this immutability.
Objects with reference semantics can be changed in-place without losing their original addresses/identities. This makes sense in that it's the identity of an object with reference semantics that makes itself distinguishable from other objects.
In contrast, an object with value-semantics should never change itself.
First, this is possible and reasonable in theory. Since only the value(not its identity) is significant, when a change is needed, it's safe to swap it to another identity with different value. This is called referential transparency. Be noted that this is impossible for the objects with reference semantics.
Secondly, this is beneficial in practice. As the OP thought, it seems inefficient to discard the old objects each time when it's changed , but most time it's more efficient than not. For one thing, Python(or any other language) has intern/cache scheme to make less objects to be created. What's more, if objects of value-semantics were designed to be mutable, it would take much more space in most cases.
For example, Date has a value semantics. If it's designed to be mutable, any method that returning a date from internal field will exposes the handle to outside world, which is risky(e.g. outside can directly modify this internal field without resorting to public interface). Similarly, if one passes any date object by reference to some function/method, this object could be modified in that function/method, which may be not as expected. To avoid these kinds of side-effect, one has to do defensive programming: instead of directly returning the inner date field, he returns a clone of it; instead of passing by reference, he passes by value which means extra copies are made. As one could imagine, there are more chances to create more objects than necessary. What's worse, code becomes more complicated with these extra cloning.
In a word, immutability enforces the value-semantics, it usually involves less object creation, has less side-effects and less hassles, and is more test-friendly. Besides, immutable objects are inherently thread-safe, which means less locks and better efficiency in multithreading environment.
That's the reason why basic data types of value-semantics like number, string, date, time are all immutable(well, string in C++ is an exception, that's why there're so many const string& stuffs to avoid string being modified unexpectedly). As a lesson, Java made mistakes on designing value-semantic class Date, Point, Rectangle, Dimension as mutable.
As we know, objects in OOP have three characteristics: state, behavior and identity. Objects with value semantics are not typical objects in that their identities do not matter at all. Usually they are passive, and mostly used to describe other real, active objects(i.e. those with reference semantics). This is a good hint to distinguish between value semantics and reference semantics.

Why does one version leak memory but not the other ? (Python)

Both these functions compute the same thing (the numbers of integers such that the length of the associated Collatz sequence is no greater than n) in essentially the same way. The only difference is that the first one uses sets exclusively whereas the second uses both sets and lists.
The second one leaks memory (in IDLE with Python 3.2, at least), the first one does not, and I have no idea why. I have tried a few "tricks" (such as adding del statements) but nothing seems to help (which is not surprising, those tricks should be useless).
I would be grateful to anybody who could help me understand what goes on.
If you want to test the code, you should probably use a value of n in the 55 to 65 range, anything above 75 will almost certainly result in a (totally expected) memory error.
def disk(n):
"""Uses sets for explored, current and to_explore. Does not leak."""
explored = set()
current = {1}
for i in range(n):
to_explore = set()
for x in current:
if not (x-1) % 3 and ((x-1)//3) % 2 and not ((x-1)//3) in explored:
to_explore.add((x-1)//3)
if not 2*x in explored:
to_explore.add(2*x)
explored.update(current)
current = to_explore
return len(explored)
def disk_2(n):
"""Does exactly the same thing, but Uses a set for explored and lists for
current and to_explore.
Leaks (like a sieve :))
"""
explored = set()
current = [1]
for i in range(n):
to_explore = []
for x in current:
if not (x-1) % 3 and ((x-1)//3) % 2 and not ((x-1)//3) in explored:
to_explore.append((x-1)//3)
if not 2*x in explored:
to_explore.append(2*x)
explored.update(current)
current = to_explore
return len(explored)
EDIT : This also happens when using the interactive mode of the interpreter (without IDLE), but not when running the script directly from a terminal (in that case, memory usage goes back to normal some time after the function has returned, or as soon as there is an explicit call to gc.collect()).

CPython allocates small objects (obmalloc.c, 3.2.3) out of 4 KiB pools that it manages in 256 KiB blocks called arenas. Each active pool has a fixed block size ranging from 8 bytes up to 256 bytes, in steps of 8. For example, a 14-byte object is allocated from the first available pool that has a 16-byte block size.
There's a potential problem if arenas are allocated on the heap instead of using mmap (this is tunable via mallopt's M_MMAP_THRESHOLD), in that the heap cannot shrink below the highest allocated arena, which will not be released so long as 1 block in 1 pool is allocated to an object (CPython doesn't float objects around in memory).
Given the above, the following version of your function should probably solve the problem. Replace the line return len(explored) with the following 3 lines:
result = len(explored)
del i, x, to_explore, current, explored
return result + 0
After deallocating the containers and all referenced objects (releasing arenas back to the system), this returns a new int with the expression result + 0. The heap cannot shrink as long as there's a reference to the first result object. In this case that gets automatically deallocated when the function returns.
If you're testing this interactively without the "plus 0" step, remember that the REPL (Read, Eval, Print, Loop) keeps a reference to the last result accessible via the pseudo-variable "_".
In Python 3.3 this shouldn't be an issue since the object allocator was modified to use anonymous mmap for arenas, where available. (The upper limit on the object allocator was also bumped to 512 bytes to accommodate 64-bit platforms, but that's inconsequential here.)
Regarding manual garbage collection, gc.collect() does a full collection of tracked container objects, but it also clears freelists of objects that are maintained by built-in types (e.g. frames, methods, floats). Python 3.3 added additional API functions to clear freelists used by lists (PyList_ClearFreeList), dicts (PyDict_ClearFreeList), and sets (PySet_ClearFreeList). If you'd prefer to keep the freelists intact, use gc.collect(1).

I doubt it leaks, I bet it is just that garbage collection doesn't kick in yet, so memory used keeps growing. This is because every round of outer loop, the previous current list becomes elgible for garbage collection, but will not be garbage collected until whenever.
Furthermore, even if it is garbage collected, memory isn't normally released back to the OS, so you have to use whatever Python method to get current used heap size.
If you add garbage collection at end of every outer loop iteration, that may reduce memory use a bit, or not, depending on how exactly Python handles its heap and garbage collection without that.

You do not have a memory leak. Processes on linux do not release memory to the OS until they exit. Accordingly, the stats you will see in e.g. top will only ever go up.
You only have a memory leak if after running the same, or smaller size of job, Python grabs more memory from the OS, when it "should" have been able to reuse the memory it was using for objects which "should" have been garbage collected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.