Why are named tuples always tracked by python's GC?

Why are named tuples always tracked by python's GC? - python

As we (or at least I) learned in this answer simple tuples that only contain immutable values are not tracked by python's garbage collector, once it figures out that they can never be involved in reference cycles:
>>> import gc
>>> x = (1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
False
Why isn't this the case for namedtuples, which are a subclass of tuple from the collections module that features named fields?
>>> import gc
>>> from collections import namedtuple
>>> foo = namedtuple('foo', ['x', 'y'])
>>> x = foo(1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
True
Is there something inherent in their implementation that prevents this or was it just overlooked?

The only comment about this that I could find is in the gcmodule.c file of the Python sources:
NOTE: about untracking of mutable objects.
Certain types of container cannot participate in a reference cycle, and so do not need to be tracked by the garbage collector.
Untracking these objects reduces the cost of garbage collections.
However, determining which objects may be untracked is not free,
and the costs must be weighed against the benefits for garbage
collection.
There are two possible strategies for when to untrack a container:
When the container is created.
When the container is examined by the garbage collector.
Tuples containing only immutable objects (integers, strings etc,
and recursively, tuples of immutable objects) do not need to be
tracked. The interpreter creates a large number of tuples, many of
which will not survive until garbage collection. It is therefore
not worthwhile to untrack eligible tuples at creation time.
Instead, all tuples except the empty tuple are tracked when
created. During garbage collection it is determined whether any
surviving tuples can be untracked. A tuple can be untracked if all
of its contents are already not tracked. Tuples are examined for
untracking in all garbage collection cycles. It may take more than
one cycle to untrack a tuple.
Dictionaries containing only immutable objects also do not need to
be tracked. Dictionaries are untracked when created. If a tracked
item is inserted into a dictionary (either as a key or value), the
dictionary becomes tracked. During a full garbage collection (all
generations), the collector will untrack any dictionaries whose
contents are not tracked.
The module provides the python function is_tracked(obj), which
returns the current tracking status of the object. Subsequent
garbage collections may change the tracking status of the object.
Untracking of certain containers was introduced in issue #4688, and the algorithm was refined in response to issue #14775.
(See the linked issues to see the real code that was introduced to allow untracking)
This comment is a bit ambiguous, however it does not state that the algorithm to choose which object to "untrack" applies to generic containers. This means that the code check only tuples ( and dicts), not their subclasses.
You can see this in the code of the file:
/* Try to untrack all currently tracked dictionaries */
static void
untrack_dicts(PyGC_Head *head)
{
PyGC_Head *next, *gc = head->gc.gc_next;
while (gc != head) {
PyObject *op = FROM_GC(gc);
next = gc->gc.gc_next;
if (PyDict_CheckExact(op))
_PyDict_MaybeUntrack(op);
gc = next;
}
}
Note the call to PyDict_CheckExact, and:
static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
PyGC_Head *gc = young->gc.gc_next;
/* omissis */
if (PyTuple_CheckExact(op)) {
_PyTuple_MaybeUntrack(op);
}
Note tha call to PyTuple_CheckExact.
Also note that a subclass of tuple need not be immutable. This means that if you wanted to extend this mechanism outside tuple and dict you'd need a generic is_immutable function. This would be really expensive, if at all possible due to Python's dynamism (e.g. methods of the class may change at runtime, while this is not possible for tuple because it is a built-in type). Hence the devs chose to stick to few special case only some well-known built-ins.
This said, I believe they could special case namedtuples too since they are pretty simple classes. There would be some issues for example when you call namedtuple you are creating a new class, hence the GC should check for a subclass.
And this might be a problem with code like:
class MyTuple(namedtuple('A', 'a b')):
# whatever code you want
pass
Because the MyTuple class need not be immutable, so the GC should check that the class is a direct subclass of namedtuple to be safe. However I'm pretty sure there are workarounds for this situation.
They probably didn't because namedtuples are part of the standard library, not the python core. Maybe the devs didn't want to make the core dependent on a module of the standard library.
So, to answer your question:
No, there is nothing in their implementation that inherently prevents untracking for namedtuples
No, I believe they did not "simply overlook" this. However only python devs could give a clear answer to why they chose not to include them. My guess is that they didn't think it would provide a big enough benefit for the change and they didn't want to make the core dependent on the standard library.

#Bakunu gave an excellent answer - accept it :-)
A gloss here: No untracking gimmick is "free": there are real costs, in both runtime and explosion of tricky code to maintain. The base tuple and dict types are very heavily used, both by user programs and by the CPython implementation, and it's very often possible to untrack them. So special-casing them is worth some pain, and benefits "almost all" programs. While it's certainly possible to find examples of programs that would benefit from untracking namedtuples (or ...) too, it wouldn't benefit the CPython implementation or most user programs. But it would impose costs on all programs (more conditionals in the gc code to ask "is this a namedtuple?", etc).
Note that all container objects benefit from CPython's "generational" cyclic gc gimmicks: the more collections a given container survives, the less often that container is scanned (because the container is moved to an "older generation", which is scanned less often). So there's little potential gain unless a container type occurs in great numbers (often true of tuples, rarely true of dicts) or a container contains a great many objects (often true of dicts, rarely true of tuples).

Related

Python: how to "kill" a class instance/object?

I want a Roach class to "die" when it reaches a certain amount of "hunger", but I don't know how to delete the instance. I may be making a mistake with my terminology, but what I mean to say is that I have a ton of "roaches" on the window and I want specific ones to disappear entirely.
I would show you the code, but it's quite long. I have the Roach class being appended into a Mastermind classes roach population list.

In general:
Each binding variable -> object increases internal object's reference counter
there are several usual ways to decrease reference (dereference object -> variable binding):
exiting block of code where variable was declared (used for the first time)
destructing object will release references of all attributes/method variable -> object references
calling del variable will also delete reference in the current context
after all references to one object are removed (counter==0) it becomes good candidate for garbage collection, but it is not guaranteed that it will be processed (reference here):
CPython currently uses a reference-counting scheme with (optional)
delayed detection of cyclically linked garbage, which collects most
objects as soon as they become unreachable, but is not guaranteed to
collect garbage containing circular references. See the documentation
of the gc module for information on controlling the collection of
cyclic garbage. Other implementations act differently and CPython may
change. Do not depend on immediate finalization of objects when they
become unreachable (ex: always close files).
how many references on the object exists, use sys.getrefcount
module for configure/check garbage collection is gc
GC will call object.__ del__ method when destroying object (additional reference here)
some immutable objects like strings are handled in a special way - e.g. if two vars contain same string, it is possible that they reference the same object, but some not - check identifying objects, why does the returned value from id(...) change?
id of object can be found out with builtin function id
module memory_profiler looks interesting - A module for monitoring memory usage of a python program
there is lot of useful resources for the topic, one example: Find all references to an object in python

You cannot force a Python object to be deleted; it will be deleted when nothing references it (or when it's in a cycle only referred to be the items in the cycle). You will have to tell your "Mastermind" to erase its reference.
del somemastermind.roaches[n]

for i,roach in enumerate(roachpopulation_list)
if roach.hunger == 100
del roachpopulation_list[i]
break
Remove the instance by deleting it from your population list (containing all the roach instances.
If your Roaches are Sprites created in Pygame, then a simple command of .kill would remove the instance.

Why do Python variables take a new address (id) every time they're modified?

Just wondering what the logic behind this one is? On the surface it seems kind of inefficient, that every time you do something simple like "x=x+1" that it has to take a new address and discard the old one.

The Python variable (called an identifier or name, in Python) is a reference to a value. The id() function says something for that value, not the name.
Many values are not mutable; integers, strings, floats all do not change in place. When you add 1 to another integer, you return a new integer that then replaces the reference to the old value.
You can look at Python names as labels, tied to values. If you imagine values as balloons, you are retying the label a new balloon each time you assign to that name. If there are no other labels attached to a balloon anymore, it simply drifts away in the wind, never to be seen again. The id() function gives you a unique number for that balloon.
See this previous answer of mine where I talk a little bit more about that idea of values-as-balloons.
This may seem inefficient. For many often used and small values, Python actually uses a process called interning, where it will cache a stash of these values for re-use. None is such a value, as are small integers and the empty tuple (()). You can use the intern() function to do the same with strings you expect to use a lot.
But note that values are only cleaned up when their reference count (the number of 'labels') drops to 0. Loads of values are reused all over the place all the time, especially those interned integers and singletons.

Because the basic types are immutable, so every time you modify it, it needs to be instantiated again
...which is perfectly fine, especially for thread-safe functions

The = operator doesn't modify an object, it assigns the name to a completely different object, which may or may not already have an id.
For your example, integers are immutable; there's no way to add something to one and keep the same id.
And, in fact, small integers are interned at least in cPython, so if you do:
x = 1
y = 2
x = x + 1
Then x and y may have the same id.

In python "primitive" types like ints and strings are immutable, which means they can not be modified.
Python is actually quite efficient, because, as #Wooble commented, «Very short strings and small integers are interned.»: if two variables reference the same (small) immutable value their id is the same (reducing duplicated immutables).
>>> a = 42
>>> b = 5
>>> id(a) == id(b)
False
>>> b += 37
>>> id(a) == id(b)
True
The reason behind the use of immutable types is a safe approach to the concurrent access on those values.
At the end of the day it depends on a design choice.
Depending on your needs you can take more advantage of an implementation instead of another.
For instance, a different philosophy can be found in a somewhat similar language, Ruby, where those types that in Python are immutable, are not.

To be accurate, assignment x=x+1 doesn't modify the object that x is referencing, it just lets the x point to another object whose value is x+1.
To understand the logic behind, one needs to understand the difference between value semantics and reference semantics.
An object with value semantics means only its value matters, not its identity. While an object with reference semantics focuses on its identity(in Python, identity can be returned from id(obj)).
Typically, value semantics implies immutability of the object. Or conversely, if an object is mutable(i.e. in-place change), that means it has reference semantics.
Let's briefly explain the rationale behind this immutability.
Objects with reference semantics can be changed in-place without losing their original addresses/identities. This makes sense in that it's the identity of an object with reference semantics that makes itself distinguishable from other objects.
In contrast, an object with value-semantics should never change itself.
First, this is possible and reasonable in theory. Since only the value(not its identity) is significant, when a change is needed, it's safe to swap it to another identity with different value. This is called referential transparency. Be noted that this is impossible for the objects with reference semantics.
Secondly, this is beneficial in practice. As the OP thought, it seems inefficient to discard the old objects each time when it's changed , but most time it's more efficient than not. For one thing, Python(or any other language) has intern/cache scheme to make less objects to be created. What's more, if objects of value-semantics were designed to be mutable, it would take much more space in most cases.
For example, Date has a value semantics. If it's designed to be mutable, any method that returning a date from internal field will exposes the handle to outside world, which is risky(e.g. outside can directly modify this internal field without resorting to public interface). Similarly, if one passes any date object by reference to some function/method, this object could be modified in that function/method, which may be not as expected. To avoid these kinds of side-effect, one has to do defensive programming: instead of directly returning the inner date field, he returns a clone of it; instead of passing by reference, he passes by value which means extra copies are made. As one could imagine, there are more chances to create more objects than necessary. What's worse, code becomes more complicated with these extra cloning.
In a word, immutability enforces the value-semantics, it usually involves less object creation, has less side-effects and less hassles, and is more test-friendly. Besides, immutable objects are inherently thread-safe, which means less locks and better efficiency in multithreading environment.
That's the reason why basic data types of value-semantics like number, string, date, time are all immutable(well, string in C++ is an exception, that's why there're so many const string& stuffs to avoid string being modified unexpectedly). As a lesson, Java made mistakes on designing value-semantic class Date, Point, Rectangle, Dimension as mutable.
As we know, objects in OOP have three characteristics: state, behavior and identity. Objects with value semantics are not typical objects in that their identities do not matter at all. Usually they are passive, and mostly used to describe other real, active objects(i.e. those with reference semantics). This is a good hint to distinguish between value semantics and reference semantics.

Problems with the GC when using a WeakValueDictionary for caches

According to the official Python documentation for the weakref module the "primary use for weak references is to implement caches or mappings holding large objects,...". So, I used a WeakValueDictionary to implement a caching mechanism for a long running function. However, as it turned out, values in the cache never stayed there until they would actually be used again, but needed to be recomputed almost every time. Since there were no strong references between accesses to the values stored in the WeakValueDictionary, the GC got rid of them (even though there was absolutely no problem with memory).
Now, how am I then supposed to use the weak reference stuff to implement a cache? If I keep strong references somewhere explicitly to keep the GC from deleting my weak references, there would be no point using a WeakValueDictionary in the first place. There should probably be some option to the GC that tells it: delete everything that has no references at all and everything with weak references only when memory is running out (or some threshold is exceeded). Is there something like that? Or is there a better strategy for this kind of cache?

I'll attempt to answer your inquiry with an example of how to use the weakref module to implement caching. We'll keep our cache's weak references in a weakref.WeakValueDictionary, and the strong references in a collections.deque because it has a maxlen property that controls how many objects it holds on to. Implemented in function closure style:
import weakref, collections
def createLRUCache(factory, maxlen=64):
weak = weakref.WeakValueDictionary()
strong = collections.deque(maxlen=maxlen)
notFound = object()
def fetch(key):
value = weak.get(key, notFound)
if value is notFound:
weak[key] = value = factory(key)
strong.append(value)
return value
return fetch
The deque object will only keep the last maxlen entries, simply dropping references to the old entries once it reaches capacity. When the old entries are dropped and garbage collected by python, the WeakValueDictionary will remove those keys from the map. Hence, the combination of the two objects helps us keep only maxlen entries in our LRU cache.
class Silly(object):
def __init__(self, v):
self.v = v
def fib(i):
if i > 1:
return Silly(_fibCache(i-1).v + _fibCache(i-2).v)
elif i: return Silly(1)
else: return Silly(0)
_fibCache = createLRUCache(fib)

It looks like there is no way to overcome this limitation, at least in CPython 2.7 and 3.0.
Reflecting on solution createLRUCache():
The solution with createLRUCache(factory, maxlen=64) is not fine with my expectations. The idea of binding to 'maxlen' is something I would like to avoid. It would force me to specify here some non scalable constant or create some heuristic, to decide which constant is better for this or that host memory limits.
I would prefer GC will eliminate unreferenced values from WeakValueDictionary not straight away, but on condition is used for regular GC:
When the number of allocations minus the number of deallocations exceeds threshold0, collection starts.

Why are mutable strings slower than immutable strings?

Why are mutable strings slower than immutable strings?
EDIT:
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s[0] = 'a'
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
13.5236170292
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... s = 'abcd'
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
6.24725079536
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
38.6385951042
i think it is obvious why i put s = UserString.MutableString('Python') on second test.

In a hypothetical language that offers both mutable and immutable, otherwise equivalent, string types (I can't really think of one offhand -- e.g., Python and Java both have immutable strings only, and other ways to make one through mutation which add indirectness and therefore can of course slow things down a bit;-), there's no real reason for any performance difference -- for example, in C++, interchangeably using a std::string or a const std::string I would expect to cause no performance difference (admittedly a compiler might be able to optimize code using the latter better by counting on the immutability, but I don't know any real-world ones that do perform such theoretically possible optimizations;-).
Having immutable strings may and does in fact allow very substantial optimizations in Java and Python. For example, if the strings get hashed, the hash can be cached, and will never have to be recomputed (since the string can't change) -- that's especially important in Python, which uses hashed strings (for look-ups in sets and dictionaries) so lavishly and even "behind the scenes". Fresh copies never need to be made "just in case" the previous one has changed in the meantime -- references to a single copy can always be handed out systematically whenever that string is required. Python also copiously uses "interning" of (some) strings, potentially allowing constant-time comparisons and many other similarly fast operations -- think of it as one more way, a more advanced one to be sure, to take advantage of strings' immutability to cache more of the results of operations often performed on them.
That's not to say that a given compiler is going to take advantage of all possible optimizations, of course. For example, when a slice of a string is requested, there is no real need to make a new object and copy the data over -- the new slice might refer to the old one with an offset (and an independently stored length), potentially a great optimization for big strings out of which many slices are taken. Python doesn't do that because, unless particular care is taken in memory management, this might easily result in the "big" string being all kept in memory when only a small slice of it is actually needed -- but it's a tradeoff that a different implementation might definitely choose to perform (with that burden of extra memory management, to be sure -- more complex, harder-to-debug compiler and runtime code for the hypothetical language in question).
I'm just scratching the surface here -- and many of these advantages would be hard to keep if otherwise interchangeable string types could exist in both mutable and immutable versions (which I suspect is why, to the best of my current knowledge at least, C++ compilers actually don't bother with such optimizations, despite being generally very performance-conscious). But by offering only immutable strings as the primitive, fundamental data type (and thus implicitly accepting some disadvantage when you'd really need a mutable one;-), languages such as Java and Python can clearly gain all sorts of advantages -- performance issues being only one group of them (Python's choice to allow only immutable primitive types to be hashable, for example, is not a performance-centered design decision -- it's more about clarity and predictability of behavior for sets and dictionaries!-).

I don't know if they are really a lot slower but they make thinking about programming easier a lot of the times, because the state of the object/string can't change. That's the most important property to immutability to me.
Furthermore you might assume that immutable string are faster because they have less state(which can change), which might mean lower memory consumption, CPU-cycles.
I also found this interesting article while googling which I would like to quote:
knowing that a string is immutable
makes it easy to lay it out at
construction time — fixed and
unchanging storage requirements

with an immutable string, python can intern it and refer to it internally by it's address in memory. This means that to compare two strings, it only has to compare their addresses in memory (unless one of them isn't interned). Also, keep in mind that not all strings are interned. I've seen example of constructed strings that are not interned.
with mutable strings, string comparison would involve comparing them character by character and would also require either storing identical strings in different locations (malloc is not free) or adding logic to keep track of how many times a given string is referred to and making a copy for every mutation if there were more than one referrer.
It seems like python is optimized for string comparison. This makes sense because even string manipulation involves string comparison in most cases so for most use cases, it's the lowest common denominator.
Another advantage of immutable strings is that it makes it possible for them to be hashable which is a requirement for using them for dictionary keys. imagine a scenario where they were mutable:
s = 'a'
d = {s : 1}
s = s + 'b'
d[s] = ?
I suppose python could keep track of which dicts have which strings as keys and update all of their hashtables when a string was modified but that's just adding more overhead to dict insertion. It's not to far off the mark to say that you can't do anything in python without a dict insertion/lookup so that would be very very bad. It also adds overhead to string manipulation.

The obvious answer to your question is that normal strings are implemented in C, while MutableString is implemented in Python.
Not only does every operation on a mutable string have the overhead of going through one or more Python function calls, but the implementation is essentially a wrapper round an immutable string - when you modify the string it creates a new immutable string and throws the old one away. You can read the source in the UserString.py file in your Python lib directory.
To quote the Python docs:
Note:
This UserString class from this module
is available for backward
compatibility only. If you are writing
code that does not need to work with
versions of Python earlier than Python
2.2, please consider subclassing directly from the built-in str type
instead of using UserString (there is
no built-in equivalent to
MutableString).
This module defines a class that acts
as a wrapper around string objects. It
is a useful base class for your own
string-like classes, which can inherit
from them and override existing
methods or add new ones. In this way
one can add new behaviors to strings.
It should be noted that these classes
are highly inefficient compared to
real string or Unicode objects; this
is especially the case for
MutableString.
(Emphasis added).

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!

Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB

shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).

Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.

"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.