Problems with the GC when using a WeakValueDictionary for caches

Problems with the GC when using a WeakValueDictionary for caches - python

According to the official Python documentation for the weakref module the "primary use for weak references is to implement caches or mappings holding large objects,...". So, I used a WeakValueDictionary to implement a caching mechanism for a long running function. However, as it turned out, values in the cache never stayed there until they would actually be used again, but needed to be recomputed almost every time. Since there were no strong references between accesses to the values stored in the WeakValueDictionary, the GC got rid of them (even though there was absolutely no problem with memory).
Now, how am I then supposed to use the weak reference stuff to implement a cache? If I keep strong references somewhere explicitly to keep the GC from deleting my weak references, there would be no point using a WeakValueDictionary in the first place. There should probably be some option to the GC that tells it: delete everything that has no references at all and everything with weak references only when memory is running out (or some threshold is exceeded). Is there something like that? Or is there a better strategy for this kind of cache?

I'll attempt to answer your inquiry with an example of how to use the weakref module to implement caching. We'll keep our cache's weak references in a weakref.WeakValueDictionary, and the strong references in a collections.deque because it has a maxlen property that controls how many objects it holds on to. Implemented in function closure style:
import weakref, collections
def createLRUCache(factory, maxlen=64):
weak = weakref.WeakValueDictionary()
strong = collections.deque(maxlen=maxlen)
notFound = object()
def fetch(key):
value = weak.get(key, notFound)
if value is notFound:
weak[key] = value = factory(key)
strong.append(value)
return value
return fetch
The deque object will only keep the last maxlen entries, simply dropping references to the old entries once it reaches capacity. When the old entries are dropped and garbage collected by python, the WeakValueDictionary will remove those keys from the map. Hence, the combination of the two objects helps us keep only maxlen entries in our LRU cache.
class Silly(object):
def __init__(self, v):
self.v = v
def fib(i):
if i > 1:
return Silly(_fibCache(i-1).v + _fibCache(i-2).v)
elif i: return Silly(1)
else: return Silly(0)
_fibCache = createLRUCache(fib)

It looks like there is no way to overcome this limitation, at least in CPython 2.7 and 3.0.
Reflecting on solution createLRUCache():
The solution with createLRUCache(factory, maxlen=64) is not fine with my expectations. The idea of binding to 'maxlen' is something I would like to avoid. It would force me to specify here some non scalable constant or create some heuristic, to decide which constant is better for this or that host memory limits.
I would prefer GC will eliminate unreferenced values from WeakValueDictionary not straight away, but on condition is used for regular GC:
When the number of allocations minus the number of deallocations exceeds threshold0, collection starts.

Related

Strange behavior with `weakref` in IPython

While coding a cache class for one of my projects I wanted to try out the weakref package as its functionality seems to fit this purpose very well. The class is supposed to cache blocks of data from disk as readable and writable buffers for ctypes.Structures. The blocks of data are supposed to be discarded when no structure is pointing to them, unless the buffer was modified due to some change to the structures.
To prevent dirty blocks from being garbage collected my idea was to set block.some_attr_name = block in the structures' __setattr__. Even when all structures are eventually garbage collected, the underlying block of data still has a reference count of at least 1 because block.some_attr_name references block.
I wanted to test this idea, so I opened up an IPython session and typed
import weakref
class Test:
def __init__ (self):
self.self = self
ref = weakref.ref(Test(), lambda r: print("Test was trashed"))
As expected, this didn't print Test was trashed. But when I went to type del ref().self to see whether the referent will be discarded, while typing, before hitting Enter, Test was trashed appeared. Oddly enough, even hitting the arrow keys or resizing the command line window after assigning ref will cause the referent to be trashed, even though the referent's reference count cannot drop to zero because it is referencing itself. This behavior persists even if I artificially increase the reference count by replacing self.self = self with self.refs = [self for i in range(20)].
I couldn't reproduce this behavior in the standard python.exe interpreter (interactive session) which is why I assume this behavior to be tied to IPython (but I am not actually sure about this).
Is this behavior expected with the devil hiding somewhere in the details of IPython's implementation or is this behavior a bug?
Edit 1: It gets stranger. If in the IPython session I run
import weakref
class Test:
def __init__ (self):
self.self = self
test = Test()
ref = weakref.ref(test, lambda r: print("Aaaand it's gone...", flush = True))
del test
the referent is not trashed immediately. But if I hold down any key, "typing" out "aaaa..." (~200 a's), suddenly Aaaand it's gone... appears. And since I added flush = True I can rule out buffering for the late response. I definitely wouldn't expect IPython to be decreasing reference counts just because I was holding down a key. Maybe Python itself checks for circular references in some garbage collection cycles?
(tested with IPython 7.30.1 running Python 3.10.1 on Windows 10 x64)

In Python's documentation on Extending and Embedding the Python Interpreter under subsection 1.10 Reference Counts the second to last paragraph reads:
While Python uses the traditional reference counting implementation, it also offers a cycle detector that works to detect reference cycles. This allows applications to not worry about creating direct or indirect circular references; these are the weakness of garbage collection implemented using only reference counting. Reference cycles consist of objects which contain (possibly indirect) references to themselves, so that each object in the cycle has a reference count which is non-zero. Typical reference counting implementations are not able to reclaim the memory belonging to any objects in a reference cycle, or referenced from the objects in the cycle, even though there are no further references to the cycle itself.
So I guess my idea of circular references to prevent garbage collection from eating my objects won't work out.

Garbage collection for a simple python class

I am writing a python class like this:
class MyImageProcessor:
def __init__ (self, image, metadata):
self.image=image
self.metadata=metadata
Both image and metadata are objects of a class written by a
colleague. Now I need to make sure there is no waste of memory. I am thinking of defining a quit() method like this,
def quit():
self.image=None
self.metadata=None
import gc
gc.collect()
and suggest users to call quit() systematically. I would like to know whether this is the right way. In particular, do the instructions in quit() above guarantee that unused memories being well collected?
Alternatively, I could rename quit() to the build-in __exit__(), and suggest users to use the "with" syntax. But my question is
more about whether the instructions in quit() indeed fulfill the garbage collection work one would need in this situation.
Thank you for your help.

In python every object has a built-in reference_count, the variables(names) you create are only pointers to the objects. There are mutable and unmutable variables (for example if you change the value of an integer, the name will be pointed to another integer object, while changing a list element will not cause changing of the list name).
Reference count basically counts how many variable uses that data, and it is incremented/decremented automatically.
The garbage collector will destroy the objects with zero references (actually not always, it takes extra steps to save time). You should check out this article.
Similarly to object constructors (__init__()), which are called on object creation, you can define destructors (__del__()), which are executed on object deletion (usually when the reference count drops to 0). According to this article, in python they are not needed as much needed in C++ because Python has a garbage collector that handles memory management automatically. You can check out those examples too.
Hope it helps :)

No need for quit() (Assuming you're using C-based python).
Python uses two methods of garbage collection, as alluded to in the other answers.
First, there's reference counting. Essentially each time you add a reference to an object it gets incremented & each time you remove the reference (e.g., it goes out of scope) it gets decremented.
From https://devguide.python.org/garbage_collector/:
When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on.
You can get information about current reference counts for an object using sys.getrefcount(x), but really, why bother.
The second way is through garbage collection (gc). [Reference counting is a type of garbage collection, but python specifically calls this second method "garbage collection" -- so we'll also use this terminology. ] This is intended to find those places where reference count is not zero, but the object is no longer accessible. ("Reference cycles") For example:
class MyObj:
pass
x = MyObj()
x.self = x
Here, x refers to itself, so the actual reference count for x is more than 1. You can call del x but that merely removes it from your scope: it lives on because "someone" still has a reference to it.
gc, and specifically gc.collect() goes through objects looking for cycles like this and, when it finds an unreachable cycle (such as your x post deletion), it will deallocate the whole lot.
Back to your question: You don't need to have a quit() object because as soon as your MyImageProcessor object goes out of scope, it will decrement reference counters for image and metadata. If that puts them to zero, they're deallocated. If that doesn't, well, someone else is using them.
Your setting them to None first, merely decrements the reference count right then, but when MyImageProcessor goes out of scope, it won't decrement those reference count again, because MyImageProcessor no longer holds the image or metadata objects! So you're just explicitly doing what python does for you already for free: no more, no less.
You didn't create a cycle, so your calling gc.collect() is unlikely to change anything.
Check out https://devguide.python.org/garbage_collector/ if you are interested in more earthy details.

Not sure if it make sense but to my logic you could
Use :
gc.get_count()
before and after
gc.collect()
to see if something has been removed.
what are count0, count1 and count2 values returned by the Python gc.get_count()

Is it safe to give a python WeakSet to a list constructor?

The question Safely iterating over WeakKeyDictionary and WeakValueDictionary did not put me at ease as I had hoped, and it's old enough that it's worth asking again rather than commenting.
Suppose I have a class MyHashable that's hashable, and I want to build a WeakSet:
obj1 = MyHashable()
obj2 = MyHashable()
obj3 = MyHashable()
obj2.cycle_sibling = obj3
obj3.cycle_sibling = obj2
ws = WeakSet([obj1, obj2, obj3])
Then I delete some local variables, and convert to a list in preparation for a later loop:
del obj2
del obj3
list_remaining = list(ws)
The question I cite seems to claim this is just fine, but even without any kind of explicit for loop, have I not already risked the cyclic garbage collector kicking in during the constructor of list_remaining and changing the size of the set? I would expect this problem to be rare enough that it would be difficult to detect experimentally, but could crash my program once in a blue moon.
I don't even feel like the various commenters on that post really came to an agreement whether something like
for obj in list(ws):
...
was ok, but they did all seem to assume that list(ws) itself can run all the way through without crashing, and I'm not even convinced of that. Does the list constructor avoid using iterators somehow and thus not care about set size changes? Can garbage collection not occur during a list constructor because list is built-in?
For the moment I've written my code to destructively pop items out of the WeakSet, thus avoiding iterators altogether. I don't mind doing it destructively because at that point in my code I'm done with the WeakSet anyway. But I don't know if I'm being paranoid.

The docs are frustratingly lacking in information on this, but looking at the implementation, we can see that WeakSet.__iter__ has a guard against this kind of problem.
During iteration over a WeakSet, weakref callbacks will add references to a list of pending removals rather than removing references from the underlying set directly. If an element dies before iteration reaches it, the iterator won't yield the element, but you're not going to get a segfault or a RuntimeError: Set changed size during iteration or anything.
Here's the guard (not threadsafe, despite what the comment says):
class _IterationGuard:
# This context manager registers itself in the current iterators of the
# weak container, such as to delay all removals until the context manager
# exits.
# This technique should be relatively thread-safe (since sets are).
def __init__(self, weakcontainer):
# Don't create cycles
self.weakcontainer = ref(weakcontainer)
def __enter__(self):
w = self.weakcontainer()
if w is not None:
w._iterating.add(self)
return self
def __exit__(self, e, t, b):
w = self.weakcontainer()
if w is not None:
s = w._iterating
s.remove(self)
if not s:
w._commit_removals()
Here's where __iter__ uses the guard:
class WeakSet:
...
def __iter__(self):
with _IterationGuard(self):
for itemref in self.data:
item = itemref()
if item is not None:
# Caveat: the iterator will keep a strong reference to
# `item` until it is resumed or closed.
yield item
And here's where the weakref callback checks the guard:
def _remove(item, selfref=ref(self)):
self = selfref()
if self is not None:
if self._iterating:
self._pending_removals.append(item)
else:
self.data.discard(item)
You can also see the same guard used in WeakKeyDictionary and WeakValueDictionary.
On old Python versions (3.0, or 2.6 and earlier), this guard is not present. If you need to support 2.6 or earlier, it looks like it should be safe to use keys, values, and items with the weak dict classes; I list no option for WeakSet because WeakSet didn't exist back then. If there's a safe, non-destructive option on 3.0, I haven't found one, but hopefully no one needs to support 3.0.

Why are named tuples always tracked by python's GC?

As we (or at least I) learned in this answer simple tuples that only contain immutable values are not tracked by python's garbage collector, once it figures out that they can never be involved in reference cycles:
>>> import gc
>>> x = (1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
False
Why isn't this the case for namedtuples, which are a subclass of tuple from the collections module that features named fields?
>>> import gc
>>> from collections import namedtuple
>>> foo = namedtuple('foo', ['x', 'y'])
>>> x = foo(1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
True
Is there something inherent in their implementation that prevents this or was it just overlooked?

The only comment about this that I could find is in the gcmodule.c file of the Python sources:
NOTE: about untracking of mutable objects.
Certain types of container cannot participate in a reference cycle, and so do not need to be tracked by the garbage collector.
Untracking these objects reduces the cost of garbage collections.
However, determining which objects may be untracked is not free,
and the costs must be weighed against the benefits for garbage
collection.
There are two possible strategies for when to untrack a container:
When the container is created.
When the container is examined by the garbage collector.
Tuples containing only immutable objects (integers, strings etc,
and recursively, tuples of immutable objects) do not need to be
tracked. The interpreter creates a large number of tuples, many of
which will not survive until garbage collection. It is therefore
not worthwhile to untrack eligible tuples at creation time.
Instead, all tuples except the empty tuple are tracked when
created. During garbage collection it is determined whether any
surviving tuples can be untracked. A tuple can be untracked if all
of its contents are already not tracked. Tuples are examined for
untracking in all garbage collection cycles. It may take more than
one cycle to untrack a tuple.
Dictionaries containing only immutable objects also do not need to
be tracked. Dictionaries are untracked when created. If a tracked
item is inserted into a dictionary (either as a key or value), the
dictionary becomes tracked. During a full garbage collection (all
generations), the collector will untrack any dictionaries whose
contents are not tracked.
The module provides the python function is_tracked(obj), which
returns the current tracking status of the object. Subsequent
garbage collections may change the tracking status of the object.
Untracking of certain containers was introduced in issue #4688, and the algorithm was refined in response to issue #14775.
(See the linked issues to see the real code that was introduced to allow untracking)
This comment is a bit ambiguous, however it does not state that the algorithm to choose which object to "untrack" applies to generic containers. This means that the code check only tuples ( and dicts), not their subclasses.
You can see this in the code of the file:
/* Try to untrack all currently tracked dictionaries */
static void
untrack_dicts(PyGC_Head *head)
{
PyGC_Head *next, *gc = head->gc.gc_next;
while (gc != head) {
PyObject *op = FROM_GC(gc);
next = gc->gc.gc_next;
if (PyDict_CheckExact(op))
_PyDict_MaybeUntrack(op);
gc = next;
}
}
Note the call to PyDict_CheckExact, and:
static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
PyGC_Head *gc = young->gc.gc_next;
/* omissis */
if (PyTuple_CheckExact(op)) {
_PyTuple_MaybeUntrack(op);
}
Note tha call to PyTuple_CheckExact.
Also note that a subclass of tuple need not be immutable. This means that if you wanted to extend this mechanism outside tuple and dict you'd need a generic is_immutable function. This would be really expensive, if at all possible due to Python's dynamism (e.g. methods of the class may change at runtime, while this is not possible for tuple because it is a built-in type). Hence the devs chose to stick to few special case only some well-known built-ins.
This said, I believe they could special case namedtuples too since they are pretty simple classes. There would be some issues for example when you call namedtuple you are creating a new class, hence the GC should check for a subclass.
And this might be a problem with code like:
class MyTuple(namedtuple('A', 'a b')):
# whatever code you want
pass
Because the MyTuple class need not be immutable, so the GC should check that the class is a direct subclass of namedtuple to be safe. However I'm pretty sure there are workarounds for this situation.
They probably didn't because namedtuples are part of the standard library, not the python core. Maybe the devs didn't want to make the core dependent on a module of the standard library.
So, to answer your question:
No, there is nothing in their implementation that inherently prevents untracking for namedtuples
No, I believe they did not "simply overlook" this. However only python devs could give a clear answer to why they chose not to include them. My guess is that they didn't think it would provide a big enough benefit for the change and they didn't want to make the core dependent on the standard library.

#Bakunu gave an excellent answer - accept it :-)
A gloss here: No untracking gimmick is "free": there are real costs, in both runtime and explosion of tricky code to maintain. The base tuple and dict types are very heavily used, both by user programs and by the CPython implementation, and it's very often possible to untrack them. So special-casing them is worth some pain, and benefits "almost all" programs. While it's certainly possible to find examples of programs that would benefit from untracking namedtuples (or ...) too, it wouldn't benefit the CPython implementation or most user programs. But it would impose costs on all programs (more conditionals in the gc code to ask "is this a namedtuple?", etc).
Note that all container objects benefit from CPython's "generational" cyclic gc gimmicks: the more collections a given container survives, the less often that container is scanned (because the container is moved to an "older generation", which is scanned less often). So there's little potential gain unless a container type occurs in great numbers (often true of tuples, rarely true of dicts) or a container contains a great many objects (often true of dicts, rarely true of tuples).

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!

Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB

shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).

Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.

"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.