Are all Python objects tracked by the garbage collector? - python

I'm trying to debug a memory leak (see question Memory leak in Python Twisted: where is it?).
When the garbage collector is running, does it have access to all Python objects created by the Python interpreter? If we suppose Python C libraries are not leaking, should RSS memory usage grow linearly with respect to the GC object count? What about sys.getobjects?

CPython uses two mechanisms to clean up garbage. One is reference counting, which affects all objects but which can't clean up objects that (directly or indirectly) refer to each other. That's where the actual garbage collector comes in: python has the gc module, which searches for cyclic references in objects it knows about. Only objects that can potentially be part of a reference cycle need to worry about participating in the cyclic gc. So, for example, lists do, but strings do not; strings don't reference any other objects. (In fact, the story is a little more complicated, as there's two ways of participating in cyclic gc, but that isn't really relevant here.)
All Python classes (and instances thereof) automatically get tracked by the cyclic gc. Types defined in C aren't, unless they put in a little effort. All the builtin types that could be part of a cycle do. But this does mean the gc module only knows about the types that play along.
Apart from the collection mechanism there's also the fact that Python has its own aggregating memory allocator (obmalloc), which allocates entire memory arenas and uses the memory for most of the smaller objects it creates. Python now does free these arenas when they're completely empty (for a long time it didn't), but actually emptying an arena is fairly rare: because CPython objects aren't movable, you can't just move some stragglers to another arena.

The RSS does not grow linearly with the number of Python objects, because Python objects can vary in size. An int object is usually much smaller than a big list.
I suppose that you mean gc.get_objects when you wrote sys.getobjects. This function gives you a list of all reachable objects. If you suppose a leak, you can iterate this list and try to find objects that should have been freed already. (For instance you might know that all objects of a certain type are to be freed at a certain point.)

A Python class designed to be unable to be involved in cycles is not tracked by the GC.
class V(object):
__slots__ = ()
Instances of V cannot have any attribute. Its size is 16, like the size of object().
sys.getsizeof(V()) and v().sizeof() return the same value: 16.
V isn't useful, but I imagine that other classes derived from base types (e.g. tuple), that only add methods, can be crafted so that reference counting is enough to manage them in memory.

Related

Accessing the memory heap in python

Is there a way to access the memory heap in Python? I'm interested in being able to access all of the objects allocated in memory of the running instance.
You can't get direct access, but the gc module should do most of what you want. A simple gc.get_objects() call will return all the objects tracked by the collector. This isn't everything since the CPython garbage collector is only concerned with potential reference cycles (so built-in types that can't refer to other objects, e.g. int, float, str, etc.) won't appear in the resulting list, but they'll all be referenced by something in that list (if they weren't, their reference count would be zero and they'd have been disposed of).
Aside from that, you might get some more targeted use out of the inspect module, especially stack frame inspection, using the traceback module for "easy formatting" or manually digging into the semi-documented frame objects themselves, either of which would allow you to narrow the scope down to a particular active scope on the stack frame.
For the closest to the heap solution, you could use the tracemalloc module to trace and record allocations as they happen, or the pdb debugger to do live introspection from the outside (possibly adding breakpoint() calls to your code to make it stop automatically when you reach that point to let you look around).

What is the reasoning behind Python using identifiers rather than variables, if there is any?

My previous question probably wasn't phrased the clearest. I didn't want to know what happened, I wanted to know why the language itself uses this philosophy. Was it an arbitrary choice, or is there some interesting history between this design choice?
The answers that the question was marked as a duplicate to simply stated that Python does this but didn't explain whether there was any reasoning behind it.
If you know C and C++, you know what pointers and references are. In Java or Python, you have two kinds of things. On one side the native numeric types (integers, characters and floating points) and on the other the complex ones which derive from a basic empty type object.
In fact, the native types are the ones that can fit into a CPU register, and for that reason they are processed as values. But object (sub-)types often require a complex memory frame. For that reason, a register can only contain a pointer to them, and for that reason they are processed as references. The nice point with references for languages that provide a garbage collector, is that they are processed the same as a C++ shared_pointer: the system maintains a reference count, and when the reference count reaches 0, the object can be freed by the GC.
C has a very limited notion of object (struct) and in early K&R versions from the 1970s, you could only process them element by element or as a whole with memcopy but could neither return from a function nor assign them nor pass them by value. The ability to pass struct by values was added into ANSI C during the 1980s, to make the language more object friendly. C++ being from the beginning an object language, allowed to pass objects by value, and the smart pointers shared_ptr and unique_ptr were added to the standard library to allow to easily use references to objects because copying a large object is an expensive operation.
Python (like java) being a post-C++ language decided from the beginning that objects would be processed as references with a reference counter and deleted by a garbage collector when the ref count reaches 0. That way assigning objects is a cheap operation, and the programmer has never to explicitely delete anything.

Is there any benefit to deleting a reference to a large Python object before overwriting that reference?

I am running some memory-heavy scripts which iterate over documents in a database, and due to memory constraints on the server I manually delete references to the large object at the conclusion of each iteration:
for document in database:
initial_function_calls()
big_object = memory_heavy_operation(document)
save_to_file(big_object)
del big_object
additional_function_calls()
The initial_function_calls() and additional_function_calls() are each slightly memory-heavy. Do I see any benefit by explicitly deleting the reference to the large object for garbage collection? Alternatively, does leaving it and having it point to a new object in the next iteration suffice?
As often in these cases; it depends. :-/
I'm assuming we're talking about CPython here.
Using del or re-assigning a name reduces the reference count for an object. Only if that reference could reaches 0 can it be de-allocated. So if you inadvertently stashed a reference to big_object away somewhere, using del won't help.
When garbage collection is triggered depends on the amount of allocations and de-allocations. See the documentation for gc.set_threshold().
If you're pretty sure that there are no further references, you could use gc.collect() to force a garbage collection run. That might help if your code doesn't do a lot of other allocations.
One thing to keep in mind is that if the big_object is created by a C extension module (like e.g. numpy), it could manage its own memory. In that case the garbage collection won't affect it! Also small integers and small strings are pre-allocated and won't be garbage collected. You can use gc.is_tracked() to check if an object is managed by the garbage collector.
What I would suggest is that you run your program with and without del+gc.collect(), and monitor the amount of RAM used. On UNIX-like systems, look at the resident set size. You could also use sys._debugmallocstats().
Unless you see the resident set size grow and grow, I wouldn't worry about it.

In Python 3, does replacing a long list by None free up the memory?

In Python 3.x, say a variable contains a very long list. Once I know I no longer need this list, does setting the variable to None free up the memory?
This is what I mean:
a = [x for x in range(10**10)]
a = None
I know that the example above could use an iterator instead, but let's assume the list actually contains relevant data.
It decrements the list's reference count, and if it becomes zero, GC will eventually collect the list.
In CPython (the standard Python interpreter), reference counting is the primary means of automatic memory management. Which means that the memory used by an object is immediately freed when there are no references to it. In your example, if the list has only the name a, then the memory it uses is freed immediately when a is set to None (or any other value, in fact). You can also del a to remove the name entirely, to similar effect. Not only the list but all its contents (assuming there are no other references to individual items) are freed.
There are a couple of caveats. First, if an object contains a reference to itself, or to another object that refers back to the first object (etc.) its reference count never reaches zero even when all its names are deleted, so it won't be freed automatically by this process.
a = [x for x in range(10**10)]
a.append(a)
a = None
In this case, the list named a contains a reference to itself, so the last line does not free the memory used by the list, because its reference count does not reach zero when that happens. Python has a garbage collector that is run periodically to find and free such structures that refer only to themselves (or cyclic structures that mutually refer to each other) and that cannot be reached from any namespace.
The other caveat is that freed memory is made available again for use by your Python script, but it may not be released back to the operating system immediately, or ever.
Finally, some objects are never freed: small integers (range -5 to 255), interned strings, empty tuples. These objects are the same instance any time you use them (instead of creating new instances) so they are kept around.
Other Python implementations may use other memory management strategies. Jython runs on the Java virtual machine and uses the JVM's memory management; ditto for IronPython and the Microsoft .NET CLR.

Deleted objects still referenced in pickle

In my project, I periodically use pickling to represent the internal state of the process for persistence. As a part of normal operation, references to objects are added to and removed from multiple other objects.
For example Person might have an attribute called address_list (a list) that contains the Address objects representing all the properties they are trying to sell. Another object, RealEstateAgent, might have an attribute called addresses_for_sale (also a list) which contains the same type of Address objects, but only those ones that are listed at their agency.
If a seller takes their property off the market, or it is sold, the Address is removed from both lists.
Both Persons and RealEstateAgents are members of a central object (Masterlist) list for pickling. My problem is that as I add and remove properties and pickle the Masterlist object repeatedly over time, the size of the pickle file grows, even when I have removed (del actually) more properties than I have added. I realize that, in pickling Masterlist, there is a circular reference. There are many circular references in my application.
I examined the pickle file using pickletools.dis(), and while it's hard to human-read, I see references to Addresses that have been removed. I am sure they are removed, because, even after unpickling, they do not exist in their respective lists.
While the application functions correctly before and after pickling/unpickling, the growing filesize is an issue as the process is meant to be long running, and reinitializing it is not an option.
My example is notional, and it might be a stretch to ask, but I'm wondering if anyone has experience with either garbage collection issues using pickles, when they contain circular references or anything else that might point me in the right direction to debugging this. Maybe some tools that would be helpful.
Many thanks
You might want to try objgraph… it can seriously aid you in tracking down memory leaks and circular references and pointer relationships between objects.
http://mg.pov.lt/objgraph/
I use it when debugging pickles (in my own pickling package called dill).
Also, certain pickled objects will (down the pickle chain) pickle globals, and is often a cause of circular references within pickled objects.
I also have a suite of pickle debugging tools in dill. See dill.detect at https://github.com/uqfoundation, where there are several methods that can be used to diagnose objects you are tying to pickle. For instance, if you set dill.detect.trace(True), it will print out all the internal calls to pickle objects while your object is being dumped.

Categories

Resources