Deleted objects still referenced in pickle - python

In my project, I periodically use pickling to represent the internal state of the process for persistence. As a part of normal operation, references to objects are added to and removed from multiple other objects.
For example Person might have an attribute called address_list (a list) that contains the Address objects representing all the properties they are trying to sell. Another object, RealEstateAgent, might have an attribute called addresses_for_sale (also a list) which contains the same type of Address objects, but only those ones that are listed at their agency.
If a seller takes their property off the market, or it is sold, the Address is removed from both lists.
Both Persons and RealEstateAgents are members of a central object (Masterlist) list for pickling. My problem is that as I add and remove properties and pickle the Masterlist object repeatedly over time, the size of the pickle file grows, even when I have removed (del actually) more properties than I have added. I realize that, in pickling Masterlist, there is a circular reference. There are many circular references in my application.
I examined the pickle file using pickletools.dis(), and while it's hard to human-read, I see references to Addresses that have been removed. I am sure they are removed, because, even after unpickling, they do not exist in their respective lists.
While the application functions correctly before and after pickling/unpickling, the growing filesize is an issue as the process is meant to be long running, and reinitializing it is not an option.
My example is notional, and it might be a stretch to ask, but I'm wondering if anyone has experience with either garbage collection issues using pickles, when they contain circular references or anything else that might point me in the right direction to debugging this. Maybe some tools that would be helpful.
Many thanks

You might want to try objgraph… it can seriously aid you in tracking down memory leaks and circular references and pointer relationships between objects.
http://mg.pov.lt/objgraph/
I use it when debugging pickles (in my own pickling package called dill).
Also, certain pickled objects will (down the pickle chain) pickle globals, and is often a cause of circular references within pickled objects.
I also have a suite of pickle debugging tools in dill. See dill.detect at https://github.com/uqfoundation, where there are several methods that can be used to diagnose objects you are tying to pickle. For instance, if you set dill.detect.trace(True), it will print out all the internal calls to pickle objects while your object is being dumped.

Related

Accessing the memory heap in python

Is there a way to access the memory heap in Python? I'm interested in being able to access all of the objects allocated in memory of the running instance.
You can't get direct access, but the gc module should do most of what you want. A simple gc.get_objects() call will return all the objects tracked by the collector. This isn't everything since the CPython garbage collector is only concerned with potential reference cycles (so built-in types that can't refer to other objects, e.g. int, float, str, etc.) won't appear in the resulting list, but they'll all be referenced by something in that list (if they weren't, their reference count would be zero and they'd have been disposed of).
Aside from that, you might get some more targeted use out of the inspect module, especially stack frame inspection, using the traceback module for "easy formatting" or manually digging into the semi-documented frame objects themselves, either of which would allow you to narrow the scope down to a particular active scope on the stack frame.
For the closest to the heap solution, you could use the tracemalloc module to trace and record allocations as they happen, or the pdb debugger to do live introspection from the outside (possibly adding breakpoint() calls to your code to make it stop automatically when you reach that point to let you look around).

diagnosing memory leak in python

I've been trying to debug a memory leak in the Coopr package using objgraph: https://gist.github.com/3855150
I have it pinned down to a _SetContainer object, but can't seem to figure out why that object is persisting in memory. Here's the result of objgraph.show_refs:
How do I go about finding the circular reference and how can I get the the garbage collector to collect the _SetContainer instance?
I previously thought that the class itself might have a self-reference (the tuple just below the class on the right in the image above). But objgraph always shows inherited classes always as having a self-referencing tuple. You can see a very simple test case here.
It's mostly guessing from the output of objgraph, but it seems that the instance is in a cycle and its class has a __del__. In this situation, the object is kept alive forever in CPython. Check it with:
import gc; gc.collect(); print gc.garbage
http://docs.python.org/library/gc.html#gc.garbage

Store and load a large number linked objects in Python

I have a lot of objects which form a network by keeping references to other objects. All objects (nodes) have a dict which is their properties.
Now I'm looking for a fast way to store these objects (in a file?) and reload all of them into memory later (I don't need random access). The data is about 300MB in memory which takes 40s to load from my SQL format, but I now want to cache it to have faster access.
Which method would you suggest?
(my pickle attempt failed due to recursion errors despite trying to mess around with getstate :( maybe there is something fast anyway? :))
Pickle would be my first choice. But since you say that it didn't work, you might want to try shelve, even thought it's not shelve's primary purpose.
Really, you should be using Pickle for this. Perhaps you could post some code so that we can take a look and figure out why it doesn't work
"The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again." So it IS possible. Perhaps increase the recursion limit with sys.setrecursionlimit.
Hitting Maximum Recursion Depth Using Python's Pickle / cPickle
Perhaps you could set up some layer of indirection where the objects are actually held within, say, another dictionary, and an object referencing another object will store the key of the object being referenced and then access the object through the dictionary. If the object for the stored key is not in the dictionary, it will be loaded into the dictionary from your SQL database, and when it doesn't seem to be needed anymore, the object can be removed from the dictionary/memory (possibly with an update to its state in the database before the version in memory is removed).
This way you don't have to load all the data from your database at once, and can keep a number of the objects cached in memory for quicker access to those. The downside would be the additional overhead required for each access to the main dict.

Are all Python objects tracked by the garbage collector?

I'm trying to debug a memory leak (see question Memory leak in Python Twisted: where is it?).
When the garbage collector is running, does it have access to all Python objects created by the Python interpreter? If we suppose Python C libraries are not leaking, should RSS memory usage grow linearly with respect to the GC object count? What about sys.getobjects?
CPython uses two mechanisms to clean up garbage. One is reference counting, which affects all objects but which can't clean up objects that (directly or indirectly) refer to each other. That's where the actual garbage collector comes in: python has the gc module, which searches for cyclic references in objects it knows about. Only objects that can potentially be part of a reference cycle need to worry about participating in the cyclic gc. So, for example, lists do, but strings do not; strings don't reference any other objects. (In fact, the story is a little more complicated, as there's two ways of participating in cyclic gc, but that isn't really relevant here.)
All Python classes (and instances thereof) automatically get tracked by the cyclic gc. Types defined in C aren't, unless they put in a little effort. All the builtin types that could be part of a cycle do. But this does mean the gc module only knows about the types that play along.
Apart from the collection mechanism there's also the fact that Python has its own aggregating memory allocator (obmalloc), which allocates entire memory arenas and uses the memory for most of the smaller objects it creates. Python now does free these arenas when they're completely empty (for a long time it didn't), but actually emptying an arena is fairly rare: because CPython objects aren't movable, you can't just move some stragglers to another arena.
The RSS does not grow linearly with the number of Python objects, because Python objects can vary in size. An int object is usually much smaller than a big list.
I suppose that you mean gc.get_objects when you wrote sys.getobjects. This function gives you a list of all reachable objects. If you suppose a leak, you can iterate this list and try to find objects that should have been freed already. (For instance you might know that all objects of a certain type are to be freed at a certain point.)
A Python class designed to be unable to be involved in cycles is not tracked by the GC.
class V(object):
__slots__ = ()
Instances of V cannot have any attribute. Its size is 16, like the size of object().
sys.getsizeof(V()) and v().sizeof() return the same value: 16.
V isn't useful, but I imagine that other classes derived from base types (e.g. tuple), that only add methods, can be crafted so that reference counting is enough to manage them in memory.

What does it mean for an object to be picklable (or pickle-able)?

Python docs mention this word a lot and I want to know what it means.
It simply means it can be serialized by the pickle module. For a basic explanation of this, see What can be pickled and unpickled?. Pickling Class Instances provides more details, and shows how classes can customize the process.
Things that are usually not pickable are, for example, sockets, file(handler)s, database connections, and so on. Everything that's build up (recursively) from basic python types (dicts, lists, primitives, objects, object references, even circular) can be pickled by default.
You can implement custom pickling code that will, for example, store the configuration of a database connection and restore it afterwards, but you will need special, custom logic for this.
All of this makes pickling a lot more powerful than xml, json and yaml (but definitely not as readable)
These are all great answers, but for anyone who's new to programming and still confused here's the simple answer:
Pickling an object is making it so you can store it as it currently is, long term (to often to hard disk). A bit like Saving in a video game.
So anything that's actively changing (like a live connection to a database) can't be stored directly (though you could probably figure out a way to store the information needed to create a new connection, and that you could pickle)
Bonus definition: Serializing is packaging it in a form that can be handed off to another program. Unserializing it is unpacking something you got sent so that you can use it
Pickling is the process in which the objects in python are converted into simple binary representation that can be used to write that object in a text file which can be stored. This is done to store the python objects and is also called as serialization. You can infer from this what de-serialization or unpickling means.
So when we say an object is picklable it means that the object can be serialized using the pickle module of python.

Categories

Resources