diagnosing memory leak in python

diagnosing memory leak in python - python

I've been trying to debug a memory leak in the Coopr package using objgraph: https://gist.github.com/3855150
I have it pinned down to a _SetContainer object, but can't seem to figure out why that object is persisting in memory. Here's the result of objgraph.show_refs:
How do I go about finding the circular reference and how can I get the the garbage collector to collect the _SetContainer instance?
I previously thought that the class itself might have a self-reference (the tuple just below the class on the right in the image above). But objgraph always shows inherited classes always as having a self-referencing tuple. You can see a very simple test case here.

It's mostly guessing from the output of objgraph, but it seems that the instance is in a cycle and its class has a __del__. In this situation, the object is kept alive forever in CPython. Check it with:
import gc; gc.collect(); print gc.garbage
http://docs.python.org/library/gc.html#gc.garbage

Related

Accessing the memory heap in python

Is there a way to access the memory heap in Python? I'm interested in being able to access all of the objects allocated in memory of the running instance.

You can't get direct access, but the gc module should do most of what you want. A simple gc.get_objects() call will return all the objects tracked by the collector. This isn't everything since the CPython garbage collector is only concerned with potential reference cycles (so built-in types that can't refer to other objects, e.g. int, float, str, etc.) won't appear in the resulting list, but they'll all be referenced by something in that list (if they weren't, their reference count would be zero and they'd have been disposed of).
Aside from that, you might get some more targeted use out of the inspect module, especially stack frame inspection, using the traceback module for "easy formatting" or manually digging into the semi-documented frame objects themselves, either of which would allow you to narrow the scope down to a particular active scope on the stack frame.
For the closest to the heap solution, you could use the tracemalloc module to trace and record allocations as they happen, or the pdb debugger to do live introspection from the outside (possibly adding breakpoint() calls to your code to make it stop automatically when you reach that point to let you look around).

Usage of Cython directive no_gc

In Cython 0.25 the no_gc directive was added. The documentation for this new directive (as well as for a related no_gc_clear directive) can be found here, but the only thing I really understand about it is that it can speed up your code be disabling certain aspects of garbage collection.
I am interested because I have some high performance Cython code which uses extension types, and I understand that no_gc can speed things up further. In my code, instances of extension types are always left alive until the very end when the program closes, which makes me think that disabling garbage collection for these might be OK.
I guess what I really need is an example where the usage of no_gc goes bad and leads to memory leaks, together with en explanation of exactly why that happens.

It's to do with circular references - when instance a holds a reference to a Python object that references a again then a can never be freed through reference counting so Python tries to detect the cycle.
A very trial example of a class that could cause issues is:
# Cython code:
cdef class A:
cdef param
def __init__(self):
self.param = self
(and some Python code to run it)
import cython_module
while True:
cython_module.A()
This is fine as is (the cycles are detected and they get deallocated every so often) but if you add no_gc then you will run out of memory.
A more realistic example might be a parent/child pair that store a reference to each other.
It's worth adding that the performance gains likely to be small. The garbage collector is only run occasionally in situations when a lot of objects have been allocated and few have been freed (https://docs.python.org/3/library/gc.html - see set_threshold). It's hopefully unlikely that this describes your high performance code.
There's probably also a small performance cost on allocation and deallocation of your objects with GC, to add/remove them from the list of tracked objects (but again, hopefully you aren't allocating/deallocting huge numbers)
Finally, if your class never stores any references to Python objects then it's effectively no_gc anyway. Setting the option will do no harm but also do no good.

Memory leak when using pickle in python

I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.
I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop).
It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.

I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects
edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).

If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.
In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.
Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)
In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:
gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection
If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!

Deleted objects still referenced in pickle

In my project, I periodically use pickling to represent the internal state of the process for persistence. As a part of normal operation, references to objects are added to and removed from multiple other objects.
For example Person might have an attribute called address_list (a list) that contains the Address objects representing all the properties they are trying to sell. Another object, RealEstateAgent, might have an attribute called addresses_for_sale (also a list) which contains the same type of Address objects, but only those ones that are listed at their agency.
If a seller takes their property off the market, or it is sold, the Address is removed from both lists.
Both Persons and RealEstateAgents are members of a central object (Masterlist) list for pickling. My problem is that as I add and remove properties and pickle the Masterlist object repeatedly over time, the size of the pickle file grows, even when I have removed (del actually) more properties than I have added. I realize that, in pickling Masterlist, there is a circular reference. There are many circular references in my application.
I examined the pickle file using pickletools.dis(), and while it's hard to human-read, I see references to Addresses that have been removed. I am sure they are removed, because, even after unpickling, they do not exist in their respective lists.
While the application functions correctly before and after pickling/unpickling, the growing filesize is an issue as the process is meant to be long running, and reinitializing it is not an option.
My example is notional, and it might be a stretch to ask, but I'm wondering if anyone has experience with either garbage collection issues using pickles, when they contain circular references or anything else that might point me in the right direction to debugging this. Maybe some tools that would be helpful.
Many thanks

You might want to try objgraph… it can seriously aid you in tracking down memory leaks and circular references and pointer relationships between objects.
http://mg.pov.lt/objgraph/
I use it when debugging pickles (in my own pickling package called dill).
Also, certain pickled objects will (down the pickle chain) pickle globals, and is often a cause of circular references within pickled objects.
I also have a suite of pickle debugging tools in dill. See dill.detect at https://github.com/uqfoundation, where there are several methods that can be used to diagnose objects you are tying to pickle. For instance, if you set dill.detect.trace(True), it will print out all the internal calls to pickle objects while your object is being dumped.

How can I protect a logging object from the garbage collector in a multiprocessing process?

I create a couple of worker processes using Python's Multiprocessing module 2.6.
In each worker I use the standard logging module (with log rotation and file per worker)
to keep an eye on the worker. I've noticed that after a couple of hours that no more
events are written to the log. The process doesn't appear to crash and still responds
to commands via my queue. Using lsof I can see that the log file is no longer open.
I suspect the log object may be killed by the garbage collector, if so is there a way
that I can mark it to protect it?

I agree with #THC4k. This doesn't seem like a GC issue. I'll give you my reasons why, and I'm sure somebody will vote me down if I'm wrong (if so, please leave a comment pointing out my error!).
If you're using CPython, it primarily uses reference counting, and objects are destroyed immediately when the ref count goes to zero (since 2.0, supplemental garbage collection is also provided to handle the case of circular references). Keep a reference to your log object and it won't be destroyed.
If you're using Jython or IronPython, the underlying VM does the garbage collection. Again, keep a reference and the GC shouldn't touch it.
Either way, it seems that either you're not keeping a reference to an object you need to keep alive, or you have some other error.

http://docs.python.org/reference/datamodel.html#object.__del__
According to this documentation the del() method is called on object destruction and you can at this point create a reference to the object to prevent it from being collected. I am not sure how to do this, hopefully this gives you some food for thought.

You could run gc.collect() immediately after fork() to see if that causes the log to be closed. But it's not likely garbage collection would take effect only after a few hours.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.