How to make python ignore an object for garbage collection? - python

At the start of my code, I load in a huge (33GB) pickled object. This object is essentially a huge graph with many interconnected nodes.
Periodically, I run gc.collect(). When I have the huge object loaded in, this takes 100 seconds. When I change my code to not load in the huge object, gc.collect() takes .5 seconds. I assume that this is caused by python checking through every subobject of this object for reference cycles every time I call gc.collect().
I know that neither the huge object, nor any of the objects it references when it is loaded in at the start, will ever need to be garbage collected. How do I tell python this, so that I can avoid the 100s gc time?

In python 3.7 you might be able to hack something using https://docs.python.org/3/library/gc.html#gc.freeze
allocate_a_lot()
gc.freeze() # move all objects to a permanent generation. none will be collected
allocate_some_more()
gc.collect() # collect all non-frozen objects
gc.unfreeze() # return to sanity
This said, I think that python does not offer the tools for what you want. In general all garbage collected languages do not want you to do manual memory management.

Related

Is there any benefit to deleting a reference to a large Python object before overwriting that reference?

I am running some memory-heavy scripts which iterate over documents in a database, and due to memory constraints on the server I manually delete references to the large object at the conclusion of each iteration:
for document in database:
initial_function_calls()
big_object = memory_heavy_operation(document)
save_to_file(big_object)
del big_object
additional_function_calls()
The initial_function_calls() and additional_function_calls() are each slightly memory-heavy. Do I see any benefit by explicitly deleting the reference to the large object for garbage collection? Alternatively, does leaving it and having it point to a new object in the next iteration suffice?
As often in these cases; it depends. :-/
I'm assuming we're talking about CPython here.
Using del or re-assigning a name reduces the reference count for an object. Only if that reference could reaches 0 can it be de-allocated. So if you inadvertently stashed a reference to big_object away somewhere, using del won't help.
When garbage collection is triggered depends on the amount of allocations and de-allocations. See the documentation for gc.set_threshold().
If you're pretty sure that there are no further references, you could use gc.collect() to force a garbage collection run. That might help if your code doesn't do a lot of other allocations.
One thing to keep in mind is that if the big_object is created by a C extension module (like e.g. numpy), it could manage its own memory. In that case the garbage collection won't affect it! Also small integers and small strings are pre-allocated and won't be garbage collected. You can use gc.is_tracked() to check if an object is managed by the garbage collector.
What I would suggest is that you run your program with and without del+gc.collect(), and monitor the amount of RAM used. On UNIX-like systems, look at the resident set size. You could also use sys._debugmallocstats().
Unless you see the resident set size grow and grow, I wouldn't worry about it.

Memory leak when using pickle in python

I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.
I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop).
It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.
I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects
edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).
If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.
In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.
Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)
In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:
gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection
If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!

Python3.4 memory usage

Consider those two codes, I run in the python console:
l=[]
for i in range(0,1000): l.append("."*1000000)
# if you check your taskmanager now, python is using nearly 900MB
del l
# now python3 immediately free-d the memory
Now consider this:
l=[]
for i in range(0,1000): l.append("."*1000000)
l.append(l)
# if you check your taskmanager now, python is using nearly 900MB
del l
# now python3 won't free the memory
Since I am working with those kind of objects, and I need to free them from my memory, I need to know in order to let python recognize it needs to delete the corresponding memory.
PS: I am using Windows7.
Because you've created a circular reference, the memory won't be freed until the garbage collector runs, detects the cycle, and cleans it up. You can trigger that manually:
import gc
gc.collect() # Memory usage will drop once you run this.
The collector will automatically run occasionally, but only if certain conditions related to the number of object allocations/deallocations are met:
gc.set_threshold(threshold0[, threshold1[, threshold2]])
Set the garbage collection thresholds (the collection frequency).
Setting threshold0 to zero disables collection.
The GC classifies objects into three generations depending on how many
collection sweeps they have survived. New objects are placed in the
youngest generation (generation 0). If an object survives a collection
it is moved into the next older generation. Since generation 2 is the
oldest generation, objects in that generation remain there after a
collection. In order to decide when to run, the collector keeps track
of the number object allocations and deallocations since the last
collection. When the number of allocations minus the number of
deallocations exceeds threshold0, collection starts.
So if you continued creating more objects in the interpreter, eventually the garbage collector would kick on by itself. You can make that happen more often by lowering threshold0, or you can just manually call gc.collect when you know you've deleted one of the objects containing a reference cycle.

Python reclaiming memory after deleting items in a dictionary

I have a relatively large dictionary in Python and would like to be able to not only delete items from it, but actually reclaim the memory back from these deletions in my program. I am running across a problem whereby although I delete items from the dictionary and even run the garbage collector manually, Python does not appear to be freeing the memory itself.
A simple example of this:
>>> tupdict = {}
# consumes around 2 GB of memory
>>> for i in xrange(12500000):
... tupdict[i] = (i,i)
...
# delete over half the entries, no drop in consumed memory
>>> for i in xrange(7500000):
... del tupdict[i]
...
>>> import gc
# manually garbage collect, still no drop in consumed memory after this
>>> gc.collect()
0
>>>
I imagine what is happening is that although the entries are deleted and garbage collector run, Python does not go ahead and resize the dictionary. My question is, is there any simple way around this, or am I likely to require a more serious rethink about how I write my program?
A lot of factors go into whether Python returns this memory to the underlying OS or not, which is probably how you're trying to tell if memory is being freed. CPython has a pooled allocator system that tends to hold on to freed memory so that it can be reused in an efficient manner (but these subsequent allocations won't increase your memory footprint from the perspective of the OS), which might be what you're seeing.
Also, on some unix platforms processes don't release freed memory back to the OS until the application closes (or some other significant event occurs). Even if you are in a situation where an entire pool has been freed (and thus Python might decide to free() it rather than holding it open for future objects), the OS still won't release this memory to be used by other processes (but can be used for further reallocation within the original process). In general this is good for reducing memory fragmentation and doesn't have too much of a downside, as the unused process memory will get paged out to disk. Windows does release process memory back to the OS for use by any new allocation (which you can then see in the Task Manager), so trying this on Windows will likely appear to give you a different result.
In the end, how to manage deallocated process memory is the purview of the operating system, and there are various schemes (with upsides and downsides) used such that just looking in your system information tool of choice won't necessarily tell you the whole truth.
You're right that Python doesn't resize dictionary back if items are deleted from dictionary. This have nothing to do with OS memory management and garbage collection, it is an implementation detail of Python's dict data structure.
A workaround is to create a new dictionary by copying the old dictionary. Check this great video for more info: http://pyvideo.org/video/276/the-mighty-dictionary-55 (around 26:30 there is an answer).

Python: Behavior of the garbage collector

I have a Django application that exhibits some strange garbage collection behavior. There is one view in particular that will just keep growing the VM size significantly every time it is called - up to a certain limit, at which point usage drops back again. The problem is that it's taking considerable time until that point is reached, and in fact the virtual machine running my app doesn't have enough memory for all FCGI processes to take as much memory as they then sometimes do.
I've spent the last two days investigating this and learning about Python garbage collection, and I think I do understand what is happening now - for the most part. When using
gc.set_debug(gc.DEBUG_STATS)
Then for a single request, I see the following output:
>>> c = django.test.Client()
>>> c.get('/the/view/')
gc: collecting generation 0...
gc: objects in each generation: 724 5748 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 731 6460 147341
gc: done.
[...more of the same...]
gc: collecting generation 1...
gc: objects in each generation: 718 8577 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 714 0 156614
gc: done.
[...more of the same...]
gc: collecting generation 0...
gc: objects in each generation: 715 5578 156612
gc: done.
So essentially, a huge amount of objects are allocated, but are initially moved to generation 1, and when gen 1 is sweeped during the same request, they are moved to generation 2. If I do a manual gc.collect(2) afterwards, they are removed. And, as I mentioned, there also removed when the next automatic gen 2 sweep happens, which, if I understand correctly, would in this case something like every 10 requests (at this point the app needs about a 150MB).
Alright, so initially I thought that there might be some cyclic referencing going on within the processing of one request that prevents any of these objects from being collected within the handling of that request. However, I've spent hours trying to find one using pympler.muppy and objgraph, both after and by debugging inside the request processing, and there don't seem to be any. Rather, it seems the 14.000 objects or so that are created during the request are all within a reference chain to some request-global object, i.e. once the request goes away, they can be freed.
That has been my attempt at explaining it, anyway. However, if that's true and there are indeed no cycling dependencies, shouldn't the whole tree of objects be freed once whatever request object that causes them to be held goes away, without the garbage collector being involved, purely by virtue of the reference counts dropping to zero?
With that setup, here are my questions:
Does the above even make sense, or do I have to look for the problem elsewhere? Is it just an unfortunate accident that significant data is kept around for so long in this particular use case?
Is there anything I can do to avoid the issue. I already see some potential to optimize the view, but that appears to be a solution with limited scope - although I am not sure what I generic one would be, either; how advisable is it for example to call gc.collect() or gc.set_threshold() manually?
In terms of how the garbage collector itself works:
Do I understand correctly that an object is always moved to the next generation if a sweep looks at it and determines that the references it has are not cyclic, but can in fact be traced to a root object.
What happens if the gc does a, say, generation 1 sweep, and finds an object that is referenced by an object within generation 2; does it follow that relationship inside generation 2, or does it wait for a generation 2 sweep to occur before analyzing the situation?
When using gc.DEBUG_STATS, I care primarily about the "objects in each generation" info; however, I keep getting hundreds of "gc: 0.0740s elapsed.", "gc: 1258233035.9370s elapsed." messages; they are totally inconvenient - it takes considerable time for them to be printed out, and they make the interesting things a lot harder to find. Is there a way to get rid of them?
I don't suppose there is a way to do a gc.get_objects() by generation, i.e. only retrieve the objects from generation 2, for example?
Does the above even make sense, or do I have to look for the problem elsewhere? Is it just an unfortunate accident that significant data is kept around for so long in this particular use case?
Yes, it does make sense. And yes, there are other issues worth to consider. Django uses threading.local as base for DatabaseWrapper (and some contribs use it to make request object accessible from places where it's not passed explicitly). These global objects survive requests and can keep references to objects till some other view is handled in the thread.
Is there anything I can do to avoid the issue. I already see some potential to optimize the view, but that appears to be a solution with limited scope - although I am not sure what I generic one would be, either; how advisable is it for example to call gc.collect() or gc.set_threshold() manually?
General advice (probably you know it, but anyway): avoid circular references and globals (including threading.local). Try to break cycles and clear globals when django design makes hard to avoid them. gc.get_referrers(obj) might help you to find places requiring attention. Another way it to disable garbage collector and call it manually after each request, when it's the best place to do (this will prevent objects from moving to the next generation).
I don't suppose there is a way to do a gc.get_objects() by generation, i.e. only retrieve the objects from generation 2, for example?
Unfortunately this is not possible with gc interface. But there are several ways to go. You can consider the end of list returned by gc.get_objects() only, since objects in this list are sorted by generation. You can compare the list with one returned from previous call by storing weak references to them (e.g. in WeakKeyDictionary) between calls. You can rewrite gc.get_objects() in your own C module (it's easy, mostly copy-paste programming!) since they are stored by generation internally, or even access internal structures with ctypes (requires quite deep ctypes understanding).
I think your analysis looks sound. I'm not an expert on the gc, so whenever I have a problem like this I just add a call to gc.collect() in an appropriate, non time critical place, and forget about it.
I'd suggest you call gc.collect() in your view(s) and see what effect it has on your response time and your memory usage.
Note also this question which suggests that setting DEBUG=True eats memory like it is nearly past its sell by date.

Categories

Resources