How to do proper memory management with ZODB?

How to do proper memory management with ZODB? - python

I read several ZODB tutorials but here is one thing I still don't get: How do you free memory that is already serialized (and committed) to the (say) FileStorage?
More specifically, I want the following code to stop eating all my memory:
for i in xrange(bignumber):
iobtree[i]=Bigobject() # Bigobject is about 1Mb
if(i%10==0):
transaction.commit() # or savepoint(True)
transaction.commit()
How can this be achieved? Is it possible to release references stored by iobtree and replace them by 'weak references' that would be accessible on demand?

Creating savepoints and commiting the transaction already clears a lot of you memory.
You'll need to check what your ZODB cache parameters are set to, and tune these as necessary. The cache size parameter indicates the number of objects cached, not bytes, so you'll have to adjust this based on the size of your objects.
You can try and call .cacheMinimize() on the ZODB connection object, this explicitly deactivates any unmodified (or already committed) objects in the cache.
Other than that, do note that even when Python frees objects from memory, the OS doesn't always reclaim that freed memory until it is needed for something else. OS-reported memory usage doesn't necessarily reflect actual memory requirements for a Python process.

Related

How to forcibly free memory used by dictionary?

I am working on a Python script which queries several different databases to collate data and persist said data to another database. This script collects data from potentially millions of records across about 15 different databases. To attempt to speed up the script I have included some caching functionality, which boils down to having a dictionary which holds some frequently queried data. The dictionary holds key value pairs where the key is a hash generated based on the database name, collection name and query conditions and the value is the data retrieved from the database. For example:
{123456789: {_id: '1', someField: 'someValue'}} where 123456789 is the hash and {_id: '1', someField: 'someValue'} is the data retrieved from the database.
Holding this data in a local dictionary means that instead of having to query the databases each time, which is likely slow, I can access some frequently queried data locally. As mentioned, there are a lot of queries so the dictionary can grow pretty large (several gigabytes). I have some code which uses psutil to look at how much memory is available on the machine running the script and if the available memory gets below a certain threshold I clear the dictionary. The code to clear the dictionary is:
cached_documents.clear()
cached_documents = None
gc.collect()
cached_documents = {}
I should point out that cached_documents is a local variable which gets passed into all the methods that either access or add to the cache. Unfortunately, it seems that this isn't enough to free the memory properly as Python is still holding onto a lot of extra memory, even after calling the above code. You can see a profile of the memory usage here:
Of note is the fact that the first few times the dictionary is cleared, we release a lot of memory back the system, but each subsequent time seems to be less, at which point the memory usage flatlines because the cache gets cleared extremely frequently since the available memory is within the threshold since Python is holding onto a lot of memory.
Is there a way to force Python to free the memory properly when clearing the dictionary so that I avoid flat lining? Any tips are appreciated.

Based on the comments on my original post, I made some changes.
As mentioned in the comments, Python does not seem to reliably return memory to the operating system until a process ends. In some applications, this means that you could spin up a seperate process to do your memory intensive work. See Releasing memory in Python for more details.
Unfortunately, this isn't applicable in my case since the whole point is to have the data in memory when its required.
Since Python holds some of the allocated memory and makes it available for other Python objects, I updated the criteria for my script to clear the cache. Instead of basing this on available system memory, I set the conditions to clear the cache based on the cache size. The rationale, is that I can continue filling the cache and reusing this memory that Python is holding. I found the cache size threshold by taking a rough average of the first couple of times the cache was cleared in the graph in my question, then reduced the number slightly to add a bit of leeway (e.g. a cache of size 10 can use different amounts of memory based on whats inside the cache).
This is less safe than clearing the cache based on memory available, because there is the possibility that the cache grows to be bigger than the available memory on the system, causing out of memory errors; especially if other processes run on the system which require lots of memory, however for my use case this was a suitable trade off.
Now with the cache being cleared based on its size rather than available system memory, I seem to be able to take advantage of Python holding onto memory. Although this may not be a perfect answer, in my case, it seems to work.

Is there any benefit to deleting a reference to a large Python object before overwriting that reference?

I am running some memory-heavy scripts which iterate over documents in a database, and due to memory constraints on the server I manually delete references to the large object at the conclusion of each iteration:
for document in database:
initial_function_calls()
big_object = memory_heavy_operation(document)
save_to_file(big_object)
del big_object
additional_function_calls()
The initial_function_calls() and additional_function_calls() are each slightly memory-heavy. Do I see any benefit by explicitly deleting the reference to the large object for garbage collection? Alternatively, does leaving it and having it point to a new object in the next iteration suffice?

As often in these cases; it depends. :-/
I'm assuming we're talking about CPython here.
Using del or re-assigning a name reduces the reference count for an object. Only if that reference could reaches 0 can it be de-allocated. So if you inadvertently stashed a reference to big_object away somewhere, using del won't help.
When garbage collection is triggered depends on the amount of allocations and de-allocations. See the documentation for gc.set_threshold().
If you're pretty sure that there are no further references, you could use gc.collect() to force a garbage collection run. That might help if your code doesn't do a lot of other allocations.
One thing to keep in mind is that if the big_object is created by a C extension module (like e.g. numpy), it could manage its own memory. In that case the garbage collection won't affect it! Also small integers and small strings are pre-allocated and won't be garbage collected. You can use gc.is_tracked() to check if an object is managed by the garbage collector.
What I would suggest is that you run your program with and without del+gc.collect(), and monitor the amount of RAM used. On UNIX-like systems, look at the resident set size. You could also use sys._debugmallocstats().
Unless you see the resident set size grow and grow, I wouldn't worry about it.

Does this mechanism use the Buffer or the Cache?

As far as I know, a buffer is something that has yet to be "written" to disk, while a cache is something that has been "read" from the disk and stored for later use.
But for this mechanism: In python, when a memory is not being used, there exists such an area which will be kept by the system for the next usage instead of just releasing immediately.
I am wondering does this area belong to the Buffer or the Cache?
Thanks.

As far as I understand, the mechanism you mentioned is related to Python's memory management and garbage collection.
This isn't related to buffering or caching data. Cache and Buffer are different things, which used to reduce disk-related operations (reading or writing data to disk).
Python's memory mechanism talks about allocating memory from the operating system.
Yoy can read more about Python's Garbage Collector here and the difference between cache and buffer here.

Find out how many uncommitted items are in the session

I developed an application which parses a lot of data, but if I commit the data after parsing all the data, it will consume too much memory. However, I cannot commit it each time, because it costs too much hard disk I/O.
Therefore, my question is how can I know how many uncommitted items are in the session?

You can use session.new. It's a collection of just created and uncommited objects. Also session.dirty can be useful. To quote the docs:
# pending objects recently added to the Session
session.new
# persistent objects which currently have changes detected
# (this collection is now created on the fly each time the property is called)
session.dirty
# persistent objects that have been marked as deleted via session.delete(obj)
session.deleted
# dictionary of all persistent objects, keyed on their
# identity key
session.identity_map

You can keep track of uncommitted changes using session.new, session.dirty and session.deleted, in combination with tracking flushes (which cause .new, .dirty and .deleted to be reset) via the events system. See the conversation here for more details: https://groups.google.com/forum/#!topic/sqlalchemy/eGxpQBChXQw

This is a classic case of buffering. Try a reasonably large chunk and reduce it if there's too much disk I/O (or you don't like it causing long pauses, etc) or increase it if your profile shows too much CPU time in I/O calls.
To implement, use an array, each "write" you append an item to the array. Have a separate "flush" function that writes the whole thing. Each append you check, and if it has reached maximum size, write them all, and clear the array. At the end, call the flush function to write partially filled array.

Python reclaiming memory after deleting items in a dictionary

I have a relatively large dictionary in Python and would like to be able to not only delete items from it, but actually reclaim the memory back from these deletions in my program. I am running across a problem whereby although I delete items from the dictionary and even run the garbage collector manually, Python does not appear to be freeing the memory itself.
A simple example of this:
>>> tupdict = {}
# consumes around 2 GB of memory
>>> for i in xrange(12500000):
... tupdict[i] = (i,i)
...
# delete over half the entries, no drop in consumed memory
>>> for i in xrange(7500000):
... del tupdict[i]
...
>>> import gc
# manually garbage collect, still no drop in consumed memory after this
>>> gc.collect()
0
>>>
I imagine what is happening is that although the entries are deleted and garbage collector run, Python does not go ahead and resize the dictionary. My question is, is there any simple way around this, or am I likely to require a more serious rethink about how I write my program?

A lot of factors go into whether Python returns this memory to the underlying OS or not, which is probably how you're trying to tell if memory is being freed. CPython has a pooled allocator system that tends to hold on to freed memory so that it can be reused in an efficient manner (but these subsequent allocations won't increase your memory footprint from the perspective of the OS), which might be what you're seeing.
Also, on some unix platforms processes don't release freed memory back to the OS until the application closes (or some other significant event occurs). Even if you are in a situation where an entire pool has been freed (and thus Python might decide to free() it rather than holding it open for future objects), the OS still won't release this memory to be used by other processes (but can be used for further reallocation within the original process). In general this is good for reducing memory fragmentation and doesn't have too much of a downside, as the unused process memory will get paged out to disk. Windows does release process memory back to the OS for use by any new allocation (which you can then see in the Task Manager), so trying this on Windows will likely appear to give you a different result.
In the end, how to manage deallocated process memory is the purview of the operating system, and there are various schemes (with upsides and downsides) used such that just looking in your system information tool of choice won't necessarily tell you the whole truth.

You're right that Python doesn't resize dictionary back if items are deleted from dictionary. This have nothing to do with OS memory management and garbage collection, it is an implementation detail of Python's dict data structure.
A workaround is to create a new dictionary by copying the old dictionary. Check this great video for more info: http://pyvideo.org/video/276/the-mighty-dictionary-55 (around 26:30 there is an answer).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.