How to forcibly free memory used by dictionary? - python

I am working on a Python script which queries several different databases to collate data and persist said data to another database. This script collects data from potentially millions of records across about 15 different databases. To attempt to speed up the script I have included some caching functionality, which boils down to having a dictionary which holds some frequently queried data. The dictionary holds key value pairs where the key is a hash generated based on the database name, collection name and query conditions and the value is the data retrieved from the database. For example:
{123456789: {_id: '1', someField: 'someValue'}} where 123456789 is the hash and {_id: '1', someField: 'someValue'} is the data retrieved from the database.
Holding this data in a local dictionary means that instead of having to query the databases each time, which is likely slow, I can access some frequently queried data locally. As mentioned, there are a lot of queries so the dictionary can grow pretty large (several gigabytes). I have some code which uses psutil to look at how much memory is available on the machine running the script and if the available memory gets below a certain threshold I clear the dictionary. The code to clear the dictionary is:
cached_documents.clear()
cached_documents = None
gc.collect()
cached_documents = {}
I should point out that cached_documents is a local variable which gets passed into all the methods that either access or add to the cache. Unfortunately, it seems that this isn't enough to free the memory properly as Python is still holding onto a lot of extra memory, even after calling the above code. You can see a profile of the memory usage here:
Of note is the fact that the first few times the dictionary is cleared, we release a lot of memory back the system, but each subsequent time seems to be less, at which point the memory usage flatlines because the cache gets cleared extremely frequently since the available memory is within the threshold since Python is holding onto a lot of memory.
Is there a way to force Python to free the memory properly when clearing the dictionary so that I avoid flat lining? Any tips are appreciated.

Based on the comments on my original post, I made some changes.
As mentioned in the comments, Python does not seem to reliably return memory to the operating system until a process ends. In some applications, this means that you could spin up a seperate process to do your memory intensive work. See Releasing memory in Python for more details.
Unfortunately, this isn't applicable in my case since the whole point is to have the data in memory when its required.
Since Python holds some of the allocated memory and makes it available for other Python objects, I updated the criteria for my script to clear the cache. Instead of basing this on available system memory, I set the conditions to clear the cache based on the cache size. The rationale, is that I can continue filling the cache and reusing this memory that Python is holding. I found the cache size threshold by taking a rough average of the first couple of times the cache was cleared in the graph in my question, then reduced the number slightly to add a bit of leeway (e.g. a cache of size 10 can use different amounts of memory based on whats inside the cache).
This is less safe than clearing the cache based on memory available, because there is the possibility that the cache grows to be bigger than the available memory on the system, causing out of memory errors; especially if other processes run on the system which require lots of memory, however for my use case this was a suitable trade off.
Now with the cache being cleared based on its size rather than available system memory, I seem to be able to take advantage of Python holding onto memory. Although this may not be a perfect answer, in my case, it seems to work.

Related

Deallocation of dict value in for-loop [duplicate]

I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:
read an input file
process the file and create a list of triangles, represented by their vertices
output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices
The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I'm getting memory errors because of the sizes of the lists.
What is the best way to tell Python that I no longer need some of the data, and it can be freed?
According to Python Official Documentation, you can explicitly invoke the Garbage Collector to release unreferenced memory with gc.collect(). Example:
import gc
gc.collect()
You should do that after marking what you want to discard using del:
del my_array
del my_object
gc.collect()
Unfortunately (depending on your version and release of Python) some types of objects use "free lists" which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory "earmarked" for only objects of a certain type and thereby unavailable to the "general fund".
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you're all done with them).
The del statement might be of use, but IIRC it isn't guaranteed to free the memory. The docs are here ... and a why it isn't released is here.
I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.
This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory
Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the "del" statement to get rid of a variable completely:
biglist = [blah,blah,blah]
#...
del biglist
(del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the "freed" memory.)
Maybe you would not run into any memory problem in the first place by using a more compact structure for your data.
Thus, lists of numbers are much less memory-efficient than the format used by the standard array module or the third-party numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.
You can't explicitly free memory. What you need to do is to make sure you don't keep references to objects. They will then be garbage collected, freeing the memory.
In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don't need to have the large lists in memory at all.
I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.
To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I'm sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big for loop, processing it, and running gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.
Practically it works like this:
from dask import delayed # this module wraps the multithreading
def f(storage, index, chunk_size): # the processing function
# read the chunk of size chunk_size starting at index in the file
# process it using data in storage if needed
# append data needed for further computations to storage
return storage
partial_result = delayed([]) # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100 # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
# we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
partial_result = delayed(f)(partial_result, index, chunk_size)
# no computations are done yet !
# dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
# passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
# it also allows you to use the results of the processing of the previous chunks in the file if needed
# this launches all the computations
result = partial_result.compute()
# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided
Others have posted some ways that you might be able to "coax" the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.
There isn't really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you're going to have to write an extension in C or C++.
That said, there are some tools to help with this:
cython
swig
boost python
As other answers already say, Python can keep from releasing memory to the OS even if it's no longer in use by Python code (so gc.collect() doesn't free anything) especially in a long-running program. Anyway if you're on Linux you can try to release memory by invoking directly the libc function malloc_trim (man page).
Something like:
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
If you don't care about vertex reuse, you could have two output files--one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.

Persistent in-memory Python object for nginx/uwsgi server

I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?

How to do proper memory management with ZODB?

I read several ZODB tutorials but here is one thing I still don't get: How do you free memory that is already serialized (and committed) to the (say) FileStorage?
More specifically, I want the following code to stop eating all my memory:
for i in xrange(bignumber):
iobtree[i]=Bigobject() # Bigobject is about 1Mb
if(i%10==0):
transaction.commit() # or savepoint(True)
transaction.commit()
How can this be achieved? Is it possible to release references stored by iobtree and replace them by 'weak references' that would be accessible on demand?
Creating savepoints and commiting the transaction already clears a lot of you memory.
You'll need to check what your ZODB cache parameters are set to, and tune these as necessary. The cache size parameter indicates the number of objects cached, not bytes, so you'll have to adjust this based on the size of your objects.
You can try and call .cacheMinimize() on the ZODB connection object, this explicitly deactivates any unmodified (or already committed) objects in the cache.
Other than that, do note that even when Python frees objects from memory, the OS doesn't always reclaim that freed memory until it is needed for something else. OS-reported memory usage doesn't necessarily reflect actual memory requirements for a Python process.

Creating an in-memory cache that persists between executions

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

Python reclaiming memory after deleting items in a dictionary

I have a relatively large dictionary in Python and would like to be able to not only delete items from it, but actually reclaim the memory back from these deletions in my program. I am running across a problem whereby although I delete items from the dictionary and even run the garbage collector manually, Python does not appear to be freeing the memory itself.
A simple example of this:
>>> tupdict = {}
# consumes around 2 GB of memory
>>> for i in xrange(12500000):
... tupdict[i] = (i,i)
...
# delete over half the entries, no drop in consumed memory
>>> for i in xrange(7500000):
... del tupdict[i]
...
>>> import gc
# manually garbage collect, still no drop in consumed memory after this
>>> gc.collect()
0
>>>
I imagine what is happening is that although the entries are deleted and garbage collector run, Python does not go ahead and resize the dictionary. My question is, is there any simple way around this, or am I likely to require a more serious rethink about how I write my program?
A lot of factors go into whether Python returns this memory to the underlying OS or not, which is probably how you're trying to tell if memory is being freed. CPython has a pooled allocator system that tends to hold on to freed memory so that it can be reused in an efficient manner (but these subsequent allocations won't increase your memory footprint from the perspective of the OS), which might be what you're seeing.
Also, on some unix platforms processes don't release freed memory back to the OS until the application closes (or some other significant event occurs). Even if you are in a situation where an entire pool has been freed (and thus Python might decide to free() it rather than holding it open for future objects), the OS still won't release this memory to be used by other processes (but can be used for further reallocation within the original process). In general this is good for reducing memory fragmentation and doesn't have too much of a downside, as the unused process memory will get paged out to disk. Windows does release process memory back to the OS for use by any new allocation (which you can then see in the Task Manager), so trying this on Windows will likely appear to give you a different result.
In the end, how to manage deallocated process memory is the purview of the operating system, and there are various schemes (with upsides and downsides) used such that just looking in your system information tool of choice won't necessarily tell you the whole truth.
You're right that Python doesn't resize dictionary back if items are deleted from dictionary. This have nothing to do with OS memory management and garbage collection, it is an implementation detail of Python's dict data structure.
A workaround is to create a new dictionary by copying the old dictionary. Check this great video for more info: http://pyvideo.org/video/276/the-mighty-dictionary-55 (around 26:30 there is an answer).

Categories

Resources