I developed an application which parses a lot of data, but if I commit the data after parsing all the data, it will consume too much memory. However, I cannot commit it each time, because it costs too much hard disk I/O.
Therefore, my question is how can I know how many uncommitted items are in the session?
You can use session.new. It's a collection of just created and uncommited objects. Also session.dirty can be useful. To quote the docs:
# pending objects recently added to the Session
session.new
# persistent objects which currently have changes detected
# (this collection is now created on the fly each time the property is called)
session.dirty
# persistent objects that have been marked as deleted via session.delete(obj)
session.deleted
# dictionary of all persistent objects, keyed on their
# identity key
session.identity_map
You can keep track of uncommitted changes using session.new, session.dirty and session.deleted, in combination with tracking flushes (which cause .new, .dirty and .deleted to be reset) via the events system. See the conversation here for more details: https://groups.google.com/forum/#!topic/sqlalchemy/eGxpQBChXQw
This is a classic case of buffering. Try a reasonably large chunk and reduce it if there's too much disk I/O (or you don't like it causing long pauses, etc) or increase it if your profile shows too much CPU time in I/O calls.
To implement, use an array, each "write" you append an item to the array. Have a separate "flush" function that writes the whole thing. Each append you check, and if it has reached maximum size, write them all, and clear the array. At the end, call the flush function to write partially filled array.
Related
I am working on a Python script which queries several different databases to collate data and persist said data to another database. This script collects data from potentially millions of records across about 15 different databases. To attempt to speed up the script I have included some caching functionality, which boils down to having a dictionary which holds some frequently queried data. The dictionary holds key value pairs where the key is a hash generated based on the database name, collection name and query conditions and the value is the data retrieved from the database. For example:
{123456789: {_id: '1', someField: 'someValue'}} where 123456789 is the hash and {_id: '1', someField: 'someValue'} is the data retrieved from the database.
Holding this data in a local dictionary means that instead of having to query the databases each time, which is likely slow, I can access some frequently queried data locally. As mentioned, there are a lot of queries so the dictionary can grow pretty large (several gigabytes). I have some code which uses psutil to look at how much memory is available on the machine running the script and if the available memory gets below a certain threshold I clear the dictionary. The code to clear the dictionary is:
cached_documents.clear()
cached_documents = None
gc.collect()
cached_documents = {}
I should point out that cached_documents is a local variable which gets passed into all the methods that either access or add to the cache. Unfortunately, it seems that this isn't enough to free the memory properly as Python is still holding onto a lot of extra memory, even after calling the above code. You can see a profile of the memory usage here:
Of note is the fact that the first few times the dictionary is cleared, we release a lot of memory back the system, but each subsequent time seems to be less, at which point the memory usage flatlines because the cache gets cleared extremely frequently since the available memory is within the threshold since Python is holding onto a lot of memory.
Is there a way to force Python to free the memory properly when clearing the dictionary so that I avoid flat lining? Any tips are appreciated.
Based on the comments on my original post, I made some changes.
As mentioned in the comments, Python does not seem to reliably return memory to the operating system until a process ends. In some applications, this means that you could spin up a seperate process to do your memory intensive work. See Releasing memory in Python for more details.
Unfortunately, this isn't applicable in my case since the whole point is to have the data in memory when its required.
Since Python holds some of the allocated memory and makes it available for other Python objects, I updated the criteria for my script to clear the cache. Instead of basing this on available system memory, I set the conditions to clear the cache based on the cache size. The rationale, is that I can continue filling the cache and reusing this memory that Python is holding. I found the cache size threshold by taking a rough average of the first couple of times the cache was cleared in the graph in my question, then reduced the number slightly to add a bit of leeway (e.g. a cache of size 10 can use different amounts of memory based on whats inside the cache).
This is less safe than clearing the cache based on memory available, because there is the possibility that the cache grows to be bigger than the available memory on the system, causing out of memory errors; especially if other processes run on the system which require lots of memory, however for my use case this was a suitable trade off.
Now with the cache being cleared based on its size rather than available system memory, I seem to be able to take advantage of Python holding onto memory. Although this may not be a perfect answer, in my case, it seems to work.
The python documentation says this about the sync method:
Write back all entries in the cache if the shelf was opened with
writeback set to True. Also empty the cache and synchronize the
persistent dictionary on disk, if feasible. This is called
automatically when the shelf is closed with close().
I am really having a hard time understanding this.
How does accessing data from cache differ from accessing data from disk?
And does emptying the cache affect how we can access the data stored
in a shelve?
For whoever is using the data in the Shelve object, it is transparent whether the data is cached or is on disk. If it is not on the cache, the file is read, the cache filled, and the value returned. Otherwise, the value as it is on the cache is used.
If the cache is emptied on calling sync, that means only that on the next value fetched from the same Shelve instance, the file will be read again. Since it is all automatic, there is no difference. The documentation is mostly describing how it is implemented.
If you are trying to open the same "shelve" file with two concurrent apps, or even two instances of shelve on the same program, chances are you are bound to big problems. Other than that, it just behaves as a "persistent dictionary" and that is it.
This pattern of writing to disk and re-reading from a single file makes no difference for a workload of a single user in an interactive program. For a Python program running as a server with tens to thousands of clients, or even a single big-data processing script, where this could impact actual performance, Shelve is hardly a usable thing anyway.
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
Awhile ago I wrote a Markov chain text generator for IRC in Python. It would consume all of my VPS's free memory after running for a month or two and I would need to purge its data and start over. Now I'm rewriting it and I want to tackle the memory issue as elegantly as possible.
The data I have to keep trimmed down is a generally a dictionary that maps strings to lists of strings. More specifically, each word in a message is mapped to all the possible subsequent words. This is still an oversimplification, but it's sufficient for contextualizing my problem.
Currently, the solution I'm wrestling with involves managing "buckets" of data. It would keep track of each bucket's apparent size, "archive" a bucket once it's reached a certain size and move on to a new one, and after 5 or so buckets it would delete the oldest "archived" bucket every time a new one is created. This has the advantage of simplicity: removing an entire bucket doesn't create any dead-ends or unreachable words because the words from each message all go into the same bucket.
The problem is that "keeping track of each bucket's apparent size" is easier said than done.
I first tried using sys.getsizeof, but quickly found that it's impractical for determining the object's actual size in memory. I've also looked into guppy / heapy / various other memory usage modules, but none of them seem to do what I'm looking for (i.e. benchmark a single object). Currently I'm experimenting with the lower-level psutil module. Here's an excerpt from the current state of the application:
class Markov(object):
# (constants declared here)
def __init__(self):
self.proc = psutil.Process(os.getpid())
self.buckets = []
self._newbucket()
def _newbucket(self):
self.buckets.append(copy.deepcopy(self.EMPTY_BUCKET))
def _checkmemory(f):
def checkmemory(self):
# Check memory usage of the process and the entire system
if (self.proc.get_memory_percent() > self.MAX_MEMORY
or psutil.virtual_memory().percent > self.MAX_TOTAL_MEMORY):
self.buckets.pop(0)
# If we just removed the last bucket, add a new one
if not self.buckets:
self._newbucket()
return f()
return checkmemory
#_checkmemory
def process(self, msg):
# generally, this adds the words in msg to self.buckets[-1]
#_checkmemory
def generate(self, keywords):
# generally, this uses the words in all the buckets to create a sentence
The problem here is that this will only expire buckets; I have no idea when to "archive" the current bucket because Python's overhead memory prevents me from accurately determining how far I am from hitting self.MAX_MEMORY. Not to mention that the Markov class is actually one of many "plugins" being managed by a headless IRC client (another detail I omitted for brevity's sake), so the overhead is not only present, but unpredictable.
In short: is there a way to accurately benchmark single Python objects? Alternatively, if you can think of a better way to 'expire' old data than my bucket-based solution, I'm all ears.
This might be a bit of a hacky solution, but if your bucket objects are pickleable (and it sounds like they are), you could pickle them and measure the byte-length of the pickled object string. It may not be exactly the size of the unpacked object in memory, but it should grow linearly as the object grows and give you a fairly good idea of relative size between objects.
To prevent having to pickle really large objects, you can measure the size of each entry added to the bucket by pickling it on its own, and adding its bytelength to the bucket's total bytelength attribute.
Bear in mind, though, that if you do this there will be some overhead memory used in the internal bindings of the entry and the bucket that will not be reflected by the independent size of the entry itself, but you can run some tests to profile this and figure out what the %memory overhead is going to be for each new entry beyond its actual size.
I read several ZODB tutorials but here is one thing I still don't get: How do you free memory that is already serialized (and committed) to the (say) FileStorage?
More specifically, I want the following code to stop eating all my memory:
for i in xrange(bignumber):
iobtree[i]=Bigobject() # Bigobject is about 1Mb
if(i%10==0):
transaction.commit() # or savepoint(True)
transaction.commit()
How can this be achieved? Is it possible to release references stored by iobtree and replace them by 'weak references' that would be accessible on demand?
Creating savepoints and commiting the transaction already clears a lot of you memory.
You'll need to check what your ZODB cache parameters are set to, and tune these as necessary. The cache size parameter indicates the number of objects cached, not bytes, so you'll have to adjust this based on the size of your objects.
You can try and call .cacheMinimize() on the ZODB connection object, this explicitly deactivates any unmodified (or already committed) objects in the cache.
Other than that, do note that even when Python frees objects from memory, the OS doesn't always reclaim that freed memory until it is needed for something else. OS-reported memory usage doesn't necessarily reflect actual memory requirements for a Python process.