Python cached list

Python cached list - python

I have a module which supports creation of geographic objects using a company-standard interface. After these objects are created, the update_db() method is called, and all objects are updated into a database.
It is important to have all objects inserted in one session, in order to keep counters and statistics before updating a production database.
The problem is that sometimes there are just too many objects, and the memory gets full.
Is there a way to create a cached list in Python, in order to handle lists that does not fit into memory?
My general thought was:
class CachedList(object):
def __init__(self, max_memory_size, directory)
def get_item(index)
def set_item(index)
def del_item(index)
def append(item)
An ordinary list would be created upon initialization. When the list's size exceeds max_memory_size, the list elements are pickled and stored at a file in directory. get_item(), set_item() and del_item() would handle the data stored in memory, or 'swap' it from disk to access it.
Is this a good design? Are there any standard alternatives?
How can I force garbage collection after pickle-ing parts of the list?
Thanks,
Adam

Use shelve. Your keys are the indices to your list.

I think your first question is answered. On the second, forcing GC: use gc.collect. http://docs.python.org/library/gc.html.

Related

Deleted objects still referenced in pickle

In my project, I periodically use pickling to represent the internal state of the process for persistence. As a part of normal operation, references to objects are added to and removed from multiple other objects.
For example Person might have an attribute called address_list (a list) that contains the Address objects representing all the properties they are trying to sell. Another object, RealEstateAgent, might have an attribute called addresses_for_sale (also a list) which contains the same type of Address objects, but only those ones that are listed at their agency.
If a seller takes their property off the market, or it is sold, the Address is removed from both lists.
Both Persons and RealEstateAgents are members of a central object (Masterlist) list for pickling. My problem is that as I add and remove properties and pickle the Masterlist object repeatedly over time, the size of the pickle file grows, even when I have removed (del actually) more properties than I have added. I realize that, in pickling Masterlist, there is a circular reference. There are many circular references in my application.
I examined the pickle file using pickletools.dis(), and while it's hard to human-read, I see references to Addresses that have been removed. I am sure they are removed, because, even after unpickling, they do not exist in their respective lists.
While the application functions correctly before and after pickling/unpickling, the growing filesize is an issue as the process is meant to be long running, and reinitializing it is not an option.
My example is notional, and it might be a stretch to ask, but I'm wondering if anyone has experience with either garbage collection issues using pickles, when they contain circular references or anything else that might point me in the right direction to debugging this. Maybe some tools that would be helpful.
Many thanks

You might want to try objgraph… it can seriously aid you in tracking down memory leaks and circular references and pointer relationships between objects.
http://mg.pov.lt/objgraph/
I use it when debugging pickles (in my own pickling package called dill).
Also, certain pickled objects will (down the pickle chain) pickle globals, and is often a cause of circular references within pickled objects.
I also have a suite of pickle debugging tools in dill. See dill.detect at https://github.com/uqfoundation, where there are several methods that can be used to diagnose objects you are tying to pickle. For instance, if you set dill.detect.trace(True), it will print out all the internal calls to pickle objects while your object is being dumped.

In Python, how do I tie an on-disk JSON file to an in-process dictionary?

In perl there was this idea of the tie operator, where writing to or modifying a variable can run arbitrary code (such as updating some underlying Berkeley database file). I'm quite sure there is this concept of overloading in python too.
I'm interested to know what the most idiomatic way is to basically consider a local JSON file as the canonical source of needed hierarchical information throughout the running of a python script, so that changes in a local dictionary are automatically reflected in the JSON file. I'll leave it to the OS to optimise writes and cache (I don't mind if the file is basically updated dozens of times throughout the running of the script), but ultimately this is just about a kilobyte of metadata that I'd like to keep around. It's not necessary to address concurrent access to this. I'd just like to be able to access a hierarchical structure (like nested dictionary) within the python process and have reads (and writes to) that structure automatically result in reads from (and changes to) a local JSON file.

well, since python itself has no signals-slots, I guess you can instead make your own dictionary class by inherit it from python dictionary. Class exactly like python dict, only in every method of it that can change dict values you will dump your json.
also you can use smth like PyQt4 QAbstractItemModel which has signals. And when it data changed signal will emitted, do your dumping - it will be only in one place, which is nice.
I know these two are sort of stupid ways, probably yea. :) If anyone knows better, go ahead and tell!

This is a developpement from aspect_mkn8rd' answer taking into account Gerrat's comments, but it is too long for a true comment.
You will need 2 special container classes emulating a list and a dictionnary. In both, you add a pointer to a top-level object and override the following methods :
__setitem__(self, key, value)
__delitem__(self, key)
__reversed__(self)
All those methods are called in modification and should have the top-level object to be written to disk.
In addition, __setitem__(self, key, value) should look if value is a list and wrap it into a special list object or if it is a dictionary, wrap it into a special dictionnary object. In both case, the method should set the top-level object into the new container. If neither of them and the object defines __setitem__, it should raise an Exception saying the object is not supported. Of course you should then modify the method to take in account this new class.
Of course, there is a good deal of code to write and test, but it should work - left to the reader as an exercise :-)

If concurrency is not required, maybe consider writing 2 functions to read and write the data to a shelf file? Our is the idea to have the dictionary" aware" of changes to update the file without this kind of thing?

Is it faster to create new instance of class/variable or set existing one?

In Python 2.7, (or in programming languages in general), is it faster to create a new instance of a class/variable or to set an existing one to something new?
For example, which is faster to create another_pic.png? This:
my_img = Image.open(cur_directory_path + '\\my_pic.png') # don't need this anymore
new_img = Image.open(cur_directory_path + '\\another_pic.png') # but need this new pic
or this:
my_img = Image.open(cur_directory_path + '\\my_pic.png') # don't need this anymore
my_img = Image.open(cur_directory_path + '\\another_pic.png') # but need this new pic
I ask because I have one Image variable which I "gets around" so to speak in my code, by constantly being reset to various things, and I am wondering if this affects performance at all.

In both cases, you're creating two completely new objects at the exact same speed, so to that end I don't think either one is faster than the other. You're never really "resetting" an object; you're just reassigning a name. All that's happening is you're changing an existing pointer to a new memory location, which is a fraction of a fraction of a second.
The main difference is that with the bottom option, you have left an unused object for the garbage collector to pick up, but deallocating memory is not a very speed-intensive task. It's possible (depending on the number of free objects you have lying around) that won't even happen before your program ends. But you're also using more memory by keeping two objects lying around. So if you're constantly importing new images, to the degree that it may impact your memory, it's probably best to be resetting the same pointer. Or you could even invoke the garbage collector manually if you're concerned about running out of memory, but it doesn't sound like you are.

They're exactly the same. Both go through the process of importing the image. The variable assignment is only storing a reference to the object. The only difference is that the latter may begin garbage collecting the my_pic.png image sooner since there are no more references to the object.

Technically it is faster to reuse variables as long as they are storing objects of the same type then it is to constantly create a new one. This boils down to addressing in memory and the fact that if you already have a variable (an address in memory is associated with it) then it will be easy to access that slot in memory and update the object located there. The reason that I mention that the object types should be the same is because of how memory is allocated for classes and objects when they are created at run time. As for why creating a new variable to store objects is slower is because it has to find proper space in memory (enough free space for the object) and then assign that address to that variable. This involves accessing address lookup tables and depending on the table configuration would also add time. The thing is the difference is so small that in any normal application you shouldn't notice it.

How can I implement a Markov chain that purges old data when it consumes too much memory?

Awhile ago I wrote a Markov chain text generator for IRC in Python. It would consume all of my VPS's free memory after running for a month or two and I would need to purge its data and start over. Now I'm rewriting it and I want to tackle the memory issue as elegantly as possible.
The data I have to keep trimmed down is a generally a dictionary that maps strings to lists of strings. More specifically, each word in a message is mapped to all the possible subsequent words. This is still an oversimplification, but it's sufficient for contextualizing my problem.
Currently, the solution I'm wrestling with involves managing "buckets" of data. It would keep track of each bucket's apparent size, "archive" a bucket once it's reached a certain size and move on to a new one, and after 5 or so buckets it would delete the oldest "archived" bucket every time a new one is created. This has the advantage of simplicity: removing an entire bucket doesn't create any dead-ends or unreachable words because the words from each message all go into the same bucket.
The problem is that "keeping track of each bucket's apparent size" is easier said than done.
I first tried using sys.getsizeof, but quickly found that it's impractical for determining the object's actual size in memory. I've also looked into guppy / heapy / various other memory usage modules, but none of them seem to do what I'm looking for (i.e. benchmark a single object). Currently I'm experimenting with the lower-level psutil module. Here's an excerpt from the current state of the application:
class Markov(object):
# (constants declared here)
def __init__(self):
self.proc = psutil.Process(os.getpid())
self.buckets = []
self._newbucket()
def _newbucket(self):
self.buckets.append(copy.deepcopy(self.EMPTY_BUCKET))
def _checkmemory(f):
def checkmemory(self):
# Check memory usage of the process and the entire system
if (self.proc.get_memory_percent() > self.MAX_MEMORY
or psutil.virtual_memory().percent > self.MAX_TOTAL_MEMORY):
self.buckets.pop(0)
# If we just removed the last bucket, add a new one
if not self.buckets:
self._newbucket()
return f()
return checkmemory
#_checkmemory
def process(self, msg):
# generally, this adds the words in msg to self.buckets[-1]
#_checkmemory
def generate(self, keywords):
# generally, this uses the words in all the buckets to create a sentence
The problem here is that this will only expire buckets; I have no idea when to "archive" the current bucket because Python's overhead memory prevents me from accurately determining how far I am from hitting self.MAX_MEMORY. Not to mention that the Markov class is actually one of many "plugins" being managed by a headless IRC client (another detail I omitted for brevity's sake), so the overhead is not only present, but unpredictable.
In short: is there a way to accurately benchmark single Python objects? Alternatively, if you can think of a better way to 'expire' old data than my bucket-based solution, I'm all ears.

This might be a bit of a hacky solution, but if your bucket objects are pickleable (and it sounds like they are), you could pickle them and measure the byte-length of the pickled object string. It may not be exactly the size of the unpacked object in memory, but it should grow linearly as the object grows and give you a fairly good idea of relative size between objects.
To prevent having to pickle really large objects, you can measure the size of each entry added to the bucket by pickling it on its own, and adding its bytelength to the bucket's total bytelength attribute.
Bear in mind, though, that if you do this there will be some overhead memory used in the internal bindings of the entry and the bucket that will not be reflected by the independent size of the entry itself, but you can run some tests to profile this and figure out what the %memory overhead is going to be for each new entry beyond its actual size.

Store and load a large number linked objects in Python

I have a lot of objects which form a network by keeping references to other objects. All objects (nodes) have a dict which is their properties.
Now I'm looking for a fast way to store these objects (in a file?) and reload all of them into memory later (I don't need random access). The data is about 300MB in memory which takes 40s to load from my SQL format, but I now want to cache it to have faster access.
Which method would you suggest?
(my pickle attempt failed due to recursion errors despite trying to mess around with getstate :( maybe there is something fast anyway? :))

Pickle would be my first choice. But since you say that it didn't work, you might want to try shelve, even thought it's not shelve's primary purpose.
Really, you should be using Pickle for this. Perhaps you could post some code so that we can take a look and figure out why it doesn't work

"The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again." So it IS possible. Perhaps increase the recursion limit with sys.setrecursionlimit.
Hitting Maximum Recursion Depth Using Python's Pickle / cPickle

Perhaps you could set up some layer of indirection where the objects are actually held within, say, another dictionary, and an object referencing another object will store the key of the object being referenced and then access the object through the dictionary. If the object for the stored key is not in the dictionary, it will be loaded into the dictionary from your SQL database, and when it doesn't seem to be needed anymore, the object can be removed from the dictionary/memory (possibly with an update to its state in the database before the version in memory is removed).
This way you don't have to load all the data from your database at once, and can keep a number of the objects cached in memory for quicker access to those. The downside would be the additional overhead required for each access to the main dict.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.