Store and load a large number linked objects in Python

Store and load a large number linked objects in Python - python

I have a lot of objects which form a network by keeping references to other objects. All objects (nodes) have a dict which is their properties.
Now I'm looking for a fast way to store these objects (in a file?) and reload all of them into memory later (I don't need random access). The data is about 300MB in memory which takes 40s to load from my SQL format, but I now want to cache it to have faster access.
Which method would you suggest?
(my pickle attempt failed due to recursion errors despite trying to mess around with getstate :( maybe there is something fast anyway? :))

Pickle would be my first choice. But since you say that it didn't work, you might want to try shelve, even thought it's not shelve's primary purpose.
Really, you should be using Pickle for this. Perhaps you could post some code so that we can take a look and figure out why it doesn't work

"The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again." So it IS possible. Perhaps increase the recursion limit with sys.setrecursionlimit.
Hitting Maximum Recursion Depth Using Python's Pickle / cPickle

Perhaps you could set up some layer of indirection where the objects are actually held within, say, another dictionary, and an object referencing another object will store the key of the object being referenced and then access the object through the dictionary. If the object for the stored key is not in the dictionary, it will be loaded into the dictionary from your SQL database, and when it doesn't seem to be needed anymore, the object can be removed from the dictionary/memory (possibly with an update to its state in the database before the version in memory is removed).
This way you don't have to load all the data from your database at once, and can keep a number of the objects cached in memory for quicker access to those. The downside would be the additional overhead required for each access to the main dict.

Related

Deleted objects still referenced in pickle

In my project, I periodically use pickling to represent the internal state of the process for persistence. As a part of normal operation, references to objects are added to and removed from multiple other objects.
For example Person might have an attribute called address_list (a list) that contains the Address objects representing all the properties they are trying to sell. Another object, RealEstateAgent, might have an attribute called addresses_for_sale (also a list) which contains the same type of Address objects, but only those ones that are listed at their agency.
If a seller takes their property off the market, or it is sold, the Address is removed from both lists.
Both Persons and RealEstateAgents are members of a central object (Masterlist) list for pickling. My problem is that as I add and remove properties and pickle the Masterlist object repeatedly over time, the size of the pickle file grows, even when I have removed (del actually) more properties than I have added. I realize that, in pickling Masterlist, there is a circular reference. There are many circular references in my application.
I examined the pickle file using pickletools.dis(), and while it's hard to human-read, I see references to Addresses that have been removed. I am sure they are removed, because, even after unpickling, they do not exist in their respective lists.
While the application functions correctly before and after pickling/unpickling, the growing filesize is an issue as the process is meant to be long running, and reinitializing it is not an option.
My example is notional, and it might be a stretch to ask, but I'm wondering if anyone has experience with either garbage collection issues using pickles, when they contain circular references or anything else that might point me in the right direction to debugging this. Maybe some tools that would be helpful.
Many thanks

You might want to try objgraph… it can seriously aid you in tracking down memory leaks and circular references and pointer relationships between objects.
http://mg.pov.lt/objgraph/
I use it when debugging pickles (in my own pickling package called dill).
Also, certain pickled objects will (down the pickle chain) pickle globals, and is often a cause of circular references within pickled objects.
I also have a suite of pickle debugging tools in dill. See dill.detect at https://github.com/uqfoundation, where there are several methods that can be used to diagnose objects you are tying to pickle. For instance, if you set dill.detect.trace(True), it will print out all the internal calls to pickle objects while your object is being dumped.

What does it mean for an object to be picklable (or pickle-able)?

Python docs mention this word a lot and I want to know what it means.

It simply means it can be serialized by the pickle module. For a basic explanation of this, see What can be pickled and unpickled?. Pickling Class Instances provides more details, and shows how classes can customize the process.

Things that are usually not pickable are, for example, sockets, file(handler)s, database connections, and so on. Everything that's build up (recursively) from basic python types (dicts, lists, primitives, objects, object references, even circular) can be pickled by default.
You can implement custom pickling code that will, for example, store the configuration of a database connection and restore it afterwards, but you will need special, custom logic for this.
All of this makes pickling a lot more powerful than xml, json and yaml (but definitely not as readable)

These are all great answers, but for anyone who's new to programming and still confused here's the simple answer:
Pickling an object is making it so you can store it as it currently is, long term (to often to hard disk). A bit like Saving in a video game.
So anything that's actively changing (like a live connection to a database) can't be stored directly (though you could probably figure out a way to store the information needed to create a new connection, and that you could pickle)
Bonus definition: Serializing is packaging it in a form that can be handed off to another program. Unserializing it is unpacking something you got sent so that you can use it

Pickling is the process in which the objects in python are converted into simple binary representation that can be used to write that object in a text file which can be stored. This is done to store the python objects and is also called as serialization. You can infer from this what de-serialization or unpickling means.
So when we say an object is picklable it means that the object can be serialized using the pickle module of python.

Shelve is too slow for large dictionaries, what can I do to improve performance?

I am storing a table using python and I need persistence.
Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve
self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True)
I use writeback to True as I found the system tends to be unstable if I don't.
After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.
I am probably using the wrong form of persistence. What can I do to improve performance?

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.
As an example of how easy the code would be, see the following:
d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}
You'd just have to convert back from unicode, which is trivial.

Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.

I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):
Because of Python semantics, a shelf cannot know when a mutable
persistent-dictionary entry is modified. By default modified objects
are written only when assigned to the shelf (see Example). If the
optional writeback parameter is set to True, all entries accessed are
also cached in memory, and written back on sync() and close(); this
can make it handier to mutate mutable entries in the persistent
dictionary, but, if many entries are accessed, it can consume vast
amounts of memory for the cache, and it can make the close operation
very slow since all accessed entries are written back (there is no way
to determine which accessed entries are mutable, nor which ones were
actually mutated).
You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).
If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.
See other answers for alternatives.

How much larger? What are the access patterns? What kinds of computation do you need to do on it?
Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.
You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

Python cached list

I have a module which supports creation of geographic objects using a company-standard interface. After these objects are created, the update_db() method is called, and all objects are updated into a database.
It is important to have all objects inserted in one session, in order to keep counters and statistics before updating a production database.
The problem is that sometimes there are just too many objects, and the memory gets full.
Is there a way to create a cached list in Python, in order to handle lists that does not fit into memory?
My general thought was:
class CachedList(object):
def __init__(self, max_memory_size, directory)
def get_item(index)
def set_item(index)
def del_item(index)
def append(item)
An ordinary list would be created upon initialization. When the list's size exceeds max_memory_size, the list elements are pickled and stored at a file in directory. get_item(), set_item() and del_item() would handle the data stored in memory, or 'swap' it from disk to access it.
Is this a good design? Are there any standard alternatives?
How can I force garbage collection after pickle-ing parts of the list?
Thanks,
Adam

Use shelve. Your keys are the indices to your list.

I think your first question is answered. On the second, forcing GC: use gc.collect. http://docs.python.org/library/gc.html.

how to transfer a python object between two requests?

i want to process a python dict object in batches between two requests. i was wondering what's the best way to do it.
i want to do that because my dict is big and i couldn't do the whole processing in 30s.
thanks

You can serialize your object (perhaps with pickle, though there may be more efficient and specific ways if your object's nature is well-constrained) and save the serialized byte string to the datastore and to memcache (I don't recommend using just memcache, because it just might occasionally happen that the cache is "flushed" between the two requests -- in that case, you definitely want to be able to fetch your serialized byte string from the datastore!).
memcache will to the pickling for you, if you pass the original object -- but, since you need the serialized string anyway to put it in the datastore, I think it's better to do your own explicit serialization. Once you memcache.add a string, the fact that the latter gets pickled (and later unpickled on retrieval) is not a big deal -- the overhead of time and space is really quite modest.
There are limits to this approach -- you can't memcache more than 1MB per key, for example, so if your object's truly huge you need to split up the serialized bytestring onto multiple keys (and for more than a few such megabyte-slices, things get very unwieldy).
Also, of course, the first and the second request must "agree" on a key to use for the serialized data's storage and retrieval -- i.e. there must be a simple way to get that key without confusion (for example, it might be based on the name of the current user).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.