i want to process a python dict object in batches between two requests. i was wondering what's the best way to do it.
i want to do that because my dict is big and i couldn't do the whole processing in 30s.
thanks
You can serialize your object (perhaps with pickle, though there may be more efficient and specific ways if your object's nature is well-constrained) and save the serialized byte string to the datastore and to memcache (I don't recommend using just memcache, because it just might occasionally happen that the cache is "flushed" between the two requests -- in that case, you definitely want to be able to fetch your serialized byte string from the datastore!).
memcache will to the pickling for you, if you pass the original object -- but, since you need the serialized string anyway to put it in the datastore, I think it's better to do your own explicit serialization. Once you memcache.add a string, the fact that the latter gets pickled (and later unpickled on retrieval) is not a big deal -- the overhead of time and space is really quite modest.
There are limits to this approach -- you can't memcache more than 1MB per key, for example, so if your object's truly huge you need to split up the serialized bytestring onto multiple keys (and for more than a few such megabyte-slices, things get very unwieldy).
Also, of course, the first and the second request must "agree" on a key to use for the serialized data's storage and retrieval -- i.e. there must be a simple way to get that key without confusion (for example, it might be based on the name of the current user).
Related
I wrote a Python script that loads an user/artist/playcount dataset and predicts which artists I might like. However, the database (a .tsv file I downloaded) is big so it takes time to read it and store the information I want in a dictionary. How can I optimize this? Is there a way to preserve the loaded database so each time I want to make predictions I don't have to load it again?
Thank you very much.
You could store and load your dictionary using the shelve module. This is likely to yield a benefit if the processing time to create the dictionary is large relative to the amount of time it takes to load it into memory - that is, if your algorithm is complicated or your dictionary is small.
If your dictionary is still going to be large, one trick you could use is to store file pointer offsets as the dictionary values. That is, when you want a dictionary value to be some information about a song (for example), instead of storing the information itself in the dictionary, store the byte offset in the TSV file where the corresponding line starts. Then, when you want to access that information, open the TSV file, seek to the offset, read a line, and parse it to construct the object representing that song. Seeks are fast, or at least much faster than reading through the whole file. Alternatively, you could use the mmap module to memory-map the file and effectively treat it as an array of bytes, which is especially useful if you know how many bytes you'll need (or at least have a reasonably low upper bound).
If you want to maintain compatibility with other systems written in other programming languages, or if you just want something human-readable, you could store your dictionary as JSON instead, using the json module. I would recommend this only if your dictionary is not too large.
Another solution you could try is just storing the information from your dictionary in a database in the first place. Databases are organized in a way that makes accessing them fast. Python's standard library includes the sqlite3 module that you can use to access an SQLite database. This should be fine. But if you already have a database server running, or you have special needs that make using a separate database server advantageous (like multiple processes accessing the database simultaneously), you can use SQLAlchemy to store and load data in any SQL database.
For completeness I would also mention the pickle module, which can be used to store pretty much any Python object, but I don't think you need to use it directly. There are more streamlined ways to store and load dictionary-type data.
Each of my mappers need access to very large dictionary. Is there someway I can avoid the overhead of each mapper opening its own copy, and instead have all of them point to one global shared object?
Any suggestions specific to DISCO or in mapreduce paradigm would be helpful.
Use Redis key-value store
Can be installed quickly on Linux and Windows compiled versions are also available.
python redis package will then allow you to write, read and update values really easily.
Using hash data type is what will serve you best, you may add/edit new values to so called fields (key in Python dictionary terminology), it is very fast and it is also very straightforward.
This solution will work even for independent processes. You may even share data in Redis over network, so for map/reduce scenario this can be great option.
The only thing, you have to care about when storing and restoring values is that the values can be only strings, so you have to serialize and de-serialize them. json.dumps and json.loads works very well for this.
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
I have a lot of objects which form a network by keeping references to other objects. All objects (nodes) have a dict which is their properties.
Now I'm looking for a fast way to store these objects (in a file?) and reload all of them into memory later (I don't need random access). The data is about 300MB in memory which takes 40s to load from my SQL format, but I now want to cache it to have faster access.
Which method would you suggest?
(my pickle attempt failed due to recursion errors despite trying to mess around with getstate :( maybe there is something fast anyway? :))
Pickle would be my first choice. But since you say that it didn't work, you might want to try shelve, even thought it's not shelve's primary purpose.
Really, you should be using Pickle for this. Perhaps you could post some code so that we can take a look and figure out why it doesn't work
"The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again." So it IS possible. Perhaps increase the recursion limit with sys.setrecursionlimit.
Hitting Maximum Recursion Depth Using Python's Pickle / cPickle
Perhaps you could set up some layer of indirection where the objects are actually held within, say, another dictionary, and an object referencing another object will store the key of the object being referenced and then access the object through the dictionary. If the object for the stored key is not in the dictionary, it will be loaded into the dictionary from your SQL database, and when it doesn't seem to be needed anymore, the object can be removed from the dictionary/memory (possibly with an update to its state in the database before the version in memory is removed).
This way you don't have to load all the data from your database at once, and can keep a number of the objects cached in memory for quicker access to those. The downside would be the additional overhead required for each access to the main dict.
I am storing a table using python and I need persistence.
Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve
self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True)
I use writeback to True as I found the system tends to be unstable if I don't.
After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.
I am probably using the wrong form of persistence. What can I do to improve performance?
For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.
As an example of how easy the code would be, see the following:
d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}
You'd just have to convert back from unicode, which is trivial.
Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.
I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):
Because of Python semantics, a shelf cannot know when a mutable
persistent-dictionary entry is modified. By default modified objects
are written only when assigned to the shelf (see Example). If the
optional writeback parameter is set to True, all entries accessed are
also cached in memory, and written back on sync() and close(); this
can make it handier to mutate mutable entries in the persistent
dictionary, but, if many entries are accessed, it can consume vast
amounts of memory for the cache, and it can make the close operation
very slow since all accessed entries are written back (there is no way
to determine which accessed entries are mutable, nor which ones were
actually mutated).
You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).
If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.
See other answers for alternatives.
How much larger? What are the access patterns? What kinds of computation do you need to do on it?
Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.
You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.