What is a good on-disk "set" implementation for Python?

What is a good on-disk "set" implementation for Python? - python

I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.
I have considered:
a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
a SQLite database (additional dependency not available by default)
custom disk-based balanced tree structure, such as B-tree or similar
Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?
After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.
(This is for a rewrite of shaback if you're curious.)

Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).

I think this is what databases like sqlite are made for. Is there a reason you can't use it?

You could use a DBM style database. I'm doing a similar thing with dbm, just storing all the keys with a value of '1'. Since it's BSD, the dbhash module should work. (it's deprecated, so no Python 3; and not a great idea for long-term use because of that). Otherwise, use the modules gdbm (dbm.gdbm in Python 3) and ndbm(dbm.dbm in Python 3). There's also the module dumbdbm(dbm.dumbdbm in Python 3) which is pure python and always works, but a bit slower. Also, if you are going to have multiple simultaneous reads and writes, definitely do not use the dumbdbm module.
The various dbm modules all work just like a python dictionary, except the keys and the values need to be strings. You can use the "in" keyword just like you would for a set, or a dict.

Dbm and setting the second value as an arbitrary value of 1 as Brian Minton suggested is a convenient solution. cPickle is good too
However, You should also consider using json. Check google but AFAIK, it seems that the json parser is faster than Pickle/cPickle. (e.g., http://kovshenin.com/2010/pickle-vs-json-which-is-faster/)

Related

Python text file instead of dictionary

I'm working on a project where I crawl and re-organize huge number of data into a result text file. Previously I used dictionary to store temporary data, but as the data volume increased the process slowed down because of memory usage and dictionary got useless.
Since process speed is not so important in my case, I'm trying to replace dictionary to file but I'm not sure how can I easily move file pointer to appropriate position and read up required data. In dictionary I can easily refer to any data. I would like to achieve the same but in file.
I'm thinking to use mmap and write my own functions to move file pointer where I want. Does Python have a built-in or 3rd party module for such operations?
Any other theoretical approach is welcome.

I think you are now trying to reinvent a key-value database.
Maybe the easiest thing would be to check if sqlite3 module would offer you what you need. Using a readymade database is easier than rolling your own!
Of course, sqlite3 is not a key-value DB (on the surface), so if you need something even simpler, have a look at LMDB and its Python bindings: http://lmdb.readthedocs.org/en/release/
It is as lightweight and fast as it gets. It is probably close to the fastest way to achieve what you want.
It should be noted that there is no such thing as an optimal key-value database. There are several aspects to consider. At least:
Do you read a lot or write a lot?
What are the key and value sizes?
Do you need transactions/crash-proof?
Do you have duplicate keys (one key, several values)?
Do you want to have sorted keys?
Do you want to read the data out in the same order it is inserted?
What is your database size (MB, GB, TB, PB)?
Are you constrained on IO or CPU?
For example, LMDB I suggested above is very good in read-intensive tasks, not so much in write-intensive tasks. It offers transactions, keeps keys in sorted order and is crash-proof (limited by the underlying file system). However, if you need to write your database often, LMDB may not be the best choice.
On the other hand, SQLite is not the perfect choice to this task - theoretically speking. In practice, it is built in into the standard Python distribution and is thus easy to use. It may provide adequate performance, and it may thus be the best choice.
There are numerous high-quality databases out there. By not mentioning them I do not try to give the impression that the DBs mentioned in this answer are the only good alternatives. Most database managers have a very good reason for their existence. While there are some that are a bit outdated, most have their own sweet spots in the application area.
The field is constantly changing. There are both completely new databases available and old database systems are updated. This should be kept in mind when reading old benchmarks. Also, the type of HW used has its impact; a computer with a SSD disk, a cloud computing instance, and a traditional computer with a HDD behave quite differently performance-wise.

Serialize a tuple of numpy arrays

I have a couple of numpy matrices (3-dimensional to be exact) which are stored in tuples
(a1,b1,c1)
(a2,b2,c2)
...
(an,bn,cn)
I would like to serialize each tuple into a file that can be read back into Python on another machine (Linux => Windows, both are x86-64). What would be a pythonic way to accomplish this?

numpy.savez or numpy.savez_compressed is the way to go. I've heard, but never experienced issues with certain types of arrays not pickling well.
I'm recalling this post (doesn't seem to have been much of an issue) as well as something about numpy.void not pickling. Likely not an issue, but there it is.

Pickle will probably work well
I also saw this: http://thsant.blogspot.com/2007/11/saving-numpy-arrays-which-is-fastest.html

Use shelve, pickle, cPickle, or shove. Each of these will let you store most kinds of python objects in a file; shove and shelve focus on dictionary-like objects that map keys to values, and shove will let you use a variety of database-like backends. If you find yourself exceeding the performance limitations on these libraries, consider going the database route, e.g. through SQLAlchemy.
I've used each of these libraries, and they work reasonably well within their own niche. I'd start with pickle or shelve, which are standard library.

I generally use cPickle, although I haven't done a formal comparison with other methods. Additionally, I always write the file as binary and use highest protocol setting:
f = open('fname.pkl','wb')
cPickle.dump(array_tuple,f,-1)
f.close()

Keeping in-memory data in sync with a file for long running Python script

I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?

Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.

I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.

Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}

Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.

You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?

Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.

A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.

I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.

if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.

Sqlite -- it comes with python, fast, widely availible and easy to maintain

If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.

Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.

The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.

It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

Best way to save complex Python data structures across program sessions (pickle, json, xml, database, other)

Looking for advice on the best technique for saving complex Python data structures across program sessions.
Here's a list of techniques I've come up with so far:
pickle/cpickle
json
jsonpickle
xml
database (like SQLite)
Pickle is the easiest and fastest technique, but my understanding is that there is no guarantee that pickle output will work across various versions of Python 2.x/3.x or across 32 and 64 bit implementations of Python.
Json only works for simple data structures. Jsonpickle seems to correct this AND seems to be written to work across different versions of Python.
Serializing to XML or to a database is possible, but represents extra effort since we would have to do the serialization ourselves manually.
Thank you,
Malcolm

You have a misconception about pickles: they are guaranteed to work across Python versions. You simply have to choose a protocol version that is supported by all the Python versions you care about.
The technique you left out is marshal, which is not guaranteed to work across Python versions (and btw, is how .pyc files are written).

You left out the marshal and shelve modules.
Also this python docs page covers persistence

Have you looked at PySyck or pyYAML?

What are your criteria for "best" ?
pickle can do most Python structures, deeply nested ones too
sqlite dbs can be easily queried (if you know sql :)
speed / memory ? trust no benchmarks that you haven't faked yourself.
(Fine print:
cPickle.dump(protocol=-1) compresses, in one case 15M pickle / 60M sqlite, but can break.
Strings that occur many times, e.g. country names, may take more memory than you expect;
see the builtin intern().
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.