Data structure that is partially filesystem-backed?

Data structure that is partially filesystem-backed? - python

Is there an existing (Python) implementation of a hash-like data structure that exists partially on disk? Or, can persist specific keys to some secondary storage, based on some criteria (like last-accessed-time)?
ex: "data at key K has not been accessed in M milliseconds; serialize it to persistent storage (disk?), and delete it".
I was referred to this, but I'm not sure I can digest it.
EDIT:
I've recd two excellent answers (sqlite; gdb)m; in order to determine a winner, I'll have to wait until I've tested both. Thank you!!

Go for SQLITE. A big problem that you will face down the road is concurrency/file corruption, etc and SQLite makes these very easy to avoid as it provides transactional integrity. Just define a single table with schema (primary key string key, string value). SQLite is insanely fast, especially if you wrap bunches of writes into a transaction.
GDBM IMHO also has licence problems depending on what you want to do, whereas SQLite is public domain.

Sounds like you are looking for gdbm:
The gdbm module provides an interface to the GNU DBM library. gdbm objects behave like mappings (dictionaries), except that keys and values are always strings.
That's basically a dictionary on disk. You might have to do a bit of serialization depending on your use.

Related

Python: Array vs Database for storage of key/value

Q: Which is quicker for this scenario?
My scenario: my application will be storing either in either an array or postgresql db a list of links, so it might look like:
1) mysite.com
a) /users/login
b) /users/registration/
c) /contact/
d) /locate/search
e) /priv/admin-login
The above entries under 1) - I will be doing string searches on these urls to find for example any path that contains:
'login'
for example.
The above letters a) through e) could maybe have anywhere from 5-100 more entries for a given domain.
*The usage: * This data structure can change potentially as much as everyday, but only once per day. Some key/values will be removed, others will be modified. An individual set like:
dict2 = { 'thesite.com': 123, 98.6: 37 };
Each key will represent 1 and only 1 domain.
I've tried searching a bit on this, but cannot seem to find a real good answer to : when should an array be used and when should a db like postgresql be used?
I've always used a db to handle data (using mysql, not postgresql), but I'm now trying to do it better from now on, so I wondered if an array or other data structure would work better within a loop, and while trying tomatch a given string while looping.
As always, thank you!

A full SQL database would probably be overkill. If you can fit everything in memory, put it all in a dict and then use the pickle module to serialize it and write it to the disk.
Another good option would be to use one of the dbm modules (dbm/dbm.ndbm, gdbm or anydbm) to store the data in a disk-bound hash table. It will have O(1) lookup times without the need to connect and form a query like in a bigger database.
edit: If you have multiple values per key and you don't want a full-blown database, SQLite would be a good choice. There is already a built-in module for it, sqlite3 (as mentioned in the comments)

Test it. It's your dataset, your hardware, your available disk and network IO, your usage pattern. There's no one true answer here. We don't even know how many queries are you planning - are we talking about one per minute or thousands per second?
If your data fits nicely in memory and doesn't take a massive amount of time to load the first time, sticking it into a dictionary in memory will probably be faster.
If you're always looking for full words (like in the login case), you will gain some speed too from splitting the url into parts and indexing those separately.

From MongoDB to PostgreSQL - Django

Could any one shed some light on how to migrate my MongoDB to PostgreSQL? What tools do I need, what about handling primary keys and foreign key relationships, etc?
I had MongoDB set up with Django, but would like to convert it back to PostgreSQL.

Whether the migration is easy or hard depends on a very large number of things including how many different versions of data structures you have to accommodate. In general you will find it a lot easier if you approach this in stages:
Ensure that all the Mongo data is consistent in structure with your RDBMS model and that the data structure versions are all the same.
Move your data. Expect that problems will be found and you will have to go back to step 1.
The primary problems you can expect are data validation problems because you are moving from a less structured data platform to a more structured one.
Depending on what you are doing regarding MapReduce you may have some work there as well.

In the mean time, Postgres Foreign Data Wrapper for MongoDB has emerged (versions 9.1-9.4). With it, one can set up a view to MongoDB, via the PostgreSQL, and then handle the data as SQL.
This would probably mean rather easy way to copy data as well.
Limitations of FDW that I have faced:
objects within arrays (in MongoDB) do not seem to be addressable
objects with dynamic key names do not seem to be addressable
I know it's 2015 now. :)

Using DVCS for an RDBMS audit trail

I'm looking to implement an audit trail for a reasonably complicated relational database, whose schema is prone to change. One avenue I'm thinking of is using a DVCS to track changes.
(The benefits I can imagine are: schemaless history, snapshots of entire system's state, standard tools for analysis, playback and migration, efficient storage, separate system, keeping DB clean. The database is not write-heavy and history is not not a core feature, it's more for the sake of having an audit trail. Oh and I like trying crazy new approaches to problems.)
I'm not an expert with these systems (I only have basic git familiarity), so I'm not sure how difficult it would be to implement. I'm thinking of taking mercurial's approach, but possibly storing the file contents/manifests/changesets in a key-value data store, not using actual files.
Data rows would be serialised to json, each "file" could be an row. Alternatively an entire table could be stored in a "file", with each row residing on the line number equal to its primary key (assuming the tables aren't too big, I'm expecting all to have less than 4000 or so rows. This might mean that the changesets could be automatically generated, without consulting the rest of the table "file".
(But I doubt it, because I think we need a SHA-1 hash of the whole file. The files could perhaps be split up by a predictable number of lines, eg 0 < primary key < 1000 in file 1, 1000 < primary key < 2000 in file 2 etc, keeping them smallish)
Is there anyone familiar with the internals of DVCS' or data structures in general who might be able to comment on an approach like this? How could it be made to work, and should it even be done at all?
I guess there are two aspects to a system like this: 1) mapping SQL data to a DVCS system and 2) storing the DVCS data in a key/value data store (not files) for efficiency.
(NB the json serialisation bit is covered by my ORM)

I've looked into this a little on my own, and here are some comments to share.
Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.
Blobs
The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".
Changesets
Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".
Trees
Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.
Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.
Storage
I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.
I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.
Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!
Use case: 1. Archiving a change
As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.
Use case 2. Retrieving an old state
After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).
Summary
None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.

If the database is not write-heavy (as you say), why not just implement the actual database tables in a way that achieves your goal? For example, add a "version" column. Then never update or delete rows, except for this special column, which you can set to NULL to mean "current," 1 to mean "the oldest known", and go up from there. When you want to update a row, set its version to the next higher one, and insert a new one with no version. Then when you query, just select rows with the empty version.

Take a look at cqrs and Greg Young's event sourcing. I also have a blog post about working in meta events that pin point schema changes within the river of business events.
http://adventuresinagile.blogspot.com/2009/09/rewind-button-for-your-application.html
If you look through my blog, you'll also find version script schemes and you can source code control those.

Storing a python set in a database with django

I have a need to store a python set in a database for accessing later. What's the best way to go about doing this? My initial plan was to use a textfield on my model and just store the set as a comma or pipe delimited string, then when I need to pull it back out for use in my app I could initialize a set by calling split on the string. Obviously if there is a simple way to serialize the set to store it in the db so I can pull it back out as a set when I need to use it later that would be best.

If your database is better at storing blobs of binary data, you can pickle your set. Actually, pickle stores data as text by default, so it might be better than the delimited string approach anyway. Just pickle.dumps(your_set) and unpickled = pickle.loads(database_string) later.

There are a number of options here, depending on what kind of data you wish to store in the set.
If it's regular integers, CommaSeparatedIntegerField might work fine, although it often feels like a clumsy storage method to me.
If it's other kinds of Python objects, you can try pickling it before saving it to the database, and unpickling it when you load it again. That seems like a good approach.
If you want something human-readable in your database though, you could even JSON-encode it into a TextField, as long as the data you're storing doesn't include Python objects.

Redis natively stores sets (as well as other data structures (lists, dicts, queue)) and provides set operations - and its rocket fast too. I find it's the swiss army knife for python development.
I know its not a relational database per se, but it does solve this problem very concisely.

What about CommaSeparatedIntegerField?
If you need other type (string for example) you can create your own field which would work like CommaSeparatedIntegerField but will use strings (without commas).
Or, if you need other type, probably a better way of doing it: have a dictionary which maps integers to your values.

Is it a good practice to use pickled data instead of additional tables?

Many times while creating database structure, I get stuck at the question, what would be more effective, storing data in pickled format in a column in the same table or create additional table and then use JOIN.
Which path should be followed, any advice ?
For example:
There is a table of Customers, containing fields like Name, Address
Now for managing Orders (each customer can have many), you can either create an Order table or store the orders in a serialized format in a separate column in the Customers table only.

It's usually better to create seperate tables. If you go with pickling and later find you want to query the data in a different way, it could be difficult.
See Database normalization.

Usually it's best to keep your data normalized (i.e. create more tables). Storing data 'pickled' as you say, is acceptable, when you don't need to perform relational operations on them.

Mixing SQL databases and pickling seems to ask for trouble. I'd go with either sticking all data in the SQL databases or using only pickling, in the form of the ZODB, which is a Python only OO database that is pretty damn awesome.
Mixing makes case sometimes, but is usually just more trouble than it's worth.

I agree with Mchi, there is no problem storing "pickled" data if you don't need to search or do relational type operations.
Denormalisation is also an important tool that can scale up database performance when applied correctly.
It's probably a better idea to use JSON instead of pickles. It only uses a little more space, and makes it possible to use the database from languages other than Python

I agree with #Lennart Regebro. You should probably see whether you need a Relational DB or an OODB. If RDBMS is your choice, I would suggest you stick with more tables. IMHO, pickling may have issues with scalability. If thats what you want, you should look at ZODB. It is pretty good and supports caching etc for better performance

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.