Separate Database table vs serialized key:value pair

Separate Database table vs serialized key:value pair - python

I am working on designing a relational database for a meal scheduler web application.
I have it 99% set up, but I am wondering if to use a separate table for the "meal type" entries.
To sum it up, users can add their own meal type(breakfast, snack, dinner) arbitrarily, in any order, and I am currently storing them in a simple list (ordered with javascript in the frontend for convenience).
It won't have more than half a dozen elements at worst (who even plans more than 6 meals a day anyway), so I am saving it all in the database's settings table, which contains rows as key:value pairs.
In this case, it's 'meals': [json string representing the python list]
The problem is that every scheduled recipe needs to be qualified by meal type.
id_scheduled_meal
id_recipe
meal_type
Right now, I'd have to use the exact string saved in the key:value pair in order to associate it with a specific meal type, so meal_type would be "Breakfast" or "Snack", rather than an id. It feels like too much redundant data.
At the same time, I am not sure it would be good to create a separate object (Meal) with a separate table (meal), only to add 4-6 entries and 1-3 columns (id, name, position).
Any suggestions? I am happy to clarify, I realize the explanation might not be as clear as it could.
Thanks in advance

I feel like this is pretty opinion based, and the answer will depend on how you want to interact with the data. If you plan on writing queries that include the meal type, then you might save yourself some pain and just do the extra table, though managing/saving items will be more complex. If it's just a list that you plan on doing everything with in python (or whatever), then serialising a list and saving the text might be the better choice. Whether the extra redundant space will adversely affect you will depend on your application and requirements.

Related

Is it possible to generate hash from a queryset?

My idea is to create a hash of a queryset result. For example, product inventory.
Each update of this stock would generate a hash.
This use would be intended to only request this queryset in the API, when there is a change (example: a new product in invetory).
Example for this use:
no change, same hash - no request to get queryset
there was change, different hash. Then a request will be made.
This would be a feature designed for those who are consuming the data and not for the Django that is serving.
Does this make any sense? I saw that in python there is a way to generate a hash from a tuple, in my case it would be to use the frozenset and generate the hash. I don't know if it's a good idea.

I would comment, but I'm waiting on the 50 rep to be able to do that. It sounds like you're trying to cache results so you aren't querying on data that hasn't been changed. If you're not familiar with caching, the idea is to save hard-to-compute answers in memory for frequently queried endpoints/functions.
For example, if I had a program that calculated the first n digits of pi, I may choose to save a map of [digit count -> value] so that if 10 people asked me for the first thousand, I would only calculate it once. Redis is a popular option for caching, and I believe it exists for Django. It allows you to cache some information, set a time before expiration on it, and then wipe specific parts of that information (to force it to recalculate) every time something specific changes (like a new product in inventory).
Everybody should try writing their own cache at least once, like what you're describing, but the de facto professional option is to use a caching library. Your idea is good, it will definitely work, and you will probably want a dict of [hash->result] for each hash, where result is the information you would send back over your API. If you plan to save data so it persists across multiple program starts, remember Python forces random seeds for hashes, causing inconsistent values. Check out this post for more info.

Is it more efficient to store id values in dictionary or re-query database

I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.

"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.

Are there serious performance differences between using pickleType and relationships?

Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?

Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.

Using DVCS for an RDBMS audit trail

I'm looking to implement an audit trail for a reasonably complicated relational database, whose schema is prone to change. One avenue I'm thinking of is using a DVCS to track changes.
(The benefits I can imagine are: schemaless history, snapshots of entire system's state, standard tools for analysis, playback and migration, efficient storage, separate system, keeping DB clean. The database is not write-heavy and history is not not a core feature, it's more for the sake of having an audit trail. Oh and I like trying crazy new approaches to problems.)
I'm not an expert with these systems (I only have basic git familiarity), so I'm not sure how difficult it would be to implement. I'm thinking of taking mercurial's approach, but possibly storing the file contents/manifests/changesets in a key-value data store, not using actual files.
Data rows would be serialised to json, each "file" could be an row. Alternatively an entire table could be stored in a "file", with each row residing on the line number equal to its primary key (assuming the tables aren't too big, I'm expecting all to have less than 4000 or so rows. This might mean that the changesets could be automatically generated, without consulting the rest of the table "file".
(But I doubt it, because I think we need a SHA-1 hash of the whole file. The files could perhaps be split up by a predictable number of lines, eg 0 < primary key < 1000 in file 1, 1000 < primary key < 2000 in file 2 etc, keeping them smallish)
Is there anyone familiar with the internals of DVCS' or data structures in general who might be able to comment on an approach like this? How could it be made to work, and should it even be done at all?
I guess there are two aspects to a system like this: 1) mapping SQL data to a DVCS system and 2) storing the DVCS data in a key/value data store (not files) for efficiency.
(NB the json serialisation bit is covered by my ORM)

I've looked into this a little on my own, and here are some comments to share.
Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.
Blobs
The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".
Changesets
Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".
Trees
Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.
Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.
Storage
I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.
I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.
Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!
Use case: 1. Archiving a change
As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.
Use case 2. Retrieving an old state
After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).
Summary
None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.

If the database is not write-heavy (as you say), why not just implement the actual database tables in a way that achieves your goal? For example, add a "version" column. Then never update or delete rows, except for this special column, which you can set to NULL to mean "current," 1 to mean "the oldest known", and go up from there. When you want to update a row, set its version to the next higher one, and insert a new one with no version. Then when you query, just select rows with the empty version.

Take a look at cqrs and Greg Young's event sourcing. I also have a blog post about working in meta events that pin point schema changes within the river of business events.
http://adventuresinagile.blogspot.com/2009/09/rewind-button-for-your-application.html
If you look through my blog, you'll also find version script schemes and you can source code control those.

Can reading a list from a disk be better than loading a dictionary?

I am building an application where I am trying to allow users to submit a list of company and date pairs and find out whether or not there was a news event on that date. The news events are stored in a dictionary with a company identifier and a date as a key.
newsDict('identifier','MM/DD/YYYY')=[list of news events for that date]
The dictionary turned out to be much larger than I thought-too big even to build it in memory so I broke it down into three pieces, each piece is limited to a particular range of company identifiers.
My plan was to take the user submitted list and using a dictionary group the user list of company identifiers to match the particular newsDict that the company events would be expected to be found and then load the newsDicts one after another to get the values.
Well now I am wondering if it would not be better to keep the news events in a list with each item of the list being a sublist list of a tuple and another list
[('identifier','MM/DD/YYYY'),[list of news events for that date]]
my thought then is that I would have a dictionary that would have the range of the list for each company identifier
companyDict['identifier']=(begofRangeinListforComp,endofRangeinListforComp)
I would use the user input to look up the ranges I needed and construct a list of the identifiers and ranges sorted by the ranges. Then I would just read the appropriate section of the list to get the data and construct the output.
The biggest reason I see for this is that even with the dictionary broken into thirds each section takes about two minutes to load on my machine and the dictionary ends up taking about 600 to 750 mb of ram.
I was surprised to note that a list of eight million lines took only about 15 seconds to load and used about 1/3 of the memory of the dictionary that had 1/3 the entries.
Further, since I can discard the lines in the list as I work through the list I will be freeing memory as I work down the user list.
I am surprised as I thought a dictionary would be the most efficient way to do this. but my poking at it suggests that the dictionary requires significantly more memory than a list. My reading of other posts on SO and elsewhere suggests that any other structure is going to require pointer allocations that are more expensive than list pointers. Am I missing something here and is there a better way to do this?
After reading Alberto's answer and response to my comment I spent some time trying to figure out how to write the function if I were to use a db. Now I might be hobbled here because I don't know much about db programming but
I think the code to implement using a db would be much more complicated than:
outList=[]
massiveFile=open('theFile','r')
for identifier in sortedUserList
# I get the list and sort it by the key of the dictionary
identifierList=massiveFile[theDict[identifier]['beginPosit']:theDict[identifier]['endPosit']+1]
for item in identifierList:
if item.startswith(manipulation of the identifier)
outList.append(item)
I have to wrap this in a function I didn't see anything that would be as comparably simple if I converted the list to a db.
Of course simpler was not the reason to bring me to this forum. I still don't see that using another structure will cost less memory. I have 30000 company identifiers and approximately 3600 dates. Each item in my list is an object in the parlance of OOD. That is where I am struggling I spent six hours this morning organizing the data for a dictionary before I gave up. Spending that amount of time to implement a database and then find that I am using half a gig or more of someone else's memory to load it seems problematic

With such a large amount of data, you should be using a database. This would be far better than looking at a list, and would be the most appropriate way of storing your data anyway. If you're using Python, it has SQLite built in I believe.

The dictionary will take more memory because it is effectively a hash.
You don't have to go so far as using a database, since your lookup requirements are so simple. Just use the file system.
Create a directory structure based on the company name (or ticker), with subdirectories for each date. To find whether data exists and load it up, just form the name of the subdirectory where the data would be, and see if it exists.
E.g., IBM news for May 21 would be in C:\db\IBM\20090521\news.txt, if in fact there were news for that day. You just check if the file exists; no searches.
If you want to try and boost speed from there, come up with a scheme to cache a limited amount of results that are likely to be frequently requested (assuming you're operating a server). For that, you'd use a hash.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.