secure hash as database table key

secure hash as database table key - python

I have a database table that is populated by a long running process. This process reads external data and updates the records in the database. Instead of really updating the records, it is easier to cascade-delete them and recreate. This way all the dependencies will be cleaned up too.
Each record has a unique name. I need to find a way to generate identifiers for these records in such a way that the same names are identified by the same identifiers. So that the identifier stays the same when the record is deleted and recreated. I tried using slugs but they can become very long and Django's SlugField does not always work.
Is it reasonable to use a secure hash as the key? I could create a hash from the slug and use that. Or is it too expensive?

Related

Is it more efficient to store id values in dictionary or re-query database

I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.

"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.

How do you control user access to records in a key-value database?

I have a web application that accesses large amounts of JSON data.
I want to use a key value database for storing JSON data owned/shared by different users of the web application (not users of the database). Each user should only be able to access the records they own or share.
In a relational database, I would add a column Owner to the record table, or manage shared ownerships in a separate table, and check access on the application side (Python). For key value stores, two approaches come to mind.
User ID as part of the key
What if I use keys like USERID_RECORDID and then write code to check the USERID before accessing the record? Is that a good idea? It wouldn't work with records that are shared between users.
User ID as part of the value
I could store one or more USERIDs in the value data and check if the data contains the ID of the user trying to access the record. Performance is probably slower than having the user ID as part of the key, but shared ownerships are possible.
What are typical patterns to do what I am trying to do?

Both of the solutions you described have some limitations.
You point yourself that including the owner ID in the key does not solve the problem of shared data. However, this solution may be acceptable, if you add another key/value pair, containing the IDs of the contents shared with this user (key: userId:shared, value: [id1, id2, id3...]).
Your second proposal, in which you include the list of users who were granted access to a given content, is OK if and only if you application needs to make a query to retrieve the list of users who have access to a particular content. If your need is to list all contents a given user can access, this design will lead you to poor performances, as the K/V store will have to scan all records -and this type of database engine usually don't allow you to create an index to optimise this kind of request.
From a more general point of view, with NoSQL databases and especially Key/Value stores, the model has to be defined according to the requests to be made by the application. It may lead you to duplicate some information. The application has the responsibility of maintaining the consistency of the data.
By example, if you need to get all contents for a given user, whether this user is the owner of the content or these contents were shared with him, I suggest you to create a key for the user, containing the list of content Ids for that user, as I already said. But if your app also needs to get the list of users allowed to access a given content, you should add their IDs in a field of this content. This would result in something like :
key: contentID, value: { ..., [userId1, userID2...]}
When you remove the access to a given content for a user, your app (and not the datastore) have to remove the userId from the content value, and the contentId from the list of contents for this user.
This design may imply for your app to make multiple requests: by example one to get the list of userIDs allowed to access a given content, and one or more to get these user profiles. However, this should not really be a problem as K/V stores usually have very high performances.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Which one data load method is the best for perfomance?

For example, I have object user stored in database (Redis)
It has several fields:
String nick
String password
String email
List posts
List comments
Set followers
and so on...
In Python programm I have class (User) with same fields for this object. Instances of this class maps to object in database. The question is how to get data from DB for best performance:
Load values for each field on instance creating and initialize fields with it.
Load field value each time on field value requesting.
As second one but after value load replace field property by loaded value.
p.s. redis runs in localhost

The method entirely depends on the requirements.
If there is only one client reading and modifying the properties, this is a rather simple problem. When modifying data, just change the instance attributes in your current Python program and -- at the same time -- keep the DB in sync while keeping your program responsive. To that end, you should outsource blocking calls to another thread or make use of greenlets. If there is only one client, there definitely is no need to fetch a property from the DB on each value lookup.
If there are multiple clients reading the data and only one client modifying the data, you have to think about which level of synchronization you need. If you need 100 % synchronization, you will have to fetch data from the DB on each value lookup.
If there are multiple clients changing the data in the database you better look into a rock-solid industry standard solution rather than writing your own DB cache/mapper.
Your distinction between (2) and (3) does not really make sense. If you fetch data on every lookup, there is no need to 'store' data. You see, if there can be multiple clients involved these things quickly become quite complex and it's really hard to get it right.

Please help me design a database schema for this:

I'm designing a python application which works with a database. I'm planning to use sqlite.
There are 15000 objects, and each object has a few attributes. every day I need to add some data for each object.(Maybe create a column with the date as its name).
However, I would like to easily delete the data which is too old but it is very hard to delete columns using sqlite(and it might be slow because I need to copy the required columns and then delete the old table)
Is there a better way to organize this data other than creating a column for every date? Or should I use something other than sqlite?

It'll probably be easiest to separate your data into two tables like so:
CREATE TABLE object(
id INTEGER PRIMARY KEY,
...
);
CREATE TABLE extra_data(
objectid INTEGER,
date DATETIME,
...
FOREIGN KEY(objectid) REFERENCES object(id)
);
This way when you need to delete all of your entries from a date it'll be an easy:
DELETE FROM extra_data WHERE date = curdate;
I would try and avoid altering tables all the time, usually indicates a bad design.

For that size of a db, I would use something else. I've used sqlite once for a media library with about 10k objects and it was slow, like 5 minutes to query it all and display, searches were :/, switching to postgres made life so much easier. This is just on the performance issue only.
It also might be better to create an index that contains the date and the data/column you want to add and a pk reference to the object it belongs and use that for your deletions instead of altering the table all the time. This can be done in sqlite if you give the pk an int type and save the pk of the object to it, instead of using a Foreign Key like you would with mysql/postgres.

If your database is pretty much a collection of almost-homogenic data, you could as well go for a simpler key-value database. If the main action you perform on the data is scanning through everything, it would perform significantly better.
Python library has bindings for popular ones as "anydbm". There is also a dict-imitating proxy over anydbm in shelve. You could pickle your objects with the attributes using any serializer you want (simplejson, yaml, pickle)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.