There are two databases, Db_A and Db_B, each with their own data dictionary. Most of the data in my database, Db_A, will fit in some field or another of the target database Db_B. Many values from Db_A will require reformatting before being inserted into fields in Db_B, and some values to be inserted into Db_B will need to be derived from multiple fields in Db_A. Very few fields in Db_A will be transferable to Db_B without at least some processing. Some fields may require a lot of processing (especially those which are derived). Unfortunately the processing steps are not very consistent. Each field will essentially require its own unique conversion.
In other words, I have a large set of fields. Each field needs to be processed in a specific way. These fields may change and the way they need to be processed may change. What is the best way of implementing this system?
One way I've done this in the past was to have a central function which loops through each field, calling that fields function. I created one function per field and used a csv file to map fields to functions and to the parameters the functions need. That way if a new field is created, I can just update the csv file and write a method to handle its conversion. If the way a field is converted needs to be changed, I can just change the corresponding method.
Is this a good way of doing it? Any suggestions? I'm using Python.
Related
we have the following data base schema to store different types of data.
DataDefinition: basic information about the new data.
*FieldDefinition: Every DataDefinition has some fields. Every field has a type, title, etc, that information is stored here. Every DataDefinition has more than one FieldDefinition associated. I have put '' because we have a lot of different models, one for every kind of field supported.
DataValue, *FieldValues: we store the definition and the values in different models.
With this setup, to retrieve a data from our database we need to do a lot of queries:
Retrieve the DataDefinition.
Retrieve the DataValue.
Retrieve the *FieldDefinition associated to that DataDefinition.
Retrieve all the *FieldValues associated to those *FieldDefinition.
So, if n is the average number of fields of a DataDefinition, we need to make 2*n+2 queries to the database to retrieve a single value.
We cannot change this setup, but queries are quite slow. So to speed it up I have thought the following: store a joined version of the tables. I do not know if this is possible but I cannot think of any other way. Any suggestion?
Update: we are already using prefetch_related and select_related and it's still slow.
Use Case Right now: get an entire data object from the one object value:
someValue = SomeTypeValue.objects.filter(value=value).select_related('DataValue', 'DataDefinition')[0]
# for each *FieldDefinition/*FieldValue model
definition = SomeFieldDefinition.objects.filter(*field_definition__id=someValue.data_value.data_definition.id)
value = SomeFieldValue.objects.filter(*field_definition__id=definition.id)
And with that info you can now build the entire data object.
Django: 1.11.20
Python: 2.7
I have a model, Reading, which has a foreign key, Type. I'm trying to get a reading for each type that I have, using the following code:
for type in Type.objects.all():
readings = Reading.objects.filter(
type=type.pk)
if readings.exists():
reading_list.append(readings[0])
The problem with this, of course, is that it hits the database for each sensor reading. I've played around with some queries to try to optimize this to a single database call, but none of them seem efficient. .values for instance will provide me a list of readings grouped by type, but it will give me EVERY reading for each type, and I have to filter them with Python in memory. This is out of the question, as we're dealing with potentially millions of readings.
if you use PostgreSQL as your DB backend you can do this in one-line with something like:
Reading.objects.order_by('type__pk', 'any_other_order_field').distinct('type__pk')
Note that the field on which distinct happens must always be the first argument in the order_by method. Feel free to change type__pk with the actuall field you want to order types on (e.g. type__name if the Type model has a name property). You can read more about distinct here https://docs.djangoproject.com/en/dev/ref/models/querysets/#distinct.
If you do not use PostgreSQL you could use the prefetch_related method for this purpose:
#reading_set could be replaced with whatever your reverse relation name actually is
for type in Type.objects.prefetch_related('reading_set').all():
readings = type.reading_set.all()
if len(readings):
reading_list.append(readings[0])
The above will perform only 2 queries in total. Note I use len() so that no extra query is performed when counting the objects. You can read more about prefetch_related here https://docs.djangoproject.com/en/dev/ref/models/querysets/#prefetch-related.
On the downside of this approach is you first retrieve all related objects from the DB and then just get the first.
The above code is not tested, but I hope it will at least point you towards the right direction.
I am creating an application that needs to allow users to create custom fields, which I think would be best stored in a document-based (basically a serialized dictionary) model field.
I am concerned that I would run into performance issues storing these potentially very large documents in a SQL database, so I thought instead of storing the documents in the SQL database, I would just store pointers in the SQL database to the documents. The documents themselves would then be stored in a separate NoSQL database.
Assuming this structure makes sense, what is the best way to go about constructing a field that stores custom data fields in this manner? Optimally, these custom fields would be accessible on the object as attributes and would be denoted as custom with a "_c" appended to the name. E.g. created_date would become created_date_c on the django model object. I'm thinking custom managers would be best to tackle this: https://docs.djangoproject.com/en/dev/topics/db/managers/
EDIT: My SQL database is MySQL. Also, documents could make more sense as columns (as e.g. in Cassandra). Thoughts on this would be helpful.
As far as I know, there is no best approach to this task, but I advise you to look into two other options, that has its own pros and contras:
Store your user defined data in a JSONfield or pickled field. This will save you a lot of efforts in writing custom managers and NoSQL. If you are worried about storing it along with your fixed structure data, store it in a separate model with one-to-one relationship in a separate InnoDB file for example.
Store yours user data in a generalized (field_id, field_name, content_type) and (object_id, field_id, field_value) dictionaries. They can be split by field types (i.e. int, string, float etc.). This approach won't give you well performing data model from scratch, but smart indexing and partitioning can make it worth noting. And your data query, model structure enforcement will be alot easier from other approaches.
If you consider using NoSQL or other db for your mutable content, be sure to choose one that has means for your data efficient querying, see this discussion and wiki.
Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.
There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.
For example, I have object user stored in database (Redis)
It has several fields:
String nick
String password
String email
List posts
List comments
Set followers
and so on...
In Python programm I have class (User) with same fields for this object. Instances of this class maps to object in database. The question is how to get data from DB for best performance:
Load values for each field on instance creating and initialize fields with it.
Load field value each time on field value requesting.
As second one but after value load replace field property by loaded value.
p.s. redis runs in localhost
The method entirely depends on the requirements.
If there is only one client reading and modifying the properties, this is a rather simple problem. When modifying data, just change the instance attributes in your current Python program and -- at the same time -- keep the DB in sync while keeping your program responsive. To that end, you should outsource blocking calls to another thread or make use of greenlets. If there is only one client, there definitely is no need to fetch a property from the DB on each value lookup.
If there are multiple clients reading the data and only one client modifying the data, you have to think about which level of synchronization you need. If you need 100 % synchronization, you will have to fetch data from the DB on each value lookup.
If there are multiple clients changing the data in the database you better look into a rock-solid industry standard solution rather than writing your own DB cache/mapper.
Your distinction between (2) and (3) does not really make sense. If you fetch data on every lookup, there is no need to 'store' data. You see, if there can be multiple clients involved these things quickly become quite complex and it's really hard to get it right.