I'm designing a python application which works with a database. I'm planning to use sqlite.
There are 15000 objects, and each object has a few attributes. every day I need to add some data for each object.(Maybe create a column with the date as its name).
However, I would like to easily delete the data which is too old but it is very hard to delete columns using sqlite(and it might be slow because I need to copy the required columns and then delete the old table)
Is there a better way to organize this data other than creating a column for every date? Or should I use something other than sqlite?
It'll probably be easiest to separate your data into two tables like so:
CREATE TABLE object(
id INTEGER PRIMARY KEY,
...
);
CREATE TABLE extra_data(
objectid INTEGER,
date DATETIME,
...
FOREIGN KEY(objectid) REFERENCES object(id)
);
This way when you need to delete all of your entries from a date it'll be an easy:
DELETE FROM extra_data WHERE date = curdate;
I would try and avoid altering tables all the time, usually indicates a bad design.
For that size of a db, I would use something else. I've used sqlite once for a media library with about 10k objects and it was slow, like 5 minutes to query it all and display, searches were :/, switching to postgres made life so much easier. This is just on the performance issue only.
It also might be better to create an index that contains the date and the data/column you want to add and a pk reference to the object it belongs and use that for your deletions instead of altering the table all the time. This can be done in sqlite if you give the pk an int type and save the pk of the object to it, instead of using a Foreign Key like you would with mysql/postgres.
If your database is pretty much a collection of almost-homogenic data, you could as well go for a simpler key-value database. If the main action you perform on the data is scanning through everything, it would perform significantly better.
Python library has bindings for popular ones as "anydbm". There is also a dict-imitating proxy over anydbm in shelve. You could pickle your objects with the attributes using any serializer you want (simplejson, yaml, pickle)
Related
we have the following data base schema to store different types of data.
DataDefinition: basic information about the new data.
*FieldDefinition: Every DataDefinition has some fields. Every field has a type, title, etc, that information is stored here. Every DataDefinition has more than one FieldDefinition associated. I have put '' because we have a lot of different models, one for every kind of field supported.
DataValue, *FieldValues: we store the definition and the values in different models.
With this setup, to retrieve a data from our database we need to do a lot of queries:
Retrieve the DataDefinition.
Retrieve the DataValue.
Retrieve the *FieldDefinition associated to that DataDefinition.
Retrieve all the *FieldValues associated to those *FieldDefinition.
So, if n is the average number of fields of a DataDefinition, we need to make 2*n+2 queries to the database to retrieve a single value.
We cannot change this setup, but queries are quite slow. So to speed it up I have thought the following: store a joined version of the tables. I do not know if this is possible but I cannot think of any other way. Any suggestion?
Update: we are already using prefetch_related and select_related and it's still slow.
Use Case Right now: get an entire data object from the one object value:
someValue = SomeTypeValue.objects.filter(value=value).select_related('DataValue', 'DataDefinition')[0]
# for each *FieldDefinition/*FieldValue model
definition = SomeFieldDefinition.objects.filter(*field_definition__id=someValue.data_value.data_definition.id)
value = SomeFieldValue.objects.filter(*field_definition__id=definition.id)
And with that info you can now build the entire data object.
Django: 1.11.20
Python: 2.7
I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.
"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.
I have a model, Reading, which has a foreign key, Type. I'm trying to get a reading for each type that I have, using the following code:
for type in Type.objects.all():
readings = Reading.objects.filter(
type=type.pk)
if readings.exists():
reading_list.append(readings[0])
The problem with this, of course, is that it hits the database for each sensor reading. I've played around with some queries to try to optimize this to a single database call, but none of them seem efficient. .values for instance will provide me a list of readings grouped by type, but it will give me EVERY reading for each type, and I have to filter them with Python in memory. This is out of the question, as we're dealing with potentially millions of readings.
if you use PostgreSQL as your DB backend you can do this in one-line with something like:
Reading.objects.order_by('type__pk', 'any_other_order_field').distinct('type__pk')
Note that the field on which distinct happens must always be the first argument in the order_by method. Feel free to change type__pk with the actuall field you want to order types on (e.g. type__name if the Type model has a name property). You can read more about distinct here https://docs.djangoproject.com/en/dev/ref/models/querysets/#distinct.
If you do not use PostgreSQL you could use the prefetch_related method for this purpose:
#reading_set could be replaced with whatever your reverse relation name actually is
for type in Type.objects.prefetch_related('reading_set').all():
readings = type.reading_set.all()
if len(readings):
reading_list.append(readings[0])
The above will perform only 2 queries in total. Note I use len() so that no extra query is performed when counting the objects. You can read more about prefetch_related here https://docs.djangoproject.com/en/dev/ref/models/querysets/#prefetch-related.
On the downside of this approach is you first retrieve all related objects from the DB and then just get the first.
The above code is not tested, but I hope it will at least point you towards the right direction.
Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?
Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.
Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.
There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.