i ask if Django make database indexes for ManyToMany Field ... and if yes is it do this for the model i provide in through ?
i just want to make sure that database will go fast when it have a lot of data
It does for plain m2m fields - can't tell for sure for "through" tables since I don't have any in my current project and don't have access to other projects ATM, but it still should be the case since this index is necessary for the UNIQUE constraint.
FWIW you can easily check it by yourself by looking at the tables definitions in your database (in MySQL / mariadb using the show create table yourtablename).
This being said, db index are only useful when there are mainly distinct values, and can actually degrade performances (depending on your db vendor etc) if that's not the case - for example, if you have a field with like 3 or 4 possible values (ie "gender" or something similar), indexing it might not yield the expected results. Should not be an issue for m2m tables since the (table1_id, table2_id) pair is supposed to be unique, but the point is that you should not believe that just adding indexes will automagically boost your performances - tuning SQL database performances is a trade in and by itself, and actually depends a lot of how you actually use your data.
Related
I have customized my Django project to work with mariaDB (mySQL).
Works fine, however I have issues (or concerns) with models.
First of all - I am not sure why I should need them if for me (personally) its much easier to use SQL statements to get the data.
Using API for DB queries might be useful for people who do not know SQL, but for me its less flexible.
Can anybody explain me main benefits of using models?
Here is one of the issues I have. See the code below.
class Quotes(models.Model):
updated = models.DateTimeField()
tdate = models.DateField(default='1900-01-01')
ticker = models.CharField(max_length=15)
open = models.FloatField(default=0)
vol = models.BigIntegerField(default=0)
why program does not consider 'default' when DB table and fields are created?
why - what I define as FloatField on DB is 'double' and not 'float' (I checked this using phpMYAdmin)
How can I properly set default value?
My table will have at least 1 million of entries.
Do I need to concern about performance using API instead of direct SQL queries? Usually one query will select 700-800 entries.
Is it good approach to use MySQLdb and direct SQL's instead of models?
sorry that some questions might sound too simple, but I just started with Django. Before this I worked with PHP. Main reason I want to use Python for web page development is library which I have developed.
Question zero, i.e. why models: Django's models are a nice abstraction on top of relational database tables – most (if not all) web apps end up having (or being) some sort of CRUD where you manipulate objects or graphs of objects saved in the database, so an object-oriented approach is nice to work with there.
In addition, many features in Django (and libraries that work with Django) are built around models (such as the admin, ModelForms, serialization, etc.).
Question 1: That date should preferably be datetime.date(1900, 1, 1), not a string, but that aside, Django deals with defaults on model instantiation, not necessarily in the database.
Question 2: Because that's how it's mapped, presumably to avoid programmers accidentally losing floating-point precision (since MySQL is rather notorious about doing precision-losing conversions "behind your back").
Question 3: Django's ORM is, to be absolutely honest, not the fastest when it generates queries and instantiates model instances. Most of the time, in regular operations, that's not a problem. Depending on what you're doing with those 700 to 800 instances, you may be able to work around that anyway; for instance, using .values() or .values_list() on a queryset if you don't need the actual instances, just the data.
Regarding direct SQL, please don't hard-code any MySQLdb calls in a Django app though; Django has very nice "escape hatches" for doing raw SQL:
You can perform .raw() SQL queries that map into models, or if that's not enough,
you can just execute SQL against the database connection like you would with MySQLdb.
Oh, and one more thing: your model name should be singular (Quote) :)
Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.
There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.
I've just started learning Python Django and have a lot of experience building high traffic websites using PHP and MySQL. What worries me so far is Python's overly optimistic approach that you will never need to write custom SQL and that it automatically creates all these Foreign Key relationships in your database. The one thing I've learned in the last few years of building Chess.com is that its impossible to NOT write custom SQL when you're dealing with something like MySQL that frequently needs to be told what indexes it should use (or avoid), and that Foreign Keys are a death sentence. Percona's strongest recommendation was for us to remove all FKs for optimal performance.
Is there a way in Django to do this in the models file? create relationships without creating actual DB FKs? Or is there a way to start at the database level, design/create my database, and then have Django reverse engineer the models file?
If you don't want foreign keys, then avoid using
models.ForeignKey(),
models.ManyToManyField(), and
models.OneToOneField().
Django will automatically create an auto-increment int field named id that you can use to refer to individual records, or you can override that by marking a field as primary_key=True.
There is also documentation on running raw SQL queries on the database.
Raw SQL is as easy as this :
for obj in MyModel.objects.raw('SELECT * FROM myapp_mymodel'):
print obj
Denormalizing a database is up to you at model definition time.
You can use non-relational databases (MongoDB, ...) too with Django NonRel
django-admin inspectdb allows you to reverse engineer a models file from existing tables. That is only a very partial response to your question ;)
You can just create the model.py and avoid having SQL Alchemy automatically create the tables leaving it up to you to define the actual tables as you please. So although there are foreign key relationships in the model.py this does not mean that they must exist in the actual tables. This is a very good thing considering how ludicrously foreign key constraints are implemented in MySQL - MyISAM just ignores them and InnoDB creates a non-optional index on every single one regardless of whether it makes sense.
I concur with the 'no foreign keys' advice (with the disclaimer: I also work for Percona).
The reason why it is is recommended is for concurrency / reducing locking internally.
It can be a difficult "optimization" to sell, but if you consider that the database has transactions (and is more or less ACID compliant) then it should only be application-logic errors that cause foreign-key violations. Not to say they don't exist, but if you enable foreign keys in development hopefully you should find at least a few bugs.
In terms of whether or not you need to write custom SQL:
The explanation I usually give is that "optimization rarely decreases complexity". I think it is okay to stick with an ORM by default, but if in a profiler it looks like one particular piece of functionality is taking a lot more time than you suspect it would when written by hand, then you need to be prepared to fix it (assuming the code is called often enough).
The real secret here is that you need good instrumentation / profiling in order to be frugal with your complexity adding optimization(s).
I have a lot of model classes with ralations between them with a CRUD interface to edit. The problem is that some objects can't be deleted since there are other objects refering to them. Sometimes I can setup ON DELETE rule to handle this case, but in most cases I don't want automatic deletion of related objects till they are unbound manually. Anyway, I'd like to present editor a list of objects refering to currently viewed one and highlight those that prevent its deletion due to FOREIGN KEY constraint. Is there a ready solution to automatically discover referers?
Update
The task seems to be quite common (e.g. django ORM shows all dependencies), so I wonder that there is no solution to it yet.
There are two directions suggested:
Enumerate all relations of current object and go through their backref. But there is no guarantee that all relations have backref defined. Moreover, there are some cases when backref is meaningless. Although I can define it everywhere I don't like doing this way and it's not reliable.
(Suggested by van and stephan) Check all tables of MetaData object and collect dependencies from their foreign_keys property (the code of sqlalchemy_schemadisplay can be used as example, thanks to stephan's comments). This will allow to catch all dependencies between tables, but what I need is dependencies between model classes. Some foreign keys are defined in intermediate tables and have no models corresponding to them (used as secondary in relations). Sure, I can go farther and find related model (have to find a way to do it yet), but it looks too complicated.
Solution
Below is a method of base model class (designed for declarative extention) that I use as solution. It is not perfect and doesn't meet all my requirements, but it works for current state of my project. The result is collected as dictionary of dictionaries, so I can show them groupped by objects and their properties. I havn't decided yet whether it's good idea, since the list of referers sometimes is huge and I'm forced to limit it to some reasonable number.
def _get_referers(self):
db = object_session(self)
cls, ident = identity_key(instance=self)
medatada = cls.__table__.metadata
result = {}
# _mapped_models is my extension. It is collected by metaclass, so I didn't
# look for other ways to find all model classes.
for other_class in medatada._mapped_models:
queries = {}
for prop in class_mapper(other_class).iterate_properties:
if not (isinstance(prop, PropertyLoader) and \
issubclass(cls, prop.mapper.class_)):
continue
query = db.query(prop.parent)
comp = prop.comparator
if prop.uselist:
query = query.filter(comp.contains(self))
else:
query = query.filter(comp==self)
count = query.count()
if count:
queries[prop] = (count, query)
if queries:
result[other_class] = queries
return result
Thanks to all who helped me, especially stephan and van.
SQL: I have to absolutely disagree with S.Lott' answer.
I am not aware of out-of-the-box solution, but it is definitely possible to discover all the tables that have ForeignKey constraints to a given table. One needs to use properly the INFORMATION_SCHEMA views such as REFERENTIAL_CONSTRAINTS, KEY_COLUMN_USAGE, TABLE_CONSTRAINTS, etc. See SQL Server example. With some limitations and extensions, most versions of new relational databases support INFORMATION_SCHEMA standard. When you have all the FK information and the object (row) in the table, it is a matter of running few SELECT statements to get all other rows in other tables that refer to given row and prevent it from being deleted.
SqlAlchemy: As noted by stephan in his comment, if you use orm with backref for relations, then it should be quite easy for you to get the list of parent objects that keep reference to the object you are trying to delete, because those objects are basically mapped properties of your object (child1.Parent).
If you work with Table objects of sql alchemy (or not always use backref for relations), then you would have to get values of foreign_keys for all the tables, and then for all those ForeignKeys call references(...) method, providing your table as a parameter. In this way you will find all the FKs (and tables) that have reference to the table your object maps to. Then you can query all the objects that keep reference to your object by constructing the query for each of those FKs.
In general, there's no way to "discover" all of the references in a relational database.
In some databases, they may use declarative referential integrity in the form of explicit Foreign Key or Check constraints.
But there's no requirement to do this. It can be incomplete or inconsistent.
Any query can include a FK relationship that is not declared. Without the universe of all queries, you can't know the relationships which are used but not declared.
To find "referers" in general, you must actually know the database design and have all queries.
For each model class, you can easily see if all its one-to-many relations are empty simply by asking for the list in each case and seeing how many entries it contains. (There is probably a more efficient way implemented in terms of COUNT, too.) If there are any foreign keys relating to the object, and you have your object relations set up correctly, then at least one of these lists will be non-zero in length.
I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?
Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.
ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.
You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.
I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.
Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.
A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.
You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.