i ask if Django make database indexes for ManyToMany Field ... and if yes is it do this for the model i provide in through ?
i just want to make sure that database will go fast when it have a lot of data
It does for plain m2m fields - can't tell for sure for "through" tables since I don't have any in my current project and don't have access to other projects ATM, but it still should be the case since this index is necessary for the UNIQUE constraint.
FWIW you can easily check it by yourself by looking at the tables definitions in your database (in MySQL / mariadb using the show create table yourtablename).
This being said, db index are only useful when there are mainly distinct values, and can actually degrade performances (depending on your db vendor etc) if that's not the case - for example, if you have a field with like 3 or 4 possible values (ie "gender" or something similar), indexing it might not yield the expected results. Should not be an issue for m2m tables since the (table1_id, table2_id) pair is supposed to be unique, but the point is that you should not believe that just adding indexes will automagically boost your performances - tuning SQL database performances is a trade in and by itself, and actually depends a lot of how you actually use your data.
I know that QuerySets are lazy and they are evaluated only on certain conditions to avoid hitting the databases all the times.
What I don't know is if given a generic query set (retrieving all the items) and then using it to construct a more refined queryset (adding a filter for example) would lead to multiple sql queries or not?
Example:
all_items = MyModel.objects.all()
subset1 = all_items.filter(**some_conditions)
subset2 = subset1.filter(**other_condition)
1) Would this create 3 different sql queries?
Or it all depends if the 3 variable are evaluated (for example iterating over them)?
2) Is this efficient or would it be better to fetch all the items, then convert them into a list and filter them in python?
1) If you enumerate only the final query set subset2 then only one database query request is executed, that is optimal.
2) Avoid premature optimization (before measurement on appropriate amount of data after most of application code is written.). You never know what will be finally the most important problem if the database gets bigger. E.g. if you ask for a subset then the query is usually faster thanks to caching in the database. The amount of memory is in opposition to other optimizations. Maybe you can't hold later all data in the memory and users will access them only by a page of data. A clean readable code is more important for a later possible optimization than an optimization by 20% that must be removed later to can continue.
Other important paragraphs about (lazy) evaluation of queries:
When QuerySets are evaluated
QuerySets are lazy
Laziness in Django
This is simply a question of curiosity. I have a script that loads a specific queryset without evaluating it, and then I print the count(). I understand that count has to go through so depending on the size it could potentially take some time, but it took over a minute to return 0 as the count of an empty queryset Why is that taking so long? is it Django or my server?
notes:
the queryset was all one type.
It all depends on the query that you're running. If you're running a SELECT COUNT(*) FROM foo on a table that has ten rows, it's going to be very fast; but if your query involves a dozen joins, sub-selects, filters on un-indexed rows--or if the target table simply has a lot of rows--the query can take an arbitrary amount of time. In all likelihood, the bottleneck is not Django (although its ORM has some quirks), but rather the database and your query. Just because no rows meet the criteria doesn't mean that the database didn't need to deal with the other rows in the table.
I have a legacy code which uses nested ORM query, which produces SQL SELECT query with JOIN, and conditions which also contains SELECT and JOIN. Execution of this query takes enormous time. By the way, when I execute this query in raw SQL, taken from Django_ORM_query.query, it performs with reasonable time.
What are best practices for optimization in such cases?
Would the query perform faster if I will use ManyToMany and ForeignKey relations?
Performance issue in Django is usually caused by following relations in a loop, which causes multiple database queries. If you have django-debug-toolbar installed, you can check for how many queries you're doing and figure out which query needs to be optimized. The debug toolbar also shows you the time of each queries, which is essential for optimizing django, you're missing out a lot if you didn't have it installed or didn't use it.
You'd generally solve the problem of following relations by using select_related() or prefetch_related().
A page generally should have at most 20-30 queries, any more and it's going to seriously affect performance. Most pages should only have 5-10 queries. You want to reduce the number of queries because round trip is the number one killer of database performance. In general one big query is faster than 100 small queries.
The number two killer of database performance is much rarer a problem, though it sometimes arises because of techniques that reduces the number of queries. Your query might simply be too big, if this is the case, you should use defer() or only() so you don't load large fields that you know you won't be using.
When in doubt, use raw SQL. That's a completely valid optimization in Django world.
I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?
Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.
ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.
You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.
I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.
Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.
A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.
You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.