I'm learning Django and its ORM data access methodology and there is something that I'm curious about. In one particular endpoint, I'm making a number of database calls (to Postgres) - below is an example of one:
projects = Project.objects\
.filter(Q(first_appointment_scheduled=True) | (Q(active=True) & Q(phase=ProjectPhase.meet.value)))\
.select_related('customer__first_name', 'customer__last_name',
'lead_designer__user__first_name', 'lead_designer__user__last_name')\
.values('id')\
.annotate(project=F('name'),
buyer=Concat(F('customer__first_name'), Value(' '), F('customer__last_name')),
designer=Concat(F('lead_designer__user__first_name'), Value(' '), F('lead_designer__user__last_name')),
created=F('created_at'),
meeting=F('first_appointment_date'))\
.order_by('id')[:QUERY_SIZE]
As you can see, that's not a small query - I'm pulling in a lot of specific, related data and doing some string manipulation. I'm relatively concerned with performance so I'm doing the best I can to make things more efficient by using select_related() and values() to only get exactly what I need.
The question I have is, conceptually and in broad terms, at what point does it become faster to just write my queries using parameterized SQL instead of using the ORM (since the ORM has to first "translate" the above "mess")? At what approximate level of query complexity should I switch over to raw SQL?
Any insight would be helpful. Thanks!
The question I have is, conceptually and in broad terms, at what point
does it become faster to just write my queries using parameterized SQL
instead of using the ORM (since the ORM has to first "translate" the
above "mess")?
If you are asking about performance, Never.
The time taken to convert the ORM query into SQL will be very small compared to the time taken to actually execute that query. Brain cells are irreplaceable, servers are cheap.
If you are really do have performance issues the first place to look at is the your indexes in your models. Try printing out each of the queries generated by the ORM and run them in your psql console by prefixing EXPLAIN ANALYSE.
You can also use the django-debug-toolbar to automate this. In fact django-debug toolbar is an essential tool to hunt down bottlenecks. You will be surprised to note how often you have missed a simple select_related and how that causes hundreds of additional queries to be executed.
At what approximate level of query complexity should I switch over to
raw SQL?
if you are asking about the ease of coding, it depends.
If the query is very very hard to write using the ORM and it's unreadable, yes, then it's perfectly fine to use a raw query. For example a query that has multiple aggregations, uses common table expressions, multiple joins etc can sometimes be hard to write as an ORM query, in that case if you are comfortable with raw sql writing it that way is fine.
Agreed with what #e4c5 said .
Additional translation layer for converting an ORM query to raw SQL query will effect performance.
However, this effect will depend on how much complex your query is?
When you use ORM, you can control the load on DB by increasing the processing in the app. In addition, this gives the opportunity to cache the result in the application itself.
At last, It totally depends on your schema , how complex your queries can be and how are you scaling your DB(Indices, replicas etc .)
For more read here
Related
I am kind of new to sql and database & currently developing website in django framework.
During my reading of django documentation I have read about raw sql queries which are executed using Manager.raw() like below.
for p in Person.objects.raw('SELECT * FROM myapp_person'):
Manager.raw(raw_query, params=None, translations=None)
How does raw queries differes from normal sql queries & when should I use raw sql queries instead of Django ORM ?
Django (like other similar ORM tools) is a connection between relational databases and object-oriented programming. One of the very important functions that it implements is providing a uniform interface to the database -- regardless of the underlying database.
When you use underlying Django functionality, the code should be supported on any database (there may be specific limits on this). This makes it particularly easy to port to another database. It also helps ensure that the generated queries do what you intend.
When you use raw SQL, the code is likely to be specific to one database (creating a porting problem). The code is also not checked, which can result in hard-to-understand errors.
I have a strong preference for using SQL directly -- but that is because I am not a programmer using an ORM framework. If you are going to use such a framework, it is probably better to use the built-in functionality wherever possible.
This is a borderline opinion question so might get flagged, but it is a good point. Essentially the raw SQL queries are intended to only be used for the edge cases where the Django ORM does not fulfil your needs (and with each new version of Django it support more and more query types so raw becomes less useful).
In general I would suggest using the ORM for the more helpful error messages, maintainability, and plain ease of use, and only use raw as a last-resort
I have a legacy code which uses nested ORM query, which produces SQL SELECT query with JOIN, and conditions which also contains SELECT and JOIN. Execution of this query takes enormous time. By the way, when I execute this query in raw SQL, taken from Django_ORM_query.query, it performs with reasonable time.
What are best practices for optimization in such cases?
Would the query perform faster if I will use ManyToMany and ForeignKey relations?
Performance issue in Django is usually caused by following relations in a loop, which causes multiple database queries. If you have django-debug-toolbar installed, you can check for how many queries you're doing and figure out which query needs to be optimized. The debug toolbar also shows you the time of each queries, which is essential for optimizing django, you're missing out a lot if you didn't have it installed or didn't use it.
You'd generally solve the problem of following relations by using select_related() or prefetch_related().
A page generally should have at most 20-30 queries, any more and it's going to seriously affect performance. Most pages should only have 5-10 queries. You want to reduce the number of queries because round trip is the number one killer of database performance. In general one big query is faster than 100 small queries.
The number two killer of database performance is much rarer a problem, though it sometimes arises because of techniques that reduces the number of queries. Your query might simply be too big, if this is the case, you should use defer() or only() so you don't load large fields that you know you won't be using.
When in doubt, use raw SQL. That's a completely valid optimization in Django world.
i'm working on a project (written in Django) which has only a few entities, but many rows for each entity.
In my application i have several static "reports", directly written in plain SQL. The users can also search the database via a generic filter form. Since the target audience is really tech-savvy and at some point the filter doesn't fit their needs, i think about creating a query language for my database like YQL or Jira's advanced search.
I found http://sourceforge.net/projects/littletable/ and http://www.quicksort.co.uk/DeeDoc.html, but it seems that they only operate on in-memory objects. Since the database can be too large for holding it in-memory, i would prefer that the query is translated in SQL (or better a Django query) before doing the actual work.
Are there any library or best practices on how to do this?
Writing such a DSL is actually surprisingly easy with PLY, and what ho—there's already an example available for doing just what you want, in Django. You see, Django has this fancy thing called a Q object which make the Django querying side of things fairly easy.
At DjangoCon EU 2012, Matthieu Amiguet gave a session entitled Implementing Domain-specific Languages in Django Applications in which he went through the process, right down to implementing such a DSL as you desire. His slides, which include all you need, are available on his website. The final code (linked to from the last slide, anyway) is available at http://www.matthieuamiguet.ch/media/misc/djangocon2012/resources/compiler.html.
Reinout van Rees also produced some good comments on that session. (He normally does!) These cover a little of the missing context.
You see in there something very similar to YQL and JQL in the examples given:
groups__name="XXX" AND NOT groups__name="YYY"
(modified > 1/4/2011 OR NOT state__name="OK") AND groups__name="XXX"
It can also be tweaked very easily; for example, you might want to use groups.name rather than groups__name (I would). This modification could be made fairly trivially (allow . in the FIELD token, by modifying t_FIELD, and then replacing . with __ before constructing the Q object in p_expression_ID).
So, that satisfies simple querying; it also gives you a good starting point should you wish to make a more complex DSL.
I've faced exactly this problem - a large database which needs searching. I made some static reports and several fancy filters using django (very easy with django) just like you have.
However the power users were clamouring for more. I decided that there already was a DSL that they all knew - SQL. The question was how to make it secure enough.
So I used django permissions to give the power users permission to make SQL queries in a new table. I then made a view for the not-quite-so-power users to use these queries. I made them take optional parameters. The queries were run using Python's lower level DB-API which django is using under the hood for its ORM anyway.
The real trick was opening a read only database connection to run these queries just to make sure that no updates were ever run. I made a read only connection by creating a different user in the database with lower permissions and opening a specific connection for that in the view.
TL;DR - SQL is the way to go!
Depending on the form of your data, the types of queries your users need to use, and the frequency that your data is updated, an alternative to the pure SQL solution suggested by Nick Craig-Wood is to index your data in Solr and then run queries against it.
Solr is an added layer of complexity (configuration, data synchronization) but it is super-fast, can handle large datasets, and provides a (relatively) intuitive query language.
You could write your own SQL-ish language using pyparsing, actually. There is even pretty verbose example you could extend.
ORM tools are great when the queries we need are simple select or insert clauses.
But sometimes we may have to fall back to use raw SQL queries, because we may need to make queries so complex that simply using the ORM API can not give us an efficient and effective solution.
What do you do to deal with the difference between objects returned from raw queries and orm queries?
I personally strive to design my models so I don't have to deffer to writing raw SQL queries, or fallback to mixing in the ContentTypes framework for complex relationships, so I have no experience on the topic.
The documentation covers the topic of the APIs for performing raw SQL queries. You can either use the Manager.raw() on your models (MyModel.objects.raw()), for queries where you can map columns back to actual model fields, or user cursor to query raw rows on your database connection.
If your going to use Manager.raw(), you'll work with the RawQuerySet instead of the usual QuerySet. For all things concerned, when working with result objects the two emulate containers identically, but the QuerySet is a more feature-packed monad.
I can imagine that performing raw SQL queries in Django would be more rewarding than working with a framework with no ORM support—Django can manage your database schema and provide you with a database connection and you'll only have to manually create queries and position query arguments. The resulting rows can be accessed as lists or dictionaries, both of which make it suitable for displaying in templates or performing additional lifting.
SQLAlchemy allows a fair bit of complexity in formulating queries so you can usually get away without raw sql. If you do need to dip down to raw sql, you can use connection.execute. But there are helpers like the text and select functions to make this easier when programming. As far as dealing with the returned objects, you get a list of tuples which are simple to deal with in a pythonic way.
In general, if you need to treat rows (list of tuples, etc) as what your ORM returns, one approach would be to write an adapter class which mimics a queryset interface. This could be initialized with the "schema" of the returned tuples, and then iterate and return objects with properties instead of tuples. I haven't really needed this, but I can see how it might be useful if, for example, you have a framework which relies on querysets being passed around.
Due to performance reasons I can't use the ORM query methods of Django and I have to use raw SQL for some complex questions. I want to find a way to map the results of a SQL query to several models.
I know I can use the following statement to map the query results to one model, but I can't figure how to use it to be able to map to related models (like I can do by using the select_related statement in Django).
model_instance = MyModel(**dict(zip(field_names, row_data)))
Is there a relatively easy way to be able to map fields of related tables that are also in the query result set?
First, can you prove the ORM is stopping your performance? Sometimes performance problems are simply poor database design, or improper indexes. Usually this comes from trying to force-fit Django's ORM onto a legacy database design. Stored procedures and triggers can have adverse impact on performance -- especially when working with Django where the trigger code is expected to be in the Python model code.
Sometimes poor performance is an application issue. This includes needless order-by operations being done in the database.
The most common performance problem is an application that "over-fetches" data. Casually using the .all() method and creating large in-memory collections. This will crush performance. The Django query sets have to be touched as little as possible so that the query set iterator is given to the template for display.
Once you choose to bypass the ORM, you have to fight out the Object-Relational Impedance Mismatch problem. Again. Specifically, relational "navigation" has no concept of "related": it has to be a first-class fetch of a relational set using foreign keys. To assemble a complex in-memory object model via SQL is simply hard. Circular references make this very hard; resolving FK's into collections is hard.
If you're going to use raw SQL, you have two choices.
Eschew "select related" -- it doesn't exist -- and it's painful to implement.
Invent your own ORM-like "select related" features. A common approach is to add stateful getters that (a) check a private cache to see if they've fetched the related object and if the object doesn't exist, (b) fetch the related object from the database and update the cache.
In the process of inventing your own stateful getters, you'll be reinventing Django's, and you'll probably discover that it isn't the ORM layer, but a database design or an application design issue.