Django Optimiztion

Django Optimiztion - python

I have query written in raw sql in Django..
Suppose the result of that query is assigned to a variable queryResult.
I then loop this queryResult, then retrive data from almost three tables using django ORM.
For example..
for item in queryResult:
a=table1.objects.get(id=item[0])
b=table2.objects.get(id=item[1])
c=table2.objects.get(id=item[2])
z=a.result
x=a.result1
v=c.result
####based on some condition check the data is stored into a list as dictionary.
recentDocsList.append({'PurchaseType':item[0],
'CaseName':z,
'DocketNumber':x,
'CourtID':item[2],
'PacerCmecf':v,
'DID':item[3]})
After completing the loop this recentDocsList is returned back...
But the entire thing is making my to page render slowly. Anybody has any method to resolve this issue.
PS: The entire thing is inside a while loop. At a time only 50 results are retrieved. The control comes out of the while loop if the result retrieved is less than 50 or the
recentDocsList length is equal to 10.
Thanks in advance.

Don't optimize too early - this can create obfuscation and confusion.
Even using SQLite3 you should be able to pull 50 chained querysets without taxing the DB (upping to a higher performance DB like PostgreSQL would improve this further). This would suggest that your problem is elsewhere, to debug this try calling your models / queries / views in
$ ./manage.py debugshell
and this will print out your SQL queries so you can see what is actually being called. Even better would be to install django-debug-toolbar as this would inform you where the SQL / rendering slow downs are.
But! Unless you have a really good reason to do so, DON'T WRITE CUSTOM SQL to be executed in django - the ORM can take care of almost everything. Some of the dangers of custom SQL include terrible performance - as you're probably experiencing.
Further - a while loop in a performance sensitive place (like page rendering) sounds like a disaster waiting to happen - are you sure you can't rewrite this in a safer way?
Without seeing more code it's difficult to help - how large are your query sets? Are they efficient? Do you have indexes to your tables? (Django will provide these if you allow it, but it sounds like you're doing something different).

Related

Django QuerySet vs. raw SQL performance considerations

I'm learning Django and its ORM data access methodology and there is something that I'm curious about. In one particular endpoint, I'm making a number of database calls (to Postgres) - below is an example of one:
projects = Project.objects\
.filter(Q(first_appointment_scheduled=True) | (Q(active=True) & Q(phase=ProjectPhase.meet.value)))\
.select_related('customer__first_name', 'customer__last_name',
'lead_designer__user__first_name', 'lead_designer__user__last_name')\
.values('id')\
.annotate(project=F('name'),
buyer=Concat(F('customer__first_name'), Value(' '), F('customer__last_name')),
designer=Concat(F('lead_designer__user__first_name'), Value(' '), F('lead_designer__user__last_name')),
created=F('created_at'),
meeting=F('first_appointment_date'))\
.order_by('id')[:QUERY_SIZE]
As you can see, that's not a small query - I'm pulling in a lot of specific, related data and doing some string manipulation. I'm relatively concerned with performance so I'm doing the best I can to make things more efficient by using select_related() and values() to only get exactly what I need.
The question I have is, conceptually and in broad terms, at what point does it become faster to just write my queries using parameterized SQL instead of using the ORM (since the ORM has to first "translate" the above "mess")? At what approximate level of query complexity should I switch over to raw SQL?
Any insight would be helpful. Thanks!

The question I have is, conceptually and in broad terms, at what point
does it become faster to just write my queries using parameterized SQL
instead of using the ORM (since the ORM has to first "translate" the
above "mess")?
If you are asking about performance, Never.
The time taken to convert the ORM query into SQL will be very small compared to the time taken to actually execute that query. Brain cells are irreplaceable, servers are cheap.
If you are really do have performance issues the first place to look at is the your indexes in your models. Try printing out each of the queries generated by the ORM and run them in your psql console by prefixing EXPLAIN ANALYSE.
You can also use the django-debug-toolbar to automate this. In fact django-debug toolbar is an essential tool to hunt down bottlenecks. You will be surprised to note how often you have missed a simple select_related and how that causes hundreds of additional queries to be executed.
At what approximate level of query complexity should I switch over to
raw SQL?
if you are asking about the ease of coding, it depends.
If the query is very very hard to write using the ORM and it's unreadable, yes, then it's perfectly fine to use a raw query. For example a query that has multiple aggregations, uses common table expressions, multiple joins etc can sometimes be hard to write as an ORM query, in that case if you are comfortable with raw sql writing it that way is fine.

Agreed with what #e4c5 said .
Additional translation layer for converting an ORM query to raw SQL query will effect performance.
However, this effect will depend on how much complex your query is?
When you use ORM, you can control the load on DB by increasing the processing in the app. In addition, this gives the opportunity to cache the result in the application itself.
At last, It totally depends on your schema , how complex your queries can be and how are you scaling your DB(Indices, replicas etc .)
For more read here

How to make Django QuerySet bulk delete() more efficient

Setup:
Django 1.1.2, MySQL 5.1
Problem:
Blob.objects.filter(foo = foo) \
.filter(status = Blob.PLEASE_DELETE) \
.delete()
This snippet results in the ORM first generating a SELECT * from xxx_blob where ... query, then doing a DELETE from xxx_blob where id in (BLAH); where BLAH is a ridiculously long list of id's. Since I'm deleting a large amount of blobs, this makes both me and the DB very unhappy.
Is there a reason for this? I don't see why the ORM can't convert the above snippet into a single DELETE query. Is there a way to optimize this without resorting to raw SQL?

For those who are still looking for an efficient way to bulk delete in django, here's a possible solution:
The reason delete() may be so slow is twofold: 1) Django has to ensure cascade deleting functions properly, thus looking for foreign key references to your models; 2) Django has to handle pre and post-save signals for your models.
If you know your models don't have cascade deleting or signals to be handled, you can accelerate this process by resorting to the private API _raw_delete as follows:
queryset._raw_delete(queryset.db)
More details in here. Please note that Django already tries to make a good handling of these events, though using the raw delete is, in many situations, much more efficient.

Not without writing your own custom SQL or managers or something; they are apparently working on it though.
http://code.djangoproject.com/ticket/9519

Bulk delete is already part of django
Keep in mind that this will, whenever possible, be executed purely in SQL

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.

I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).

Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.

If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.

An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Overhead of a Round-trip to MySql?

So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.

Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.

The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.

There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.

There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.