Django haystack whoosh super slow

Django haystack whoosh super slow - python

I have a simple setup with django-haystack and whoosh engine. A search yielding 19 objects took me 8 seconds. I used the django-debug-toolbar to determine that i had a bunch of repeated queries.
I then updated my search view to prefetch relations, so that duplicate queries would not happen:
class MySearchView(SearchView):
template_name = 'search_results.html'
form_class = SearchForm
queryset = RelatedSearchQuerySet().load_all().load_all_queryset(
models.Customer, models.Customer.objects.all().select_related('customer_number').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Contact, models.Contact.objects.all().select_related('customer')
).load_all_queryset(
models.Account, models.Account.objects.all().select_related(
'customer', 'account_number', 'main_contact', 'main_contact__customer'
)
).load_all_queryset(
models.Invoice, models.Invoice.objects.all().select_related(
'customer', 'end_customer', 'customer__original', 'end_customer__original', 'quote_number', 'invoice_number'
)
).load_all_queryset(
models.File, models.File.objects.all().select_related('file_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Import, models.Import.objects.all().select_related('import_number', 'customer').prefetch_related(
'keywords'
)
).load_all_queryset(
models.Event, models.Event.objects.all().prefetch_related('customers', 'contracts', 'accounts', 'keywords')
)
But even then, the search still takes 5 seconds. I then used the profiler from django-debug-toolbar, which gave me this information:
From what I can tell, the issue lies in haystack/query:779::__getitem__, which is hit twice, each costing 1.5 seconds. I have glanced through the code in question, but cannot make sense of it. So where do I go from here?

You say in the question:
I then updated my search view to prefetch relations […]
The code you present, though, does not use QuerySet.prefetch_related for most of them. Instead, your sample code uses QuerySet.select_related for most of them; this does not pre-fetch the objects.
The documentation for each of those methods is extensive and can help to decide which is correct for your case.
In particular, the QuerySet.prefetch_related documentation says:
select_related works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python. This allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related. It also supports prefetching of GenericRelation and GenericForeignKey, however, it must be restricted to a homogeneous set of results. For example, prefetching objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.

Try adding
HAYSTACK_LIMIT_TO_REGISTERED_MODELS = False
to your settings.py. As per the docks,
'If your search index is never used for anything other than the models registered with Haystack, you can turn this off and get a small to moderate performance boost.'
It knocked 3-4 seconds off for my project

Related

How to optimize DRF number queries when using properties in serializers fields?

I am working on a DRF API and I am not completely familiar with django properties.
The DB relationships are classic. Companies have different jobs to which candidates can apply. Each job has several matches, match being a joined table between a job and a candidate. Matches have different statuses representing different phases of an application process.
So here is the deal:
I am using a drf viewset to get data from the api. This viewset uses a serializer to get specific fields, specifically the number of matches per status for a job. The simplified version of the serializer looks something like this.
class Team2AmBackofficeSerializer(Normal2JobSerializer):
class Meta:
model = Job
fields = (
'pk',
'name',
'company',
'company_name',
'job__nb_matches_proposition',
'job__nb_matches_preselection',
'job__nb_matches_valides',
'job__nb_matches_pitches',
'job__nb_matches_entretiens',
'job__nb_matches_offre',
)
The job__xxx fields are using the decorator #property, for instance:
#property
def job__nb_matches_offre(self):
return self.matches.filter(current_status__step_name='Offre').count()
The problem is each time I add one of these properties to my serializer's fields, the number of DB queries increases significantly. This is of course due to the fact that each property calls the DB multiple times. So here is my question:
Is there a way to optimize the number of queries made to the DB, either by changing something in the serializer or by getting the number of matches for a specific status in a different manner ?
I have had a look at select_related and prefetch_related. This allows me to reduce the numbers of queries when getting information about the company but not really for the number of matches.
Any help is greatly appreciated :)

What you want is to annotate your queryset with these values which will result in the database doing all the counting in just one query. The result is significantly faster than your current solution.
Example:
from django.db.models import Count, Q
Job.objects.annotate(
'nb_matches_offre'=Count(
'pk',
filter=Q(current_status__step_name='Offre')
),
'nb_matches_entretiens'=Count(...)
).all()
The resulting queryset will contain Job objects that have the properties job_obj.nb_matches_offre and job_obj.nb_matches_entretiens with the count.
See also https://docs.djangoproject.com/en/3.0/topics/db/aggregation/

Prefech many to many relation for one class instance

I want to limit the queries for a detail view. I want to access multiple many to many fields for one class instance in less query. It seems prefetch_related doesn't work with get and the server hits he database for every manytomany field.
JobInstance = Job.objects.get(pk=id).prefetch_related('cities').prefetch_related('experience_level')

You can let it work, by reordering it, like:
job_instance = Job.objects.prefetch_related('cities', 'experience_level').get(pk=id)
A .prefetch_related(..) is defined on a QuerySet, when you perform a .get(..) then you fetch the object, and you are no longer working with a queryset.
But for a single object, .prefetch_related(..) will not improve efficiency. After all, .prefetch_related(..) will make here two extra queries to fetch the related objects, exactly as much as not prefetching, and later evaluating the related objects of the job_instance.
.prefetch_related(..) is therefore useful when you want to fetch the related objects of multiple objects in bulk.

Tastypie: queryset = Model.objects.all()

I am newbie to Tastypie. I see that tastypie call Django Models using queryset and displays data.
My question is: if Tastypie builds the statement queryset = < DJANGO-MODEL >.objects.all(),
will it put a tremendous load on the database/backend if there are 100 million objects?
class RestaurentsResource(ModelResource):
class Meta:
queryset = Restaurents.objects.all()
print queryset
resource_name = 'restaurents'

Django querysets are lazy: https://docs.djangoproject.com/en/dev/topics/db/queries/#querysets-are-lazy, so no database activity will be carried out until the queryset is evaluated.
If you return all 1000 objects from your REST interface, then a 'tremendous' load will be placed on your server, usually pagination: http://django-tastypie.readthedocs.org/en/latest/paginator.html or similar is used to prevent this.
Calling print on the queryset as in the example class above, will force evaluation. Doing this in production code is a bad idea, although it can be handy when debugging or as a learning tool.

The two other answers are correct in terms of QuerySets being lazy. But on top of that, the queryset you specify in the Meta class is the base for the query. In Django, a QuerySet is essentially the representation of a database query, but is not executed. QuerySets can be additionally filtered before a query is executed.
So you could have code that looks like:
Restaurant.objects.all().filter(attribute1=something).filter(attribute2=somethindelse
Tastypie just uses the QuerySet you provide as the base. On each API access, it adds additional filters to the base before executing the new query. Tastypie also handles some pagination, so you can get paginated results so not every row is returned.
While using all() is very normal, this feature is most useful if you want to limit your Tastypie results. Ie, if your Restaurant resource has a 'hidden' field, you might set:
class Meta:
queryset = Restaurant.objects.filter(hidden=False)
All queries generated by the API will use the given queryset as the base, and won't show any rows where 'hidden=True'.

Django QuerySet objects are evaluated lazily, that is - the result is fetched from the db when it is really needed. In this case, queryset = Restaurents.objects.all() create a QuerySet that has not yet been evaluated.
The default implementation of ModelResource usually forces the queryset to be evaluated at dehydration time or paging. The first one requires model objects to be passed, the other one slices the queryset.
Custom views, authorization, or filtering methods can force the evaluation earlier.
That said, after doing all the filtering and paging, the results' list fetched is considerably smaller that the overall amount of data in the database.

Django many-to-many generic relationship

I think I need to create a 'many-to-many generic relationship'.
I have two types of Participants:
class MemberParticipant(AbstractParticipant):
class Meta:
app_label = 'participants'
class FriendParticipant(AbstractParticipant):
"""
Abstract participant common information shared for all rewards.
"""
pass
These Participants can have 1 or more rewards of 2 different kinds (rewards model is from another app):
class SingleVoucherReward(AbstractReward):
"""
Single-use coupons are coupon codes that can only be used once
"""
pass
class MultiVoucherReward(AbstractReward):
"""
A multi-use coupon code is a coupon code that can be used unlimited times.
"""
So now I need to link these all up. This is how I was thinking of creating the relationship (see below) would this work, any issues you see?
Proposed linking model below:
class ParticipantReward(models.Model):
participant_content_type = models.ForeignKey(ContentType, editable=False,
related_name='%(app_label)s_%(class)s_as_participant',
)
participant_object_id = models.PositiveIntegerField()
participant = generic.GenericForeignKey('participant_content_type', 'participant_object_id')
reward_content_type = models.ForeignKey(ContentType, editable=False,
related_name='%(app_label)s_%(class)s_as_reward',
)
reward_object_id = models.PositiveIntegerField()
reward = generic.GenericForeignKey('reward_content_type', 'reward_object_id')
Note: I'm using Django 1.6

Your approach is exactly the right way to do it given your existing tables. While there's nothing official (this discussion, involving a core developer in 2007, appears not to have gone anywhere), I did find this blog post which takes the same approach (and offers it in a third-party library), and there's also a popular answer here which is similar, except only one side of the relationship is generic.
I'd say the reason this functionality has never made it into django's trunk is that while it's a rare requirement, it's fairly easy to implement using the existing tools. Also, the chance of wanting a custom "through" table is probably quite high so most end-user implementations are going to involve a bit of custom code anyway.
The only other potentially simpler approach would be to have base Participant and Reward models, with the ManyToMany relationship between those, and then use multi-table inheritance to extend these models as Member/Friend etc.
Ultimately, you'll just need to weigh up the complexity of a generic relation versus that of having your object's data spread across two models.

Late reply, but I found this conversation when looking for a way to implement generic m2m relations and felt my 2 cents would be helpful for future googlers.
As Greg says, the approach you chose is a good way to do it.
However, I would not qualify generic many to many as 'easy to implement using existing tools' when you want to use features such as reverse relations or prefetching.
The 3rd party app django-genericm2m is nice but has several shortcomings in my opinion (the fact that the 'through' objects are all in the same database table by default and that you don't have 'add' / 'remove' methods - and therefore bulk add/remove).
With that in view, because I needed something to implement generic many-to-many relations 'the django way' and also because I wanted to learn a little bit about django internals, I recently released django-gm2m. It has a very similar API to django's built-in GenericForeignKey and ManyToManyField (with prefetching, through models ...) and adds deletion behavior customisation. The only thing it lacks for the moment is a suitable django admin interface.

List of parents objects and their children with fewer queries

I've got a Django view that I'm trying to optimise. It shows a list of parent objects on a page, along with their children. The child model has the foreign key back to the parent, so select_related doesn't seem to apply.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent)
A naive implementation uses n+1 queries, where n is the number of parent objects, ie. one query to fetch the parent list, then one query to fetch the children of each parent.
I've written a view that does the job in two queries - one to fetch the parent objects, another to fetch the related children, then some Python (that I'm far too embarrassed to post here) to put it all back together again.
Once I found myself importing the standard library's collections module I realised that I was probably doing it wrong. There is probably a much easier way, but I lack the Django experience to find it. Any pointers would be much appreciated!

Add a related_name to the foreign key, then use the prefetch_related method which added to Django 1.4:
Returns a QuerySet that will automatically retrieve, in a single
batch, related objects for each of the specified lookups.
This has a similar purpose to select_related, in that both are
designed to stop the deluge of database queries that is caused by
accessing related objects, but the strategy is quite different:
select_related works by creating a SQL join and including the fields
of the related object in the SELECT statement. For this reason,
select_related gets the related objects in the same database query.
However, to avoid the much larger result set that would result from
joining across a 'many' relationship, select_related is limited to
single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to
prefetch many-to-many and many-to-one objects, which cannot be done
using select_related, in addition to the foreign key and one-to-one
relationships that are supported by select_related. It also supports
prefetching of GenericRelation and GenericForeignKey.
class Parent(models.Model):
name = models.CharField(max_length=31)
class Child(models.Model):
name = models.CharField(max_length=31)
parent = models.ForeignKey(Parent, related_name='children')
>>> Parent.objects.all().prefetch_related('children')
All the relevant children will be fetched in a single query, and used
to make QuerySets that have a pre-filled cache of the relevant
results. These QuerySets are then used in the self.children.all()
calls.
Note 1 that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously
cached results, and retrieve data using a fresh database query.
Note 2 that if you use iterator() to run the query, prefetch_related() calls will be ignored since these two
optimizations do not make sense together.

If you ever need to work with more than 2 levels at once, you can consider a different approach to storing trees in db using MPTT
In a nutshell, it adds data to your model which are updated during updates and allow a much more efficient retrieval.

Actually, select_related is what you are looking for. select_related creates a JOIN so that all the data that you need is fetched in one statement. prefetch_related runs all the queries at once then caches them.
The trick here is to "join in" only what you absolutely need to in order to reduce the performance penalty of the join. "What you absolutely need to" is the long way of saying that you should pre-select only the fields that you will read later in your view or template. There is good documentation here: https://docs.djangoproject.com/en/1.4/ref/models/querysets/#select-related
This is a snippet from one of my models where I faced a similar problem:
return QuantitativeResult.objects.select_related(
'enrollment__subscription__configuration__analyte',
'enrollment__subscription__unit',
'enrollment__subscription__configuration__analyte__unit',
'enrollment__subscription__lab',
'enrollment__subscription__instrument_model'
'enrollment__subscription__instrument',
'enrollment__subscription__configuration__method',
'enrollment__subscription__configuration__reagent',
'enrollment__subscription__configuration__reagent__manufacturer',
'enrollment__subscription__instrument_model__instrument__manufacturer'
).filter(<snip, snip - stuff edited out>)
In this pathological case, I went down from 700+ queries to just one. The django debug toolbar is your friend when it comes to this sort of issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django haystack whoosh super slow - python

Related

How to optimize DRF number queries when using properties in serializers fields?

Prefech many to many relation for one class instance

Tastypie: queryset = Model.objects.all()

Django many-to-many generic relationship

List of parents objects and their children with fewer queries

Categories

Resources