Querying a list in mongoengine; contains vs in - python

I have a ListField in a model with ids (ReferenceField), and I need to do a query if a certain id is in that list. AFAIK I have 2 options for this:
Model.objects.filter(refs__contains='59633cad9d4bc6543aab2f39')
or:
Model.objects.filter(refs__in=['59633cad9d4bc6543aab2f39'])
Which one is the most efficient for this use case?
The model looks like:
class Model(mongoengine.Document):
refs = mongoengine.ListField(mongoengine.ReferenceField(SomeOtherModel))
From what I can read in the mongoengine documentation, contains is really a string query, but it works surprisingly here as well. But I'm guessing that __in is more efficient since it should be optimized for lists, or am I wrong?

The string queries normally under the covers are all regex query so would be less efficient. However, the exception is when testing against reference fields! The following queries are:
Model.objects.filter(refs__contains="5305c92956c02c3f391fcaba")._query
{'refs': ObjectId('5305c92956c02c3f391fcaba')}
Which is a direct lookup.
Model.objects.filter(refs__in=["5305c92956c02c3f391fcaba"])._query
{'refs': {'$in': [ObjectId('5305c92956c02c3f391fcaba')]}}
This probably is less efficient, but would probably be extremely marginal. The biggest impact would be the number of docs and whether or not the refs field has an index.

Related

Different behavior between multiple nested lookups inside .filter and .exclude

What's the difference between having multiple nested lookups inside queryset.filter and queryset.exclude?
For example car ratings. User can create ratings of multiple types for any car.
class Car(Model):
...
class Rating(Model):
type = ForeignKey('RatingType') # names like engine, design, handling
user = ... # user
Let's try to get a list of cars without rating by user "a" and type "design".
Approach 1
car_ids = Car.objects.filter(
rating__user="A", rating__type__name="design"
).values_list('id',flat=True)
Car.objects.exclude(id__in=car_ids)
Approach 2
Car.objects.exclude(
rating__user="A", rating__type__name="design"
)
The Approach 1 works well to me whereas the Approach 2 looks to be excluding more cars. My suspicion is that nested lookup inside exclude does not behave like AND (for the rating), rather it behaves like OR.
Is that true? If not, why these two approaches results in different querysets?
Regarding filter, "multiple parameters are joined via AND in the underlying SQL statement". Your first approach results not in one but in two SQL queries roughly equivalent to:
SELECT ... WHERE rating.user='A' AND rating.type.name='design';
SELECT ... WHERE car.id NOT IN (id1, id2, id3 ...);
Here's the part of the documentation that answers your question very precisely regarding exclude:
https://docs.djangoproject.com/en/stable/ref/models/querysets/#exclude
The evaluated SQL query would look like:
SELECT ... WHERE NOT (rating.user='A' AND rating.type.name='design')
Nested lookups inside filter and exclude behave similarly and use AND conditions. At the end of the day, most of the time, your 2 approaches are indeed equivalent... Except that the Car table might have been updated between the 1st and the 2d query of your approach 1.
Are you sure it's not the case? To be sure, try maybe to wrap the 2 lines of approach 1 in a transaction.atomic block? In any case, your second approach is probably the best (the less queries, the better).
If you have any doubt, you can also have a look at the evaluated queries (see here or here).

Index of row looping over django queryset [duplicate]

I have a QuerySet, let's call it qs, which is ordered by some attribute which is irrelevant to this problem. Then I have an object, let's call it obj. Now I'd like to know at what index obj has in qs, as efficiently as possible. I know that I could use .index() from Python or possibly loop through qs comparing each object to obj, but what is the best way to go about doing this? I'm looking for high performance and that's my only criteria.
Using Python 2.6.2 with Django 1.0.2 on Windows.
If you're already iterating over the queryset and just want to know the index of the element you're currently on, the compact and probably the most efficient solution is:
for index, item in enumerate(your_queryset):
...
However, don't use this if you have a queryset and an object obtained by some unrelated means, and want to learn the position of this object in the queryset (if it's even there).
If you just want to know where you object sits amongst all others (e.g. when determining rank), you can do it quickly by counting the objects before you:
index = MyModel.objects.filter(sortField__lt = myObject.sortField).count()
Assuming for the purpose of illustration that your models are standard with a primary key id, then evaluating
list(qs.values_list('id', flat=True)).index(obj.id)
will find the index of obj in qs. While the use of list evaluates the queryset, it evaluates not the original queryset but a derived queryset. This evaluation runs a SQL query to get the id fields only, not wasting time fetching other fields.
QuerySets in Django are actually generators, not lists (for further details, see Django documentation on QuerySets).
As such, there is no shortcut to get the index of an element, and I think a plain iteration is the best way to do it.
For starter, I would implement your requirement in the simplest way possible (like iterating); if you really have performance issues, then I would use some different approach, like building a queryset with a smaller amount of fields, or whatever.
In any case, the idea is to leave such tricks as late as possible, when you definitely knows you need them.
Update: You may want to use directly some SQL statement to get the rownumber (something lie . However, Django's ORM does not support this natively and you have to use a raw SQL query (see documentation). I think this could be the best option, but again - only if you really see a real performance issue.
It's possible for a simple pythonic way to query the index of an element in a queryset:
(*qs,).index(instance)
This answer will unpack the queryset into a list, then use the inbuilt Python index function to determine it's position.
You can do this using queryset.extra(…) and some raw SQL like so:
queryset = queryset.order_by("id")
record500 = queryset[500]
numbered_qs = queryset.extra(select={
'queryset_row_number': 'ROW_NUMBER() OVER (ORDER BY "id")'
})
from django.db import connection
cursor = connection.cursor()
cursor.execute(
"WITH OrderedQueryset AS (" + str(numbered_qs.query) + ") "
"SELECT queryset_row_number FROM OrderedQueryset WHERE id = %s",
[record500.id]
)
index = cursor.fetchall()[0][0]
index == 501 # because row_number() is 1 indexed not 0 indexed

Django: queryset.count() is significantly slower on chained filters than single filters regardless of returned query size--is there a solution?

EDIT: Best solution thanks to Hakan--
queriedForms.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
I tried more of his suggestions but I can't avoid an INNER JOIN--this seems to be a a stable solution that does get me small, but predictable speed increases across the board. Look through his answer for more details!
I've been struggling with a problem I haven't seen an answer to online.
When chaining two filters in Django e.g.
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
masterQuery.count()
#returns 100,000 results in < 1 second
#test filter--all 100,000+ names have "test x" where x is 0-9
storedCount = masterQuery.filter(name__contains="9").count()
#returns ~50,000 results but takes 5-6 seconds
Trying a slightly different way:
masterQuery = masterQuery.filter(name__contains="9")
masterQuery.count()
#also returns ~50,000 results in 5-6 seconds
performing an & merge seems to ever so slightly improve performance, e.g
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
(masterQuery & masterQuery.filter(name__contains="9")).count()
It seems as if count takes a significantly longer time beyond a single filter in a queryset.
I assume it may have something to do with mySQL, which apparently doesn't like nested statements--and I assume that two filters are creating a nested query that slows mySQL down, regardless of the SELECT COUNT(*) django uses
So my question is: Is there anyway to speed this up? I'm getting ready to do a lot of regular nested querying only using queryset counts (I don't need the actual model values) without database hits to load the models. e.g. I don't need to load 100,000 models from the database, I just need to know there are 100,000 there. It's obviously much faster to do this through querysets than len() but even at 5 secs a count when I'm running 40 counts for an entire complex query is 3+ minutes--I'd prefer it be under a minute. Am I just fantasizing or does someone have a suggestion as to how this could be accomplished outside of increasing the server's processor speed?
EDIT: If it's helpful--the time.clock() speed is .3 secs for the chained filter() count--the actual time to console and django view output is 5-6s
EDIT2: To answer any questions about indexing, the filters use both an indexed and non indexed value for each link in the chain:
mainQuery = masterQuery = bigmodel.relatedmodel_set.all()
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Where "record_attribute_type" is another foreign key being used as a filter
mainQuery.count() #produces 100,000 results in < 1sec
mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="9", reverseforeignkeytestmodel__record_attribute_type__pk=5).count()
#produces ~50,000 results in 5-6 secs
So each filter in the chain is functionally similar, it is an AND filter(condition,condition) where one condition is indexed, and the other is not. I can't index both conditions.
Edit 3:
Similar queries that result in smaller results, e.g. < 10,000 are much faster, regardless of the nesting--e.g. the first filter in the chain produces 10,000 results in ~<1sec but the second filter in the chain will produce 5,000 results in ~<1sec
Edit 4:
Still not working based on #Hakan's solution
mainQuery = bigmodel.relatedmodel_set.all()
#Setup the first filter as normal
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Grab a values list for the second chained filter instead of chaining it
values = bigmodel.relatedmodel_set.all().filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=8).values_list('pk', flat=True)
#filter the first query based on the values_list rather than a second filter
mainQuery = mainQuery.filter(pk__in=values)
mainQuery.count()
#Still takes on average the same amount of time after enough test runs--seems to be slightly faster than average--similar to the (quersetA & querysetB) merge solution I tried.
It's possible I did this wrong--but the count results are consistent between the new value_list filter technique, e.g. I'm getting the same # of results. So it's definitely working--but seemingly taking the same amount of time
EDIT 5:
Also based on #Hakan's solution with some slight tweaks
mainQuery.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
This seems to operate faster for larger results in a queryset, e.g. > 50,000, but is actually much slower on smaller queryset results, e.g. < 50,000--where they used to be <1sec--sometimes 2-3 running in 1 second for chain filtering, they now all take 1 second individually. Essentially the speed gains in the larger queryset have been nullified by the speed loss in the smaller querysets.
I'm still going to try and break up the queries as per his suggestion further--but I'm not sure I'm able to. I'll update again(possibly on Monday) when I figure that out and let everyone interested know the progress.
Not sure if this helps, since I don't have a mysql project to test with.
The QuerySet API reference contains a section about the performance of nested queries.
Performance considerations
Be cautious about using nested queries and understand your database
server’s performance characteristics (if in doubt, benchmark!). Some
database backends, most notably MySQL, don’t optimize nested queries
very well. It is more efficient, in those cases, to extract a list of
values and then pass that into the second query. That is, execute two
queries instead of one:
values = Blog.objects.filter(
name__contains='Cheddar').values_list('pk', flat=True)
entries = Entry.objects.filter(blog__in=list(values))
Note the list() call around the Blog QuerySet to force execution of the first query.
Without it, a nested query would be executed, because QuerySets are
lazy.
So, maybe you can improve the performance by trying something like this:
masterQuery = bigmodel.relatedmodel_set.all()
pks = list(masterQuery.filter(name__contains="test").values_list('pk', flat=True))
count = masterQuery.filter(pk__in=pks, name__contains="9")
Since your initial MySQL performance is so slow, it might even be faster to do the second step in Python instead of in the database.
names = masterQuery.filter(name__contains='test').values_list('name')
count = sum('9' in n for n in names)
Edit:
From your updates, I see that you are querying fields in related models, which result in multiple sql JOIN operations. That's likely a big reason why the query is slow.
To avoid joins, you could try something like this. The goal is to avoid doing deeply chained lookups across relations.
# query only RelatedModel, avoid JOIN
related_pks = RelatedModel.objects.filter(
record_value__contains=constraint['TVAL'],
record_attribute_type=rtypePK,
).values_list('pk', flat=True)
# list(queryset) will do a database query, resulting in a list of integers.
pks_list = list(related_pks)
# use that result to filter your main model.
count = MainModel.objects.filter(
formrecordattributevalue__in=pks_list
).count()
I'm assuming that the relation is defined as a foreign key from MainModel to RelatedModel.

Most Efficient Way to get list of values from Django Queryset

I can see quite a few different options for doing this and would like some feedback on the most efficient or 'best practice' method.
I get a Django Queryset with filter()
c_layer_points = models.layer_points.objects.filter(location_id=c_location.pk,season_id=c_season.pk,line_path_id=c_line_path.pk,radar_id=c_radar.pk,layer_id__in=c_layer_pks,gps_time__gte=start_gps,gps_time__lte=stop_gps)
This queryset could be very large (hundreds of thousands of rows).
Now what needs to happen is a conversion to lists and encoding to JSON.
Options (that i've seen in my searches):
Loop over the queryset
Example:
gps_time = [lp.gps_time for lp in c_layer_points];
twtt = [lp.twtt for lp in c_layer_points];
Use values() or values_list()
Use iterator()
In the end I would like to encode as json something like this format:
{'gps_time':[list of all gps times],'twtt',[list of all twtt]}
Any hints on the best way to do this would be great, Thanks!
You might not be able to get the required format from the ORM. However, you can efficiently do something like this:
c_layer_points = models.layer_points.objects.filter(
location_id=c_location.pk,
season_id=c_season.pk,
line_path_id=c_line_path.pk,
radar_id=c_radar.pk,
layer_id__in=c_layer_pks,
gps_time__gte=start_gps,
gps_time__lte=stop_gps
).values_list('gps_time', 'twtt')
and now split the tuples into two lists: (Tuple unpacking)
split_lst = zip(*c_layer_points)
dict(gps_time=list(split_lst[0]), twtt=list(split_lst[1]))
I will suggest you use the iterate through the query set and conform the json dictionary element by element from the queryset.
Normally, Django's QuerySets are lazy, this means they get load into memory whenever they get accessed. If you load the entire list: gps_time = [lp.gps_time for lp in c_layer_points] you will have all those objects in memory (thousands). You'll be good by doing a simple iteration:
for item in c_layer_points:
#convert item to json and add it to the
#json dict.
As an aside note, you don't need the ; character at the end of lines in python :)
Hope this helps!

Cleaner way to query on a dynamic number of columns in Django?

In my case, I have a number of column names coming from a form. I want to filter to make sure they're all true. Here's how I currently do it:
for op in self.cleaned_data['options']:
cars = cars.filter((op, True))
Now it works but there are are a possible ~40 columns to be tested and it therefore doesn't appear very efficient to keep querying.
Is there a way I can condense this into one filter query?
Build the query as a dictionary and use the ** operator to unpack the options as keyword arguments to the filter method.
op_kwargs = {}
for op in self.cleaned_data['options']:
op_kwargs[op] = True
cars = CarModel.objects.filter(**op_kwargs)
This is covered in the django documentation and has been covered on SO as well.
Django's query sets are lazy, so what you're currently doing is actually pretty efficient. The database won't be hit until you try to access one of the fields in the QuerySet... assuming, that is, that you didn't edit out some code, and it is effectively like this:
cars = CarModel.objects.all()
for op in self.cleaned_data['options']:
cars = cars.filter((op, True))
More information here.

Categories

Resources