Django: optimizing a query with spread data - python

I have Order objects and OrderOperation objects that represent an action on a Order (creation, modification, cancellation).
Conceptually, an order has 1 to many order operations. Each time there is an operation on the order, the total is computed in this operation. Which means when I need to find the total of an order, I just get the last order operation total.
The simplified code
class OrderOperation(models.Model):
order = models.ForeignKey(Order)
total = DecimalField(max_digits=9, decimal_places=2)
class Order(models.Model):
#property
def last_operation(self) -> Optional['OrderOperation']:
try:
qs = self.orderoperation_set.all()
return qs[len(qs) - 1]
except AssertionError: # when there is a negative indexing (no operation)
# IndexError can not happen
return None
#property
def total(self) -> Optional[Decimal]:
last_operation = self.last_operation
return last_operation.total if last_operation else None
The issue
Since I get lots of orders, each time I want to make a simple filtering like "orders that have a total lower than 5€", it takes a long time, because I need to browse all orders, using the following, obviously bad query:
all_objects = Order.objects.all()
Order.objects.prefetch_related('orderoperation_set').filter(
pk__in=[o.pk for o in all_objects if o.total <= some_value])
My current ideas / what I tried
Data denormalization?
I could simply create a total attribute on Order, and copy the operation total to the order total every time on operation is created.
Then, Order.objects.filter(total__lte=some_value) would work.
However, before duplicating data in my database, I'd like to be sure there is not an easier/cleaner solution.
Using annotate() method?
I somehow expected to be able to do: Order.objects.annotate(total=something_magical_here).filter(total__lte=some_value). It seems it's not possible.
Filtering separetely then matching?
order_operations = OrderOperation.objects.filter(total__lte=some_value)
orders = Order.objects.filter(orderoperation__in=order_operations)
This is very fast, but the filtering is bad since I didn't filter last operations, but all operations here. This is wrong.
Any other idea? Thanks.

Using annotate() method
It seems it's not possible.
Of course, it is possible ;) You can use subqueries or some clever conditional expressions. Assuming that you want to get total amount from last order operation, here is example with subquery:
from django.db.models import Subquery, OuterRef
orders = Order.objects.annotate(
total=Subquery( # [1]
OrderOperation.objects \
.filter(order_id=OuterRef("pk")) \ # [2]
.order_by('-id') \ # [3]
.values('total') \ # [4]
[:1] # [5]
)
)
Explanation of code above:
We are adding new field to results list, called total taht will be filled in by subquery. You can access it as any other field of model Order in this queryset (either after evaluating it, in model instances or in filtering and other annotations). You can learn how annotation works from Django docs.
Subquery should only be invoked for operations from current order. OuterRef just will be replaced with reference to selected field in resulting SQL query.
We want to order by operation id descending, because we do want latest one. If you have other field in your operations that you want to order by instead (like creation date), fill it here.
That subquery should only return total value from operation
We want only one element. It is being fetched using slice notation instead of normal index, because using index on django querysets will immediately invoke it. Slicing only adds LIMIT clause to SQL query, without invoking it and that is what we want.
Now you can use:
orders.filter(total__lte=some_value)
to fetch only orders that you want. You can also use that annotation to

Related

Index of row looping over django queryset [duplicate]

I have a QuerySet, let's call it qs, which is ordered by some attribute which is irrelevant to this problem. Then I have an object, let's call it obj. Now I'd like to know at what index obj has in qs, as efficiently as possible. I know that I could use .index() from Python or possibly loop through qs comparing each object to obj, but what is the best way to go about doing this? I'm looking for high performance and that's my only criteria.
Using Python 2.6.2 with Django 1.0.2 on Windows.
If you're already iterating over the queryset and just want to know the index of the element you're currently on, the compact and probably the most efficient solution is:
for index, item in enumerate(your_queryset):
...
However, don't use this if you have a queryset and an object obtained by some unrelated means, and want to learn the position of this object in the queryset (if it's even there).
If you just want to know where you object sits amongst all others (e.g. when determining rank), you can do it quickly by counting the objects before you:
index = MyModel.objects.filter(sortField__lt = myObject.sortField).count()
Assuming for the purpose of illustration that your models are standard with a primary key id, then evaluating
list(qs.values_list('id', flat=True)).index(obj.id)
will find the index of obj in qs. While the use of list evaluates the queryset, it evaluates not the original queryset but a derived queryset. This evaluation runs a SQL query to get the id fields only, not wasting time fetching other fields.
QuerySets in Django are actually generators, not lists (for further details, see Django documentation on QuerySets).
As such, there is no shortcut to get the index of an element, and I think a plain iteration is the best way to do it.
For starter, I would implement your requirement in the simplest way possible (like iterating); if you really have performance issues, then I would use some different approach, like building a queryset with a smaller amount of fields, or whatever.
In any case, the idea is to leave such tricks as late as possible, when you definitely knows you need them.
Update: You may want to use directly some SQL statement to get the rownumber (something lie . However, Django's ORM does not support this natively and you have to use a raw SQL query (see documentation). I think this could be the best option, but again - only if you really see a real performance issue.
It's possible for a simple pythonic way to query the index of an element in a queryset:
(*qs,).index(instance)
This answer will unpack the queryset into a list, then use the inbuilt Python index function to determine it's position.
You can do this using queryset.extra(…) and some raw SQL like so:
queryset = queryset.order_by("id")
record500 = queryset[500]
numbered_qs = queryset.extra(select={
'queryset_row_number': 'ROW_NUMBER() OVER (ORDER BY "id")'
})
from django.db import connection
cursor = connection.cursor()
cursor.execute(
"WITH OrderedQueryset AS (" + str(numbered_qs.query) + ") "
"SELECT queryset_row_number FROM OrderedQueryset WHERE id = %s",
[record500.id]
)
index = cursor.fetchall()[0][0]
index == 501 # because row_number() is 1 indexed not 0 indexed

Django compare queryset from different databases

I need to compare 2 querysets from the same model from 2 different databases.
I expect the difference between them. In this case, I grab only one column (charfield), from two databases and want to compare this "list", i.e. it would be great to work with sets and difference methods of sets.
But I can't simple subtract querysets, also set(queryset) and list(querysets) -- this give me nothing (not an error), i.e.
diff_set = set(articles1) - set(articles2)
I switched db's on the fly, make 2 querysets and try to compare them (filter, or exclude)
articles1 = list(Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ'))
# right connection
connections.databases['tmp2']['HOST'] = db2.host
connections.databases['tmp2']['NAME'] = db2.name
articles2 = list(Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ'))
# okay to chain Smdocuments objects, gives all the entries
all = list(chain(articles1, articles2))
# got nothing, even len(diff_set) is none
diff_set = set(articles1) - set(articles2)
# this one raise error Subqueries aren't allowed across different databases.
articles_exclude = Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ')
len(articles1)
diff_ex = Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ').exclude(id__in=articles_exclude)
len(diff_ex)
diff_ex raise an error
Subqueries aren't allowed across different databases. Force the inner
query to be evaluated using list(inner_query).
So, "Model objects" not so easy to manipulate, and querysets between difference databases as well.
I see, thats not a good db scheme, but it's another application with distributed db, and I need to compare them.
It's would be enough to compare by one column, but probably compare full queryset will work for future.
Or, should I convert queryset to list and compare raw data?
Your question is really unclear about what you actually expect, but here are a couple hints anyway:
First, model instances (assuming they are instances of the same model of course) compare on their primary key value, which is also used as hash for dicts and sets, so if you want to compare the underlying database records you should not work on model instances but on the raw db values as lists of tuples or dicts. You can get those using (resp.) Queryset.values_list() or Queryset.values() - not forgetting to list() them so you really get a list and not a queryset.
Which brings us to the second important point: while presenting themselves as list-likes (in that they support len(), iteration, subscripting and - with some restrictions - slicing), Querysets are NOT lists. You can't compare two querysets (well you can but they compare on identity, which means that two querysets will only be equal if they are actually the very same object), and, more important, using a queryset as an argument to a 'field__in=' lookup will result in a SQL subquery where passing a proper list results in a plain 'field IN (...)' where clause. This explains the error you get with the exclude(...) approach.
To make a long story short, if you want to effectively compare database rows, you want:
# the fields you want to compate records on
fields = 'field1', 'field2', 'fieldN'
rows1 = list(YouModel.objects.using('tmp1').filter(...).values_list(*fields))
rows2 = list(YouModel.objects.using('tmp2').filter(...).values_list(*fields))
# now you have two lists of tuples so you can apply ordinary python comparisons / set operations etc
print rows1 == rows2
print set(rows1) - set(rows2)
# etc

Django: queryset.count() is significantly slower on chained filters than single filters regardless of returned query size--is there a solution?

EDIT: Best solution thanks to Hakan--
queriedForms.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
I tried more of his suggestions but I can't avoid an INNER JOIN--this seems to be a a stable solution that does get me small, but predictable speed increases across the board. Look through his answer for more details!
I've been struggling with a problem I haven't seen an answer to online.
When chaining two filters in Django e.g.
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
masterQuery.count()
#returns 100,000 results in < 1 second
#test filter--all 100,000+ names have "test x" where x is 0-9
storedCount = masterQuery.filter(name__contains="9").count()
#returns ~50,000 results but takes 5-6 seconds
Trying a slightly different way:
masterQuery = masterQuery.filter(name__contains="9")
masterQuery.count()
#also returns ~50,000 results in 5-6 seconds
performing an & merge seems to ever so slightly improve performance, e.g
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
(masterQuery & masterQuery.filter(name__contains="9")).count()
It seems as if count takes a significantly longer time beyond a single filter in a queryset.
I assume it may have something to do with mySQL, which apparently doesn't like nested statements--and I assume that two filters are creating a nested query that slows mySQL down, regardless of the SELECT COUNT(*) django uses
So my question is: Is there anyway to speed this up? I'm getting ready to do a lot of regular nested querying only using queryset counts (I don't need the actual model values) without database hits to load the models. e.g. I don't need to load 100,000 models from the database, I just need to know there are 100,000 there. It's obviously much faster to do this through querysets than len() but even at 5 secs a count when I'm running 40 counts for an entire complex query is 3+ minutes--I'd prefer it be under a minute. Am I just fantasizing or does someone have a suggestion as to how this could be accomplished outside of increasing the server's processor speed?
EDIT: If it's helpful--the time.clock() speed is .3 secs for the chained filter() count--the actual time to console and django view output is 5-6s
EDIT2: To answer any questions about indexing, the filters use both an indexed and non indexed value for each link in the chain:
mainQuery = masterQuery = bigmodel.relatedmodel_set.all()
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Where "record_attribute_type" is another foreign key being used as a filter
mainQuery.count() #produces 100,000 results in < 1sec
mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="9", reverseforeignkeytestmodel__record_attribute_type__pk=5).count()
#produces ~50,000 results in 5-6 secs
So each filter in the chain is functionally similar, it is an AND filter(condition,condition) where one condition is indexed, and the other is not. I can't index both conditions.
Edit 3:
Similar queries that result in smaller results, e.g. < 10,000 are much faster, regardless of the nesting--e.g. the first filter in the chain produces 10,000 results in ~<1sec but the second filter in the chain will produce 5,000 results in ~<1sec
Edit 4:
Still not working based on #Hakan's solution
mainQuery = bigmodel.relatedmodel_set.all()
#Setup the first filter as normal
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Grab a values list for the second chained filter instead of chaining it
values = bigmodel.relatedmodel_set.all().filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=8).values_list('pk', flat=True)
#filter the first query based on the values_list rather than a second filter
mainQuery = mainQuery.filter(pk__in=values)
mainQuery.count()
#Still takes on average the same amount of time after enough test runs--seems to be slightly faster than average--similar to the (quersetA & querysetB) merge solution I tried.
It's possible I did this wrong--but the count results are consistent between the new value_list filter technique, e.g. I'm getting the same # of results. So it's definitely working--but seemingly taking the same amount of time
EDIT 5:
Also based on #Hakan's solution with some slight tweaks
mainQuery.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
This seems to operate faster for larger results in a queryset, e.g. > 50,000, but is actually much slower on smaller queryset results, e.g. < 50,000--where they used to be <1sec--sometimes 2-3 running in 1 second for chain filtering, they now all take 1 second individually. Essentially the speed gains in the larger queryset have been nullified by the speed loss in the smaller querysets.
I'm still going to try and break up the queries as per his suggestion further--but I'm not sure I'm able to. I'll update again(possibly on Monday) when I figure that out and let everyone interested know the progress.
Not sure if this helps, since I don't have a mysql project to test with.
The QuerySet API reference contains a section about the performance of nested queries.
Performance considerations
Be cautious about using nested queries and understand your database
server’s performance characteristics (if in doubt, benchmark!). Some
database backends, most notably MySQL, don’t optimize nested queries
very well. It is more efficient, in those cases, to extract a list of
values and then pass that into the second query. That is, execute two
queries instead of one:
values = Blog.objects.filter(
name__contains='Cheddar').values_list('pk', flat=True)
entries = Entry.objects.filter(blog__in=list(values))
Note the list() call around the Blog QuerySet to force execution of the first query.
Without it, a nested query would be executed, because QuerySets are
lazy.
So, maybe you can improve the performance by trying something like this:
masterQuery = bigmodel.relatedmodel_set.all()
pks = list(masterQuery.filter(name__contains="test").values_list('pk', flat=True))
count = masterQuery.filter(pk__in=pks, name__contains="9")
Since your initial MySQL performance is so slow, it might even be faster to do the second step in Python instead of in the database.
names = masterQuery.filter(name__contains='test').values_list('name')
count = sum('9' in n for n in names)
Edit:
From your updates, I see that you are querying fields in related models, which result in multiple sql JOIN operations. That's likely a big reason why the query is slow.
To avoid joins, you could try something like this. The goal is to avoid doing deeply chained lookups across relations.
# query only RelatedModel, avoid JOIN
related_pks = RelatedModel.objects.filter(
record_value__contains=constraint['TVAL'],
record_attribute_type=rtypePK,
).values_list('pk', flat=True)
# list(queryset) will do a database query, resulting in a list of integers.
pks_list = list(related_pks)
# use that result to filter your main model.
count = MainModel.objects.filter(
formrecordattributevalue__in=pks_list
).count()
I'm assuming that the relation is defined as a foreign key from MainModel to RelatedModel.

django - filter after slice / filter on queryset where results have been limited

having trouble understanding why I can't filter after a slice on a queryset and what is happening.
stuff = stuff.objects.all()
stuff.count()
= 7
If I then go
extra_stuff = stuff.filter(stuff_flag=id)
extra_stuff.count()
= 6. Everything is all good and I have my new queryset in extrastuff no issues
stuff = stuff.objects.all()[:3]
extra_stuff = stuff.filter(stuff_flag=id)
I get the error "Cannot filter a query once a slice has been taken."
How can I filter further on a queryset where I have limited the number of results?
You can't use filter() after you have sliced the queryset. The error is pretty explicit.
Cannot filter a query once a slice has been taken.
You could do the filter in Python
stuff = stuff.objects.all()[:3]
extra_stuff = [s for s in stuff if s.stuff_flag=='flag']
To get the number or items in extra_stuff, just use len()
extra_stuff_count = len(extra_stuff)
Doing the filtering in Python will work fine when the size of stuff is very small, as in this case. If you had a much larger slice, you could use a subquery, however this might have performance issues as well, you would have to test.
extra_stuff = Stuff.objects.filter(id__in=stuff, stuff_flag='flag')
Django gives you that error because it's already retrieved the items from the database by that point. The filter method is only useful to refine the database query before actually executing it.
Since you're only getting three objects, you could just do the extra filtering in Django:
extra_stuff = [s for s in stuff if s.stuff_flag==id]
but I wonder why you don't do the filter before slicing.
Just made the filtering first after that create another variable and slice it like that:
extra_stuff = stuff.objects.filter(stuff_flag=id)
the_sliced_stuff = extra_stuff[:3]
It works well
Just do 2 queries.
total_stuff = StuffClass.objects.count()
extra_stuff = StuffClass.filter(stuff_flag=id)[:3]
extra_stuff_count = len(StuffClass.filter(stuff_flag=id))
Note, if extra_stuff_count is a few count, like 3 or 300.
Because, it's need more memory for more count (in this case, just make one more request).

django queryset runtime - get nth entry in constant time

I'm using multiple ways to get data from db via different django querysets,
but I would like to know the runtime for each queryset and if possible a better way (to maybe get data in constant time!!)
qs = MyModel.objects.order_by('-time')
qs = qs.filter(blah = blah)
to get the first entry I'm doing this:
entry = list(qs[:1])
first_entry = entry[0]
or to get 10th and last entry:
entry = list(qs)
some_entry = entry[9]
last_entry = entry[-1]
but I believe this will take O(n) time, is there anyway to get the nth term in constant time?
I dont want to use get() as I dont know the id or other value of the entry(its sorted), but only the position.
I may also use annotate, but this also take O(n) runtime.
MyModel.objects.values('date').annotate(min_value=Min('value')).order_by('min_value')[0]
I know the position just need that entry in constant time?
From the docs:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
Generally, slicing a QuerySet returns a new QuerySet – it doesn’t evaluate the query. An exception is if you use the “step” parameter of Python slice syntax.
To retrieve a single object rather than a list (e.g. SELECT foo FROM bar LIMIT 1), use a simple index instead of a slice.
https://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
The part about not evaluating the queryset as you slice it is the important part.

Categories

Resources