Is it possible to use queryset in the FROM clause - python

I have a model for user's points collection:
class Rating(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='rating')
points = models.IntegerField()
Each user could have multiple records in this model. I need to calculate a rank of each user by sum of collected points. For the listing it's easy:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).order_by('-total_points')
But how to get rank for a single user by his user_id? I added annotation with numbers of rows:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
)
it really added correct ranking numbers, but when I try to get a single user by user_id it returns a row with rank=1. It's because the filter condition goes to the WHERE clause and there is a single row with the number 1. I mean this:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
).filter(user_id=1)
I got the SQL query of this queryset (qs.query) like
SELECT ... FROM rating_rating WHERE ...
and inserted it into another SQL query as "rank_table" and added a condition into the outside WHERE clause:
SELECT * FROM (SELECT ... FROM rating_rating WHERE ...) AS rank_table WHERE user_id = 1;
and executed within the MySQL console. And this works exactly as I need. The question is: how to implement the same using Django ORM?

I have one solution to get what I need. I could add another field to mark records as "correct" or "incorrect" user, sort result by this field and then get the first row:
qs.annotate(
required_user=Case(
When(user_id=1, then=1),
default=0,
output_field=IntegerField(),
)
).order_by('-required_user').first()
This works. But SELECT within another SELECT seems more elegant and I would like to know is it possible with Django.

somehow someone just recently asked something about filtering on windows functions. While what you want is basically subquery (select in select), using annotation with the window function is not supported :
https://code.djangoproject.com/ticket/28333 because the annotated fields will inside the subquery :'(. One provides raw sql with query_with_params, but it is not really elegant.

Related

Django Query where one field is duplicate and another is different

I want to know if I can create a query where one field is duplicate and another one is different.
Basically I want to get all UsersNames where First Name is the same and user_id is different.
I did this
UserNames.objects.values("first_name", "user_id").annotate(ct=Count("first_name")).filter(ct__gt=0)
This will retrieve a list whit all Users
After tis, I make some post processing and create another query where I filter just the users with first_name__in=['aaa'] & user_id__in=[1, 2] to get the users with the same first_name but different user_id
Can I do this just in one query? or in a better way?
You can work with a subquery here, but it will not matter much in terms of performance I think:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.filter(
Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
)
or prior to django-3.0:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.annotate(
has_other=Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
).filter(has_other=True)
We thus retain UserNames objects for which there exists a UserNames object with the same first_name, and with a different user_id.

Optimise Django query with large subquery

I have a database containing Profile and Relationship models. I haven't explicitly linked them in the models (because they are third party IDs and they may not yet exist in both tables), but the source and target fields map to one or more Profile objects via the id field:
from django.db import models
class Profile(models.Model):
id = models.BigIntegerField(primary_key=True)
handle = models.CharField(max_length=100)
class Relationship(models.Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField(db_index=True)
target = models.BigIntegerField(db_index=True)
My query needs to get a list of 100 values from the Relationship.source column which don't yet exist as a Profile.id. This list will then be used to collect the necessary data from the third party. The query below works, but as the table grows (10m+), the SubQuery is getting very large and slow.
Any recommendations for how to optimise this? Backend is PostgreSQL but I'd like to use native Django ORM if possible.
EDIT: There's an extra level of complexity that will be contributing to the slow query. Not all IDs are guaranteed to return success, which would mean they continue to "not exist" and get the program in an infinite loop. So I've added a filter and order_by to input the highest id from the previous batch of 100. This is going to be causing some of the problem so apologies for missing it initially.
from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
target=user.id,
).exclude(
source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
"source", flat=True
).order_by(
"source"
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
)
Thanks in advance for any advice!
For future searchers, here's how I bypassed the __in query and was able to speed up the results.
from django.db.models import Subquery
from django.db.models import Count # New
user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source")) # New queryset to use within Subquery
rels = Relationship.objects.order_by(
"source"
).annotate(
# Annotate each relationship record with a Count of the times that the "source" ID
# appears in the `Profile` table. We can then filter on those that have a count of 0
# (ie don't appear and therefore haven't yet been connected)
prof_count=Count(Subquery(subq.values("id")))
).filter(
target=user.id,
prof_count=0
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
).values_list(
"source", flat=True
)
I think this is faster because the query will complete once it reaches it's required 100 items (rather than comparing against a list of 1m+ IDs each time).

Django postgres order_by distinct on field

We have a limitation for order_by/distinct fields.
From the docs: "fields in order_by() must start with the fields in distinct(), in the same order"
Now here is the use case:
class Course(models.Model):
is_vip = models.BooleanField()
...
class CourseEvent(models.Model):
date = models.DateTimeField()
course = models.ForeignKey(Course)
The goal is to fetch the courses, ordered by nearest date but vip goes first.
The solution could look like this:
CourseEvent.objects.order_by('-course__is_vip', '-date',).distinct('course_id',).values_list('course')
But it causes an error since the limitation.
Yeah I understand why ordering is necessary when using distinct - we get the first row for each value of course_id so if we don't specify an order we would get some arbitrary row.
But what's the purpose of limiting order to the same field that we have distinct on?
If I change order_by to something like ('course_id', '-course__is_vip', 'date',) it would give me one row for course but the order of courses will have nothing in common with the goal.
Is there any way to bypass this limitation besides walking through the entire queryset and filtering it in a loop?
You can use a nested query using id__in. In the inner query you single out the distinct events and in the outer query you custom-order them:
CourseEvent.objects.filter(
id__in=CourseEvent.objects\
.order_by('course_id', '-date').distinct('course_id')
).order_by('-course__is_vip', '-date')
From the docs on distinct(*fields):
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.

Sort table by a column and set other column to sequential value to persist ordering

So I'm not sure whether to pose this as a Django or SQL question however I have the following model:
class Picture(models.Model):
weight = models.IntegerField(default=0)
taken_date = models.DateTimeField(blank=True, null=True)
album = models.ForeignKey(Album, db_column="album_id", related_name='pictures')
I may have a subset of Picture records numbering in the thousands, and I'll need to sort them by taken_date and persist the order by setting the weight value.
For instance in Django:
pictures = Picture.objects.filter(album_id=5).order_by('taken_date')
for weight, picture in enumerate(list(pictures)):
picture.weight = weight
picture.save()
Now for 1000s of records as I'm expecting to have, this could take way too long. Is there a more efficient way of performing this task? I'm assuming I might need to resort to SQL as I've recently come to learn Django's not necessarily "there yet" in terms of database bulk operations.
Ok I put together the following in MySQL which works fine, however I'm gonna guess there's no way to simulate this using Django ORM?
UPDATE picture p
JOIN (SELECT #inc := #inc + 1 AS new_weight, id
FROM (SELECT #inc := 0) temp, picture
WHERE album_id = 5
ORDER BY taken_date) pw
ON p.id = pw.id
SET p.weight = pw.new_weight;
I'll leave the question open for a while just in case there's some awesome solution or app that solves this, however the above query for ~6000 records takes 0.11s.
NOTE that the above query will generate warnings in MySQL if you have the following setting in MySQL:
binlog_format=statement
In order to fix this, you must change the binlog_format setting to either mixed or row. mixed is probably better as it means you'll still use statement for everything except in cases where row is required to avoid a warning like the above.

Django - SQL Query - Timestamp

Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.

Categories

Resources