Django Query where one field is duplicate and another is different - python

I want to know if I can create a query where one field is duplicate and another one is different.
Basically I want to get all UsersNames where First Name is the same and user_id is different.
I did this
UserNames.objects.values("first_name", "user_id").annotate(ct=Count("first_name")).filter(ct__gt=0)
This will retrieve a list whit all Users
After tis, I make some post processing and create another query where I filter just the users with first_name__in=['aaa'] & user_id__in=[1, 2] to get the users with the same first_name but different user_id
Can I do this just in one query? or in a better way?

You can work with a subquery here, but it will not matter much in terms of performance I think:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.filter(
Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
)
or prior to django-3.0:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.annotate(
has_other=Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
).filter(has_other=True)
We thus retain UserNames objects for which there exists a UserNames object with the same first_name, and with a different user_id.

Related

Is it possible to use queryset in the FROM clause

I have a model for user's points collection:
class Rating(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='rating')
points = models.IntegerField()
Each user could have multiple records in this model. I need to calculate a rank of each user by sum of collected points. For the listing it's easy:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).order_by('-total_points')
But how to get rank for a single user by his user_id? I added annotation with numbers of rows:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
)
it really added correct ranking numbers, but when I try to get a single user by user_id it returns a row with rank=1. It's because the filter condition goes to the WHERE clause and there is a single row with the number 1. I mean this:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
).filter(user_id=1)
I got the SQL query of this queryset (qs.query) like
SELECT ... FROM rating_rating WHERE ...
and inserted it into another SQL query as "rank_table" and added a condition into the outside WHERE clause:
SELECT * FROM (SELECT ... FROM rating_rating WHERE ...) AS rank_table WHERE user_id = 1;
and executed within the MySQL console. And this works exactly as I need. The question is: how to implement the same using Django ORM?
I have one solution to get what I need. I could add another field to mark records as "correct" or "incorrect" user, sort result by this field and then get the first row:
qs.annotate(
required_user=Case(
When(user_id=1, then=1),
default=0,
output_field=IntegerField(),
)
).order_by('-required_user').first()
This works. But SELECT within another SELECT seems more elegant and I would like to know is it possible with Django.
somehow someone just recently asked something about filtering on windows functions. While what you want is basically subquery (select in select), using annotation with the window function is not supported :
https://code.djangoproject.com/ticket/28333 because the annotated fields will inside the subquery :'(. One provides raw sql with query_with_params, but it is not really elegant.

Django: remove duplicates (group by) from queryset by related model field

I have a Queryset with a couple of records, and I wan't to remove duplicates using the related model field. For example:
class User(models.Model):
group = models.ForeignKey('Group')
...
class Address(models.Model):
...
models.ForeignKey('User')
addresses = Address.objects.filter(user__group__id=1).order_by('-id')
This returns a QuerySet of Address records, and I want to group by the User ID.
I can't use .annotate because I need all fields from Address, and the relationship between Address and User
I can't use .distinct() because it doesn't work, since all addresses are distinct, and I want distinct user addresses.
I could:
addresses = Address.objects.filter(user__group__id=1).order_by('-id')
unique_users_ids = []
unique_addresses = []
for address in addresses:
if address.user.id not in unique_users_ids:
unique_addresses.append(address)
unique_users_ids.append(address.user.id)
print unique_addresses # TA-DA!
But it seems too much for a simple thing like a group by (damn you Django).
Is there a easy way to achieve this?
By using .distinct() with a field name
Django has also a .distinct(..) function that takes as input column the column names that should be unique. Alas most database systems do not support this (only PostgreSQL to the best of my knowledge). But in PostgreSQL we can thus perform:
# Limited number of database systems support this
addresses = (Address.objects
.filter(user__group__id=1)
.order_by('-id')
.distinct('user_id'))
By using two queries
Another way to handle this is by first having a query that works over the users, and for each user obtains the largest address_id:
from django.db.models import Max
address_ids = (User.objects
.annotate(address_id=Max('address_set__id'))
.filter(address_id__isnull=False)
.values_list('address_id'))
So now for every user, we have calculated the largest corresponding address_id, and we eliminate Users that have no address. We then obtain the list of ids.
In a second step, we then fetch the addresses:
addresses = Address.objects.filter(pk__in=address_ids)

Optimise Django query with large subquery

I have a database containing Profile and Relationship models. I haven't explicitly linked them in the models (because they are third party IDs and they may not yet exist in both tables), but the source and target fields map to one or more Profile objects via the id field:
from django.db import models
class Profile(models.Model):
id = models.BigIntegerField(primary_key=True)
handle = models.CharField(max_length=100)
class Relationship(models.Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField(db_index=True)
target = models.BigIntegerField(db_index=True)
My query needs to get a list of 100 values from the Relationship.source column which don't yet exist as a Profile.id. This list will then be used to collect the necessary data from the third party. The query below works, but as the table grows (10m+), the SubQuery is getting very large and slow.
Any recommendations for how to optimise this? Backend is PostgreSQL but I'd like to use native Django ORM if possible.
EDIT: There's an extra level of complexity that will be contributing to the slow query. Not all IDs are guaranteed to return success, which would mean they continue to "not exist" and get the program in an infinite loop. So I've added a filter and order_by to input the highest id from the previous batch of 100. This is going to be causing some of the problem so apologies for missing it initially.
from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
target=user.id,
).exclude(
source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
"source", flat=True
).order_by(
"source"
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
)
Thanks in advance for any advice!
For future searchers, here's how I bypassed the __in query and was able to speed up the results.
from django.db.models import Subquery
from django.db.models import Count # New
user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source")) # New queryset to use within Subquery
rels = Relationship.objects.order_by(
"source"
).annotate(
# Annotate each relationship record with a Count of the times that the "source" ID
# appears in the `Profile` table. We can then filter on those that have a count of 0
# (ie don't appear and therefore haven't yet been connected)
prof_count=Count(Subquery(subq.values("id")))
).filter(
target=user.id,
prof_count=0
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
).values_list(
"source", flat=True
)
I think this is faster because the query will complete once it reaches it's required 100 items (rather than comparing against a list of 1m+ IDs each time).

Django: Order by evaluation of whether or not a date is empty

In Django, is it possible to order by whether or not a field is None, instead of the value of the field itself?
I know I can send the QuerySet to python sorted() but I want to keep it as a QuerySet for subsequent filtering. So, I'd prefer to order in the QuerySet itself.
For example, I have a termination_date field and I want to first sort the ones without a termination_date, then I want to order by a different field, like last_name, first_name.
Is this possible or am I stuck using sorted() and then having to do an entire new Query with the included ids and run sorted() on the new QuerySet? I can do this, but would prefer not to waste the overhead and use the beauty of QuerySets that they don't run until evaluated.
Translation, how can I get this SQL from Django assuming my app is employee, my model is Employee and it has three fields 'first_name (varchar)', 'last_name (varchar)', and 'termination_date (date)':
SELECT
"employee_employee"."last_name",
"employee_employee"."first_name",
"employee_employee"."termination_date"
FROM "employee_employee"
ORDER BY
"employee_employee"."termination_date" IS NOT NULL,
"employee_employee"."last_name",
"employee_employee"."first_name"
You should be able to order by query expressions, like this:
from django.db.models import IntegerField, Case, Value, When
MyModel.objects.all().order_by(
Case(
When(some_field=None, then=Value(1)),
default=Value(0),
output_field=IntegerField(),
).asc(),
'some_other_field'
)
I cannot test here so it might require a bit a fiddling around, but this should put rows that have a NULL some_field after those that have a some_field. And each set of rows should be sorted by some_other_field.
Granted, the CASE/WHEN is be a bit more cumbersome that what you put in your question, but I don't know how to get Django ORM to output that. Maybe someone else will have a better answer.
Spectras' answer works fine, but it only orders your records by 'null or not'. There is a shorter way that allows you to put empty dates wherever you want them in your date ordering - Coalesce:
from django.db.models import Value
from django.db.models.functions import Coalesce
wayback = datetime(year=1, month=1, day=1) # or whatever date you want
MyModel.objects
.annotate(null_date=Coalesce('date_field', Value(wayback)))
.order_by('null_date')
This will essentially sort by the field 'date_field' with all records with date_field == None will be in the order as if they had the date wayback. This works perfectly with PostgreSQL, but might need some raw sql casting in MySQL as described in the documentation.

GQL - Select all parents where specific child is not in children

I'd like to have a parent class (Group) where any number of User may join. I want to display all Groups where the User is not already in. How do I model this data and how do I query? Sorry for not providing any code, but I simply have no idea.
Edit:
In SQL, this would be done with a User table, a Group table and a GroupUser cross ref table. And querying would go:
select *
from Group
where Group.ID not in
(
select GroupID
from GroupUser
where UserID = #userid
)
There are going to be thousands of groups and user is going to be a member of tens of them?
You are going to have a user experience issue: how do you make user choose from thousand of groups? A pagination table with 50+ pages?
About your problem: when you solve the above problem, then you can just properly mark groups user is already a member of:
You can simply have all users groups in memory (it's only ten IDs, right?) and simply filter them as you display the groups.
As Wooble says, there's no way to construct a query like this for the App Engine datastore. If your number of groups greatly outnumbers the number of groups a user is actually in, your best option is just to select all groups, then filter out the ones the user is already in - which is exactly what an SQL database would do given that query.
Maybe I put my question unclearly, I am obviously new to GAE and my terminology may be wrong. Anyway here is my solution:
class User(db.Model):
username = db.StringProperty()
class Group(db.Model):
users = db.ListProperty(db.Key)
To find a group and join (somewhat simplified):
groups = db.GqlQuery("SELECT * "
"FROM Group")
for g in groups:
if user.key() not in g.users:
group = g
break
group.users.append(user.key())

Categories

Resources