Django: remove duplicates (group by) from queryset by related model field

Django: remove duplicates (group by) from queryset by related model field - python

I have a Queryset with a couple of records, and I wan't to remove duplicates using the related model field. For example:
class User(models.Model):
group = models.ForeignKey('Group')
...
class Address(models.Model):
...
models.ForeignKey('User')
addresses = Address.objects.filter(user__group__id=1).order_by('-id')
This returns a QuerySet of Address records, and I want to group by the User ID.
I can't use .annotate because I need all fields from Address, and the relationship between Address and User
I can't use .distinct() because it doesn't work, since all addresses are distinct, and I want distinct user addresses.
I could:
addresses = Address.objects.filter(user__group__id=1).order_by('-id')
unique_users_ids = []
unique_addresses = []
for address in addresses:
if address.user.id not in unique_users_ids:
unique_addresses.append(address)
unique_users_ids.append(address.user.id)
print unique_addresses # TA-DA!
But it seems too much for a simple thing like a group by (damn you Django).
Is there a easy way to achieve this?

By using .distinct() with a field name
Django has also a .distinct(..) function that takes as input column the column names that should be unique. Alas most database systems do not support this (only PostgreSQL to the best of my knowledge). But in PostgreSQL we can thus perform:
# Limited number of database systems support this
addresses = (Address.objects
.filter(user__group__id=1)
.order_by('-id')
.distinct('user_id'))
By using two queries
Another way to handle this is by first having a query that works over the users, and for each user obtains the largest address_id:
from django.db.models import Max
address_ids = (User.objects
.annotate(address_id=Max('address_set__id'))
.filter(address_id__isnull=False)
.values_list('address_id'))
So now for every user, we have calculated the largest corresponding address_id, and we eliminate Users that have no address. We then obtain the list of ids.
In a second step, we then fetch the addresses:
addresses = Address.objects.filter(pk__in=address_ids)

Related

How to annotate on a Django model's M2M field and get a list of distinct instances?

I have two Django models Profile and Device with a ManyToMany relationship with one another like so:
class Profile(models.Model):
devices = models.ManyToManyField(Device, related_name='profiles')
I am trying to use annotate() and Count() to query on all profiles that have 1 or more devices like this:
profiles = Profile.objects.annotate(dev_count=Count('devices')).filter(dev_count__gt=1)
This is great, it gives me a QuerySet with all the profiles (4500+) with one or more devices, as expected.
Next, because of the M2M relationship, I would like to get a list of all the distinct devices among all the profiles from the previous queryset.
All of my failed attempts below return an empty queryset. I have read the documentation on values, values_list, and annotate but I still can't figure out how to make the correct query here.
devices = profiles.values('devices').distinct()
devices = profiles.values_list('devices', flat=True).distinct()
I have also tried to do it in one go:
devices = (
Profile.objects.values_list('devices', flat=True)
.annotate(dev_count=Count('devices'))
.filter(dev_count__gt=1)
.distinct()
)

You can not work with .values() since that that item appears both in the SELECT clause and the GROUP BY clause, so then you start mentioning the field, and hence the COUNT(devices) will return 1 for each group.
You can filter on the Devices that are linked to at least one of these Profiles with:
profiles = Profile.objects.annotate(
dev_count=Count('devices')
).filter(dev_count__gt=1)
devices = Device.objects.filter(profile__in=profiles).distinct()
For some SQL dialects, usually MySQL it is better to first materialize the list of profiles and not work with a subquery, so:
profiles = Profile.objects.annotate(
dev_count=Count('devices')
).filter(dev_count__gt=1)
profiles_list = list(profiles)
devices = Device.objects.filter(profile__in=profiles_list).distinct()

Django Query where one field is duplicate and another is different

I want to know if I can create a query where one field is duplicate and another one is different.
Basically I want to get all UsersNames where First Name is the same and user_id is different.
I did this
UserNames.objects.values("first_name", "user_id").annotate(ct=Count("first_name")).filter(ct__gt=0)
This will retrieve a list whit all Users
After tis, I make some post processing and create another query where I filter just the users with first_name__in=['aaa'] & user_id__in=[1, 2] to get the users with the same first_name but different user_id
Can I do this just in one query? or in a better way?

You can work with a subquery here, but it will not matter much in terms of performance I think:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.filter(
Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
)
or prior to django-3.0:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.annotate(
has_other=Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
).filter(has_other=True)
We thus retain UserNames objects for which there exists a UserNames object with the same first_name, and with a different user_id.

How to query a model by a related object and get the related object in the queryset using the Django ORM

I know it's possible to query a model using a reverse related field using the Django ORM. But is it possible to also get all the fields of the reverse related model for which the query matched?
For example, if we have the following models:
class Location(models.Model):
name = models.CharField(max_length=50)
class Availability(models.Model):
location = models.ForeignKey(Location, on_delete=models.CASCADE)
start_datetime = models.DateTimeField()
end_datetime = models.DateTimeField()
price = models.PositiveIntegerField()
would it be possible to find all Locations that are available in a specific timeframe AND also get the price of the Location during that availability? We are under the assumption that Availability objects that have the same location can not have overlapping start/end datetimes.
if user_start_datetime and user_end_datetime are inputted by the user, then we could possibly do something like the following:
Location.objects.filter(
availability__start_datetime__lte=start_datetime,
availability__end_datetime__gte=end_datetime)
But I'm not sure how to also get the price field for the specific availability that did result in a match for the query.
In raw SQL, the behavior I'm talking about might be achievable via something like this:
SELECT l.id, l.name, a.price
FROM Location l
INNER JOIN Availability a
ON a.location_id = l.id
WHERE /* availability is within user-inputted timeframe */
I've considered using something like prefetch_related('availability_set'), but that would just give me all the availabilities for the Location objects that matched the query. I just want the one availability that was within the timeframe that was queried, and more specifically, the price of that availability.

When you are using an ORM, in general you fetch results from one model class at a time. Since Location and Availability are separate models, you can simply do the following:
availabilities = Availability.objects.filter(
start_datetime__lte=start_datetime,
end_datetime__gte=end_datetime)
for availability in availabilities:
print(availability.location.id, availability.location.name, availability.price)
Which is an easy to read implementation.
Now, accessing Location from an Availability object (in availability.location) requires a second SQL query. You can optimise this using select_related:
This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.
Simply append it to your original query, i.e.:
availabilities = Availability.objects.select_related('location').filter(...
This will create an SQL join statement in the background and the Location objects will not require an extra query.

Django postgres order_by distinct on field

We have a limitation for order_by/distinct fields.
From the docs: "fields in order_by() must start with the fields in distinct(), in the same order"
Now here is the use case:
class Course(models.Model):
is_vip = models.BooleanField()
...
class CourseEvent(models.Model):
date = models.DateTimeField()
course = models.ForeignKey(Course)
The goal is to fetch the courses, ordered by nearest date but vip goes first.
The solution could look like this:
CourseEvent.objects.order_by('-course__is_vip', '-date',).distinct('course_id',).values_list('course')
But it causes an error since the limitation.
Yeah I understand why ordering is necessary when using distinct - we get the first row for each value of course_id so if we don't specify an order we would get some arbitrary row.
But what's the purpose of limiting order to the same field that we have distinct on?
If I change order_by to something like ('course_id', '-course__is_vip', 'date',) it would give me one row for course but the order of courses will have nothing in common with the goal.
Is there any way to bypass this limitation besides walking through the entire queryset and filtering it in a loop?

You can use a nested query using id__in. In the inner query you single out the distinct events and in the outer query you custom-order them:
CourseEvent.objects.filter(
id__in=CourseEvent.objects\
.order_by('course_id', '-date').distinct('course_id')
).order_by('-course__is_vip', '-date')
From the docs on distinct(*fields):
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.

django model object filter

I have tables called 'has_location' and 'locations'. 'has_location' has user_has and location_id and its own id which is given by django itself.
'locations' have more columns.
Now I want to get all locations of some certain user. What I did is..(user.id is known):
users_locations_id = has_location.objects.filter(user_has__exact=user.id)
locations = Location.objects.filter(id__in=users_locations_id)
print len(locations)
but I am getting 0 by this print. I have data in db. but I have the feeling that __in does not accept the models id, does it ?
thanks

Using __in for this kind of query is a common anti-pattern in Django: it's tempting because of its simplicity, but it scales poorly in most databases. See slides 66ff in this presentation by Christophe Pettus.
You have a many-to-many relationship between users and locations, represented by the has_location table. You would normally describe this to Django using a ManyToManyField with a through table, something like this:
class Location(models.Model):
# ...
class User(models.Model):
locations = models.ManyToManyField(Location, through = 'LocationUser')
# ...
class LocationUser(models.Model):
location = models.ForeignKey(Location)
user = models.ForeignKey(User)
class Meta:
db_table = 'has_location'
Then you can fetch the locations for a user like this:
user.locations.all()
You can query the locations in your filter operations:
User.objects.filter(locations__name = 'Barcelona')
And you can request that users' related locations be fetched efficiently using the prefetch_related() method on a query set.

You are using has_location's own id to filter locations. You have to use location_ids to filter locations:
user_haslocations = has_location.objects.filter(user_has=user)
locations = Location.objects.filter(id__in=user_haslocations.values('location_id'))
You can also filter the locations directly through the reverse relation:
location = Location.objects.filter(has_location__user_has=user.id)

What do your models look like?
For your doubt, __in does accept filtered ids.
For your current code, the solution:
locations = Location.objects.filter(id__in=has_location.objects.filter(user=user).values('location_id'))
# if you just want the length of the locations, evaluate locations.count()
locations.count()
# if you want to iterate locations to access items afterwards
len(locations)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django: remove duplicates (group by) from queryset by related model field - python

Related

How to annotate on a Django model's M2M field and get a list of distinct instances?

Django Query where one field is duplicate and another is different

How to query a model by a related object and get the related object in the queryset using the Django ORM

Django postgres order_by distinct on field

django model object filter

Categories

Resources