django distinct queryset from queryset-values

django distinct queryset from queryset-values - python

I have a User that can have multiple Accounts. There can be multiple Units on each Account.
I build a filter dictionary and get relevant units:
units = Unit.objects.filter(**unit_filter)
However, I would also like to get distinct users. I can easily get their ids:
user_dicts = units.values('account__user').distinct()
or to be more exact:
user_ids = [rec.get('account__user') for rec in
units.values('account__user').distinct()]
So then I can filter Users using User.objects.filter(id__in=user_ids). (I can also use the generator expression instead of list comprehension, but that is not the point.)
I am not sure but evaluating id in seems to me not very efficient. Is there a better way how to get unique users from filtered units?
Edit:
I add SQL queries (did not test them, may be wrong) to make obvious what I am trying to do in Django ORM. Actually I am trying to use JOIN instead of WHERE clause.
I hope for this:
WITH selected_units AS
(SELECT id, account_id FROM units)
SELECT DISTINCT u.id FROM user u
JOIN account a on a.user_id=u.id
JOIN unit ut ON ut.account=a.id
JOIN selected_units s ON s.id=ut.id;
But with id_in I get this:
WITH selected_units AS
(SELECT id, account_id FROM units)
SELECT DISTINCT u.id FROM user u
JOIN account a on a.user_id=u.id
JOIN unit ut ON ut.account_id=a.id
WHERE ut.id IN (SELECT id FROM selected_units);

I think you can do it using subqueries, something like this:
User.objects.filter(account__unit__in=units)
Or less pretty:
User.objects.filter(id__in=units.values_list('account__user__id', flat=True))

Related

Django ORM: Get latest record for distinct field

I'm having loads of trouble translating some SQL into Django.
Imagine we have some cars, each with a unique VIN, and we record the dates that they are in the shop with some other data. (Please ignore the reason one might structure the data this way. It's specifically for this question. :-) )
class ShopVisit(models.Model):
vin = models.CharField(...)
date_in_shop = models.DateField(...)
mileage = models.DecimalField(...)
boolfield = models.BooleanField(...)
We want a single query to return a Queryset with the most recent record for each vin and update it!
special_vins = [...]
# Doesn't work
ShopVisit.objects.filter(vin__in=special_vins).annotate(max_date=Max('date_in_shop').filter(date_in_shop=F('max_date')).update(boolfield=True)
# Distinct doesn't work with update
ShopVisit.objects.filter(vin__in=special_vins).order_by('vin', '-date_in_shop).distinct('vin').update(boolfield=True)
Yes, I could iterate over a queryset. But that's not very efficient and it takes a long time when I'm dealing with around 2M records. The SQL that could do this is below (I think!):
SELECT *
FROM cars
INNER JOIN (
SELECT MAX(dateInShop) as maxtime, vin
FROM cars
GROUP BY vin
) AS latest_record ON (cars.dateInShop= maxtime)
AND (latest_record.vin = cars.vin)
So how can I make this happen with Django?

This is somewhat untested, and relies on Django 1.11 for Subqueries, but perhaps something like:
latest_visits = Subquery(ShopVisit.objects.filter(id=OuterRef('id')).order_by('-date_in_shop').values('id')[:1])
ShopVisit.objects.filter(id__in=latest_visits)
I had a similar model, so went to test it but got an error of:
"This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery"
The SQL it generated looked reasonably like what you want, so I think the idea is sound. If you use PostGres, perhaps it has support for that type of subquery.
Here's the SQL it produced (trimmed up a bit and replaced actual names with fake ones):
SELECT `mymodel_activity`.* FROM `mymodel_activity` WHERE `mymodel_activity`.`id` IN (SELECT U0.`id` FROM `mymodel_activity` U0 WHERE U0.`id` = (`mymodel_activity`.`id`) ORDER BY U0.`date_in_shop` DESC LIMIT 1)

I wonder if you found the solution yourself.
I could come up with only raw query string. Django Raw SQL query Manual
UPDATE "yourapplabel_shopvisit"
SET boolfield = True WHERE date_in_shop
IN (SELECT MAX(date_in_shop) FROM "yourapplabel_shopvisit" GROUP BY vin);

Django ORM values_list with '__in' filter performance

What is the preferred way to filter query set with '__in' in Django?
providers = Provider.objects.filter(age__gt=10)
consumers = Consumer.objects.filter(consumer__in=providers)
or
providers_ids = Provider.objects.filter(age__gt=10).values_list('id', flat=True)
consumers = Consumer.objects.filter(consumer__in=providers_ids)

These should be totally equivalent. Underneath the hood Django will optimize both of these to a subselect query in SQL. See the QuerySet API reference on in:
This queryset will be evaluated as subselect statement:
SELECT ... WHERE consumer.id IN (SELECT id FROM ... WHERE _ IN _)
However you can force a lookup based on passing in explicit values for the primary keys by calling list on your values_list, like so:
providers_ids = list(Provider.objects.filter(age__gt=10).values_list('id', flat=True))
consumers = Consumer.objects.filter(consumer__in=providers_ids)
This could be more performant in some cases, for example, when you have few providers, but it will be totally dependent on what your data is like and what database you're using. See the "Performance Considerations" note in the link above.

I Agree with Wilduck. However couple of notes
You can combine a filter such as these into one like this:
consumers = Consumer.objects.filter(consumer__age__gt=10)
This would give you the same result set - in a single query.
The second thing, to analyze the generated query, you can use the .query clause at the end.
Example:
print Provider.objects.filter(age__gt=10).query
would print the query the ORM would be generating to fetch the resultset.

Django query based on FK — get all, not any

I need to find an order with all order items with status = completed. It looks like this:
FINISHED_STATUSES = [17,18,19]
if active_tab == 'outstanding':
orders = orders.exclude(items__status__in=FINISHED_STATUSES)
However, this query only gives me orders with any order item with a completed status. How would I do the query such that I retrieve only those orders with ALL order items with a completed status?

I think that you need to do raw query here:
Set you orders and items model as Orders and Items:
# raw query
sql = """\
select `orders`.* from `%{orders_table}s` as `orders`
join `%{items_table}s` as `items`
on `items`.`%{item_order_fk}s` = `orders`.`%{order_pk}s`
where `items`.`%{status_field}s` in (%{status_list}s)
group by `orders`.`%{orders_pk}s`
having count(*) = %{status_count)s;
""" % {
"orders_table": Orders._meta.db_table,
"items_table": Items._meta.db_table,
"order_pk": Orders._meta.pk.colum,
"item_order_fk":Items._meta.get_field("order").colum,
"status_field": Items._meta.get_field("status").colum,
"status_list": str(FINISHED_STATUSES)[1:-1],
"status_count": len(FINISHED_STATUSES),
}
orders = Orders.objects.raw(sql)

I was able to get this done by a sort of hackish way. First, I added an additional Boolean column, is_finished. Then, to find an order with at least one non-finished item:
orders = orders.filter(items__status__is_finished=False)
This gives me all un-finished orders.
Doing the opposite of that gets the finished orders:
orders = orders.exclude(items__status__is_finished=False)

Adding the boolean field is a good idea. That way you have your business rules clearly defined in the model.
Now, let's say that you still wanted to do it without resorting to adding fields. This may very well be a requirement given a different set of circumstances. Unfortunately, you can't really use subqueries or arbitrary joins in the Django ORM. You could, however, build up Q objects and make an implicit join in the having clause using filter() and annotate().
from django.db.models.aggregates import Count
from django.db.models import Q
from functools import reduce
from operator import or_
total_items_by_orders = Orders.objects.annotate(
item_count=Count('items'))
finished_items_by_orders = Orders.objects.filter(
items__status__in=FINISHED_STATUSES).annotate(
item_count=Count('items'))
orders = total_items_by_orders.exclude(
reduce(or_, (Q(id=o.id, item_count=o.item_count)
for o in finished_items_by_orders)))
Note that using raw SQL, while less elegant, would usually be more efficient.

How to filter by joinloaded table in SqlAlchemy?

Lets say I got 2 models, Document and Person. Document got relationship to Person via "owner" property. Now:
session.query(Document)\
.options(joinedload('owner'))\
.filter(Person.is_deleted!=True)
Will double join table Person. One person table will be selected, and the doubled one will be filtered which is not exactly what I want cuz this way document rows will not be filtered.
What can I do to apply filter on joinloaded table/model ?

You are right, table Person will be used twice in the resulting SQL, but each of them serves different purpose:
one is to filter the the condition: filter(Person.is_deleted != True)
the other is to eager load the relationship: options(joinedload('owner'))
But the reason your query returns wrong results is because your filter condition is not complete. In order to make it produce the right results, you also need to JOIN the two models:
qry = (session.query(Document).
join(Document.owner). # THIS IS IMPORTANT
options(joinedload(Document.owner)).
filter(Person.is_deleted != True)
)
This will return correct rows, even though it will still have 2 references (JOINs) to Person table. The real solution to your query is that using contains_eager instead of joinedload:
qry = (session.query(Document).
join(Document.owner). # THIS IS STILL IMPORTANT
options(contains_eager(Document.owner)).
filter(Person.is_deleted != True)
)

SQLAlchemy filter query by related object

Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC

The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries

You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries

I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

django distinct queryset from queryset-values - python

I think you can do it using subqueries, something like this: User.objects.filter(accountunitin=units) Or less pretty: User.objects.filter(id__in=units.values_list('accountuserid', flat=True))

Related

Django ORM: Get latest record for distinct field

Django ORM values_list with '__in' filter performance

Django query based on FK — get all, not any

How to filter by joinloaded table in SqlAlchemy?

SQLAlchemy filter query by related object

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

django distinct queryset from queryset-values - python

I think you can do it using subqueries, something like this: User.objects.filter(account__unit__in=units) Or less pretty: User.objects.filter(id__in=units.values_list('account__user__id', flat=True))

Related

Django ORM: Get latest record for distinct field

Django ORM values_list with '__in' filter performance

Django query based on FK — get all, not any

How to filter by joinloaded table in SqlAlchemy?

SQLAlchemy filter query by related object

Categories

Resources

I think you can do it using subqueries, something like this: User.objects.filter(accountunitin=units) Or less pretty: User.objects.filter(id__in=units.values_list('accountuserid', flat=True))