Django ORM: Get latest record for distinct field - python

I'm having loads of trouble translating some SQL into Django.
Imagine we have some cars, each with a unique VIN, and we record the dates that they are in the shop with some other data. (Please ignore the reason one might structure the data this way. It's specifically for this question. :-) )
class ShopVisit(models.Model):
vin = models.CharField(...)
date_in_shop = models.DateField(...)
mileage = models.DecimalField(...)
boolfield = models.BooleanField(...)
We want a single query to return a Queryset with the most recent record for each vin and update it!
special_vins = [...]
# Doesn't work
ShopVisit.objects.filter(vin__in=special_vins).annotate(max_date=Max('date_in_shop').filter(date_in_shop=F('max_date')).update(boolfield=True)
# Distinct doesn't work with update
ShopVisit.objects.filter(vin__in=special_vins).order_by('vin', '-date_in_shop).distinct('vin').update(boolfield=True)
Yes, I could iterate over a queryset. But that's not very efficient and it takes a long time when I'm dealing with around 2M records. The SQL that could do this is below (I think!):
SELECT *
FROM cars
INNER JOIN (
SELECT MAX(dateInShop) as maxtime, vin
FROM cars
GROUP BY vin
) AS latest_record ON (cars.dateInShop= maxtime)
AND (latest_record.vin = cars.vin)
So how can I make this happen with Django?

This is somewhat untested, and relies on Django 1.11 for Subqueries, but perhaps something like:
latest_visits = Subquery(ShopVisit.objects.filter(id=OuterRef('id')).order_by('-date_in_shop').values('id')[:1])
ShopVisit.objects.filter(id__in=latest_visits)
I had a similar model, so went to test it but got an error of:
"This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery"
The SQL it generated looked reasonably like what you want, so I think the idea is sound. If you use PostGres, perhaps it has support for that type of subquery.
Here's the SQL it produced (trimmed up a bit and replaced actual names with fake ones):
SELECT `mymodel_activity`.* FROM `mymodel_activity` WHERE `mymodel_activity`.`id` IN (SELECT U0.`id` FROM `mymodel_activity` U0 WHERE U0.`id` = (`mymodel_activity`.`id`) ORDER BY U0.`date_in_shop` DESC LIMIT 1)

I wonder if you found the solution yourself.
I could come up with only raw query string. Django Raw SQL query Manual
UPDATE "yourapplabel_shopvisit"
SET boolfield = True WHERE date_in_shop
IN (SELECT MAX(date_in_shop) FROM "yourapplabel_shopvisit" GROUP BY vin);

Related

SQL query "translated" to one that Django will accept, anyone? Please (python)

Can anyone "translate" this MySQL query to Django chain or Q(). This is just an example (but valid) query, but I want to see it with my own eyes, because Django documentation doesn't look very noob friendly in this aspect and I couldn't get those chains and stuff to work.
mysql -> SELECT position, COUNT(position) FROM
-> (SELECT * FROM log WHERE (aspect LIKE 'es%' OR brand LIKE '%pj%')
-> AND tag IN ('in','out')) AS list
-> GROUP BY position ORDER BY COUNT(position) DESC;
While I think chaining filters would be more convenient for me in the future, this below just seems way more straightforward at the moment.
query = "the query from above"
cursor.execute(query)
[new_list.append([item for item in row ]) for row in cursor]
...or should I just quit()
from django.db.models import Count,Q
myfilter = [Q(aspect__startswith="es")|Q(brand__contains="pj"),tag__in=['in','out']]
# condition 1 OR condition2 AND condition3
qry = Log.objects.filter(*myfilter).values('position').annotate({"count":Count('position')})
print(qry.values())
I think maybe gets you the same answer using django ... close anyway I think
whether it is better or not i suppose is in the eye of the beholder
myfilter = Q(aspect__startswith="es") | Q(brand__contains="pj"), Q(tag__in=['in', 'out'])
qry = Log.objects.filter(*myfilter).values('position').annotate(Count('position')).order_by('-position__count')
had to add Q before tag__inbrackets on myfilter did not make a difference
annotate({"count":Count('position')}) gave an error QuerySet.annotate() received non-expression(s)
and to sort in descending order had to add
.order_by('-position__count')
output is a dictionary, but that's perfect for displaying on the website
So You were close. Now the data is matching mysql output and I understand it better how to do the QuerySets and filters. Thanks

Sort table by a column and set other column to sequential value to persist ordering

So I'm not sure whether to pose this as a Django or SQL question however I have the following model:
class Picture(models.Model):
weight = models.IntegerField(default=0)
taken_date = models.DateTimeField(blank=True, null=True)
album = models.ForeignKey(Album, db_column="album_id", related_name='pictures')
I may have a subset of Picture records numbering in the thousands, and I'll need to sort them by taken_date and persist the order by setting the weight value.
For instance in Django:
pictures = Picture.objects.filter(album_id=5).order_by('taken_date')
for weight, picture in enumerate(list(pictures)):
picture.weight = weight
picture.save()
Now for 1000s of records as I'm expecting to have, this could take way too long. Is there a more efficient way of performing this task? I'm assuming I might need to resort to SQL as I've recently come to learn Django's not necessarily "there yet" in terms of database bulk operations.
Ok I put together the following in MySQL which works fine, however I'm gonna guess there's no way to simulate this using Django ORM?
UPDATE picture p
JOIN (SELECT #inc := #inc + 1 AS new_weight, id
FROM (SELECT #inc := 0) temp, picture
WHERE album_id = 5
ORDER BY taken_date) pw
ON p.id = pw.id
SET p.weight = pw.new_weight;
I'll leave the question open for a while just in case there's some awesome solution or app that solves this, however the above query for ~6000 records takes 0.11s.
NOTE that the above query will generate warnings in MySQL if you have the following setting in MySQL:
binlog_format=statement
In order to fix this, you must change the binlog_format setting to either mixed or row. mixed is probably better as it means you'll still use statement for everything except in cases where row is required to avoid a warning like the above.

prefetch limited number of related objects in django

I want to display list of Posts with 5 latest Comments for each of them. How do I do that with minimum number of db queries?
Post.objects.filter(...).prefetch_related('comment_set')
retrieves all comments while I need only few of them.
I would go with two queries. First get posts:
posts = list(Post.objects.filter(...))
Now run raw SQL query with UNION (NOTE: omitted ordering for simplicity):
sql = "SELECT * FROM comments WHERE post_id=%s LIMIT 5"
query = []
for post in posts:
query.append( sql % post.id )
query = " UNION ".join(query)
and run it:
comments = Comments.objects.raw(query)
After that you can loop over comments and group them on the Python side.
I haven't tried it, but it looks ok.
There are other possible solutions for your problem (possibly getting down to one query), have a look at this:
http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

SQLAlchemy filter query by related object

Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.

Django - SQL Query - Timestamp

Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.

Categories

Resources