Django / Postgres Group By Aggregate - python

I've have an issue regarding queries with a group by clause.
Lets assume I have the following Django-Model:
class SomeModel(models.Model):
date = models.DateField()
value = models.FloatField()
relation = models.ForeignKey('OtherModel')
If I want to do a query where I group SomeModel instances by OtherModel and annotate the latest date:
SomeModel.objects.values('relation').annotate(Max('date'))
This is all great, but as soon as I want to add a filter on the already annotated queryset I am getting nowhere:
SomeModel.objects.values('relation').annotate(Max('date')).filter(value__gt=0)
This would indeed filter out all the value != 0, however I only want it after the annotations took place. If the latest date of a relation has the value 0, I want it to be filtered out!

You need to add the value in the annotate filed and then you can filter over it. Your ORM query becomes
SomeModel.objects.values('relation').annotate(Max('date'), value=F('value')).filter(value__gt=0)
This should give the value you require

Related

Optimise Django query with large subquery

I have a database containing Profile and Relationship models. I haven't explicitly linked them in the models (because they are third party IDs and they may not yet exist in both tables), but the source and target fields map to one or more Profile objects via the id field:
from django.db import models
class Profile(models.Model):
id = models.BigIntegerField(primary_key=True)
handle = models.CharField(max_length=100)
class Relationship(models.Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField(db_index=True)
target = models.BigIntegerField(db_index=True)
My query needs to get a list of 100 values from the Relationship.source column which don't yet exist as a Profile.id. This list will then be used to collect the necessary data from the third party. The query below works, but as the table grows (10m+), the SubQuery is getting very large and slow.
Any recommendations for how to optimise this? Backend is PostgreSQL but I'd like to use native Django ORM if possible.
EDIT: There's an extra level of complexity that will be contributing to the slow query. Not all IDs are guaranteed to return success, which would mean they continue to "not exist" and get the program in an infinite loop. So I've added a filter and order_by to input the highest id from the previous batch of 100. This is going to be causing some of the problem so apologies for missing it initially.
from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
target=user.id,
).exclude(
source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
"source", flat=True
).order_by(
"source"
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
)
Thanks in advance for any advice!
For future searchers, here's how I bypassed the __in query and was able to speed up the results.
from django.db.models import Subquery
from django.db.models import Count # New
user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source")) # New queryset to use within Subquery
rels = Relationship.objects.order_by(
"source"
).annotate(
# Annotate each relationship record with a Count of the times that the "source" ID
# appears in the `Profile` table. We can then filter on those that have a count of 0
# (ie don't appear and therefore haven't yet been connected)
prof_count=Count(Subquery(subq.values("id")))
).filter(
target=user.id,
prof_count=0
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
).values_list(
"source", flat=True
)
I think this is faster because the query will complete once it reaches it's required 100 items (rather than comparing against a list of 1m+ IDs each time).

Update multiple objects at once in Django?

I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:
class MeasureValue(models.Model):
org = models.ForeignKey(Org, null=True, blank=True)
month = models.DateField()
calc_value = models.FloatField(null=True, blank=True)
percentile = models.FloatField(null=True, blank=True)
There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.
Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:
records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata
However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
row.percentile = None
mv.percentile = row.percentile
mv.save()
This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.
Atomic transactions can reduce the time spent in the loop:
from django.db import transaction
with transaction.atomic():
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
# if it's already None, why set it to None?
row.percentile = None
mv.percentile = row.percentile
mv.save()
Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.
By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:
objs = [
Entry.objects.create(headline='Entry 1'),
Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])
In older versions of Django you could use update() with Case/When, e.g.:
from django.db.models import Case, When
Entry.objects.filter(
pk__in=headlines # `headlines` is a pk -> headline mapping
).update(
headline=Case(*[When(pk=entry_pk, then=headline)
for entry_pk, headline in headlines.items()]))
Actually, attempting #Eugene Yarmash 's answer I found I got this error:
FieldError: Joined field references are not permitted in this query
But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.
So, for versions of Django that don't offer bulk_update, assuming the same data used in Eugene's answer, where headlines is a pk -> headline mapping:
from django.db import transaction
with transaction.atomic():
for entry_pk, headline in headlines.items():
Entry.objects.filter(pk=entry_pk).update(headline=headline)

Extract OneToOne Field in django model

class Post(models.Model):
created_time = models.DateTimeField()
comment_count = models.IntegerField(default=0)
like_count = models.IntegerField(default=0)
group = models.ForeignKey(Group)
class MonthPost(models.Model):
created_time = models.DateTimeField()
comment_count = models.IntegerField(default=0)
like_count = models.IntegerField(default=0)
group = models.ForeignKey(Group)
post = models.OneToOneField(Post)
I use this two models. MonthPost is part of Post.
I want to use MonthPost when filtered date is smaller than month.
_models = Model.extra(
select={'score': 'like_count + comment_count'},
order_by=('-score',)
)
I use extra about above two models. Post works well, but MonthPost doesn't work.
django.db.utils.ProgrammingError: column reference "like_count" is ambiguous
LINE 1: ... ("archive_post"."is_show" = false)) ORDER BY (like_count...
This is the error message.
_models.values_list("post", flat=True)
And then, I want to extract OneToOne field(post) from MonthPost.
I try to use values_list("post", flat=True). It return only id list.
I need to post object list for django rest framework.
I don't' quite understand what you are trying to achieve with your MonthPost model and why it duplicates Post fields. With that being said I think you can get the results you want with this info.
First of all extra is depreciated see the docs on extra. In either case, your select is not valid SQL syntax, your query should look more like this:
annotate(val=RawSQL(
"select col from sometable where othercol =%s",
(someparam,)))
However, what you are after here requires neither extra or RawSql. These methods should only be used when there is no built in way to achieve the desired results. When using RawSql or extra, you must tailor the SQL for your specific backed. Django has built in methods for such queries:
qs = Post.objects.all().annotate(
score=(Count('like_count') + Count('comment_count'))
A values_list() query needs to explicitly list all fields from related models and extra or annotated fields. For MonthPost it should look like this:
MonthPost.objects.all().values_list('post', 'post__score', 'post__created_time')
Finally, if the purpose of MonthPost is simply to list the posts with he greatest score for a given month, you can eliminate the MonthPost model entirely and query your Post model for this.
import datetime
today = datetime.date.today()
# Filter for posts this month
# Annotate the score
# Order the results by the score field
qs = Post.objects\
.filter(created_time__year=today.year, created_time__month=today.month)\
.annotate(score=(Count('like_count') + Count('comment_count'))\
.order_by('score')
# Slice the top ten posts for the month
qs = qs[:10]
The code above is not tested, but should give you a better handle on how to perform these types of queries.

Ordering a list of objects by the Year element of a DateTimeField

I have the following model defined in models.py
class Schoolclass(models.Model):
...
date_created = models.DateTimeField('date created')
In my views.py I want to order the list of Schoolclass objects by the year only associated with date_created. I specifically don't want ordering to take account of any other element of the DateTimeField. The reason being that I wish secondary ordering to occur through a different field.
This is what I can come up with but it doesn't work.
def index(request):
class_list = SchoolClass.objects.order_by('date_created__year')
In case it helps I get the following error when I run the above code:
Join on field 'date_created' not permitted. Did you misspell 'year' for the lookup type?
The problem is that the year field only exists in Python, not in the database. I guess you'll have to use extra() and make it explicit to the database that it should order by the year part of the datetime field. How exactly to do that will depend on your database.
This should work for MySQL:
def index(request):
class_list = SchoolClass.objects.extra(select={'year_created': 'YEAR(date_created)'},
order_by=['year_created'])
If you're using other database, you'll have to replace YEAR(date_created) with the equivalent operation to extract the year from a datetime field.

Django - SQL Query - Timestamp

Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.

Categories

Resources