Optimise Django query with large subquery - python

I have a database containing Profile and Relationship models. I haven't explicitly linked them in the models (because they are third party IDs and they may not yet exist in both tables), but the source and target fields map to one or more Profile objects via the id field:
from django.db import models
class Profile(models.Model):
id = models.BigIntegerField(primary_key=True)
handle = models.CharField(max_length=100)
class Relationship(models.Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField(db_index=True)
target = models.BigIntegerField(db_index=True)
My query needs to get a list of 100 values from the Relationship.source column which don't yet exist as a Profile.id. This list will then be used to collect the necessary data from the third party. The query below works, but as the table grows (10m+), the SubQuery is getting very large and slow.
Any recommendations for how to optimise this? Backend is PostgreSQL but I'd like to use native Django ORM if possible.
EDIT: There's an extra level of complexity that will be contributing to the slow query. Not all IDs are guaranteed to return success, which would mean they continue to "not exist" and get the program in an infinite loop. So I've added a filter and order_by to input the highest id from the previous batch of 100. This is going to be causing some of the problem so apologies for missing it initially.
from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
target=user.id,
).exclude(
source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
"source", flat=True
).order_by(
"source"
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
)
Thanks in advance for any advice!

For future searchers, here's how I bypassed the __in query and was able to speed up the results.
from django.db.models import Subquery
from django.db.models import Count # New
user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source")) # New queryset to use within Subquery
rels = Relationship.objects.order_by(
"source"
).annotate(
# Annotate each relationship record with a Count of the times that the "source" ID
# appears in the `Profile` table. We can then filter on those that have a count of 0
# (ie don't appear and therefore haven't yet been connected)
prof_count=Count(Subquery(subq.values("id")))
).filter(
target=user.id,
prof_count=0
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
).values_list(
"source", flat=True
)
I think this is faster because the query will complete once it reaches it's required 100 items (rather than comparing against a list of 1m+ IDs each time).

Related

Is it possible to use queryset in the FROM clause

I have a model for user's points collection:
class Rating(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='rating')
points = models.IntegerField()
Each user could have multiple records in this model. I need to calculate a rank of each user by sum of collected points. For the listing it's easy:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).order_by('-total_points')
But how to get rank for a single user by his user_id? I added annotation with numbers of rows:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
)
it really added correct ranking numbers, but when I try to get a single user by user_id it returns a row with rank=1. It's because the filter condition goes to the WHERE clause and there is a single row with the number 1. I mean this:
Rating.objects.values('user__username').annotate(
total_points=Sum('points')
).annotate(
rank=Window(
expression=RowNumber(),
order_by=[F('total_points').desc()]
)
).filter(user_id=1)
I got the SQL query of this queryset (qs.query) like
SELECT ... FROM rating_rating WHERE ...
and inserted it into another SQL query as "rank_table" and added a condition into the outside WHERE clause:
SELECT * FROM (SELECT ... FROM rating_rating WHERE ...) AS rank_table WHERE user_id = 1;
and executed within the MySQL console. And this works exactly as I need. The question is: how to implement the same using Django ORM?
I have one solution to get what I need. I could add another field to mark records as "correct" or "incorrect" user, sort result by this field and then get the first row:
qs.annotate(
required_user=Case(
When(user_id=1, then=1),
default=0,
output_field=IntegerField(),
)
).order_by('-required_user').first()
This works. But SELECT within another SELECT seems more elegant and I would like to know is it possible with Django.
somehow someone just recently asked something about filtering on windows functions. While what you want is basically subquery (select in select), using annotation with the window function is not supported :
https://code.djangoproject.com/ticket/28333 because the annotated fields will inside the subquery :'(. One provides raw sql with query_with_params, but it is not really elegant.

Django Query where one field is duplicate and another is different

I want to know if I can create a query where one field is duplicate and another one is different.
Basically I want to get all UsersNames where First Name is the same and user_id is different.
I did this
UserNames.objects.values("first_name", "user_id").annotate(ct=Count("first_name")).filter(ct__gt=0)
This will retrieve a list whit all Users
After tis, I make some post processing and create another query where I filter just the users with first_name__in=['aaa'] & user_id__in=[1, 2] to get the users with the same first_name but different user_id
Can I do this just in one query? or in a better way?
You can work with a subquery here, but it will not matter much in terms of performance I think:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.filter(
Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
)
or prior to django-3.0:
from django.db.models import Exists, OuterRef, Q
UserNames.objects.annotate(
has_other=Exists(UserNames.objects.filter(
~Q(user_id=OuterRef('user_id')),
first_name=OuterRef('first_name')
))
).filter(has_other=True)
We thus retain UserNames objects for which there exists a UserNames object with the same first_name, and with a different user_id.

Django Query, distinct on foreign key

Given these models
class User(Model):
pass
class Post(Model):
by = ForeignKey(User)
posted_on = models.DateTimeField(auto_now=True)
I want to get the latest Posts, but not all from the same User, I have something like this:
posts = Post.objects.filter(public=True) \
.order_by('posted_on') \
.distinct("by")
But distinct doesn't work on mysql, I'm wondering if there is another way to do it?
I have seen some using values(), but values doesn't work for me because I need to do more things with the objects themselves
Since distinct will not work with MySQL on other fields then model id, this is possible way-around with using Subquery:
from django.db.models import Subquery, OuterRef
...
sub_qs = Post.objects.filter(user_id=OuterRef('id')).order_by('posted_on')
# here you get users with annotated last post
qs = User.objects.annotate(last_post=Subquery(sub_qs[:1]))
# next you can limit the number of users
Also note that ordering on posted_on field depends on your model constraints - perhaps you'll need to change it to -posted_on to order from newest on top.
order_by should match the distinct(). In you case, you should be doing this:
posts = Post.objects.filter(public=True) \
.order_by('by') \
.distinct('by')
.distinct([*fields]) only works in PostgresSQL.
For MySql Engine. This is MySQL documentation in Django:
Here's the difference. For a normal distinct() call, the database
compares each field in each row when determining which rows are
distinct. For a distinct() call with specified field names, the
database will only compare the specified field names.
For MySql workaround could be this:
from django.db.models import Subquery, OuterRef
user_post = Post.objects.filter(user_id=OuterRef('id')).order_by('posted_on')
post_ids = User.objects.filter(related_posts__isnull=False).annotate(post=Subquery(user_post.values_list('id', flat=True)[:1]))).values_list('post', flat=True)
posts = Post.objects.filter(id__in=post_ids)

Update multiple objects at once in Django?

I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:
class MeasureValue(models.Model):
org = models.ForeignKey(Org, null=True, blank=True)
month = models.DateField()
calc_value = models.FloatField(null=True, blank=True)
percentile = models.FloatField(null=True, blank=True)
There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.
Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:
records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata
However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
row.percentile = None
mv.percentile = row.percentile
mv.save()
This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.
Atomic transactions can reduce the time spent in the loop:
from django.db import transaction
with transaction.atomic():
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
# if it's already None, why set it to None?
row.percentile = None
mv.percentile = row.percentile
mv.save()
Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.
By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:
objs = [
Entry.objects.create(headline='Entry 1'),
Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])
In older versions of Django you could use update() with Case/When, e.g.:
from django.db.models import Case, When
Entry.objects.filter(
pk__in=headlines # `headlines` is a pk -> headline mapping
).update(
headline=Case(*[When(pk=entry_pk, then=headline)
for entry_pk, headline in headlines.items()]))
Actually, attempting #Eugene Yarmash 's answer I found I got this error:
FieldError: Joined field references are not permitted in this query
But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.
So, for versions of Django that don't offer bulk_update, assuming the same data used in Eugene's answer, where headlines is a pk -> headline mapping:
from django.db import transaction
with transaction.atomic():
for entry_pk, headline in headlines.items():
Entry.objects.filter(pk=entry_pk).update(headline=headline)

Django - SQL Query - Timestamp

Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.

Categories

Resources