I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:
class MeasureValue(models.Model):
org = models.ForeignKey(Org, null=True, blank=True)
month = models.DateField()
calc_value = models.FloatField(null=True, blank=True)
percentile = models.FloatField(null=True, blank=True)
There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.
Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:
records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata
However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
row.percentile = None
mv.percentile = row.percentile
mv.save()
This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.
Atomic transactions can reduce the time spent in the loop:
from django.db import transaction
with transaction.atomic():
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
# if it's already None, why set it to None?
row.percentile = None
mv.percentile = row.percentile
mv.save()
Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.
By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:
objs = [
Entry.objects.create(headline='Entry 1'),
Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])
In older versions of Django you could use update() with Case/When, e.g.:
from django.db.models import Case, When
Entry.objects.filter(
pk__in=headlines # `headlines` is a pk -> headline mapping
).update(
headline=Case(*[When(pk=entry_pk, then=headline)
for entry_pk, headline in headlines.items()]))
Actually, attempting #Eugene Yarmash 's answer I found I got this error:
FieldError: Joined field references are not permitted in this query
But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.
So, for versions of Django that don't offer bulk_update, assuming the same data used in Eugene's answer, where headlines is a pk -> headline mapping:
from django.db import transaction
with transaction.atomic():
for entry_pk, headline in headlines.items():
Entry.objects.filter(pk=entry_pk).update(headline=headline)
Related
I've have an issue regarding queries with a group by clause.
Lets assume I have the following Django-Model:
class SomeModel(models.Model):
date = models.DateField()
value = models.FloatField()
relation = models.ForeignKey('OtherModel')
If I want to do a query where I group SomeModel instances by OtherModel and annotate the latest date:
SomeModel.objects.values('relation').annotate(Max('date'))
This is all great, but as soon as I want to add a filter on the already annotated queryset I am getting nowhere:
SomeModel.objects.values('relation').annotate(Max('date')).filter(value__gt=0)
This would indeed filter out all the value != 0, however I only want it after the annotations took place. If the latest date of a relation has the value 0, I want it to be filtered out!
You need to add the value in the annotate filed and then you can filter over it. Your ORM query becomes
SomeModel.objects.values('relation').annotate(Max('date'), value=F('value')).filter(value__gt=0)
This should give the value you require
I have a database containing Profile and Relationship models. I haven't explicitly linked them in the models (because they are third party IDs and they may not yet exist in both tables), but the source and target fields map to one or more Profile objects via the id field:
from django.db import models
class Profile(models.Model):
id = models.BigIntegerField(primary_key=True)
handle = models.CharField(max_length=100)
class Relationship(models.Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField(db_index=True)
target = models.BigIntegerField(db_index=True)
My query needs to get a list of 100 values from the Relationship.source column which don't yet exist as a Profile.id. This list will then be used to collect the necessary data from the third party. The query below works, but as the table grows (10m+), the SubQuery is getting very large and slow.
Any recommendations for how to optimise this? Backend is PostgreSQL but I'd like to use native Django ORM if possible.
EDIT: There's an extra level of complexity that will be contributing to the slow query. Not all IDs are guaranteed to return success, which would mean they continue to "not exist" and get the program in an infinite loop. So I've added a filter and order_by to input the highest id from the previous batch of 100. This is going to be causing some of the problem so apologies for missing it initially.
from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
target=user.id,
).exclude(
source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
"source", flat=True
).order_by(
"source"
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
)
Thanks in advance for any advice!
For future searchers, here's how I bypassed the __in query and was able to speed up the results.
from django.db.models import Subquery
from django.db.models import Count # New
user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source")) # New queryset to use within Subquery
rels = Relationship.objects.order_by(
"source"
).annotate(
# Annotate each relationship record with a Count of the times that the "source" ID
# appears in the `Profile` table. We can then filter on those that have a count of 0
# (ie don't appear and therefore haven't yet been connected)
prof_count=Count(Subquery(subq.values("id")))
).filter(
target=user.id,
prof_count=0
).filter(
source__gt=max_id_from_previous_batch # An integer representing a previous `Relationship.source` id
).values_list(
"source", flat=True
)
I think this is faster because the query will complete once it reaches it's required 100 items (rather than comparing against a list of 1m+ IDs each time).
Is it possible to filter a queryset by casting an hstore value to int or float?
I've run into an issue where we need to add more robust queries to an existing data model. The data model uses the HStoreField to store the majority of the building data, and we need to be able to query/filter against them, and some of the values need to be treated as numeric values.
However, since the values are treated as strings, they're compared character by character and results in incorrect queries. For example, '700' > '1000'.
So if I want to query for all items with a sqft value between 700 and 1000, I get back zero results, even though I can plainly see there are hundreds of items with values within that range. If I just query for items with sqft value >= 700, I only get results where the sqft value starts with 7, 8 or 9.
I also tried testing this using a JsonField from django-pgjson (since we're not yet on Django 1.9), but it appears to have the same issue.
Setup
Django==1.8.9
django-pgjson==0.3.1 (for jsonfield functionality)
Postgres==9.4.7
models.py
from django.contrib.postgres.fields import HStoreField
from django.db import models
class Building (models.Model):
address1 = models.CharField(max_length=50)
address2 = models.CharField(max_length=20, default='', blank=True)
city = models.CharField(max_length=50)
state = models.CharField(max_length=2)
zipcode = models.CharField(max_length=10)
data = HStoreField(blank=True, null=True)
Example Data
This is an example of what some of the data on the hstore field looks like.
address1: ...
address2: ...
city: ...
state: ...
zipcode: ...
data: {
'year_built': '1995',
'building_type': 'residential',
'building_subtype': 'single-family',
'bedrooms': '2',
'bathrooms': '1',
'total_sqft': '958',
}
Example Query which returns incorrect results
queryset = Building.objects.filter(data__total_sqft__gte=700)
I've tried playing around with the annotate feature to see if I can coerce it to cast to a numeric value but I have not had any luck getting that to work. I always get an error saying the field I'm querying against does not exist. This is an example I found elsewhere which doesn't seem to work.
queryset = Building.objects.all().annotate(
sqft=RawSQL("((data->>total_sqft)::numeric)")
).filter(sqft__gte=700)
Which results in this error:
FieldError: Cannot resolve keyword 'sqft' into field. Choices are: address1, address2, city, state, zipcode, data
One thing that complicates this setup a little further is that we're building the queries dynamically and using Q() objects to and/or them together.
So, trying to do something sort of like this, given a key, value and operator type (gte, lte, iexact):
queryset.annotate(**{key: RawSQL("((%data->>%s)::numeric)", (key,)})
queries.append(Q(**{'{}__{}'.format(key, operator): value})
queries.filter(reduce(operator.and_, queries)
However, I'd be happy even just getting the first query working without dynamically building them out.
I've thought about the possibility of having to create a separate model for the building data with the fields explicitly defined, however there are over 600 key value pairs in the data hstore. It seems like changing that into a concrete data model would be a nightmare to setup and potentially maintain.
So I had a very similar problem and ended up using the Cast Function (Django > 1.10) with KeyTextTransform.
my_query =.query.annotate(as_numeric=Cast(KeyTextTransform('my_json_fieldname', 'metadata'), output_field=DecimalField(max_digits=6, decimal_places=2))).filter(as_numeric=2)
So I'm not sure whether to pose this as a Django or SQL question however I have the following model:
class Picture(models.Model):
weight = models.IntegerField(default=0)
taken_date = models.DateTimeField(blank=True, null=True)
album = models.ForeignKey(Album, db_column="album_id", related_name='pictures')
I may have a subset of Picture records numbering in the thousands, and I'll need to sort them by taken_date and persist the order by setting the weight value.
For instance in Django:
pictures = Picture.objects.filter(album_id=5).order_by('taken_date')
for weight, picture in enumerate(list(pictures)):
picture.weight = weight
picture.save()
Now for 1000s of records as I'm expecting to have, this could take way too long. Is there a more efficient way of performing this task? I'm assuming I might need to resort to SQL as I've recently come to learn Django's not necessarily "there yet" in terms of database bulk operations.
Ok I put together the following in MySQL which works fine, however I'm gonna guess there's no way to simulate this using Django ORM?
UPDATE picture p
JOIN (SELECT #inc := #inc + 1 AS new_weight, id
FROM (SELECT #inc := 0) temp, picture
WHERE album_id = 5
ORDER BY taken_date) pw
ON p.id = pw.id
SET p.weight = pw.new_weight;
I'll leave the question open for a while just in case there's some awesome solution or app that solves this, however the above query for ~6000 records takes 0.11s.
NOTE that the above query will generate warnings in MySQL if you have the following setting in MySQL:
binlog_format=statement
In order to fix this, you must change the binlog_format setting to either mixed or row. mixed is probably better as it means you'll still use statement for everything except in cases where row is required to avoid a warning like the above.
Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.