Django numeric comparison of hstore or json data? - python

Is it possible to filter a queryset by casting an hstore value to int or float?
I've run into an issue where we need to add more robust queries to an existing data model. The data model uses the HStoreField to store the majority of the building data, and we need to be able to query/filter against them, and some of the values need to be treated as numeric values.
However, since the values are treated as strings, they're compared character by character and results in incorrect queries. For example, '700' > '1000'.
So if I want to query for all items with a sqft value between 700 and 1000, I get back zero results, even though I can plainly see there are hundreds of items with values within that range. If I just query for items with sqft value >= 700, I only get results where the sqft value starts with 7, 8 or 9.
I also tried testing this using a JsonField from django-pgjson (since we're not yet on Django 1.9), but it appears to have the same issue.
Setup
Django==1.8.9
django-pgjson==0.3.1 (for jsonfield functionality)
Postgres==9.4.7
models.py
from django.contrib.postgres.fields import HStoreField
from django.db import models
class Building (models.Model):
address1 = models.CharField(max_length=50)
address2 = models.CharField(max_length=20, default='', blank=True)
city = models.CharField(max_length=50)
state = models.CharField(max_length=2)
zipcode = models.CharField(max_length=10)
data = HStoreField(blank=True, null=True)
Example Data
This is an example of what some of the data on the hstore field looks like.
address1: ...
address2: ...
city: ...
state: ...
zipcode: ...
data: {
'year_built': '1995',
'building_type': 'residential',
'building_subtype': 'single-family',
'bedrooms': '2',
'bathrooms': '1',
'total_sqft': '958',
}
Example Query which returns incorrect results
queryset = Building.objects.filter(data__total_sqft__gte=700)
I've tried playing around with the annotate feature to see if I can coerce it to cast to a numeric value but I have not had any luck getting that to work. I always get an error saying the field I'm querying against does not exist. This is an example I found elsewhere which doesn't seem to work.
queryset = Building.objects.all().annotate(
sqft=RawSQL("((data->>total_sqft)::numeric)")
).filter(sqft__gte=700)
Which results in this error:
FieldError: Cannot resolve keyword 'sqft' into field. Choices are: address1, address2, city, state, zipcode, data
One thing that complicates this setup a little further is that we're building the queries dynamically and using Q() objects to and/or them together.
So, trying to do something sort of like this, given a key, value and operator type (gte, lte, iexact):
queryset.annotate(**{key: RawSQL("((%data->>%s)::numeric)", (key,)})
queries.append(Q(**{'{}__{}'.format(key, operator): value})
queries.filter(reduce(operator.and_, queries)
However, I'd be happy even just getting the first query working without dynamically building them out.
I've thought about the possibility of having to create a separate model for the building data with the fields explicitly defined, however there are over 600 key value pairs in the data hstore. It seems like changing that into a concrete data model would be a nightmare to setup and potentially maintain.

So I had a very similar problem and ended up using the Cast Function (Django > 1.10) with KeyTextTransform.
my_query =.query.annotate(as_numeric=Cast(KeyTextTransform('my_json_fieldname', 'metadata'), output_field=DecimalField(max_digits=6, decimal_places=2))).filter(as_numeric=2)

Related

How to keep a property value in django.values()?

I have my model as:
class Subs(models.Model):
...
created_at = models.DateTimeField(auto_now_add=True, db_column="order_date", null=True, blank=True)
#property
def created_date(self):
return self.created_at.strftime('%B %d %Y')
I want to get created_Date in my views.py
data = Subs.objects.values('somevalues','created_date')
It throws an error. How to access created_date so that I can use it here?
Although your approach works, it's not best practice performance-wise. generally iterating whole Model.objects.all() is a bad idea because it loads all rows in memory.
In such cases you have several options:
if you just need some simple python logic on your data (like formatting here) do this on the presentation layer (e.g. filter tags)
if you need to apply some heavy business logic, it's better to have them in create/update time (e.g. overriding .save()) or have some cronjobs for it in off-peak time and save them in an extra column in DB.
if your manipulation needs some DB layer query and depends on several columns or tables use .annotate() to add it into your queryset.
As values didnot work i used for loop.
instead of doing this:
data = Subs.objects.values('somevalues','created_date')
I did this :
newarr = [{'created_date': i.created_date} for i in Subs.objects.all()]

Python/Django - How to annotate a QuerySet with a value determined by another query

Django 1.11, python 2.7, postgresql
I have a set of models that look like this:
class Book(Model):
released_at=DateTimeField()
class BookPrice(Model):
price = DecimalField()
created_at = DateTimeField()
Assuming multiple entries for Book and BookPrice (created at different points in time), I want to get a QuerySet of Book annotated with the BookPrice.price value that was current at the time the Book was released. Something like:
books = Book.objects.annotate(
old_price=Subquery(BookPrice.objects.filter(
created_at__lt=OuterRef('released_at')
)
.order_by('created_at')
.last()
.price
)
)
When I try something like this, I get an error: This queryset contains a reference to an outer query and may only be used in a subquery.
I could get the data with a for loop easily enough, but I'm trying to prepare a large chunk of data for a CSV download and I don't want to iterate through every book if I can help it.
Your problem is that you are doing .last().price. This code resolves (executes) the query and tries to get a python object. Hence the error you are getting as the query you are trying to execute contains an OuterRef, therefore it cannot be executed.
You should transform your query into something like the following:
last_price_before_release_query = BookPrice.objects.filter(created_at__lt=OuterRef('released_at')).order_by('-created_at').values('price') # Note the reversed ordering
books = Book.objects.annotate(old_price=Subquery(last_price_before_release_query[:1]))
You can get more information here.

How to query a model by a related object and get the related object in the queryset using the Django ORM

I know it's possible to query a model using a reverse related field using the Django ORM. But is it possible to also get all the fields of the reverse related model for which the query matched?
For example, if we have the following models:
class Location(models.Model):
name = models.CharField(max_length=50)
class Availability(models.Model):
location = models.ForeignKey(Location, on_delete=models.CASCADE)
start_datetime = models.DateTimeField()
end_datetime = models.DateTimeField()
price = models.PositiveIntegerField()
would it be possible to find all Locations that are available in a specific timeframe AND also get the price of the Location during that availability? We are under the assumption that Availability objects that have the same location can not have overlapping start/end datetimes.
if user_start_datetime and user_end_datetime are inputted by the user, then we could possibly do something like the following:
Location.objects.filter(
availability__start_datetime__lte=start_datetime,
availability__end_datetime__gte=end_datetime)
But I'm not sure how to also get the price field for the specific availability that did result in a match for the query.
In raw SQL, the behavior I'm talking about might be achievable via something like this:
SELECT l.id, l.name, a.price
FROM Location l
INNER JOIN Availability a
ON a.location_id = l.id
WHERE /* availability is within user-inputted timeframe */
I've considered using something like prefetch_related('availability_set'), but that would just give me all the availabilities for the Location objects that matched the query. I just want the one availability that was within the timeframe that was queried, and more specifically, the price of that availability.
When you are using an ORM, in general you fetch results from one model class at a time. Since Location and Availability are separate models, you can simply do the following:
availabilities = Availability.objects.filter(
start_datetime__lte=start_datetime,
end_datetime__gte=end_datetime)
for availability in availabilities:
print(availability.location.id, availability.location.name, availability.price)
Which is an easy to read implementation.
Now, accessing Location from an Availability object (in availability.location) requires a second SQL query. You can optimise this using select_related:
This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.
Simply append it to your original query, i.e.:
availabilities = Availability.objects.select_related('location').filter(...
This will create an SQL join statement in the background and the Location objects will not require an extra query.

Django postgres order_by distinct on field

We have a limitation for order_by/distinct fields.
From the docs: "fields in order_by() must start with the fields in distinct(), in the same order"
Now here is the use case:
class Course(models.Model):
is_vip = models.BooleanField()
...
class CourseEvent(models.Model):
date = models.DateTimeField()
course = models.ForeignKey(Course)
The goal is to fetch the courses, ordered by nearest date but vip goes first.
The solution could look like this:
CourseEvent.objects.order_by('-course__is_vip', '-date',).distinct('course_id',).values_list('course')
But it causes an error since the limitation.
Yeah I understand why ordering is necessary when using distinct - we get the first row for each value of course_id so if we don't specify an order we would get some arbitrary row.
But what's the purpose of limiting order to the same field that we have distinct on?
If I change order_by to something like ('course_id', '-course__is_vip', 'date',) it would give me one row for course but the order of courses will have nothing in common with the goal.
Is there any way to bypass this limitation besides walking through the entire queryset and filtering it in a loop?
You can use a nested query using id__in. In the inner query you single out the distinct events and in the outer query you custom-order them:
CourseEvent.objects.filter(
id__in=CourseEvent.objects\
.order_by('course_id', '-date').distinct('course_id')
).order_by('-course__is_vip', '-date')
From the docs on distinct(*fields):
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.

Update multiple objects at once in Django?

I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:
class MeasureValue(models.Model):
org = models.ForeignKey(Org, null=True, blank=True)
month = models.DateField()
calc_value = models.FloatField(null=True, blank=True)
percentile = models.FloatField(null=True, blank=True)
There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.
Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:
records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata
However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
row.percentile = None
mv.percentile = row.percentile
mv.save()
This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.
Atomic transactions can reduce the time spent in the loop:
from django.db import transaction
with transaction.atomic():
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
# if it's already None, why set it to None?
row.percentile = None
mv.percentile = row.percentile
mv.save()
Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.
By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:
objs = [
Entry.objects.create(headline='Entry 1'),
Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])
In older versions of Django you could use update() with Case/When, e.g.:
from django.db.models import Case, When
Entry.objects.filter(
pk__in=headlines # `headlines` is a pk -> headline mapping
).update(
headline=Case(*[When(pk=entry_pk, then=headline)
for entry_pk, headline in headlines.items()]))
Actually, attempting #Eugene Yarmash 's answer I found I got this error:
FieldError: Joined field references are not permitted in this query
But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.
So, for versions of Django that don't offer bulk_update, assuming the same data used in Eugene's answer, where headlines is a pk -> headline mapping:
from django.db import transaction
with transaction.atomic():
for entry_pk, headline in headlines.items():
Entry.objects.filter(pk=entry_pk).update(headline=headline)

Categories

Resources