Django query to retrieve rows that meet aggregate criteria on related table - python

I'm really stuck on a Django query and am hoping you've got a couple minutes to help me figure it out.
I have a very simple model:
class Task(models.Model):
# a tuple representing a specific item to be searched for a specific URL
instructions = models.TextField()
ASSIGNMENT_STATUS_CHOICES = (
( 'a', 'assigned' ),
( 'c', 'submitted for review' ),
( 'f', 'finished' ),
( 'r', 'rejected' ),
)
class Assignment(models.Model):
# the overall container representing a collection of terms for a page found
# by a user
user = models.ForeignKey(User)
task = models.ForeignKey(Task, related_name='assignments')
status = models.CharField(max_length=1, choices=ASSIGNMENT_STATUS_CHOICES, default='a')
What I want to do is randomly select a task that has fewer than N assignments that are status != 'r'. In other words, I want to make sure each task gets successfully completed N times, so if a worker requests a task it needs one that has fewer than N tasks in a state that could lead to finishing.
I'm just totally lost trying to figure out the query that would return such tasks. For a given task, I can test:
task.assignments.exclude(status='r').count() < N
and if that is true it's a candidate. But how do I query Task.objects in such a way that it returns all candidates in a single database query such that I can randomly choose one:
Task.objects.<some magic filter>.order_by('?')[0]
Any help would be appreciated!

from django.db.models import Count
Task.objects.exclude(assignments__status='r').annotate(assignments_count=Count('assignments').filter(assignments_count__gt=N)

Related

Django auto increment fields

I have 2 columns named Serial and Bag I need them to be auto incremented but based on each other and also based on the user that will update the record, so every bag should have 100 serial and reset the number automatically after reaching 100, then start again with Bag number 2 and put 100 serial in it and reset.
For example:
when user update the first record the Bag will start with number 1 and Serial will be also number 1 the second record Bag will still number 1 and the serial will be changed to number 2 till reach 100 Serial in one Bag, then we will start again with bag number 2 and serial number 1 etc ...
Thanks
The way you explain your example is a bit confusing but I'll try to give you an answer.
I assume the "2 columns named Serial and Bag" are fields of the same model and as you replied in the comments "the record is already existing but it has empty serial and bag", which means the auto-increment begins when the record is updated. Lastly, you mentioned first and second records implying that there are multiple records in this model. Based on these criteria, what you can do is add a save method in your model:
# Sample model
class Record(models.Model):
bag = models.IntegerField(default=0, null=True)
serial = models.IntegerField(default=0, null=True)
created_at = models.DateTimeField(auto_now=True, null=True)
def save(self, *args, **kwargs):
# Ensures the record will only auto-increment during update
if self.created_at:
# Retrieves the object with the highest bag & serial value
latest_record = Record.objects.all().order_by('bag', 'serial').last()
# Incrementing logic
if latest_record.serial_no + 1 <= 100:
self.bag = latest_record.bag if latest_record.bag > 0 else 1
self.serial = latest_record.serial + 1
else:
self.bag = latest_record.bag + 1
self.serial = 1
super(Record, self).save(*args, **kwargs)
Now, each time you write save like:
record = Record()
record.save()
The model save method executes.
Rather than do the incrementing logic in python, where it is subject to race conditions if multiple updates can happen concurrently, it should be possible to push it down into the database.
Something like:
update foop set
bag=vala,
ser=valb
from (
select
case when ser >= 5 then bag+1 else bag end as vala,
case when ser >= 5 then 1 else ser+1 end as valb
from foop
order by bag desc nulls last,
ser desc nulls last
limit 1) as tt
where some_primarykey = %;
It might be possible to translate that into django ORM, but it might also be easier and more readable to just drop into raw SQL or sneak it in via .extra() on a queryset than attempt to shoehorn it in.

How do I find each instance of a model in Django

I am trying to iterate through each instance of a model I have defined.
Say I have the following in models.py, under the people django app:
class foo(models.Model):
name = models.CharField(max_length=20, default="No Name")
age = models.PositiveSmallIntegerField(default=0)
And I have populated the database to have the following data in foo:
name="Charley", age=17
name="Matthew", age=63
name="John", age=34
Now I want to work out the average age of my dataset. In another part of the program (outside the people app, inside the project folder), in a file called bar.py that will be set up to run automatically every day at a specific time, I calculate this average.
from people.models import foo
def findAverageAge():
total = 0
for instance in //What goes here?// :
total += instance.age
length = //What goes here, to find the number of instances?//
average = total / length
return average
print(findAverageAge)
In the code above, the //What goes here?// signifies I don't know what goes there and need help.
You can retrieve all elements with .all() [Django-doc]:
from people.models import foo
def findAverageAge():
total = 0
qs = foo.objects.all()
for instance in :
total += instance.age
length = len(qs)
average = total / length
return average
print(findAverageAge())
But you should not calculate the average at the Django/Python level. This requires retrieving all records, and this can be quite large. A database itself can calculate the average, and this is (normally) done in a more efficient way. You can .aggregate(…) [jango-doc] on the queryset with:
from people.models import foo
def findAverageAge():
return foo.objects.aggregate(
avg_age=Avg('age')
)['avg_age'] or 0
print(findAverageAge())
You should make a management command however: this will load the Django apps, and also makes it more convenient to run command with parameters, etc.

Python/Django loop optimization

i am currently working on a Django website and at a certain point need to generate a dataset for a graph out of model data. Over the last days i have been trying to optimize the code to generate that data, but its still relatively slow with a dataset which will definetely get a lot larger when the website is up.
The model i am working with looks like this:
class Prices(models.Model):
name = models.CharField(max_length=200, blank=False)
website = models.CharField(max_length=200, blank=False, choices=WEBSITES)
date= models.DateField('date', default=date.today(), blank=False, null=False)
price = models.IntegerField(default=0, blank=False, null=False)
class Meta:
indexes = [
models.Index(fields=['website']),
models.Index(fields=['date'])
]
It stores prices for different items(defined by the name) on different websites.
For the graph representation of the model I need two arrays, one with all the dates as the y-Axis and one with the prices of the item at that date(which might not be available, not every item has a date entry for every day on every website). And all of that for every website.
My code for generating that data looks like this atm:
for website in websites:
iterator = Prices.objects.filter(website=website).iterator()
data = []
entry = next(iterator)
for date in labels:
if entry is not None and date == entry.date:
data.append(entry.price* 0.01)
entry = next(iterator, None)
else:
data.append(None)
... do stuff with data (not relevant for performance)
I loop over each website and retrieve all price data from my model. Then I loop over all dates(which are in the array labels), see if that date entry is available and if so, add that to the array, otherwise none.
Does anybody have tips or ideas on how to optimize the inner for-loop, as thats what makes out 90% of my performance problem.
This might not be as efficient as you want but you can try this.
If this still does not meet your requirements, then all I can think of is doing this asynchronously.
iterator = Prices.objects.filter(website=website)
from collections import defaultdict
result = reduce(lambda acc, prices: acc[prices.date].append((prices.price * 0.01, prices.website)) or acc, filter(lambda x: x.date in labels, iterator), defaultdict(list))
One way to do this asynchronously is
using concurrent.futures module
ThreadPoolExecutor(max_workers = 10) (you can specify maximum workers like this).
Moreover, if you want multiple Processes instead of Threads.
You can simply replace ThreadPoolExecutor with ProcessPoolExecutor
result = defaultdict(list)
def reducer_function(price):
if price.date in labels:
result[price.date].append((prices.price * 0.01, prices.website))
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(reducer_function, iterator)

Making calculations with database columns

I have the following models in django
class Job(models.Model):
cost = models.FloatField()
class Account(models.Model):
job = models.ManyToManyField(Job, through='HasJob')
class HasJob(models.Model):
account = models.ForeignKey(Account, related_name='hasjobs')
job = models.ForeignKey(Job, related_name='hasjobs')
quantity = models.IntegerField()
So an Account can have many jobs in different quantities. I want to be able to sum up the total cost of an account. Is that possible in database level or should I do python for it? Like
account = Account.objects.get(pk=1)
sum = 0
for hasjob in account.hasjobs.all():
sum += hasjob.quantity*hasjob.job.cost
I know its a very "starters" way to do that, and I am guessing it includes many hits on the database. So is there a better way?
IFAIK aggregation can't sum by F() expressions so you have to calculate the sum in python code.
But you can reduce the number of db hits to one - just add the select_related() call to the queryset:
total_sum = sum(hasjob.quantity * hasjob.job.cost
for hasjob in account.hasjobs.all().select_related('job'))

Loading datasets from datastore and merge into single dictionary. Resource problem

I have a productdatabase that contains products, parts and labels for each part based on langcodes.
The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.
The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.
I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
## Start of problem lines ##
for defaultPart in product.defaultPartsData:
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel
for optionalPart in product.optionalPartsData:
for label in labelsForLangCode:
if label.key() in optionalPart.partLabelList:
typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
## end problem lines ##
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.
All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.
The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.
The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:
#property
def defaultPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)
#property
def optionalPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)
When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.
Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.
Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?
I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.
..fredrik
EDIT:
A AppStats screenshoot here.
From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?
EDIT 2:
I implemented dound's solution and added abit. Now it looks like this:
langCode = 'en'
typeData = Products.ProductPartTypes.all()
productData = Products.Product.all()
labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
productList = []
label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
labelData = memcache.get(label_cache_key)
if labelData is None:
langDict = {}
for langLabel in labelsForLangCode:
langDict[str(langLabel.key())] = langLabel.partLangLabel
memcache.add(label_cache_key, langDict, 500)
labelData = memcache.get(label_cache_key)
GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if partData is None:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
for lb in part.partLabelList:
if str(lb) in labelData:
label = labelData[str(lb)]
break
if part.key() in product.defaultParts:
typeDict[part.type.typeId]['default'] = label
elif part.key() in product.optionalParts:
typeDict[part.type.typeId]['optional'].append(label)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
The result is much better. I now have about 3000ms with out memcache and about 700ms with.
I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?
Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?
EDIT 3:
I noticed that non of the db.Model are indexed, could this make a differance?
EDIT 4:
After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.
..fredrik
A few simple ideas:
1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)
2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.
3) Like Will pointed out, IN queries perform one query PER argument.
Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
class ProductParts(db.Model):
...
products = db.ListProperty(db.Key) # product keys
...
Then you can do ONE query to per product instead of N queries per product. For example, you could do this:
parts = ProductParts.all().filter("products =", product).fetch(1000)
The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).
Less critical improvement idea:
4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:
GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
#property
def defaultPartsData(self):
return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)
You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.
Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).
GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
# here's a new approach that does just ONE datastore query (for each product)
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
if part.key() in self.defaultParts:
part_type = 'default'
else:
part_type = 'optional'
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
# (end new code)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.
So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.
Here's the relevant section from the GQL Reference:
Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.
You might try installing AppStats for your app to see where else it might be slowing down.
I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.
GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.

Categories

Resources