Aggregating across columns in Django

Aggregating across columns in Django - python

I'm trying to figure out if there's a way to do a somewhat-complex aggregation in Django using its ORM, or if I'm going to have to use extra() to stick in some raw SQL.
Here are my object models (stripped to show just the essentials):
class Submission(Models.model)
favorite_of = models.ManyToManyField(User, related_name="favorite_submissions")
class Response(Models.model)
submission = models.ForeignKey(Submission)
voted_up_by = models.ManyToManyField(User, related_name="voted_up_responses")
What I want to do is sum all the votes for a given submission: that is, all of the votes for any of its responses, and then also including the number of people who marked the submission as a favorite.
I have the first part working using the following code; this returns the total votes for all responses of each submission:
submission_list = Response.objects\
.values('submission')\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
(So after getting the vote total, I sort in descending order and return the top TOP_NUM submissions, to get a "best of" listing.)
That part works. Is there any way you can suggest to include the number of people who have favorited each submission in its votes? (I'd prefer to avoid extra() for portability, but I'm thinking it may be necessary, and I'm willing to use it.)
EDIT: I realized after reading the suggestions below that I should have been clearer in my description of the problem. The ideal solution would be one that allowed me to sort by total votes (the sum of voted_up_by and favorited) and then pick just the top few, all within the database. If that's not possible then I'm willing to load a few of the fields of each response and do the processing in Python; but since I'll be dealing with 100,000+ records, it'd be nice to avoid that overhead. (Also, to Adam and Dmitry: I'm sorry for the delay in responding!)

One possibility would be to re-arrange your current query slightly. What if you tried something like the following:
submission_list = Response.objects\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
submission_list.query.group_by = ['submission_id']
This will return a queryset of Response objects (objects with the same Submission will be lumped together). In order to access the related submission and/or the favorite_of list/count, you have two options:
num_votes = submission_list[0].votes
submission = submission_list[0].submission
num_favorite = submission.favorite_of.count()
or...
submissions = []
for response in submission_list:
submission = response.submission
submission.votes = response.votes
submissions.append(submission)
num_votes = submissions[0].votes
submission = submissions[0]
num_favorite = submission.favorite_of.count()
Basically the first option has the benefit of still being a queryset, but you have to be sure to access the submission object in order to get any info about the submission (since each object in the queryset is technically a Response). The second option has the benefit of being a list of the submissions with both the favorite_of list as well as the votes, but it is no longer a queryset (so be sure you don't need to alter the query anymore afterwards).

You can count favorites in another query like
favorite_list = Submission.objects.annotate(favorites=Count(favorite_of))
After that you add the values from two lists:
total_votes = {}
for item in submission_list:
total_votes[item.submission.id] = item.voted_by
for item in favorite_list:
has_votes = total_votes.get(item.id, 0)
total_votes[item.id] = has_votes + item.favorites
I am using ids in the dictionary because Submission objects will not be identical. If you need the Submissions themselves, you may use one more dictionary or store tuple (submission, votes) instead of just votes.
Added: this solution is better than the previous because you have only two DB requests.

Related

Pass many-to-many object to variable

I have 2 classes, with many-to-many relationship, my goal is to fill an 'item' list with data from that 2 models, here are my models:
class Bakery(models.Model):
title = models.CharField('restaurant_name', max_length=100)
class DeliveryService(models.Model):
title = models.CharField('deliveryservice_name', max_length=100)
bakery = models.ManyToManyField(Bakery)
Here is the logic on my 'views' file:
item = []
bakerys = Bakery.objects.all()
for i in bakerys:
item.append(i.title)
item.append(i.deliveryservice.title)
I hope you got what exactly I want to accomplish. My current 'views' file logic is wrong and I know it, I just does not know what can I do to solver this problem. Thank you for your time.

The following seems to do what you're asking for. But it seems odd that you want to create a list with all the titles for different objects all mixed together and likely have duplicates (if a delivery service is linked to more than one bakery it'll be added twice).
item = []
bakerys = Bakery.objects.all()
for i in bakerys:
item.append(i.title)
for j in i.deliveryservice_set.all():
item.append(j.title)
You should really read up on the many-to-many functionality of the ORM. The documentation is pretty clear on how to do these things.
Sayse had a good answer too if you really just want all the titles. Their answer also groups everything in tuples and accomplishes it with more efficiency by using fewer db queries. Their answer was: Bakery.objects.values('title', 'deliveryservice__title')

filtering with a for or if loop

So I'm currently working on a project where I'm trying to improve the code.
Currently, I have this as my views.py
def home1(request):
if request.user.is_authenticated():
location = request.GET.get('location', request.user.profile.location)
users = User.objects.filter(profile__location=location)
print users
matchesper = Match.objects.get_matches_with_percent(request.user)
print matchesper
matches = [match for match in matchesper if match[0] in users][:20]
Currently, users gives me back a list of user that have the same location as the request.user and matchesper gives me a match percentage with all users. Then matches uses these two lists to give me back a list of users with their match percentage and that match with the request.users location
This works perfectly, however as soon the number of users using the website increases this will become very slow? I could add [:50] at the end of matchesper for example but this means you will never match with older users that have the same location as the request.user.
My question is, is there not a way to just create matches with matchesper for only the users that have the same location? Could I use an if statement before matchesper or a for loop?
I haven't written this code but I do understand it, however when trying to improve it I get very stuck, I hope my explanation and question makes sense.
Thank you for any help in advance I'm very stuck!

(I'm assuming you're using the matchmaker project.)
In Django, you can chain QuerySet methods. You'll notice that the models.py file you're working from defines both a MatchQuerySet and a MatchManager. You might also notice that get_matches_with_percent is only defined on the Manager, not the QuerySet.
This is a problem, but not an insurmountable one. One way around it is to modify which QuerySet our manager method actually works on on. We can do this by creating a new method that is basically a copy of get_matches_with_percent, but with some additional filtering.
class MatchManager(models.Manager):
[...]
def get_matches_with_percent_by_location(self, user, location=None):
if location is None:
location = user.profile.location
user_a = Q(user_a__profile__location=location)
user_b = Q(user_b__profile__location=location)
qs = self.get_queryset().filter(user_a | user_b).matches(user).order_by('-match_decimal')
matches = []
for match in qs:
if match.user_a == user:
items_wanted = [match.user_b, match.get_percent]
matches.append(items_wanted)
elif match.user_b == user:
items_wanted = [match.user_a, match.get_percent]
matches.append(items_wanted)
else:
pass
return matches
Note the use of repeated chaining in line 10! That's the magic.
Other notes:
Q objects are a way of doing complex queries, like multiple "OR" conditions.
An even better solution would factor out the elements that are common to get_matches_with_percent and get_matches_with_percent_by_location to keep the code "DRY", but this is good enough for now ;)
Be mindful of the fact that get_matches_with_percent returns a vanilla list instead of a Django QuerySet; it's a "terminal" method. Thus, you can't use any other QuerySet methods (like filter) after invoking get_matches_with_percent.

Sort by a value in a many to one field with a django queryset?

I have a data model like this:
class Post(models.Model)
name = models.CharField(max_length=255)
class Tag(models.Model)
name = models.CharField(max_length=255)
rating = models.FloatField(max_length=255)
parent = models.ForeignKey(Post, related_name="tags")
I want to get Posts that have a tag, and order them by the tags rating.
something like:
Posts.objects.filter(tags__name="exampletag").order_by("tags(name=exampletag)__rating")
Currently, I am thinking it makes sense to do something like
tags = Tags.objects.filter(name="sometagname").order_by("rating")[0:10]
posts = [t.parent for t in tags]
But I like to know if there is a better way, preferably querying Post, and getting me back a queryset.
Edit:
I don't think this: (Edit 2 - this does give the correct sorting!)
Posts.objects.filter(tags__name="exampletag").order_by("tags__rating")
will give the correct sorting, as it does not sort only by the related item with name "exampletag"
Something like the following would be needed
Posts.objects.filter(tags__name="exampletag").order_by("tags(name=exampletag)__rating")
I've been looking over the django docs, and it seem "annotate" nearly works - but I don't see a way to use it to select a tag by name.
Edit 2
Both the Answers are correct! See my comments to observe some epic brain-farts (one test, the results WERE in order, the other i filter and sort by different tags!)
how it works
the query
Posts.objects.filter(tags__name="exampletag").order_by("tags__rating")
and
Posts.objects.filter(tags__name="exampletag").filter(tags__name="someothertag").order_by("tags__rating")
will work correctly and by sorted by the rating of "exampletag"
it seems the tag(From a ForeignKey BackReference Set) used for sorting when calling order_by is the one in the first filter.

You can do like:
tags = Tags.objects.filter(name="sometagname")
posts = Post.objects.filter(tags__in=tags).order_by('tags__rating')

Even shorter than Anush's, with a JOIN rather than a subquery:
Post.objects.filter(tags__name='exampletag').order_by('tags__rating')

Make a "matrix" from Django query

I have a model similar to this one:
class MyModel(models.Model):
name = models.CharField(max_length = 30)
a = models.ForeignKey(External)
b = models.ForeignKey(External, related_name='MyModels_a')
def __unicode__(self):
return self.a + self.b.name + self.b.name
So when I query it I get something like this:
>>> MyModel.objects.all()
[<MyModel: Name1AB>,<MyModel: Name2AC>,<MyModel: Name3CB>,<MyModel: Name4BA>,<MyModel: Name5BA>]
And I'd like to represent this data similar to the following.
[[ [] , [Name1AB] , [Name2AC] ]
[ [Name4BA, Name5BA] , [] , [] ]
[ [] , [Name3CB] , [] ]]
As you can see the rows would be 'a' in the model; and the columns would be 'b'
I can do this, but it takes a long of time because in the real database I have a lot of data. I'd like to know if there's a Django built in way to do this.
I'm doing it like this:
mymodel_list = MyModel.objects.all()
external_list = External.objects.all()
for i in external_list:
for j in external_list:
print(mymodel_list.filter(a=i).filter(arrl=j).all(),end='')
print()
Thanks

Three ways of doing it but you will have to research a bit more. The third option one may be the most suitable for what you are looking for.
1) Django queries
The reason it is taking a long time is because you are constantly accessing the database in this line:
print(mymodel_list.filter(a=i).filter(arrl=j).all(),end='')
You may have to start reading what the Django documentation say about the way of working with queries. For what you are doing you have to create the algorithm to avoid the filters. Using MyModel.objects.order_by('a') may help you to build a efficient algorithm.
2) {% ifchanged ...%} tag
I suppose you are using print to post your answer but you probably need it in html. In that case, you may want to read about the ifchanged tag. It will allow you to build your matrix in html with just one db access.
3) Many to many relations
It seems you are sort of simulating a many to many relation in a very peculiar way. Django has support for many to many relations. You will need an extra field, so you will also have to read this.
Finally, for doing what you are trying with just one access to the database, you will need to read prefetch_related

There's no built-in way (because what you need is not common). You should write it manually, but I'd recommend retrieving full dataset (or at least dataset for one table line) and process it in Python instead of hitting DB in each table cell.

How to delete entities not found in feed on GAE

I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed.
I have this code:
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
def updateFeed(offset, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name)
)
db.put(feedEntriesToAdd)
How do I find out which items were not in the feed and delete them from the datastore?
I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. - but that seems rather slow.
PS: All item.id are unique for that feed item and are consistent.

If you add a DateTimeProperty with auto_now=True, it will record the last modified time of each entity. Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more.
Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh.
Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. A group post suggests maybe it's 10 bytes as opposed to 8.
But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. Further, 40k times a few bytes isn't much even at $0.24/G/month.
With either a generation or a datetime, you don't necessarily have to delete the data immediately. Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. Depends entirely on the app whether it's worth having.

I would add a generation counter
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
generation = db.IntegerProperty(required=True)
def updateFeed(offset, generation, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name,generation=generation)
)
db.put(feedEntriesToAdd)
def deleteOld(generation):
q = db.GqlQuery("SELECT * FROM FeedEntry " +
"WHERE generation != :1" ,generation )
db.delete(generation)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating across columns in Django - python

Related

Pass many-to-many object to variable

filtering with a for or if loop

Sort by a value in a many to one field with a django queryset?

Make a "matrix" from Django query

How to delete entities not found in feed on GAE

Categories

Resources